Cation Language Reference
Lexical structure
The lexical structure of Cation describes what sequence of characters form valid tokens of the language. These valid tokens form the lowest-level building blocks of the language and are used to describe the rest of the language in subsequent chapters. A token consists of an identifier, keyword, punctuation, literal, or operator.
In most cases, tokens are generated from the characters of a Cation source file by considering the longest possible substring from the input text, within the constraints of the grammar that are specified below. This behavior is referred to as longest match or maximal munch.
Whitespace and comments
Whitespace has three uses:
- to separate tokens in the source file;
- to separate statements with indentation;
- to distinguish between prefix and infix operators (see Operators);
The following characters are considered whitespace: space (U+0020), line feed (U+000A), carriage return (U+000D), horizontal tab (U+0009), vertical tab (U+000B), form feed (U+000C) and null (U+0000).
Comments are treated as whitespace by the compiler. Single line comments begin with --
and continue until a line feed
(U+000A) or carriage return (U+000D). Multiline comments begin with {-
and end with -}
. Nesting multiline comments
is allowed, but the comment markers must be balanced.
Identifiers
Identifiers begin with an uppercase or lowercase letter A through Z, an underscore (_
), a noncombining alphanumeric
Unicode character in the Basic Multilingual Plane, or a character outside the Basic Multilingual Plane that isn’t in
a Private Use Area. After the first character, digits and combining Unicode characters are also allowed.
To use ASCII punctuation characters in an identifier, or a reserved word, put a backtick (`
) before and after it.
For example, data
isn’t a valid identifier, but `data`
is valid. The backticks aren’t considered part of the
identifier; `x`
and x
has the same meaning.
Keywords
The following keywords are reserved and can’t be used as identifiers, unless they’re escaped with backticks, as
described above in Identifiers. Keywords other than let
and lambda
can be used as field or variant
names in data types, as parameter names in a function declaration or function call without being escaped with backticks.
data
: defines new data typeclass
: defines a new data type classfx
: defines new function (prefix function)infx
: defines new infix functionlambda
: defines a lambda functionalias
: defines a new alias for a functionlet
: defines new value
The following tokens are reserved as built-in operators and can’t be used in custom operators: (
, )
, {
, }
,
[
, ]
, .
, ,
, :
, ;
, #
, =>
, ->
, <-
, <|
, |>
, >|
, |?
, |:
, .\
, `
.
Literals
A literal is the source code representation of a value of a type, such as a number or string.
The following are examples of literals:
42 -- Integer literal
42#U8 -- Tagged integer literal
3.14159 -- Floating-point literal
'A' -- Character literal
"Hello, world!" -- String literal
A literal without a tag doesn’t have a type on its own. Instead, Cation’s type inference attempts to infer
a type for the literal. For example, in the declaration let x: U8 := 42
, Cation uses the explicit type information
(: U8
) to infer that the type of the integer literal 42
is U8
. If there isn’t suitable type information available,
Cation infers that the literal’s type is one of the default literal built-in Cation types listed in the table below.
Literal | Default type |
---|---|
Integer | U64 |
Float | F64 |
Character | Char |
String | String |
When specifying the type annotation for a literal value with a tag, the annotation’s type must be a type that can be instantiated from that literal value.
For example, in the declaration let str := "Hello, world"
, the default inferred type of the string literal
"Hello, world"
is String
. This can be changed by using type annotation or tag: both
let str: AsciiString := "Hello, world"
and let str := "Hello, world"#AsciiString
will be inferred to AsciiString
type instead of String
.
Integer literals
Integer literals represent integer values of some precision. By default, integer literals are expressed in decimal; you can specify an alternate base using a prefix:
- binary literals begin with 0b,
- octal literals begin with 0o,
- and hexadecimal literals begin with 0x.
Decimal literals contain the digits 0 through 9. Binary literals contain 0 and 1, octal literals contain 0 through 7, and hexadecimal literals contain 0 through 9 as well as A through F in upper- or lowercase.
Negative integers literals are expressed by prepending a minus sign (-) to an integer literal, as in -42
.
Underscores (_
) are allowed between digits for readability, but they’re ignored and therefore don’t affect the value of
the literal. Integer literals can begin with leading zeros (0
), but they’re likewise ignored and don’t affect the base
or value of the literal.
Unless otherwise specified, the default inferred type of an integer literal is the Cation built-in type U64
.
Cation has built-in types for various sizes of signed, unsigned and built-in integers, as described in
Integers.
Floating-point literals
Floating-point literals represent floating-point values of some precision.
By default, floating-point literals are expressed in decimal (with no prefix), but they can also be expressed in hexadecimal (with a 0x prefix).
Decimal floating-point literals consist of a sequence of decimal digits followed by either a decimal fraction, a decimal
exponent, or both. The decimal fraction consists of a decimal point (.
) followed by a sequence of decimal digits. The
exponent consists of an upper- or lowercase e
prefix followed by a sequence of decimal digits that indicates what
power of 10 the value preceding the e
is multiplied by. For example, 1.25e2
represents 1.25 x 10², which evaluates
to 125.0. Similarly, 1.25e-2
represents 1.25 x 10⁻², which evaluates to 0.0125.
Hexadecimal floating-point literals consist of a 0x prefix, followed by an optional hexadecimal fraction, followed by a
hexadecimal exponent. The hexadecimal fraction consists of a decimal point followed by a sequence of hexadecimal digits.
The exponent consists of an upper- or lowercase p prefix followed by a sequence of decimal digits that indicates what
power of 2 the value preceding the p is multiplied by. For example, 0xFp2
represents 15 x 2², which evaluates to 60.
Similarly, 0xFp-2
represents 15 x 2⁻², which evaluates to 3.75.
Negative floating-point literals are expressed by prepending a minus sign (-) to a floating-point literal, as in -42.5
.
Underscores (_
) are allowed between digits for readability, but they’re ignored and therefore don’t affect the value
of the literal. Floating-point literals can begin with leading zeros (0
), but they’re likewise ignored and don’t
affect the base or value of the literal.
Unless otherwise specified, the default inferred type of a floating-point literal is the Cation built-in F64
, which
represents a 64-bit floating-point number.
Character literal
String literals
Binary data literals
Date-and-time literals
Custom literals
Syntactic constructs
Specifiers
Projections
Function calls are made using projection operator expressed as a space (
); i.e. for a prefix function fn
the way
to call it is to write fn args
, where args
can be a named or unnamed data type matching arguments in the function
declaration. They can be kapt in (
..)
, or without them, but if multiple arguments are present they must be separated
from each other with a comma ,
(product operator which composes arguments into a data type). Thus, a function always
takes a single argument, which can be a named or unnamed data type.
Prefix functions may be called using an infix notation: (args).fn
, or via arg0.fn arg1, .., argLast
notation.
Call of an infix function infn
is arg0 infn arg1, .., argLast
. It can be reverted via use of
(infn arg0, .., argLast)
or (infn) arg0, .., argLast
forms.
Injections
Injections are branching operators, which allow code execution to branch execution basing on a value of a co-product
type. They are either pattern matching (an equivalent of Rust match
and Haskell case
), or a boolean-test based.
Pattern matching is done with >|
infix operator, which takes a value of a co-product type on the left side, and a
set of natural transformations matching each of the injections on the right:
value >|
pattern1 => code
pattern2 => code
_ => code -- default match
Branching based on the boolean conditions uses .. |? .. |: ..
operator, which is also a ternary operator,
corresponding to if .. then .. else
construction. It also has a form of
condition1 |? statement1 |: condition2 |? statement2 |: statementElse
, corresponding to
if condition1 then statement1 elseIf condition2 then statement2 else statementElse
.
Annotations
Operators
Built-in
Categorical limits
Operator ,
is a categorical product operator. Operator |
is categorical sum operator
Mappings (morphisms and functors)
Operators ->
<-
define functors: they do project an object or each of
objects in a category (value, collection, iterator or a data type) to a
different object or a category.
For instance, when used in function definition, it projects all possible values of a function input as data type (which may be a composite anonymous type) to the set of return values defined by the return data type:
fx some: input Type -> output Type
Operator ->
is also used inside collections of key-value maps, separating
keys and their corresponding values.
Natural transformation mapping
Operator =>
is used to introduce generic arguments as a natural transformation.
Iteration operators
Operators |>
and <|
used to build iterators: it maps of the items
of an iterable collection into a new value:
x <- 0..<100 |> mulTry x, 2
which will result in a new iterator over doubled values in the range from 0 to 99.
One may skip te initial x <-
assignment; in this case the input value of
the current iteration will be put into the first argument of the next expression:
0..<100 |> mulTry 2
one may also use the context operator _
:
0..<100 |> _ mulTry 2
or reverse the order of the expressions:
mulTry 2 <| 0..<100
Lambda operator
Lambda operator .\
, with alias λ
(Greek letter lambda) is a shorthand for creating lambda expressions.
It has four forms:
- single-line, end of line (trailing lambda):
.\args -> ret: expr
- single-line, alongside other expressions:
.\(args -> ret: expr), _
- multi-line, end of line (trailing lambda block):
.\args -> ret expr1 expr2 -- ...
- multi-line, alongside other expressions:
.\(args -> ret expr1 expr2 -- ... ), _
All forms may skip the return type -> ret
part; in this case the return type is inferred by the compiler:
.\args: expr
.\(args: expr), _
.\args
expr1
expr2
-- ...
.\(args
expr1
expr2
-- ...
), _
If the lambda expression has no inputs, lambda operator must be simply followed by the expression itself with no colon used:
.\expr
.\(expr), _
.\
expr1
expr2
-- ...
.\(
expr1
expr2
-- ...
), _
See also lambda specifier.
Standard library
Monadic operators
Operators !
, !!
, ?
and ??
are used with monad types (like
optionals/maybes or result types). First, !
and ?
marks a part of the
expression which should be tested against unwrapped monad value; if the unwrap
procedure fails, it will default to the expression put after !!
or ??
operator at the end of the same line. For instance, if value
is of Maybe
type we can write value ?? 0
; or value !! noValue
; the first will default
expression to zero, and the second convert return type of the function to a
result type, create an enum Error
with variant noValue
and will return
that value if the maybe monad in value
doesn't contain an actual value.
Arithmetic operators
Unlike other languages, Cation requires to explicitly handle overflow and underflow conditions in arithmetics, as well as zero divisions, which is required for termination analysis and helps in avoiding undefined behaviours. Thus, each of arithmetic operators has multiple forms, handling overflows and underflows differently. This approach used in Cation is named checked arithmetics.
To construct arithmetic operators, a symbol representing mathematical operation (like +
, -
, *
, /
, %
, ^
) is
postfixed with a symbol representing the way of handling overflow, underflow or zero divisions. Such symbols are:
?
for converting the result of the operation intoMaybe
monad;!
for converting the result of the operation intoResult::error
monad variant with details on specific condition which has occurred;%
for wrapping a value in case of overflow (modulo-arithmetic) and returningArRes
monad;^
for saturating a value with a maximum possible value in case of overflow;- with no postfix for extending the result type to the next bit dimension so it always fits.
Additionally, Cation allows native arithmetic operations when an operation itself and the resulting type guarantees impossibility of overflow or other exceptional conditions. These operations are:
- unsigned, signed and non-zero integer addition with
+
operator, when the bit dimension of the resulting integer type is equal to or exceeds the sum of the bit dimensions of the inputs; - signed integer subtraction with
-
operator, when the bit dimension of the resulting integer type is equal to or exceeds the sum of the bit dimensions of the inputs; - unsigned, signed and non-zero integer multiplication with
*
operator, when the dimension of the resulting integer type is equal to or exceeds the product of the bit dimensions of the inputs; - division with
/
operator of non-zero unsigned integers; - modulo division with
%
operator of non-zero unsigned integers; - unsigned, signed and non-zero potentiation with
^
operator when the bit dimensions of the resulting integer type exceeds
Thus, the resulting table of the arithmetic operations is the following:
Operation | Native | Maybe | Result | Wrapping | Saturating | Extending |
---|---|---|---|---|---|---|
Addition | + | +? | +! | +% | +^ | + |
Subtraction | - | -? | -! | -% | -^ | - |
Multiplication | * | *? | *! | *% | *^ | * |
Division | / | /? | /! | n/a | n/a | n/a |
Modulo division | % | %? | %! | n/a | n/a | n/a |
Potentiation | ^ | ^? | ^! | ^% | ^^ | ^ |
Bitwise operators
Boolean logic operators
Ternary logic operators
String operators
Types
Built-in
Initial and terminal
Integers
Integers come in signed, unsigned and non-zero unsigned classes, each of which contains types with different bit length.
Supported bit length for integer types are:
Bits | Bytes | Unsigned | Signed | Non-zero | C equivalents | Rust equivalents |
---|---|---|---|---|---|---|
8 bits | 1 byte | U8 | I8 | N8 | (unsigned) char | u8 , i8 , NonZeroU8 |
16 bits | 2 bytes | U16 | I16 | N16 | (unsigned) short | u16 , i16 , NonZeroU16 |
24 bits | 3 bytes | U24 | I24 | N24 | n/a | n/a |
32 bits | 4 bytes | U32 | I32 | N32 | (unsigned) long | u32 , i32 , NonZeroU32 |
40 bits | 5 bytes | U40 | I40 | N40 | n/a | n/a |
48 bits | 6 bytes | U48 | I48 | N48 | n/a | n/a |
56 bits | 7 bytes | U56 | I56 | N56 | n/a | n/a |
64 bits | 8 bytes | U64 | I64 | N64 | (unsigned) long long | u64 , i64 , NonZeroU64 |
80 bits | 10 bytes | U80 | I80 | N80 | n/a | n/a |
96 bits | 12 bytes | U96 | I96 | N96 | n/a | n/a |
112 bits | 14 bytes | U112 | I112 | N112 | n/a | n/a |
128 bits | 16 bytes | U128 | I128 | N128 | n/a | u128 , i128 , NonZeroU128 |
256 bits | 32 bytes | U256 | I256 | N256 | n/a | n/a |
512 bits | 64 bytes | U512 | I512 | N512 | n/a | n/a |
1024 bits | 128 bytes | U1024 | I1024 | N1024 | n/a | n/a |
All integer types in Cation are co-product types, made with all their allowed values. This makes it possible to use injection operators to match them against patterns and ranges.
Floats
Supported bit length and encodings for floating-point types are:
Type name | Bytes | Encoding | Underlynig Rust type |
---|---|---|---|
F16B | 2 | bfloat16 | bfloat::bf16 |
F16 | 2 | IEEE Half | apfloat::ieee::Half |
F32 | 4 | IEEE Single | apfloat::ieee::Single |
F64 | 8 | IEEE Double | apfloat::ieee::Double |
F80 | 10 | IEEE X87 Extended | apfloat::ieee::X87DoubleExtended |
F128 | 16 | IEEE Quad | apfloat::ieee::Quad |
F256 | 32 | IEEE Oct | apfloat::ieee::Oct |
Character
Cation has just a single built-in Unicode character type Char
. It is the only of the built-in types which has
variable bit length, due to the Unicode standard. Its length varies from 8 bits to 32 bits; with 8 bit step.
Ranges
Range types simplify creation of collection types, as well as are an efficient tool for cycles and iterators. Cation comes with the following set of range types, each of which can be instantiated using a shorthand range expressions.
Type name | Range operator |
---|---|
RangeAll | .. |
RangeTo | ..<N |
RangeToIncl | ..=N |
RangeFrom | M.. |
RangeFromTo | M..<N |
Standard library
Small integers
String types
Monads
Statements
Expressions
Expressions are separated either with a line feed character (U+000A), or with a semicolon ;
if put on one line one
after the other.
Lambda expressions
Lambda expressions have two forms: operator and specifier.
Collection comprehension
Grouping operator is (
..)
constructs an anonymous data type.
Operators [
..]
, {
..}
and {
..->
..}
construct collection types.
Range expressions
Operators ..=
and ..<
help to create iterators or collections over
ranges. They may be combined with a step size information in form of
0<=2x<=100
where instead of 2 any other constant can be given
Context value
Operator _
means context default value, like the one kept in a stack from
a previous function call or a decomposition
Result value
Operator $
means result of the current expression or a function
. It
can be used to assign an output value of a function, like in $ <- value
,
or to access the results within iterations from the previous cycles of
iterations, like with $-1
, accessing the previous iteration result, or
$3
accessing the result of the third iteration from the beginning.
Patterns
Specifiers
Function
Data type
Data class
Value specifier
Lambda specifier
Lambda specifier starts with a lambda
keyword, followed by a value name, colon, argument and return type
definition and body:
val local: U8 = random
lambda sq: x U8 -> U32
pow 2 + local
As any other specifier it can be put into a single line:
val local: U8 = random
lambda sq: x U8 -> U32 := pow 2 + local
See also lambda operator.
Generics
Annotations
Attributes
Statements starting with @
are used to add attributes other Cation statements. Attributes are a way of
metaprogramming: they are similar to procedural macros in Rust or annotations in Java.
@id
@alias
@final
@override
@private
Tags
Operator #
adds context tags to the values; for instance it is used to add
integer tag to co-product type variants (like in data Bool: false#0 | true#1
)
or to enforce some literal to be of a specific data type (like in ..100#U8
,
where the range is enforced to be over U8
type).