# Characters and tokens

We begin with how characters are interpreted by the assembler to form tokens. A **token** is an atomic unit within the Dauug|36 assembly language. Only three kinds of tokens are visible in programs:

- numbers (also called
**constants**) - identifiers
- other symbols

If you read the assembler source code, you’ll see the assembler also adds tokens to represent line numbers and an end-of-file marker. But these are implementation details of the assembler, as opposed to elements of the assembly language itself.

## Source code character set

Assembly language programs are written in 7-bit ASCII. The only supported symbols are 7 (tab), 10 (newline), and 32–126 (space through tilde). In the future, the UTF-8 encoding may be permitted. No other encodings will be supported.

Here is a summary of characters that may have unanticipated meanings:

Symbol | Use |

`0` –`9` |
digits 0 through 9 |

`a` –`z` |
digits 10 through 35 |

`A` –`Z` |
digits 36 through 61 |

`’` (apostrophe) |
digit 62 |

`.` (period) |
digit 63 |

`_` (underscore) |
digit group separator |

``` (backtick) |
numeral base if not 10 |

`~` (tilde) |
negative sign for numeric constants |

`;` |
comment to end of line |

`( )` |
comment spanning any number of lines |

tab | same as space |

newline | end of statement (unless parenthesized) |

## Numbers

In Dauug|36 assembly language, a number is a sequence of one or more **digits** that is optionally followed by a backtick and radix. Because any number base between 1 through 64 can be used, the language provides 64 digits, namely, symbols 0 through 9, ASCII lowercase and uppercase letters, apostrophe, and period, to represent the digits 0 through 63. Underscores may be mixed into the sequence of digits at will; these underscores have no effect other than to possibly improve readability. Whitespace is not allowed anywhere within numbers.

To write a number that is not in base 10, a backtick and radix are appended. The radix may either be written as a base 10 number between 1 and 64, or one of the following abbreviations may be used:

Abbrev. | Same as | Derivation |

``u` |
``1` |
unary |

``b` |
``2` |
binary |

``o` |
``8` |
octal |

``d` |
``10` |
decimal |

``h` |
``16` |
hexadecimal |

``t` |
``64` |
tetrasexagesimal, tribble |

Note to world: the word *hexadecimal* does not start with *x*.

Here are a few ways of writing `19`

other than the usual:

1110111100111110001111111`1 0010011`b 201`3 19`10 13`h j`t

Digits must stay within the indicated radix for a number. For example, `19`o`

is not a valid number. Here is a table of digits:

0 | 00 | g | 16 | w | 32 | M | 48 | ` | not a digit | ||||

1 | 01 | h | 17 | x | 33 | N | 49 | _ | not a digit | ||||

2 | 02 | i | 18 | y | 34 | O | 50 | ||||||

3 | 03 | j | 19 | z | 35 | P | 51 | ||||||

4 | 04 | k | 20 | A | 36 | Q | 52 | ||||||

5 | 05 | l | 21 | B | 37 | R | 53 | ||||||

6 | 06 | m | 22 | C | 38 | S | 54 | ||||||

7 | 07 | n | 23 | D | 39 | T | 55 | ||||||

8 | 08 | o | 24 | E | 40 | U | 56 | ||||||

9 | 09 | p | 25 | F | 41 | V | 57 | ||||||

a | 10 | q | 26 | G | 42 | W | 58 | ||||||

b | 11 | r | 27 | H | 43 | X | 59 | ||||||

c | 12 | s | 28 | I | 44 | Y | 60 | ||||||

d | 13 | t | 29 | J | 45 | Z | 61 | ||||||

e | 14 | u | 30 | K | 46 | ’ | 62 | ||||||

f | 15 | v | 31 | L | 47 | . | 63 |

Like APL, Dauug|36 assembly requires a symbol to indicate a negative constant that is distinct from the symbol used for subtraction and negation. Negative constants in Dauug|36 are written using a tilde, so the statement

x = ~10

will generate an `IMN`

(immediate negative) instruction to set `x`

to -10. In contrast, the statement

x = -10

will generate an `S`

(subtract) instruction to take 10 away from 0. There are places where this makes a difference.

Here are some examples using underscores to separate digit groups, tetrasexagesimal notation, and negative constants, all based on some important 36-bit quantities:

~34_359_738_368 | ~w00000`t | most negative 36-bit number |

0 | 000000`t | zero |

34_359_738_367 | v.....`t | most positive signed 36-bit number |

68_719_476_735 | ......`t | most positive unsigned 36-bit number |

## Identifiers

For a token to represent a number, Dauug|36 assembly language requires:

- at least one digit,
*and* - either all digits are between
`0`

and`9`

,*or*a suffix with```

and a radix

An **identifier** is a token consisting of digits and underscores at most, but does not meet the radix requirement to be a number. So the following are identifiers:

x y Kitty a17 _city_

You would expect this in many languages. But Dauug|36 is more permissive, so the following are also identifiers:

3.14159 a' a'' ... 'X' my.little.identifier 'twas 0.1

Again, the reason all of these are identifiers is that

- there is no radix mark
```

,*and* - a digit beyond base 10 (lowercase, uppercase, apostrophe, period) appears.

Identifiers are used as names for registers, labels, scopes, and CPU instructions.

### Treatment of underscores

The underscore `_`

, which is not a digit, is treated a little differently between numbers and identifiers. Numbers are interpreted as if all underscores have been removed and aren’t part of any number. In identifiers, underscores are significant in determining names.

In other words, these numbers are all the same:

53316291173 53_316_291_173 5__3316291173__

but these identifiers are all different:

Dauug36 Dauug_36 Dauug__36 _Dauug36 Dauug36_ _Dauug_36_

## Comments

Comments begin with a semicolon and extend to the end of the line. For example:

open_secret = 314159 ; ten thousand times pi

Those are called **single-line comments**, because they cannot span multiple lines. Also supported are **inline comments**, which begin and end explicitly with parentheses. An inline comment can span as many lines as desired and can be used to “comment out” a bunch of code, or to provide space within a source file for documentation. No parsing or syntax checking is done inside inline comments, although only the ASCII symbols are guaranteed to be supported. For example:

x = (Dare I choose zero here?!?) 0

Inline comments do not nest. So how does one comment out a block of code or documentation that contains parentheses itself? This is done using multiple consecutive parentheses. Inline comments don’t technically begin with the `(`

character, but with a series of one or more consecutive `(`

characters. They end with an equal number of consecutive `)`

characters. So `((`

and `))`

can enclose blocks that do not contain exactly two consecutive right parentheses. Notably but perhaps less usefully, `(`

and `)`

can enclose blocks that contain multiple consecutive left or right parentheses, but not isolated ones. In any event, the runs of parentheses that begin and end inline comments can be as long as available memory permits. For example:

diameter = 10 ((( The circumference is exactly pi times the diameter, but this program does not use any floating point (so pi is going to be 3). ))) double_diameter = diameter + diameter circumference = diameter + double_diameter

Inline comments can enable a line of source code to continue after the closing parenthesis (-es). Inline comments can also be placed around newlines in order to provide **line continuations**. For example:

a = b + (added to) c twelve_hundred = ( ) 1200 fixed_point_thirty_six_bit_approximation_of_pi_divided_by_four ( ) = 110010_010000_111111_011010_101000_100010`b

Every comment, whether single-line or inline, behaves as a single space in terms of the assembly language syntax. So identifiers, numbers, symbols, etc. cannot be spliced together through inline comments.

## Reserved words

Dauug|36 assembly language does not have any **reserved words**. This frees the programmer from needing to memorize a list of such words or worry about future releases adding unexpected reserved words. Instead, the structure of a line determines a word’s possible use. As an example, consider an ill-named loop counter named `jump`

that counts from 0 through 9:

unsigned jump jump = 0 loop: cmp jump - 10 jump >= done ; ; This is the middle of the loop. Do what you wish here. ; jump = jump + 1 jump loop done:

The above code is legal and provides the intended loop. The structure of the 6 lines containing the token `jump`

are such that two are correctly recognized as `JUMP`

instructions, while the other four treat `jump`

as a variable. Equally permissible would have been to instead name the `loop`

label `jump`

like this:

unsigned i i = 0 jump: cmp i - 10 jump >= done ; ; This is the middle of the loop. Do what you wish here. ; i = i + 1 jump jump done:

In the above, the assembler knows that in the line `jump jump`

, the first `jump`

is a CPU instruction, and the second `jump`

is a label that indicates a branch destination.

*Aside*. If you were extra alert, you perhaps noticed comparisons that subtract 10 in the two preceding examples, where the variable is `unsigned`

and the result will be less than zero at times. These are fully-supported, legal comparisons, and they do not set the `T`

(emporal) or `R`

(ange) flags. This is possible because the result of the subtraction is not kept and therefore needn’t fit within any particular destination register.

## Symbols

The assembly language has certain symbols, of which some require multiple characters that are consecutive. These could have ambiguous meanings; for instance, does `<=`

express one idea or two, and if one idea, can it also be written as `< =`

? The answer is that symbols are grouped from left to right as a line is scanned, and if there is more than one choice of grouping, the longest contiguous group prevails. Backtracking does not occur.

Here are the affected groups, called **multiple-character symbols**, in the language as of 30 June 2023:

++ | add with carry |

-- | subtract with carry |

~-- | reverse subtract with carry |

~- | reverse subtract |

!& | NAND |

!| | NOR |

!^ | XNOR |

<= | less than or equal to |

== | equal to |

!= | not equal to |

>= | greater than or equal to |

:: | scope or scope resolution |

Nearly all of the characters seen in these multiple-character symbols also have their own use when they appear separately:

+ | add |

- | subtract |

& | AND |

| | OR |

^ | XOR |

! | NOT |

< | less than |

= | assignment |

> | greater than |

: | label |

So if you see, for instance, the assembler input:

~-----

although it does not represent valid syntax, you know that the symbols will be interpreted as if they had been written:

~-- -- -