Assemblers

Old assembler

The original Dauug|36 assembler is deprecated, but still in the source tree under old-asm/. It is still the assembler used by the Dauug|36 virtual machine (vm/ in the source tree). The old assembler can also be used by the Dauug|36 electrical simulator (netsim/ in the source tree) by using the keyword old-asm in lieu of asm in .ns test scripts.

Pages 283–287 and 359–366 of the dissertation, in conjunction with the old assembler’s source code and regression tests, are the best source of information about the old assembler.

The Dauug|36 virtual machine is actually part of the old assembler and is obsolete for testing anything except the ALU. The problem with the VM is that it does little more than model the flow of data through the arithmetic logic unit. (It also implements the JUMP family of instructions.) This makes the VM useless for simulating privileged instructions, multitasking, operating systems, I/O, and so on.

An update to the VM to reflect the rest of the system is desirable, because logic-only testing can run orders of magnitude faster than modeling electrical components, propagation times, etc. But there is no practical extraction from the system netlist into C code for the VM, and the netlist is changing quickly as the architecture is extended. So an update to the virtual machine would be require not only considerable labor up front, but also careful validation and ongoing maintenance to remain faithful to the circuit board design as it evolves.

XA|36 Cross assembler

The current Dauug|36 assembler, named XA|36, is written in C and can be found under cross-asm/ in the source tree. This is the cross assembler, so called because although the assembler assembles programs for the Dauug|36 architecture, the assembler itself runs on a different architecture (i.e., an x86 running a POSIX OS). Thus code produced by this assembler must either “cross” onto another physical system to run, or run in a simulator. The Dauug|36 electrical simulator uses the cross assembler as its default assembler.

The cross assembler was written in May 2023 and adds features such as function-scoped variables, ease of porting, and expandability that were not available via the old assembler.

The cross assembler is only partially written in C. As much as possible is human-written bytecode that guides a Turing machine-like process. Small patches of C code interpret this bytecode in the process of assembling Dauug|36 programs. The benefit to this approach is that when I write the self-hosted assembler, which will be written in Dauug|36 assembly language, only the small patches of C code from the cross assembler will require rewriting. The bytecode that is already present in the cross assembler can be copied and used unmodified in the self-hosted assembler. The cross assembler and self-hosted assembler will generate exactly the same object code for any assembly source code.

The scope and capabilities of the cross assembler are deliberately minimal and will remain so. For more information, see the self-hosted assembler section below.

Cross assembler source organization

These files contain the cross assembler source code:

asapi.c	API functions with external linkage
as.c	compiling this builds the assembler as a library
ascoll.c	collections and data structures for assembler
as.h	assembler API prototypes (unneeded for standalone assembler)
asglos.c	glossary generation (register-allocated constants)
aslex.c	bytecode-driven lexer to break source code into tokens
asmain.c	compiling this builds a standalone assembler
assyn.c	bytecode-driven assembler to convert tokens into executable code
asutil.c	small helper functions and macros
oclist.h	link into firmware source: opcode name (string) to opcode number
opcode.h	link into firmware source: opcode name (macro) to opcode number

The assembler is built by either compiling as.c, which #includes everything else it needs, or by compiling asmain.c, which merely adds a main() function after including as.c. The apparent lack of encapsulation and module separation may be concerning to some maintainers. The reasons I invoke the C compiler just once to build the assembler are these. First, I am sensitive to compilation time. Second, I don’t trust incremental builds.

The collections in ascoll.c stick to linked lists. Their time complexity is abysmal, but we seldom need to assemble the assembler. More important to me was that the source code is straightforward to audit and straightforward to port to assembly language. Some collections should be refactored: typedefs Reg, Label, and Keep should be descended from Scope instead of each point to a Scope.

There are two finite state machines (FSMs) that are bytecode- (const char *-) driven. One is in aslex.c and determines token separation. In hindsight, perhaps this would have been better implemented in straight C, even though an assembly language version would need to be written. The other is in assyn.c and translates the syntax of assembly language programs into object code. The bytecode in assyn.c is probably a good idea. Both FSMs do a lot of unnecessary backtracking so that their source code is easier for a human to comprehend and audit.

These other files may be present:

as.o	object code for assembler library
fibclean.a	assembly program used for testing assembler
makefile	GNU Make script to build assembler
no-bu	list of files that don’t need to be backed up
notes/	data collected while implementing the assembler
preempt.a	assembly program used for testing assembler
xa	cross assembler executable

The cross assembler does not adhere to C’s “strict aliasing” rule, which causes incorrect operation if the C compiler does certain optimizations. I didn’t pay much attention to exactly where I broke the rule—maybe I’m thinking like an assembly language programmer—but you’ll find that makefile disables these optimizations for GCC. If you use a different compiler, you may need to alter something to make the cross assembler work correctly.

The test files fibclean.a and preempt.a were forked at one time from the regression test cases with these names. They are not interchangeable with their same-named ancestors.

The .a suffix for assembler source code does not adhere to conventions on certain systems. It’s common to use .s for assembly language, but I chose a different suffix to preclude confusion between x86 assembly code and Dauug|36 assembly code. In so doing, I may have created new confusion, because many programs assume .a denotes a Unix Archiver file.

SHA|36 Self-hosted assembler

Future tool. Not implemented as of 6 July 2023.

The first major program written for the Dauug|36 architecture will be a real-time operating system. The second major Dauug|36 program, which will run on this operating system, will be a self-hosted assembler, tentatively named SHA|36. This replicates the cross assembler in this manner:

All source that assembles on XA|36 must also assemble on SHA|36.
All source that assembles on SHA|36 must also assemble on XA|36.
Object code from XA|36 and SHA|36 for the same source must be bit-for-bit identical.

SHA|36 will be written in Dauug|36 assembly language, meaning that it can assemble itself on a Dauug|36 minicomputer or Dauug|36 simulator. This capability is called self-hosting. The challenge with self-hosting is that an SHA|36 executable will not be available to assemble the first copy of itself. Another tool has to exist first that can also assemble SHA|36, and that tool is the cross assembler XA|36 that was written in May 2023.

Establishing toolchain trust

We have a problem trusting XA|36 to faithfully assemble SHA|36 without introducing a backdoor, malware, or other exploitable defects. The problem is that XA|36 depends on GCC, an x86 host operating system, and many libraries. Not only is this many millions of lines of source code, but either an object code audit for all these dependencies is also necessary, or every generation of source code for GCC and its dependencies must be audited and built, starting from GCC’s introduction in 1987. No one is capable of this, and no one will pay for this.

Instead of evaluating XA|36 for trustworthiness, I will do an instruction-by-instruction audit of SHA|36’s object code side-by-side with the SHA|36 source code. And so can anyone else later on. The outcome will be a “clean” executable for the self-hosted assembler that is known to exactly match its source code. SHA|36’s only dependency will be the Dauug|36 architecture and firmware, which we can also control.

For the benefit of the SHA|36 object code audit, which is a human-centered manual process, SHA|36 and its features need to be as small as they can reasonably be. Moreover, because the cross assembler XA|36 needs to perfectly replicate SHA|36, XA|36 and its features also need to be as small as they can reasonably be. This is why as of 6 July 2023, the source code for XA|36 comes to fewer than 3,000 lines, not counting the imported lists in oclist.h and opcode.h.

Firmware trust

To shield the firmware from corruption by a buggy or compromised C toolchain, the firmware generator may someday be rewritten in Dauug|36 assembly language. This project is not on my radar at this time.

For scope clarification, the firmware generator (which generates firmware) is not part of any of the Dauug|36 assemblers (which generate executable code). This mention is made here because once a self-hosted assembler is working, the CPU and its firmware have become part of the toolchain.

Further assemblers

SHA|36 is intended to provide the Dauug|36 community an easily-traceable, uncorrupted tool from which future applications, toolchains, and assemblers can be built. SHA|36 will provide an “object code firewall,” in the sense that programs assembled by SHA|36 should not require an object code audit, but only an audit of their source code.

To minimize the number and timespan of generations of toolchain code that must be audited, my hope is that all future assemblers for Dauug|36 use only language features that are supported by the XA|36 and SHA|36 twin assemblers. Future assemblers for Dauug|36 can implement as many advanced features as desired, such as assistance computing operands for the permutation instructions, but their source code should be written in a manner that allows SHA|36 to assemble these future assemblers.

Put differently, it should never be necessary to audit more than two generations of Dauug|36 assemblers for security. The first generation audited, Generation 1, will contain only fully object-code-audited assemblers such as SHA|36. The second generation audited, Generation 2, will contain every other assembler, every one of which will be assembled by SHA|36 or another Generation 1 assembler.

An assembler “generation” may contain more than one assembler pass, provided the same source code is used for each pass. For example, suppose that a future Generation 2 assembler named “X” optimizes register allocation by using graph coloring. When SHA|36 assembles X, a working copy of X will exist, but its registers won’t be efficiently allocated. It’s allowed for X to re-assemble itself so that fewer registers are used. The re-assembly is permissible, because re-assembly will not create additional source code that requires auditing.