Memory instructions

Opcode	P/U	Category	Description
`ADDLD`	user	memory	add then load
`LD`	user	memory	load
`LDSTO`	user	memory	load and store
`STO`	user	memory	store
`STO2`	user	memory	store twice

Preface

Dauug|36’s very short pipeline depth leaves it an extremely limited selection of addressing modes.

Advantages

All instructions execute in just 4 clock cycles.
Having 512 registers per program makes up for lack of memory throughput.

Disadvantages

Register indirect is (almost) the only addressing mode for data.
Addresses must be computed to the last bit before any load or store instruction.

As with all things Dauug, exceptions to these restrictions exist. So in addition to the LD and STO instructions we expect to find, we also have ADDLD, LDSTO, and STO2 with peculiar characteristics that can be leveraged in specific cases.

Virtual memory

Dauug|36 uses paged virtual memory for data. All nonprivileged memory instructions honor a page table that maps a user program’s idea of an address in memory to an address within physical memory that the user has been granted access to use. This is generally transparent to the user—unless the user is hoping to find memory that belongs to another user or the operating system!

All five user memory instructions have privileged counterparts that bypass the page table to directly access physical memory with no restrictions. Their names can be predicted by replacing the substring LD with RDM (read data memory), and the substring STO with WDM (write data memory).

`ADDLD` Add then load

Syntax

dest = base addld offset

Register	Signedness
All	ignored
	1 opcode only

Flag	Set if and only if
`N`	bit 35 of the result is set
`Z`	all result bits are zero
`T`	flag does not change
`R`	flag does not change

ADDLD (add then load) approximates a base + offset scheme for reading from data memory, except the addition is somewhat broken. The advantage to using ADDLD is that for cases where the addition quirk is known to be harmless, a base + offset read from memory can be done in a single instruction instead of two.

ADDLD adds bits 0–5 of offset to the corresponding bits of base with wraparound, and adds bits 6–11 of offset to the corresponding bits of base with wraparound. This is done in the ALU’s alpha layer while the page table is converting the virtual page (bits 12–22 of base) to its corresponding physical page. Wraparound occurs in tribbles 0 and 1, because the alpha RAMs doing the addition act simultaneously and cannot intercommunicate. The two six-bit sums, alongside with the retrieved physical page, form a physical address where a word is fetched from data memory. This result is written to register dest. The N and Z flags are set as if dest is a signed register. T and R do not change.

Safety for ADDLD comes via the use of “ADDLD-compatible” pointers, which are pointers to structures that either (i) do not cross 64-word boundaries, or (ii) are aligned on and fully within a power-of-two boundary not larger than 2¹² words. The convenience of this safety is that compatibility is established when memory for structures is allocated, rather than when ADDLD is used. Thus all that is necessary to make a pointer ADDLD-compatible is to use an ADDLD-compatible allocator.

Because ADDLD-compatible allocators fragment the free memory pool to satisfy alignment constraints, choosing ADDLD can increase a program’s data memory consumption. For block sizes of 1 through 10 words, the overhead is less than 7% when all blocks are of the same size. Some block sizes, such as 33 words, will have overhead approaching 100%, although this overhead could be reclaimed in part by partitioning free blocks according to their size.

Never do this

Never use ADDLD to access elements of an array that might grow beyond 4096 words. Your test cases will turn out great, but in the field your software will fail. And just try to debug that! This restriction does not apply to privileged counterpart ADDRDM, which has no page table to get in the way.

Why there is no `ADDSTO` instruction

Dauug|36 supports a maximum of two operands per instruction, but ADDSTO (add then store) would require three operands—base, offset, and a word to write. This exceeds what the architecture can do in one instruction.

`LD` Load

Syntax

dest = ld addr

Register	Signedness
All	ignored
	1 opcode only

Flag	Set if and only if
`N`	bit 35 of the result is set
`Z`	all result bits are zero
`T`	flag does not change
`R`	flag does not change

LD (load) fetches a word from virtual data memory address addr and stores the result in register dest. The N and Z flags are set as if dest is a signed register. T and R do not change.

`LDSTO` Load and store

Syntax

dest = addr ldsto tval

Register	Signedness
All	ignored
	1 opcode only

Flag	Set if and only if
`N`	bit 35 of the result is set
`Z`	all result bits are zero
`T`	flag does not change
`R`	flag does not change

LDSTO (load and store) atomically fetches the word from virtual data memory address addr and stores the result in register dest, while storing the transposed value of register tval to the same virtual address addr. The N and Z flags are set as if dest is a signed register. T and R do not change.

LDSTO effectively rotates tval into memory with transposition, and what was in memory rotates to dest. It is permissible to use the same register for dest and tval, in which case LDSTO becomes a register-memory swap with transposition. (Technically addr could also use the same register, although that scenario seems unlikely to me.)

The point of LDSTO is not to save time, but to provide an atomic operation that can implement semaphores in shared memory. Transposition of the right operand is electrically unavoidable due to the instruction being limited to four CPU cycles, but it doesn’t matter much. In a simple semaphore, the right operand would be 0 or 1, which will not change value when transposed. In cases where transposition matters, TXOR can be added to write the intended value this way:

tval = 0 txor val
dest = addr ldsto tval

TXOR’s presence as a separate instruction does not break the atomicity of the LDSTO.

Code to obtain a lock on a semaphore by atomically writing a 1 over a 0 at location addr would look like the following. There is no problem writing a 1 over someone else’s 1, but LDSTO will let us know this happened so we don’t claim the semaphore.

waiting:
    tval = addr ldsto 1
    jump == ready           ; If tval == 0, we obtained the semaphore.
    yield
    nop                     ; YIELD does not take effect immediately.
    jump waiting            ; Needn't replace other user's 1 with our 1.

ready:
    ; Critical section goes here.

done:
    ; When it's time to release the semaphore, write 0 to addr.
    ; Since the semaphore is ours, we don't need to read what it was.
    sto addr = 0

Do not use LDSTO on write-protected memory locations, because your program will leave the semaphore unlocked while proceeding as if the lock was acquired.

`STO` Store

Syntax

sto dest = val

Register	Signedness
All	ignored
	1 opcode only

No flags changed

STO (store) copies the data in register val to data memory at the virtual address contained in register dest. No flags are modified.

It is not an error if virtual address dest points into a write-protected physical page; however, in this situation STO will have no effect. (If you’re curious about write protection, every word of Dauug|36 physical data memory is accessible at two addresses that differ only at bit 35. The address with bit 35 set can be read from, but not written to. Privileged programs can easily overcome write protection by clearing bit 35, but user programs are stuck with how bit 35 is set in the page table.)

`STO2` Store twice

Syntax

sto2 dests = tval

Register	Signedness
All	ignored
	1 opcode only

No flags changed

STO2 (store twice) transposes and stores the word in register tval to two memory locations determined by dests, based on the following table. No flags are changed.

`dests` mod 4	addresses written to
0	dests, dests + 1
1	dests, dests + 1
2	dests, dests + 1
3	dests, dests − 3

The reason this instruction cycles addresses modulo 4 is that STO2 operates the data RAM in burst mode, and it’s the RAM itself that modifies the address for the second write.

The reason the value to write is transposed is that it has to be introduced via the ALU’s beta layer as a right operand. This isn’t much inconvenience, because most uses of STO2 are for filling memory with 0, which is its own transpose.

The purpose of STO2 is to speed memset loops, particularly when an operating system needs to erase a 4096-word memory page for privacy before a user program is allowed to access it. This is especially helpful during electrical simulations of Dauug|36 running an operating system, because STO2 improves the time needed to zero a page from 57 seconds (using already highly optimized code) to 31 seconds.

Memory instructions

Preface

Virtual memory

ADDLD Add then load

Never do this

Why there is no ADDSTO instruction

LD Load

LDSTO Load and store

STO Store

STO2 Store twice

`ADDLD` Add then load

Why there is no `ADDSTO` instruction

`LD` Load

`LDSTO` Load and store

`STO` Store

`STO2` Store twice