Memory instructions
Opcode | P/U | Category | Description |
ADDLD |
user | memory | add then load |
LD |
user | memory | load |
LDSTO |
user | memory | load and store |
STO |
user | memory | store |
STO2 |
user | memory | store twice |
Preface
Dauug|36’s very short pipeline depth leaves it an extremely limited selection of addressing modes.
Advantages
- All instructions execute in just 4 clock cycles.
- Having 512 registers per program makes up for lack of memory throughput.
Disadvantages
- Register indirect is (almost) the only addressing mode for data.
- Addresses must be computed to the last bit before any load or store instruction.
As with all things Dauug, exceptions to these restrictions exist. So in addition to the LD
and STO
instructions we expect to find, we also have ADDLD
, LDSTO
, and STO2
with peculiar characteristics that can be leveraged in specific cases.
Virtual memory
Dauug|36 uses paged virtual memory for data. All nonprivileged memory instructions honor a page table that maps a user program’s idea of an address in memory to an address within physical memory that the user has been granted access to use. This is generally transparent to the user—unless the user is hoping to find memory that belongs to another user or the operating system!
All five user memory instructions have privileged counterparts that bypass the page table to directly access physical memory with no restrictions. Their names can be predicted by replacing the substring LD
with RDM
(read data memory), and the substring STO
with WDM
(write data memory).
ADDLD
Add then load
Syntax |
dest = base addld offset |
Register | Signedness |
All | ignored |
1 opcode only |
Flag | Set if and only if |
N |
bit 35 of the result is set |
Z |
all result bits are zero |
T |
flag does not change |
R |
flag does not change |
ADDLD
(add then load) approximates a base + offset scheme for reading from data memory, except the addition is somewhat broken. The advantage to using ADDLD
is that for cases where the addition quirk is known to be harmless, a base + offset read from memory can be done in a single instruction instead of two.
ADDLD
adds bits 0–5 of offset
to the corresponding bits of base
with wraparound, and adds bits 6–11 of offset
to the corresponding bits of base
with wraparound. This is done in the ALU’s alpha layer while the page table is converting the virtual page (bits 12–22 of base
) to its corresponding physical page. Wraparound occurs in tribbles 0 and 1, because the alpha RAMs doing the addition act simultaneously and cannot intercommunicate. The two six-bit sums, alongside with the retrieved physical page, form a physical address where a word is fetched from data memory. This result is written to register dest
. The N
and Z
flags are set as if dest
is a signed register. T
and R
do not change.
Safety for ADDLD
comes via the use of “ADDLD
-compatible” pointers, which are pointers to structures that either (i) do not cross 64-word boundaries, or (ii) are aligned on and fully within a power-of-two boundary not larger than 212 words. The convenience of this safety is that compatibility is established when memory for structures is allocated, rather than when ADDLD
is used. Thus all that is necessary to make a pointer ADDLD
-compatible is to use an ADDLD
-compatible allocator.
Because ADDLD
-compatible allocators fragment the free memory pool to satisfy alignment constraints, choosing ADDLD
can increase a program’s data memory consumption. For block sizes of 1 through 10 words, the overhead is less than 7% when all blocks are of the same size. Some block sizes, such as 33 words, will have overhead approaching 100%, although this overhead could be reclaimed in part by partitioning free blocks according to their size.
Never do this
Never use ADDLD
to access elements of an array that might grow beyond 4096 words. Your test cases will turn out great, but in the field your software will fail. And just try to debug that! This restriction does not apply to privileged counterpart ADDRDM
, which has no page table to get in the way.
Why there is no ADDSTO
instruction
Dauug|36 supports a maximum of two operands per instruction, but ADDSTO
(add then store) would require three operands—base, offset, and a word to write. This exceeds what the architecture can do in one instruction.
LD
Load
Syntax |
dest = ld addr |
Register | Signedness |
All | ignored |
1 opcode only |
Flag | Set if and only if |
N |
bit 35 of the result is set |
Z |
all result bits are zero |
T |
flag does not change |
R |
flag does not change |
LD
(load) fetches a word from virtual data memory address addr
and stores the result in register dest
. The N
and Z
flags are set as if dest
is a signed register. T
and R
do not change.
LDSTO
Load and store
Syntax |
dest = addr ldsto tval |
Register | Signedness |
All | ignored |
1 opcode only |
Flag | Set if and only if |
N |
bit 35 of the result is set |
Z |
all result bits are zero |
T |
flag does not change |
R |
flag does not change |
LDSTO
(load and store) atomically fetches the word from virtual data memory address addr
and stores the result in register dest
, while storing the transposed value of register tval
to the same virtual address addr
. The N
and Z
flags are set as if dest
is a signed register. T
and R
do not change.
LDSTO
effectively rotates tval
into memory with transposition, and what was in memory rotates to dest
. It is permissible to use the same register for dest
and tval
, in which case LDSTO
becomes a register-memory swap with transposition. (Technically addr
could also use the same register, although that scenario seems unlikely to me.)
The point of LDSTO
is not to save time, but to provide an atomic operation that can implement semaphores in shared memory. Transposition of the right operand is electrically unavoidable due to the instruction being limited to four CPU cycles, but it doesn’t matter much. In a simple semaphore, the right operand would be 0
or 1
, which will not change value when transposed. In cases where transposition matters, TXOR
can be added to write the intended value this way:
tval = 0 txor val dest = addr ldsto tval
TXOR
’s presence as a separate instruction does not break the atomicity of the LDSTO
.
Code to obtain a lock on a semaphore by atomically writing a 1
over a 0
at location addr
would look like the following. There is no problem writing a 1
over someone else’s 1
, but LDSTO
will let us know this happened so we don’t claim the semaphore.
waiting: tval = addr ldsto 1 jump == ready ; If tval == 0, we obtained the semaphore. yield nop ; YIELD does not take effect immediately. jump waiting ; Needn't replace other user's 1 with our 1. ready: ; Critical section goes here. done: ; When it's time to release the semaphore, write 0 to addr. ; Since the semaphore is ours, we don't need to read what it was. sto addr = 0
Do not use LDSTO
on write-protected memory locations, because your program will leave the semaphore unlocked while proceeding as if the lock was acquired.
STO
Store
Syntax |
sto dest = val |
Register | Signedness |
All | ignored |
1 opcode only |
No flags changed |
STO
(store) copies the data in register val
to data memory at the virtual address contained in register dest
. No flags are modified.
It is not an error if virtual address dest
points into a write-protected physical page; however, in this situation STO
will have no effect. (If you’re curious about write protection, every word of Dauug|36 physical data memory is accessible at two addresses that differ only at bit 35. The address with bit 35 set can be read from, but not written to. Privileged programs can easily overcome write protection by clearing bit 35, but user programs are stuck with how bit 35 is set in the page table.)
STO2
Store twice
Syntax |
sto2 dests = tval |
Register | Signedness |
All | ignored |
1 opcode only |
No flags changed |
STO2
(store twice) transposes and stores the word in register tval
to two memory locations determined by dests
, based on the following table. No flags are changed.
dests mod 4 |
addresses written to |
0 | dests, dests + 1 |
1 | dests, dests + 1 |
2 | dests, dests + 1 |
3 | dests, dests − 3 |
The reason this instruction cycles addresses modulo 4 is that STO2
operates the data RAM in burst mode, and it’s the RAM itself that modifies the address for the second write.
The reason the value to write is transposed is that it has to be introduced via the ALU’s beta layer as a right operand. This isn’t much inconvenience, because most uses of STO2
are for filling memory with 0, which is its own transpose.
The purpose of STO2
is to speed memset
loops, particularly when an operating system needs to erase a 4096-word memory page for privacy before a user program is allowed to access it. This is especially helpful during electrical simulations of Dauug|36 running an operating system, because STO2
improves the time needed to zero a page from 57 seconds (using already highly optimized code) to 31 seconds.