Proposed division instructions

Division is a beast, and there really isn’t a classical algorithm that works efficiently for all divisors. I will write a hybrid division function that triages the problem and selects a reasonably fast algorithm given the particulars of the problem. I will extend the firmware (that is, I will add ALU opcodes and layer operations) to improve execution time.

By “division,” I refer to a process involving all-unsigned 36-bit quantities that map a Numerator and Denominator onto a Quotient and a Remainder. In general across the subcases, Q is not known exactly until R is known exactly, so a single function will be written that computes both. There really isn’t a bunch of time efficiency that can be leveraged when you only care about just Q or just R.

The subcases to be detected and leveraged are:

small numerator (less than 64)
power-of-two denominator
small denominator (less than 64)
medium denominator (less than 4096)
large denominator
division by zero

Small numerator

When N < 64, division will be done very quickly by table lookup. The biggest trick will be exploiting theta to handle cases where D ≥ 64. Firmware would be extended to accelerate this.

Power-of-two denominator

Consider how code would be written to handle just the special case where D is 2. This would employ a right shift of 1 bit, which is notated from a left rotation perspective (43`o is 35 decimal). In assembly language, this looks like:

    cmp D - 2
    jump != not_2
    Q = D asr 434343_434343`o
    R = 0
    return
not_2:

This is a really minimal number of instructions. But if we want to generalize this to where D is any power of two, not one clock cycle needs to be added. Because the code would look like:

    shifts = glorp D
    jump +t not_po2
    Q = D asr shifts
    R = 0
    return
not_po2:

where GLORP is a yet-unnamed, single-instruction stacked unary macro that converts powers of two into a right shift control word, and sets the T flag when D isn’t a power of two. Firmware would be extended to do this.

What GLORP does—I promise not to name it that—internally is this. Each alpha RAM looks at its own tribble, and if that tribble is consistent with a power of two, that RAM outputs what it think the right shift control tribble would be. It also sends a bit to notify theta that a power of two is claimed. The beta RAMs output all 1s if any input bit is 1, and output 0 otherwise. In other words, beta is doing a bitwise OR from all alpha RAMs to all subwords in gamma. If only one alpha RAM sees a power of two, gamma will see the correct shift control word with all six tribbles filled. But if more than one alpha RAM sees a power of two, gamma will see (but not be able to detect) an incorrect shift pattern. Theta will catch the incorrect cases, setting T when the number of claiming alpha RAMs is not exactly 1.

Thus we’ll have a fast subcase to divide by any of the 36 powers of two, at no more computational cost than handling just one of the 36 cases.

Denominator less than 64

This case isn’t directly about the size of the denominator, but is actually that the binary representation of the denominator’s reciprocal is likely to repeat in short cycles. This likelihood is high enough to matter for numbers less than 64, and is lower enough for larger numbers to not be profitable on average for optimization.

Consider the problem of dividing by 47, which has 72-bit reciprocal:

000001010111001001100010000010101110010011000100000101011100100110001000`b

We only need a 36-bit reciprocal to do this, but this way you can see three full repetitions of the 23-bit cycle I’m talking about. This gives rise to a six-step process to almost exactly divide by 47:

Q = N × .101011`b
Q = Q + N × .000000_100100`b
Q = Q + N × .000000_000000_110001`b
Q = Q = Q + (Q shifted right 010111`b)
Q = Q = Q + (Q shifted right 111111`b)
Q = Q shifted right 000101`b

Steps 1–3 simply use the distributive law to multiply N by the first 23 bits of the reciprocal, skipping the five initial zero bits. You can find the three six-bit constants in order on the left side of the 72-bit reciprocal. Although only 18 bits is mentioned in these steps, we exploit our advantage that the rightmost 5 bits are 0s. So 23 bits of the reciprocal are accounted for in Q.

Step 4 shifts Q over 23 bits and replicates the reciprocal. This accounts for 46 bits of reciprocal, and since our word size is 36 bits, we’ve worked hard enough.

Step 5 does nothing, because a 36-bit word shifted right 63 bits will be 0. But for certain denominators, step 5 can be a faster process than step 3.

Step 6 is a 5-bit right shift, dividing the tentative quotient by 32.

These six steps combined, with suitable instructions added to the firmware, take only 22 instructions once given this 36-bit “control word” that catenates the parameters for each of the six steps right to left:

000101_111111_0101111_110001_100100_101011`b

Branches could be added to skip some of the steps (especially steps 3 and 5) if the corresponding tribble in the control word is zero. This would make worst-case execution time longer, but possibly shorten average execution time for most programs that divide. This decision can be made when the algorithm is written.

Once the almost-exact Q = N ÷ 47 is determined by the above process, the remainder is computed as R = N − Q × 47. The concern is that R may be too high by 47, in which case 47 is subtracted from R and Q is incremented. This concludes the division task. Note that the × 47 step only requires three CPU instructions, because the multiplication is short.

It would take only one stacked unary instruction to fetch the control word for a denominator less than 64; however, six denominators in this set fail to work. These are:

29 37 53 58 59 61

These denominators don’t work, because their binary representations don’t repeat quickly enough to encode as 18 bits in steps 1–3. These cases get forwarded to general algorithm for denominators less than 4096.

Denominator less than 4096

The division process for medium denominators is to have the firmware look up their reciprocal, then do long multiplication. The reciprocal can be obtained in two CPU instructions once firmware support is added. The first instruction replicates bits 6–11 of the denominator into each tribble, and the second instruction replicates bits 0–5 into each tribble and gets the reciprocal from the gamma RAMs. The long multiplication will take 35 instructions. Multiplying the quotient by the 12-bit divisor takes 7 or 8 instructions, depending on how I deal with a 6-bit shift. The remainder check and adjustments to the quotient and remainder take 4 instructions. So this method of division takes around 49 instructions.

Large denominator

We come to the general case where long division is necessary. This sounds scary, because human long division in base 10 is nightmarish, but long division in binary is a breeze. In this architecture—and possibly many others—long division only uses three instructions per quotient bit, plus a fixed number of instructions prior to and following the loop.

In the setup process, we align N and D at the left side of the word, and we’ll call these N’ and D’. For example, if N is

000000000000000001000010010111010100`b

then N’ is

100001001011101010000000000000000000`b

Computing N’ and D’ requires 10 CPU instructions total, because each will need an LLZ (light leading zeros stacked unary), HAM1 (Hamming weight part 1), HAM2, SWIZ (swizzle), and LSL (logical shift left). This alignment is necessary so that the T(emporal) flag correctly detects overflow after each subtraction in the loop.

Some bit lengths will also need computed, and the quotient will need to be preinitialized with a “stop bit” that appears in bit 35 when the desired number of iterations is finished. By defining the quotient as signed (although it isn’t), we can use a conditional jump to watch for the stop bit to appear.

The division loop itself would look like:

    unsigned N' D'
    signed Q

    ; code here is not shown

loop:
    N' = N' - D'
    Q = Q ++ Q
    jump <= loop

This is textbook binary long division using non-restoring subtraction. The Q line of the algorithm is an add with carry, shifting a 1 into Q from the right if the subtraction overflowed, and shifting 0 into Q otherwise.

I wouldn’t use long division to divide by 0 or 1, due to a few considerations that arise with computing quotient length and including a stop bit. I’ll have another method cover these cases. Due to its infrequent use, I would position the case of dividing by 0 where it avoids adding a branch instruction to as many other cases as possible.

After the loop, reparations are made in parallel for the overflowed subtraction, and the quotient and remainder are shifted and truncated as appropriate. The quotient, I have read, will always be odd, and the remainder may be positive, zero, or negative. This too is adjusted, and the long division is complete.

Long division is a worst-case scenario in terms of execution time, but there is some good news. Because the denominator is guaranteed to be at least 4096 (otherwise we will use a different method), the quotient will be a maximum of 24 bits, limiting the inner loop to 72 instructions at most. Allowing a generous 36 instructions for setup, teardown, and method selection would mean long division will require no more than 108 CPU instructions. My ego would like to see fewer than 100 instructions, but I can’t claim it until the code is written and tested.

Once again, even though my “long multiplication” example produces a 72-bit product, my discussion of “long division” only supports a 36-bit numerator. This is not a totally inconsistent perspective, but it depends on where you decide to make the concession. A routine to divide 72 bits by 36 bits can produce a result that doesn’t fit in 36 bits, so no matter what decision we make, the set consisting of division and multiplication will not be closed under some definition.

Division by zero

Dividing N by D is the process of finding some pair Q and R such that N = Q × D + R. When division is unsigned and D is not zero, there will be some maximum Q such that R is less than D, and we consider that pair to be the correct solution.

Although some people joke that even God can’t divide by zero, there is an obvious, unique solution for R when D is zero, in which case R happens to be N. So it’s clear what Dauug|36 needs to do in terms of R. It’s also clear that any value of Q forms a correct solution too, but for the purpose of standardization we should decide a value for Q that we can depend on obtaining.

I have not made a final decision, but I am leaning towards N; that is, N = N × 0 + N.

Here are a few things Dauug|36 will not support when dividing by zero:

raise a signal
send registered mail
throw an exception
email a professor
terminate the program
take out the grid
generate an interrupt
file a lawsuit
somehow keep R < D
babysit the coder

All from the above list are unnecessary functionality.

If a program asks the CPU to divide by zero, the CPU will divide by zero.

Case closed.