(Replying to PARENT post)
It offers a modification to Knuth's algorithm D which avoids all hardware division by utilizing an approximate reciprocal. The 3-by-2 division is particularly interesting as it allows you to skip the quotient loop (D3). It's a nice little optimization that I've spent some time tinkering with in my own code. This technique is notably used in GMP and other arbitrary precision integer libraries.
(Replying to PARENT post)
In some ways the assembly implementation was easier as you aren't forever fighting with the compiler to get it to generate the correct instructions.
The 32 X 32 to 64 bit multiply is one example. There isn't a way of expressing that in C - you do a 64 X 64 bit multiply of 32 bit variables casted to 64 bits and hope the compiler gets the hint that it should be doing a 32 X 32 to 64 bit multiply not a 64 X 64 bit multiply. This of course involves reading the assembly output.
The other thing you'd like is addition with carry. It is possible to express this in C and have the compiler recognise what you are doing and generate the correct instructions but it is a pain! I noted the other day that zig has @addWithOverflow which is a step in the right direction.
I can see why the Author of the article found so many things to criticize in the algorithm studied, but I think they are mostly an artifact of the fact that C (and most high level languages) is a poor choice for low level arithmetic.
(Replying to PARENT post)
32-bit version:
const unsigned b = 65536; // Number base (16 bits).
unsigned un1, un0, // Norm. dividend LSD's.
vn1, vn0, // Norm. divisor digits.
q1, q0, // Quotient digits.
un32, un21, un10,// Dividend digit pairs.
rhat; // A remainder.
So a properly adapted 64-bit version should use uint64_t and a 32-bit number base.I did not attempt to follow the author's arguments why he advocates for 32-bit variables or why that deviation from the original would be correct. Using 64-bit does not seem to be any bottleneck at all here, so why bother.
(Replying to PARENT post)
It is a very tricky algorithm to get right. There are many edge cases that only happen for ~2^-64 fractions of the input space, so hard to find even with fuzz-testing. Best strategy is to implement it for small limbs first, and fuzz that well.
[1] https://github.com/0xProject/OpenZKP/blob/master/algebra/u25...
(Replying to PARENT post)
(Replying to PARENT post)
These algorithms are probably easier to implement in assembly than in a higher-level language.
(Replying to PARENT post)
> The format above illustrates the style in which all of the algorithms throughout this book will be presented. Each algorithm we consider has been given an identifying letter (E in the preceding example), and the steps of the algorithm are identified by this letter followed by a number (E1, E2, E3). The chapters are divided into numbered sections; within a section the algorithms are designated by letter only, but when algorithms are referred to in other sections, the appropriate section number is attached. For example, we are now in Section 1.1; within this section Euclid’s algorithm is called Algorithm E, while in later sections it is referred to as Algorithm 1.1E.
In the case of this particular "Algorithm D", the letter "D" is chosen probably because it's an algorithm for "Division of nonnegative integers" (it follows algorithms "A", "S" and "M" for you-can-guess-what); indeed the same volume also has another "Algorithm D (Factoring with sieves)" (=Algorithm 4.5.4D), and an "Algorithm D (Division of polynomials over a field)" (=Algorithm 4.6.1D). I think that, as several times in the same section it's referred to simply as "Algorithm D" (for example), this convention is not very obvious unless one has looked at either the second page of the first volume, or the appendix. Or I guess within a certain field a particular section of TAOCP is implicit, so it's clear from context.