gmp.texi
上传用户:qaz666999
上传日期:2022-08-06
资源大小:2570k
文件大小:420k
- @math{X(t)} and @math{Y(t)} are evaluated and multiplied at 7 points, giving
- values of @math{W(t)} at those points. In GMP the following points are used,
- @quotation
- @multitable {@m{t=-1/2,t=inf}M} {MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM}
- @item Point @tab Value
- @item @math{t=0} @tab @m{x_0y_0,x0 * y0}, which gives @ms{w,0} immediately
- @item @math{t=1/2} @tab @m{(x_3+2x_2+4x_1+8x_0)(y_3+2y_2+4y_1+8y_0),(x3+2*x2+4*x1+8*x0) * (y3+2*y2+4*y1+8*y0)}
- @item @math{t=-1/2} @tab @m{(-x_3+2x_2-4x_1+8x_0)(-y_3+2y_2-4y_1+8y_0),(-x3+2*x2-4*x1+8*x0) * (-y3+2*y2-4*y1+8*y0)}
- @item @math{t=1} @tab @m{(x_3+x_2+x_1+x_0)(y_3+y_2+y_1+y_0),(x3+x2+x1+x0) * (y3+y2+y1+y0)}
- @item @math{t=-1} @tab @m{(-x_3+x_2-x_1+x_0)(-y_3+y_2-y_1+y_0),(-x3+x2-x1+x0) * (-y3+y2-y1+y0)}
- @item @math{t=2} @tab @m{(8x_3+4x_2+2x_1+x_0)(8y_3+4y_2+2y_1+y_0),(8*x3+4*x2+2*x1+x0) * (8*y3+4*y2+2*y1+y0)}
- @item @m{t=infty,t=inf} @tab @m{x_3y_3,x3 * y3}, which gives @ms{w,6} immediately
- @end multitable
- @end quotation
- The number of additions and subtractions for Toom-4 is much larger than for Toom-3.
- But several subexpressions occur multiple times, for example @m{x_2+x_0,x2+x0}, occurs
- for both @math{t=1} and @math{t=-1}.
- Toom-4 is asymptotically @math{O(N^@W{1.404})}, the exponent being
- @m{log7/log4,log(7)/log(4)}, representing 7 recursive multiplies of 1/4 the
- original size each.
- @node FFT Multiplication, Other Multiplication, Toom 4-Way Multiplication, Multiplication Algorithms
- @subsection FFT Multiplication
- @cindex FFT multiplication
- @cindex Fast Fourier Transform
- At large to very large sizes a Fermat style FFT multiplication is used,
- following Sch@"onhage and Strassen (@pxref{References}). Descriptions of FFTs
- in various forms can be found in many textbooks, for instance Knuth section
- 4.3.3 part C or Lipson chapter IX@. A brief description of the form used in
- GMP is given here.
- The multiplication done is @m{xy bmod 2^N+1, x*y mod 2^N+1}, for a given
- @math{N}. A full product @m{xy,x*y} is obtained by choosing @m{N ge
- mathop{rm bits}(x)+mathop{rm bits}(y), N>=bits(x)+bits(y)} and padding
- @math{x} and @math{y} with high zero limbs. The modular product is the native
- form for the algorithm, so padding to get a full product is unavoidable.
- The algorithm follows a split, evaluate, pointwise multiply, interpolate and
- combine similar to that described above for Karatsuba and Toom-3. A @math{k}
- parameter controls the split, with an FFT-@math{k} splitting into @math{2^k}
- pieces of @math{M=N/2^k} bits each. @math{N} must be a multiple of
- @m{2^ktimes@code{mp_bits_per_limb}, (2^k)*@nicode{mp_bits_per_limb}} so
- the split falls on limb boundaries, avoiding bit shifts in the split and
- combine stages.
- The evaluations, pointwise multiplications, and interpolation, are all done
- modulo @m{2^{N'}+1, 2^N'+1} where @math{N'} is @math{2M+k+3} rounded up to a
- multiple of @math{2^k} and of @code{mp_bits_per_limb}. The results of
- interpolation will be the following negacyclic convolution of the input
- pieces, and the choice of @math{N'} ensures these sums aren't truncated.
- @tex
- $$ w_n = sum_{{i+j = b2^k+n}atop{b=0,1}} (-1)^b x_i y_j $$
- @end tex
- @ifnottex
- @example
- ---
- b
- w[n] = / (-1) * x[i] * y[j]
- ---
- i+j==b*2^k+n
- b=0,1
- @end example
- @end ifnottex
- The points used for the evaluation are @math{g^i} for @math{i=0} to
- @math{2^k-1} where @m{g=2^{2N'/2^k}, g=2^(2N'/2^k)}. @math{g} is a
- @m{2^k,2^k'}th root of unity mod @m{2^{N'}+1,2^N'+1}, which produces necessary
- cancellations at the interpolation stage, and it's also a power of 2 so the
- fast Fourier transforms used for the evaluation and interpolation do only
- shifts, adds and negations.
- The pointwise multiplications are done modulo @m{2^{N'}+1, 2^N'+1} and either
- recurse into a further FFT or use a plain multiplication (Toom-3, Karatsuba or
- basecase), whichever is optimal at the size @math{N'}. The interpolation is
- an inverse fast Fourier transform. The resulting set of sums of @m{x_iy_j,
- x[i]*y[j]} are added at appropriate offsets to give the final result.
- Squaring is the same, but @math{x} is the only input so it's one transform at
- the evaluate stage and the pointwise multiplies are squares. The
- interpolation is the same.
- For a mod @math{2^N+1} product, an FFT-@math{k} is an @m{O(N^{k/(k-1)}),
- O(N^(k/(k-1)))} algorithm, the exponent representing @math{2^k} recursed
- modular multiplies each @m{1/2^{k-1},1/2^(k-1)} the size of the original.
- Each successive @math{k} is an asymptotic improvement, but overheads mean each
- is only faster at bigger and bigger sizes. In the code, @code{MUL_FFT_TABLE}
- and @code{SQR_FFT_TABLE} are the thresholds where each @math{k} is used. Each
- new @math{k} effectively swaps some multiplying for some shifts, adds and
- overheads.
- A mod @math{2^N+1} product can be formed with a normal
- @math{N@cross{}N@rightarrow{}2N} bit multiply plus a subtraction, so an FFT
- and Toom-3 etc can be compared directly. A @math{k=4} FFT at
- @math{O(N^@W{1.333})} can be expected to be the first faster than Toom-3 at
- @math{O(N^@W{1.465})}. In practice this is what's found, with
- @code{MUL_FFT_MODF_THRESHOLD} and @code{SQR_FFT_MODF_THRESHOLD} being between
- 300 and 1000 limbs, depending on the CPU@. So far it's been found that only
- very large FFTs recurse into pointwise multiplies above these sizes.
- When an FFT is to give a full product, the change of @math{N} to @math{2N}
- doesn't alter the theoretical complexity for a given @math{k}, but for the
- purposes of considering where an FFT might be first used it can be assumed
- that the FFT is recursing into a normal multiply and that on that basis it's
- doing @math{2^k} recursed multiplies each @m{1/2^{k-2},1/2^(k-2)} the size of
- the inputs, making it @m{O(N^{k/(k-2)}), O(N^(k/(k-2)))}. This would mean
- @math{k=7} at @math{O(N^@W{1.4})} would be the first FFT faster than Toom-3.
- In practice @code{MUL_FFT_THRESHOLD} and @code{SQR_FFT_THRESHOLD} have been
- found to be in the @math{k=8} range, somewhere between 3000 and 10000 limbs.
- The way @math{N} is split into @math{2^k} pieces and then @math{2M+k+3} is
- rounded up to a multiple of @math{2^k} and @code{mp_bits_per_limb} means that
- when @math{2^k@ge{}@nicode{mp_bits_per_limb}} the effective @math{N} is a
- multiple of @m{2^{2k-1},2^(2k-1)} bits. The @math{+k+3} means some values of
- @math{N} just under such a multiple will be rounded to the next. The
- complexity calculations above assume that a favourable size is used, meaning
- one which isn't padded through rounding, and it's also assumed that the extra
- @math{+k+3} bits are negligible at typical FFT sizes.
- The practical effect of the @m{2^{2k-1},2^(2k-1)} constraint is to introduce a
- step-effect into measured speeds. For example @math{k=8} will round @math{N}
- up to a multiple of 32768 bits, so for a 32-bit limb there'll be 512 limb
- groups of sizes for which @code{mpn_mul_n} runs at the same speed. Or for
- @math{k=9} groups of 2048 limbs, @math{k=10} groups of 8192 limbs, etc. In
- practice it's been found each @math{k} is used at quite small multiples of its
- size constraint and so the step effect is quite noticeable in a time versus
- size graph.
- The threshold determinations currently measure at the mid-points of size
- steps, but this is sub-optimal since at the start of a new step it can happen
- that it's better to go back to the previous @math{k} for a while. Something
- more sophisticated for @code{MUL_FFT_TABLE} and @code{SQR_FFT_TABLE} will be
- needed.
- @node Other Multiplication, Unbalanced Multiplication, FFT Multiplication, Multiplication Algorithms
- @subsection Other Multiplication
- @cindex Toom multiplication
- The Toom algorithms described above (@pxref{Toom 3-Way Multiplication},
- @pxref{Toom 4-Way Multiplication}) generalizes to split into an arbitrary
- number of pieces, as per Knuth section 4.3.3 algorithm C@. This is not
- currently used. The notes here are merely for interest.
- In general a split into @math{r+1} pieces is made, and evaluations and
- pointwise multiplications done at @m{2r+1,2*r+1} points. A 4-way split does 7
- pointwise multiplies, 5-way does 9, etc. Asymptotically an @math{(r+1)}-way
- algorithm is @m{O(N^{log(2r+1)/log(r+1)}, O(N^(log(2*r+1)/log(r+1)))}. Only
- the pointwise multiplications count towards big-@math{O} complexity, but the
- time spent in the evaluate and interpolate stages grows with @math{r} and has
- a significant practical impact, with the asymptotic advantage of each @math{r}
- realized only at bigger and bigger sizes. The overheads grow as
- @m{O(Nr),O(N*r)}, whereas in an @math{r=2^k} FFT they grow only as @m{O(N log
- r), O(N*log(r))}.
- Knuth algorithm C evaluates at points 0,1,2,@dots{},@m{2r,2*r}, but exercise 4
- uses @math{-r},@dots{},0,@dots{},@math{r} and the latter saves some small
- multiplies in the evaluate stage (or rather trades them for additions), and
- has a further saving of nearly half the interpolate steps. The idea is to
- separate odd and even final coefficients and then perform algorithm C steps C7
- and C8 on them separately. The divisors at step C7 become @math{j^2} and the
- multipliers at C8 become @m{2tj-j^2,2*t*j-j^2}.
- Splitting odd and even parts through positive and negative points can be
- thought of as using @math{-1} as a square root of unity. If a 4th root of
- unity was available then a further split and speedup would be possible, but no
- such root exists for plain integers. Going to complex integers with
- @m{i=sqrt{-1}, i=sqrt(-1)} doesn't help, essentially because in Cartesian
- form it takes three real multiplies to do a complex multiply. The existence
- of @m{2^k,2^k'}th roots of unity in a suitable ring or field lets the fast
- Fourier transform keep splitting and get to @m{O(N log r), O(N*log(r))}.
- Floating point FFTs use complex numbers approximating Nth roots of unity.
- Some processors have special support for such FFTs. But these are not used in
- GMP since it's very difficult to guarantee an exact result (to some number of
- bits). An occasional difference of 1 in the last bit might not matter to a
- typical signal processing algorithm, but is of course of vital importance to
- GMP.
- @node Unbalanced Multiplication, , Other Multiplication, Multiplication Algorithms
- @subsection Unbalanced Multiplication
- @cindex Unbalanced multiplication
- Multiplication of operands with different sizes, both below
- @code{MUL_TOOM22_THRESHOLD} are done with plain schoolbook multiplication
- (@pxref{Basecase Multiplication}).
- For really large operands, we invoke FFT directly.
- For operands between these sizes, we use Toom inspired algorithms suggested by
- Alberto Zanoni and Marco Bodrato. The idea is to split the operands into
- polynomials of different degree. GMP currently splits the smaller operand
- onto 2 coefficients, i.e., a polynomial of degree 1, but the larger operand
- can be split into 2, 3, or 4 coefficients, i.e., a polynomial of degree 1 to
- 3.
- @c FIXME: This is mighty ugly, but a cleaner @need triggers texinfo bugs that
- @c screws up layout here and there in the rest of the manual.
- @c @tex
- @c goodbreak
- @c @end tex
- @node Division Algorithms, Greatest Common Divisor Algorithms, Multiplication Algorithms, Algorithms
- @section Division Algorithms
- @cindex Division algorithms
- @menu
- * Single Limb Division::
- * Basecase Division::
- * Divide and Conquer Division::
- * Block-Wise Barrett Division::
- * Exact Division::
- * Exact Remainder::
- * Small Quotient Division::
- @end menu
- @node Single Limb Division, Basecase Division, Division Algorithms, Division Algorithms
- @subsection Single Limb Division
- N@cross{}1 division is implemented using repeated 2@cross{}1 divisions from
- high to low, either with a hardware divide instruction or a multiplication by
- inverse, whichever is best on a given CPU.
- The multiply by inverse follows ``Improved division by invariant integers'' by
- M@"oller and Granlund (@pxref{References}) and is implemented as
- @code{udiv_qrnnd_preinv} in @file{gmp-impl.h}. The idea is to have a
- fixed-point approximation to @math{1/d} (see @code{invert_limb}) and then
- multiply by the high limb (plus one bit) of the dividend to get a quotient
- @math{q}. With @math{d} normalized (high bit set), @math{q} is no more than 1
- too small. Subtracting @m{qd,q*d} from the dividend gives a remainder, and
- reveals whether @math{q} or @math{q-1} is correct.
- The result is a division done with two multiplications and four or five
- arithmetic operations. On CPUs with low latency multipliers this can be much
- faster than a hardware divide, though the cost of calculating the inverse at
- the start may mean it's only better on inputs bigger than say 4 or 5 limbs.
- When a divisor must be normalized, either for the generic C
- @code{__udiv_qrnnd_c} or the multiply by inverse, the division performed is
- actually @m{a2^k,a*2^k} by @m{d2^k,d*2^k} where @math{a} is the dividend and
- @math{k} is the power necessary to have the high bit of @m{d2^k,d*2^k} set.
- The bit shifts for the dividend are usually accomplished ``on the fly''
- meaning by extracting the appropriate bits at each step. Done this way the
- quotient limbs come out aligned ready to store. When only the remainder is
- wanted, an alternative is to take the dividend limbs unshifted and calculate
- @m{r = a bmod d2^k, r = a mod d*2^k} followed by an extra final step @m{r2^k
- bmod d2^k, r*2^k mod d*2^k}. This can help on CPUs with poor bit shifts or
- few registers.
- The multiply by inverse can be done two limbs at a time. The calculation is
- basically the same, but the inverse is two limbs and the divisor treated as if
- padded with a low zero limb. This means more work, since the inverse will
- need a 2@cross{}2 multiply, but the four 1@cross{}1s to do that are
- independent and can therefore be done partly or wholly in parallel. Likewise
- for a 2@cross{}1 calculating @m{qd,q*d}. The net effect is to process two
- limbs with roughly the same two multiplies worth of latency that one limb at a
- time gives. This extends to 3 or 4 limbs at a time, though the extra work to
- apply the inverse will almost certainly soon reach the limits of multiplier
- throughput.
- A similar approach in reverse can be taken to process just half a limb at a
- time if the divisor is only a half limb. In this case the 1@cross{}1 multiply
- for the inverse effectively becomes two @m{{1over2}times1, (1/2)x1} for each
- limb, which can be a saving on CPUs with a fast half limb multiply, or in fact
- if the only multiply is a half limb, and especially if it's not pipelined.
- @node Basecase Division, Divide and Conquer Division, Single Limb Division, Division Algorithms
- @subsection Basecase Division
- Basecase N@cross{}M division is like long division done by hand, but in base
- @m{2GMPraise{@code{mp_bits_per_limb}}, 2^mp_bits_per_limb}. See Knuth
- section 4.3.1 algorithm D, and @file{mpn/generic/sb_divrem_mn.c}.
- Briefly stated, while the dividend remains larger than the divisor, a high
- quotient limb is formed and the N@cross{}1 product @m{qd,q*d} subtracted at
- the top end of the dividend. With a normalized divisor (most significant bit
- set), each quotient limb can be formed with a 2@cross{}1 division and a
- 1@cross{}1 multiplication plus some subtractions. The 2@cross{}1 division is
- by the high limb of the divisor and is done either with a hardware divide or a
- multiply by inverse (the same as in @ref{Single Limb Division}) whichever is
- faster. Such a quotient is sometimes one too big, requiring an addback of the
- divisor, but that happens rarely.
- With Q=N@minus{}M being the number of quotient limbs, this is an
- @m{O(QM),O(Q*M)} algorithm and will run at a speed similar to a basecase
- Q@cross{}M multiplication, differing in fact only in the extra multiply and
- divide for each of the Q quotient limbs.
- @node Divide and Conquer Division, Block-Wise Barrett Division, Basecase Division, Division Algorithms
- @subsection Divide and Conquer Division
- For divisors larger than @code{DC_DIV_QR_THRESHOLD}, division is done by dividing.
- Or to be precise by a recursive divide and conquer algorithm based on work by
- Moenck and Borodin, Jebelean, and Burnikel and Ziegler (@pxref{References}).
- The algorithm consists essentially of recognising that a 2N@cross{}N division
- can be done with the basecase division algorithm (@pxref{Basecase Division}),
- but using N/2 limbs as a base, not just a single limb. This way the
- multiplications that arise are (N/2)@cross{}(N/2) and can take advantage of
- Karatsuba and higher multiplication algorithms (@pxref{Multiplication
- Algorithms}). The two ``digits'' of the quotient are formed by recursive
- N@cross{}(N/2) divisions.
- If the (N/2)@cross{}(N/2) multiplies are done with a basecase multiplication
- then the work is about the same as a basecase division, but with more function
- call overheads and with some subtractions separated from the multiplies.
- These overheads mean that it's only when N/2 is above
- @code{MUL_TOOM22_THRESHOLD} that divide and conquer is of use.
- @code{DC_DIV_QR_THRESHOLD} is based on the divisor size N, so it will be somewhere
- above twice @code{MUL_TOOM22_THRESHOLD}, but how much above depends on the
- CPU@. An optimized @code{mpn_mul_basecase} can lower @code{DC_DIV_QR_THRESHOLD} a
- little by offering a ready-made advantage over repeated @code{mpn_submul_1}
- calls.
- Divide and conquer is asymptotically @m{O(M(N)log N),O(M(N)*log(N))} where
- @math{M(N)} is the time for an N@cross{}N multiplication done with FFTs. The
- actual time is a sum over multiplications of the recursed sizes, as can be
- seen near the end of section 2.2 of Burnikel and Ziegler. For example, within
- the Toom-3 range, divide and conquer is @m{2.63M(N), 2.63*M(N)}. With higher
- algorithms the @math{M(N)} term improves and the multiplier tends to @m{log
- N, log(N)}. In practice, at moderate to large sizes, a 2N@cross{}N division
- is about 2 to 4 times slower than an N@cross{}N multiplication.
- @node Block-Wise Barrett Division, Exact Division, Divide and Conquer Division, Division Algorithms
- @subsection Block-Wise Barrett Division
- For the largest divisions, a block-wise Barrett division algorithm is used.
- Here, the divisor is inverted to a precision determined by the relative size of
- the dividend and divisor. Blocks of quotient limbs are then generated by
- multiplying blocks from the dividend by the inverse.
- Our block-wise algorithm computes a smaller inverse than in the plain Barrett
- algorithm. For a @math{2n/n} division, the inverse will be just @m{lceil n/2
- rceil, ceil(n/2)} limbs.
- @node Exact Division, Exact Remainder, Block-Wise Barrett Division, Division Algorithms
- @subsection Exact Division
- A so-called exact division is when the dividend is known to be an exact
- multiple of the divisor. Jebelean's exact division algorithm uses this
- knowledge to make some significant optimizations (@pxref{References}).
- The idea can be illustrated in decimal for example with 368154 divided by
- 543. Because the low digit of the dividend is 4, the low digit of the
- quotient must be 8. This is arrived at from @m{4 mathord{times} 7 bmod 10,
- 4*7 mod 10}, using the fact 7 is the modular inverse of 3 (the low digit of
- the divisor), since @m{3 mathord{times} 7 mathop{equiv} 1 bmod 10, 3*7
- @equiv{} 1 mod 10}. So @m{8mathord{times}543 = 4344,8*543=4344} can be
- subtracted from the dividend leaving 363810. Notice the low digit has become
- zero.
- The procedure is repeated at the second digit, with the next quotient digit 7
- (@m{1 mathord{times} 7 bmod 10, 7 @equiv{} 1*7 mod 10}), subtracting
- @m{7mathord{times}543 = 3801,7*543=3801}, leaving 325800. And finally at
- the third digit with quotient digit 6 (@m{8 mathord{times} 7 bmod 10, 8*7
- mod 10}), subtracting @m{6mathord{times}543 = 3258,6*543=3258} leaving 0.
- So the quotient is 678.
- Notice however that the multiplies and subtractions don't need to extend past
- the low three digits of the dividend, since that's enough to determine the
- three quotient digits. For the last quotient digit no subtraction is needed
- at all. On a 2N@cross{}N division like this one, only about half the work of
- a normal basecase division is necessary.
- For an N@cross{}M exact division producing Q=N@minus{}M quotient limbs, the
- saving over a normal basecase division is in two parts. Firstly, each of the
- Q quotient limbs needs only one multiply, not a 2@cross{}1 divide and
- multiply. Secondly, the crossproducts are reduced when @math{Q>M} to
- @m{QM-M(M+1)/2,Q*M-M*(M+1)/2}, or when @math{Q@le{}M} to @m{Q(Q-1)/2,
- Q*(Q-1)/2}. Notice the savings are complementary. If Q is big then many
- divisions are saved, or if Q is small then the crossproducts reduce to a small
- number.
- The modular inverse used is calculated efficiently by @code{binvert_limb} in
- @file{gmp-impl.h}. This does four multiplies for a 32-bit limb, or six for a
- 64-bit limb. @file{tune/modlinv.c} has some alternate implementations that
- might suit processors better at bit twiddling than multiplying.
- The sub-quadratic exact division described by Jebelean in ``Exact Division
- with Karatsuba Complexity'' is not currently implemented. It uses a
- rearrangement similar to the divide and conquer for normal division
- (@pxref{Divide and Conquer Division}), but operating from low to high. A
- further possibility not currently implemented is ``Bidirectional Exact Integer
- Division'' by Krandick and Jebelean which forms quotient limbs from both the
- high and low ends of the dividend, and can halve once more the number of
- crossproducts needed in a 2N@cross{}N division.
- A special case exact division by 3 exists in @code{mpn_divexact_by3},
- supporting Toom-3 multiplication and @code{mpq} canonicalizations. It forms
- quotient digits with a multiply by the modular inverse of 3 (which is
- @code{0xAA..AAB}) and uses two comparisons to determine a borrow for the next
- limb. The multiplications don't need to be on the dependent chain, as long as
- the effect of the borrows is applied, which can help chips with pipelined
- multipliers.
- @node Exact Remainder, Small Quotient Division, Exact Division, Division Algorithms
- @subsection Exact Remainder
- @cindex Exact remainder
- If the exact division algorithm is done with a full subtraction at each stage
- and the dividend isn't a multiple of the divisor, then low zero limbs are
- produced but with a remainder in the high limbs. For dividend @math{a},
- divisor @math{d}, quotient @math{q}, and @m{b = 2
- GMPraise{@code{mp_bits_per_limb}}, b = 2^mp_bits_per_limb}, this remainder
- @math{r} is of the form
- @tex
- $$ a = qd + r b^n $$
- @end tex
- @ifnottex
- @example
- a = q*d + r*b^n
- @end example
- @end ifnottex
- @math{n} represents the number of zero limbs produced by the subtractions,
- that being the number of limbs produced for @math{q}. @math{r} will be in the
- range @math{0@le{}r<d} and can be viewed as a remainder, but one shifted up by
- a factor of @math{b^n}.
- Carrying out full subtractions at each stage means the same number of cross
- products must be done as a normal division, but there's still some single limb
- divisions saved. When @math{d} is a single limb some simplifications arise,
- providing good speedups on a number of processors.
- @code{mpn_divexact_by3}, @code{mpn_modexact_1_odd} and the @code{mpn_redc_X}
- functions differ subtly in how they return @math{r}, leading to some negations
- in the above formula, but all are essentially the same.
- @cindex Divisibility algorithm
- @cindex Congruence algorithm
- Clearly @math{r} is zero when @math{a} is a multiple of @math{d}, and this
- leads to divisibility or congruence tests which are potentially more efficient
- than a normal division.
- The factor of @math{b^n} on @math{r} can be ignored in a GCD when @math{d} is
- odd, hence the use of @code{mpn_modexact_1_odd} by @code{mpn_gcd_1} and
- @code{mpz_kronecker_ui} etc (@pxref{Greatest Common Divisor Algorithms}).
- Montgomery's REDC method for modular multiplications uses operands of the form
- of @m{xb^{-n}, x*b^-n} and @m{yb^{-n}, y*b^-n} and on calculating @m{(xb^{-n})
- (yb^{-n}), (x*b^-n)*(y*b^-n)} uses the factor of @math{b^n} in the exact
- remainder to reach a product in the same form @m{(xy)b^{-n}, (x*y)*b^-n}
- (@pxref{Modular Powering Algorithm}).
- Notice that @math{r} generally gives no useful information about the ordinary
- remainder @math{a @bmod d} since @math{b^n @bmod d} could be anything. If
- however @math{b^n @equiv{} 1 @bmod d}, then @math{r} is the negative of the
- ordinary remainder. This occurs whenever @math{d} is a factor of
- @math{b^n-1}, as for example with 3 in @code{mpn_divexact_by3}. For a 32 or
- 64 bit limb other such factors include 5, 17 and 257, but no particular use
- has been found for this.
- @node Small Quotient Division, , Exact Remainder, Division Algorithms
- @subsection Small Quotient Division
- An N@cross{}M division where the number of quotient limbs Q=N@minus{}M is
- small can be optimized somewhat.
- An ordinary basecase division normalizes the divisor by shifting it to make
- the high bit set, shifting the dividend accordingly, and shifting the
- remainder back down at the end of the calculation. This is wasteful if only a
- few quotient limbs are to be formed. Instead a division of just the top
- @m{rm2Q,2*Q} limbs of the dividend by the top Q limbs of the divisor can be
- used to form a trial quotient. This requires only those limbs normalized, not
- the whole of the divisor and dividend.
- A multiply and subtract then applies the trial quotient to the M@minus{}Q
- unused limbs of the divisor and N@minus{}Q dividend limbs (which includes Q
- limbs remaining from the trial quotient division). The starting trial
- quotient can be 1 or 2 too big, but all cases of 2 too big and most cases of 1
- too big are detected by first comparing the most significant limbs that will
- arise from the subtraction. An addback is done if the quotient still turns
- out to be 1 too big.
- This whole procedure is essentially the same as one step of the basecase
- algorithm done in a Q limb base, though with the trial quotient test done only
- with the high limbs, not an entire Q limb ``digit'' product. The correctness
- of this weaker test can be established by following the argument of Knuth
- section 4.3.1 exercise 20 but with the @m{v_2 GMPhat q > b GMPhat r
- + u_2, v2*q>b*r+u2} condition appropriately relaxed.
- @need 1000
- @node Greatest Common Divisor Algorithms, Powering Algorithms, Division Algorithms, Algorithms
- @section Greatest Common Divisor
- @cindex Greatest common divisor algorithms
- @cindex GCD algorithms
- @menu
- * Binary GCD::
- * Lehmer's Algorithm::
- * Subquadratic GCD::
- * Extended GCD::
- * Jacobi Symbol::
- @end menu
- @node Binary GCD, Lehmer's Algorithm, Greatest Common Divisor Algorithms, Greatest Common Divisor Algorithms
- @subsection Binary GCD
- At small sizes GMP uses an @math{O(N^2)} binary style GCD@. This is described
- in many textbooks, for example Knuth section 4.5.2 algorithm B@. It simply
- consists of successively reducing odd operands @math{a} and @math{b} using
- @quotation
- @math{a,b = @abs{}(a-b),@min{}(a,b)} @*
- strip factors of 2 from @math{a}
- @end quotation
- The Euclidean GCD algorithm, as per Knuth algorithms E and A, repeatedly
- computes the quotient @m{q = lfloor a/b rfloor, q = floor(a/b)} and replaces
- @math{a,b} by @math{v, u - q v}. The binary algorithm has so far been found to
- be faster than the Euclidean algorithm everywhere. One reason the binary
- method does well is that the implied quotient at each step is usually small,
- so often only one or two subtractions are needed to get the same effect as a
- division. Quotients 1, 2 and 3 for example occur 67.7% of the time, see Knuth
- section 4.5.3 Theorem E.
- When the implied quotient is large, meaning @math{b} is much smaller than
- @math{a}, then a division is worthwhile. This is the basis for the initial
- @math{a @bmod b} reductions in @code{mpn_gcd} and @code{mpn_gcd_1} (the latter
- for both N@cross{}1 and 1@cross{}1 cases). But after that initial reduction,
- big quotients occur too rarely to make it worth checking for them.
- @sp 1
- The final @math{1@cross{}1} GCD in @code{mpn_gcd_1} is done in the generic C
- code as described above. For two N-bit operands, the algorithm takes about
- 0.68 iterations per bit. For optimum performance some attention needs to be
- paid to the way the factors of 2 are stripped from @math{a}.
- Firstly it may be noted that in twos complement the number of low zero bits on
- @math{a-b} is the same as @math{b-a}, so counting or testing can begin on
- @math{a-b} without waiting for @math{@abs{}(a-b)} to be determined.
- A loop stripping low zero bits tends not to branch predict well, since the
- condition is data dependent. But on average there's only a few low zeros, so
- an option is to strip one or two bits arithmetically then loop for more (as
- done for AMD K6). Or use a lookup table to get a count for several bits then
- loop for more (as done for AMD K7). An alternative approach is to keep just
- one of @math{a} or @math{b} odd and iterate
- @quotation
- @math{a,b = @abs{}(a-b), @min{}(a,b)} @*
- @math{a = a/2} if even @*
- @math{b = b/2} if even
- @end quotation
- This requires about 1.25 iterations per bit, but stripping of a single bit at
- each step avoids any branching. Repeating the bit strip reduces to about 0.9
- iterations per bit, which may be a worthwhile tradeoff.
- Generally with the above approaches a speed of perhaps 6 cycles per bit can be
- achieved, which is still not terribly fast with for instance a 64-bit GCD
- taking nearly 400 cycles. It's this sort of time which means it's not usually
- advantageous to combine a set of divisibility tests into a GCD.
- Currently, the binary algorithm is used for GCD only when @math{N < 3}.
- @node Lehmer's Algorithm, Subquadratic GCD, Binary GCD, Greatest Common Divisor Algorithms
- @comment node-name, next, previous, up
- @subsection Lehmer's algorithm
- Lehmer's improvement of the Euclidean algorithms is based on the observation
- that the initial part of the quotient sequence depends only on the most
- significant parts of the inputs. The variant of Lehmer's algorithm used in GMP
- splits off the most significant two limbs, as suggested, e.g., in ``A
- Double-Digit Lehmer-Euclid Algorithm'' by Jebelean (@pxref{References}). The
- quotients of two double-limb inputs are collected as a 2 by 2 matrix with
- single-limb elements. This is done by the function @code{mpn_hgcd2}. The
- resulting matrix is applied to the inputs using @code{mpn_mul_1} and
- @code{mpn_submul_1}. Each iteration usually reduces the inputs by almost one
- limb. In the rare case of a large quotient, no progress can be made by
- examining just the most significant two limbs, and the quotient is computing
- using plain division.
- The resulting algorithm is asymptotically @math{O(N^2)}, just as the Euclidean
- algorithm and the binary algorithm. The quadratic part of the work are
- the calls to @code{mpn_mul_1} and @code{mpn_submul_1}. For small sizes, the
- linear work is also significant. There are roughly @math{N} calls to the
- @code{mpn_hgcd2} function. This function uses a couple of important
- optimizations:
- @itemize
- @item
- It uses the same relaxed notion of correctness as @code{mpn_hgcd} (see next
- section). This means that when called with the most significant two limbs of
- two large numbers, the returned matrix does not always correspond exactly to
- the initial quotient sequence for the two large numbers; the final quotient
- may sometimes be one off.
- @item
- It takes advantage of the fact the quotients are usually small. The division
- operator is not used, since the corresponding assembler instruction is very
- slow on most architectures. (This code could probably be improved further, it
- uses many branches that are unfriendly to prediction).
- @item
- It switches from double-limb calculations to single-limb calculations half-way
- through, when the input numbers have been reduced in size from two limbs to
- one and a half.
- @end itemize
- @node Subquadratic GCD, Extended GCD, Lehmer's Algorithm, Greatest Common Divisor Algorithms
- @subsection Subquadratic GCD
- For inputs larger than @code{GCD_DC_THRESHOLD}, GCD is computed via the HGCD
- (Half GCD) function, as a generalization to Lehmer's algorithm.
- Let the inputs @math{a,b} be of size @math{N} limbs each. Put @m{S=lfloor N/2
- rfloor + 1, S = floor(N/2) + 1}. Then HGCD(a,b) returns a transformation
- matrix @math{T} with non-negative elements, and reduced numbers @math{(c;d) =
- T^{-1} (a;b)}. The reduced numbers @math{c,d} must be larger than @math{S}
- limbs, while their difference @math{abs(c-d)} must fit in @math{S} limbs. The
- matrix elements will also be of size roughly @math{N/2}.
- The HGCD base case uses Lehmer's algorithm, but with the above stop condition
- that returns reduced numbers and the corresponding transformation matrix
- half-way through. For inputs larger than @code{HGCD_THRESHOLD}, HGCD is
- computed recursively, using the divide and conquer algorithm in ``On
- Sch@"onhage's algorithm and subquadratic integer GCD computation'' by M@"oller
- (@pxref{References}). The recursive algorithm consists of these main
- steps.
- @itemize
- @item
- Call HGCD recursively, on the most significant @math{N/2} limbs. Apply the
- resulting matrix @math{T_1} to the full numbers, reducing them to a size just
- above @math{3N/2}.
- @item
- Perform a small number of division or subtraction steps to reduce the numbers
- to size below @math{3N/2}. This is essential mainly for the unlikely case of
- large quotients.
- @item
- Call HGCD recursively, on the most significant @math{N/2} limbs of the reduced
- numbers. Apply the resulting matrix @math{T_2} to the full numbers, reducing
- them to a size just above @math{N/2}.
- @item
- Compute @math{T = T_1 T_2}.
- @item
- Perform a small number of division and subtraction steps to satisfy the
- requirements, and return.
- @end itemize
- GCD is then implemented as a loop around HGCD, similarly to Lehmer's
- algorithm. Where Lehmer repeatedly chops off the top two limbs, calls
- @code{mpn_hgcd2}, and applies the resulting matrix to the full numbers, the
- subquadratic GCD chops off the most significant third of the limbs (the
- proportion is a tuning parameter, and @math{1/3} seems to be more efficient
- than, e.g, @math{1/2}), calls @code{mpn_hgcd}, and applies the resulting
- matrix. Once the input numbers are reduced to size below
- @code{GCD_DC_THRESHOLD}, Lehmer's algorithm is used for the rest of the work.
- The asymptotic running time of both HGCD and GCD is @m{O(M(N)log N),O(M(N)*log(N))},
- where @math{M(N)} is the time for multiplying two @math{N}-limb numbers.
- @comment node-name, next, previous, up
- @node Extended GCD, Jacobi Symbol, Subquadratic GCD, Greatest Common Divisor Algorithms
- @subsection Extended GCD
- The extended GCD function, or GCDEXT, calculates @math{@gcd{}(a,b)} and also
- cofactors @math{x} and @math{y} satisfying @m{ax+by=gcd(a@C{}b),
- a*x+b*y=gcd(a@C{}b)}. All the algorithms used for plain GCD are extended to
- handle this case. The binary algorithm is used only for single-limb GCDEXT.
- Lehmer's algorithm is used for sizes up to @code{GCDEXT_DC_THRESHOLD}. Above
- this threshold, GCDEXT is implemented as a loop around HGCD, but with more
- book-keeping to keep track of the cofactors. This gives the same asymptotic
- running time as for GCD and HGCD, @m{O(M(N)log N),O(M(N)*log(N))}
- One difference to plain GCD is that while the inputs @math{a} and @math{b} are
- reduced as the algorithm proceeds, the cofactors @math{x} and @math{y} grow in
- size. This makes the tuning of the chopping-point more difficult. The current
- code chops off the most significant half of the inputs for the call to HGCD in
- the first iteration, and the most significant two thirds for the remaining
- calls. This strategy could surely be improved. Also the stop condition for the
- loop, where Lehmer's algorithm is invoked once the inputs are reduced below
- @code{GCDEXT_DC_THRESHOLD}, could maybe be improved by taking into account the
- current size of the cofactors.
- @node Jacobi Symbol, , Extended GCD, Greatest Common Divisor Algorithms
- @subsection Jacobi Symbol
- @cindex Jacobi symbol algorithm
- @code{mpz_jacobi} and @code{mpz_kronecker} are currently implemented with a
- simple binary algorithm similar to that described for the GCDs (@pxref{Binary
- GCD}). They're not very fast when both inputs are large. Lehmer's multi-step
- improvement or a binary based multi-step algorithm is likely to be better.
- When one operand fits a single limb, and that includes @code{mpz_kronecker_ui}
- and friends, an initial reduction is done with either @code{mpn_mod_1} or
- @code{mpn_modexact_1_odd}, followed by the binary algorithm on a single limb.
- The binary algorithm is well suited to a single limb, and the whole
- calculation in this case is quite efficient.
- In all the routines sign changes for the result are accumulated using some bit
- twiddling, avoiding table lookups or conditional jumps.
- @need 1000
- @node Powering Algorithms, Root Extraction Algorithms, Greatest Common Divisor Algorithms, Algorithms
- @section Powering Algorithms
- @cindex Powering algorithms
- @menu
- * Normal Powering Algorithm::
- * Modular Powering Algorithm::
- @end menu
- @node Normal Powering Algorithm, Modular Powering Algorithm, Powering Algorithms, Powering Algorithms
- @subsection Normal Powering
- Normal @code{mpz} or @code{mpf} powering uses a simple binary algorithm,
- successively squaring and then multiplying by the base when a 1 bit is seen in
- the exponent, as per Knuth section 4.6.3. The ``left to right''
- variant described there is used rather than algorithm A, since it's just as
- easy and can be done with somewhat less temporary memory.
- @node Modular Powering Algorithm, , Normal Powering Algorithm, Powering Algorithms
- @subsection Modular Powering
- Modular powering is implemented using a @math{2^k}-ary sliding window
- algorithm, as per ``Handbook of Applied Cryptography'' algorithm 14.85
- (@pxref{References}). @math{k} is chosen according to the size of the
- exponent. Larger exponents use larger values of @math{k}, the choice being
- made to minimize the average number of multiplications that must supplement
- the squaring.
- The modular multiplies and squares use either a simple division or the REDC
- method by Montgomery (@pxref{References}). REDC is a little faster,
- essentially saving N single limb divisions in a fashion similar to an exact
- remainder (@pxref{Exact Remainder}).
- @node Root Extraction Algorithms, Radix Conversion Algorithms, Powering Algorithms, Algorithms
- @section Root Extraction Algorithms
- @cindex Root extraction algorithms
- @menu
- * Square Root Algorithm::
- * Nth Root Algorithm::
- * Perfect Square Algorithm::
- * Perfect Power Algorithm::
- @end menu
- @node Square Root Algorithm, Nth Root Algorithm, Root Extraction Algorithms, Root Extraction Algorithms
- @subsection Square Root
- @cindex Square root algorithm
- @cindex Karatsuba square root algorithm
- Square roots are taken using the ``Karatsuba Square Root'' algorithm by Paul
- Zimmermann (@pxref{References}).
- An input @math{n} is split into four parts of @math{k} bits each, so with
- @math{b=2^k} we have @m{n = a_3b^3 + a_2b^2 + a_1b + a_0, n = a3*b^3 + a2*b^2
- + a1*b + a0}. Part @ms{a,3} must be ``normalized'' so that either the high or
- second highest bit is set. In GMP, @math{k} is kept on a limb boundary and
- the input is left shifted (by an even number of bits) to normalize.
- The square root of the high two parts is taken, by recursive application of
- the algorithm (bottoming out in a one-limb Newton's method),
- @tex
- $$ s',r' = mathop{rm sqrtrem} > (a_3b + a_2) $$
- @end tex
- @ifnottex
- @example
- s1,r1 = sqrtrem (a3*b + a2)
- @end example
- @end ifnottex
- This is an approximation to the desired root and is extended by a division to
- give @math{s},@math{r},
- @tex
- $$eqalign{
- q,u &= mathop{rm divrem} > (r'b + a_1, 2s') cr
- s &= s'b + q cr
- r &= ub + a_0 - q^2
- }$$
- @end tex
- @ifnottex
- @example
- q,u = divrem (r1*b + a1, 2*s1)
- s = s1*b + q
- r = u*b + a0 - q^2
- @end example
- @end ifnottex
- The normalization requirement on @ms{a,3} means at this point @math{s} is
- either correct or 1 too big. @math{r} is negative in the latter case, so
- @tex
- $$eqalign{
- mathop{rm if} ; r &< 0 ; mathop{rm then} cr
- r &leftarrow r + 2s - 1 cr
- s &leftarrow s - 1
- }$$
- @end tex
- @ifnottex
- @example
- if r < 0 then
- r = r + 2*s - 1
- s = s - 1
- @end example
- @end ifnottex
- The algorithm is expressed in a divide and conquer form, but as noted in the
- paper it can also be viewed as a discrete variant of Newton's method, or as a
- variation on the schoolboy method (no longer taught) for square roots two
- digits at a time.
- If the remainder @math{r} is not required then usually only a few high limbs
- of @math{r} and @math{u} need to be calculated to determine whether an
- adjustment to @math{s} is required. This optimization is not currently
- implemented.
- In the Karatsuba multiplication range this algorithm is @m{O({3over2}
- M(N/2)),O(1.5*M(N/2))}, where @math{M(n)} is the time to multiply two numbers
- of @math{n} limbs. In the FFT multiplication range this grows to a bound of
- @m{O(6 M(N/2)),O(6*M(N/2))}. In practice a factor of about 1.5 to 1.8 is
- found in the Karatsuba and Toom-3 ranges, growing to 2 or 3 in the FFT range.
- The algorithm does all its calculations in integers and the resulting
- @code{mpn_sqrtrem} is used for both @code{mpz_sqrt} and @code{mpf_sqrt}.
- The extended precision given by @code{mpf_sqrt_ui} is obtained by
- padding with zero limbs.
- @node Nth Root Algorithm, Perfect Square Algorithm, Square Root Algorithm, Root Extraction Algorithms
- @subsection Nth Root
- @cindex Root extraction algorithm
- @cindex Nth root algorithm
- Integer Nth roots are taken using Newton's method with the following
- iteration, where @math{A} is the input and @math{n} is the root to be taken.
- @tex
- $$a_{i+1} = {1over n} left({A over a_i^{n-1}} + (n-1)a_i right)$$
- @end tex
- @ifnottex
- @example
- 1 A
- a[i+1] = - * ( --------- + (n-1)*a[i] )
- n a[i]^(n-1)
- @end example
- @end ifnottex
- The initial approximation @m{a_1,a[1]} is generated bitwise by successively
- powering a trial root with or without new 1 bits, aiming to be just above the
- true root. The iteration converges quadratically when started from a good
- approximation. When @math{n} is large more initial bits are needed to get
- good convergence. The current implementation is not particularly well
- optimized.
- @node Perfect Square Algorithm, Perfect Power Algorithm, Nth Root Algorithm, Root Extraction Algorithms
- @subsection Perfect Square
- @cindex Perfect square algorithm
- A significant fraction of non-squares can be quickly identified by checking
- whether the input is a quadratic residue modulo small integers.
- @code{mpz_perfect_square_p} first tests the input mod 256, which means just
- examining the low byte. Only 44 different values occur for squares mod 256,
- so 82.8% of inputs can be immediately identified as non-squares.
- On a 32-bit system similar tests are done mod 9, 5, 7, 13 and 17, for a total
- 99.25% of inputs identified as non-squares. On a 64-bit system 97 is tested
- too, for a total 99.62%.
- These moduli are chosen because they're factors of @math{2^@W{24}-1} (or
- @math{2^@W{48}-1} for 64-bits), and such a remainder can be quickly taken just
- using additions (see @code{mpn_mod_34lsub1}).
- When nails are in use moduli are instead selected by the @file{gen-psqr.c}
- program and applied with an @code{mpn_mod_1}. The same @math{2^@W{24}-1} or
- @math{2^@W{48}-1} could be done with nails using some extra bit shifts, but
- this is not currently implemented.
- In any case each modulus is applied to the @code{mpn_mod_34lsub1} or
- @code{mpn_mod_1} remainder and a table lookup identifies non-squares. By
- using a ``modexact'' style calculation, and suitably permuted tables, just one
- multiply each is required, see the code for details. Moduli are also combined
- to save operations, so long as the lookup tables don't become too big.
- @file{gen-psqr.c} does all the pre-calculations.
- A square root must still be taken for any value that passes these tests, to
- verify it's really a square and not one of the small fraction of non-squares
- that get through (ie.@: a pseudo-square to all the tested bases).
- Clearly more residue tests could be done, @code{mpz_perfect_square_p} only
- uses a compact and efficient set. Big inputs would probably benefit from more
- residue testing, small inputs might be better off with less. The assumed
- distribution of squares versus non-squares in the input would affect such
- considerations.
- @node Perfect Power Algorithm, , Perfect Square Algorithm, Root Extraction Algorithms
- @subsection Perfect Power
- @cindex Perfect power algorithm
- Detecting perfect powers is required by some factorization algorithms.
- Currently @code{mpz_perfect_power_p} is implemented using repeated Nth root
- extractions, though naturally only prime roots need to be considered.
- (@xref{Nth Root Algorithm}.)
- If a prime divisor @math{p} with multiplicity @math{e} can be found, then only
- roots which are divisors of @math{e} need to be considered, much reducing the
- work necessary. To this end divisibility by a set of small primes is checked.
- @node Radix Conversion Algorithms, Other Algorithms, Root Extraction Algorithms, Algorithms
- @section Radix Conversion
- @cindex Radix conversion algorithms
- Radix conversions are less important than other algorithms. A program
- dominated by conversions should probably use a different data representation.
- @menu
- * Binary to Radix::
- * Radix to Binary::
- @end menu
- @node Binary to Radix, Radix to Binary, Radix Conversion Algorithms, Radix Conversion Algorithms
- @subsection Binary to Radix
- Conversions from binary to a power-of-2 radix use a simple and fast
- @math{O(N)} bit extraction algorithm.
- Conversions from binary to other radices use one of two algorithms. Sizes
- below @code{GET_STR_PRECOMPUTE_THRESHOLD} use a basic @math{O(N^2)} method.
- Repeated divisions by @math{b^n} are made, where @math{b} is the radix and
- @math{n} is the biggest power that fits in a limb. But instead of simply
- using the remainder @math{r} from such divisions, an extra divide step is done
- to give a fractional limb representing @math{r/b^n}. The digits of @math{r}
- can then be extracted using multiplications by @math{b} rather than divisions.
- Special case code is provided for decimal, allowing multiplications by 10 to
- optimize to shifts and adds.
- Above @code{GET_STR_PRECOMPUTE_THRESHOLD} a sub-quadratic algorithm is used.
- For an input @math{t}, powers @m{b^{n2^i},b^(n*2^i)} of the radix are
- calculated, until a power between @math{t} and @m{sqrt{t},sqrt(t)} is
- reached. @math{t} is then divided by that largest power, giving a quotient
- which is the digits above that power, and a remainder which is those below.
- These two parts are in turn divided by the second highest power, and so on
- recursively. When a piece has been divided down to less than
- @code{GET_STR_DC_THRESHOLD} limbs, the basecase algorithm described above is
- used.
- The advantage of this algorithm is that big divisions can make use of the
- sub-quadratic divide and conquer division (@pxref{Divide and Conquer
- Division}), and big divisions tend to have less overheads than lots of
- separate single limb divisions anyway. But in any case the cost of
- calculating the powers @m{b^{n2^i},b^(n*2^i)} must first be overcome.
- @code{GET_STR_PRECOMPUTE_THRESHOLD} and @code{GET_STR_DC_THRESHOLD} represent
- the same basic thing, the point where it becomes worth doing a big division to
- cut the input in half. @code{GET_STR_PRECOMPUTE_THRESHOLD} includes the cost
- of calculating the radix power required, whereas @code{GET_STR_DC_THRESHOLD}
- assumes that's already available, which is the case when recursing.
- Since the base case produces digits from least to most significant but they
- want to be stored from most to least, it's necessary to calculate in advance
- how many digits there will be, or at least be sure not to underestimate that.
- For GMP the number of input bits is multiplied by @code{chars_per_bit_exactly}
- from @code{mp_bases}, rounding up. The result is either correct or one too
- big.
- Examining some of the high bits of the input could increase the chance of
- getting the exact number of digits, but an exact result every time would not
- be practical, since in general the difference between numbers 100@dots{} and
- 99@dots{} is only in the last few bits and the work to identify 99@dots{}
- might well be almost as much as a full conversion.
- @code{mpf_get_str} doesn't currently use the algorithm described here, it
- multiplies or divides by a power of @math{b} to move the radix point to the
- just above the highest non-zero digit (or at worst one above that location),
- then multiplies by @math{b^n} to bring out digits. This is @math{O(N^2)} and
- is certainly not optimal.
- The @math{r/b^n} scheme described above for using multiplications to bring out
- digits might be useful for more than a single limb. Some brief experiments
- with it on the base case when recursing didn't give a noticeable improvement,
- but perhaps that was only due to the implementation. Something similar would
- work for the sub-quadratic divisions too, though there would be the cost of
- calculating a bigger radix power.
- Another possible improvement for the sub-quadratic part would be to arrange
- for radix powers that balanced the sizes of quotient and remainder produced,
- ie.@: the highest power would be an @m{b^{nk},b^(n*k)} approximately equal to
- @m{sqrt{t},sqrt(t)}, not restricted to a @math{2^i} factor. That ought to
- smooth out a graph of times against sizes, but may or may not be a net
- speedup.
- @node Radix to Binary, , Binary to Radix, Radix Conversion Algorithms
- @subsection Radix to Binary
- @strong{This section needs to be rewritten, it currently describes the
- algorithms used before GMP 4.3.}
- Conversions from a power-of-2 radix into binary use a simple and fast
- @math{O(N)} bitwise concatenation algorithm.
- Conversions from other radices use one of two algorithms. Sizes below
- @code{SET_STR_PRECOMPUTE_THRESHOLD} use a basic @math{O(N^2)} method. Groups
- of @math{n} digits are converted to limbs, where @math{n} is the biggest
- power of the base @math{b} which will fit in a limb, then those groups are
- accumulated into the result by multiplying by @math{b^n} and adding. This
- saves multi-precision operations, as per Knuth section 4.4 part E
- (@pxref{References}). Some special case code is provided for decimal, giving
- the compiler a chance to optimize multiplications by 10.
- Above @code{SET_STR_PRECOMPUTE_THRESHOLD} a sub-quadratic algorithm is used.
- First groups of @math{n} digits are converted into limbs. Then adjacent
- limbs are combined into limb pairs with @m{xb^n+y,x*b^n+y}, where @math{x}
- and @math{y} are the limbs. Adjacent limb pairs are combined into quads
- similarly with @m{xb^{2n}+y,x*b^(2n)+y}. This continues until a single block
- remains, that being the result.
- The advantage of this method is that the multiplications for each @math{x} are
- big blocks, allowing Karatsuba and higher algorithms to be used. But the cost
- of calculating the powers @m{b^{n2^i},b^(n*2^i)} must be overcome.
- @code{SET_STR_PRECOMPUTE_THRESHOLD} usually ends up quite big, around 5000 digits, and on
- some processors much bigger still.
- @code{SET_STR_PRECOMPUTE_THRESHOLD} is based on the input digits (and tuned
- for decimal), though it might be better based on a limb count, so as to be
- independent of the base. But that sort of count isn't used by the base case
- and so would need some sort of initial calculation or estimate.
- The main reason @code{SET_STR_PRECOMPUTE_THRESHOLD} is so much bigger than the
- corresponding @code{GET_STR_PRECOMPUTE_THRESHOLD} is that @code{mpn_mul_1} is
- much faster than @code{mpn_divrem_1} (often by a factor of 5, or more).
- @need 1000
- @node Other Algorithms, Assembly Coding, Radix Conversion Algorithms, Algorithms
- @section Other Algorithms
- @menu
- * Prime Testing Algorithm::
- * Factorial Algorithm::
- * Binomial Coefficients Algorithm::
- * Fibonacci Numbers Algorithm::
- * Lucas Numbers Algorithm::
- * Random Number Algorithms::
- @end menu
- @node Prime Testing Algorithm, Factorial Algorithm, Other Algorithms, Other Algorithms
- @subsection Prime Testing
- @cindex Prime testing algorithms
- The primality testing in @code{mpz_probab_prime_p} (@pxref{Number Theoretic
- Functions}) first does some trial division by small factors and then uses the
- Miller-Rabin probabilistic primality testing algorithm, as described in Knuth
- section 4.5.4 algorithm P (@pxref{References}).
- For an odd input @math{n}, and with @math{n = q@GMPmultiply{}2^k+1} where
- @math{q} is odd, this algorithm selects a random base @math{x} and tests
- whether @math{x^q @bmod{} n} is 1 or @math{-1}, or an @m{x^{q2^j} bmod n,
- x^(q*2^j) mod n} is @math{1}, for @math{1@le{}j@le{}k}. If so then @math{n}
- is probably prime, if not then @math{n} is definitely composite.
- Any prime @math{n} will pass the test, but some composites do too. Such
- composites are known as strong pseudoprimes to base @math{x}. No @math{n} is
- a strong pseudoprime to more than @math{1/4} of all bases (see Knuth exercise
- 22), hence with @math{x} chosen at random there's no more than a @math{1/4}
- chance a ``probable prime'' will in fact be composite.
- In fact strong pseudoprimes are quite rare, making the test much more
- powerful than this analysis would suggest, but @math{1/4} is all that's proven
- for an arbitrary @math{n}.
- @node Factorial Algorithm, Binomial Coefficients Algorithm, Prime Testing Algorithm, Other Algorithms
- @subsection Factorial
- @cindex Factorial algorithm
- Factorials are calculated by a combination of removal of twos, powering, and
- binary splitting. The procedure can be best illustrated with an example,
- @quotation
- @math{23! = 1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23}
- @end quotation
- @noindent
- has factors of two removed,
- @quotation
- @math{23! = 2^{19}.1.1.3.1.5.3.7.1.9.5.11.3.13.7.15.1.17.9.19.5.21.11.23}
- @end quotation
- @noindent
- and the resulting terms collected up according to their multiplicity,
- @quotation
- @math{23! = 2^{19}.(3.5)^3.(7.9.11)^2.(13.15.17.19.21.23)}
- @end quotation
- Each sequence such as @math{13.15.17.19.21.23} is evaluated by splitting into
- every second term, as for instance @math{(13.17.21).(15.19.23)}, and the same
- recursively on each half. This is implemented iteratively using some bit
- twiddling.
- Such splitting is more efficient than repeated N@cross{}1 multiplies since it
- forms big multiplies, allowing Karatsuba and higher algorithms to be used.
- And even below the Karatsuba threshold a big block of work can be more
- efficient for the basecase algorithm.
- Splitting into subsequences of every second term keeps the resulting products
- more nearly equal in size than would the simpler approach of say taking the
- first half and second half of the sequence. Nearly equal products are more
- efficient for the current multiply implementation.
- @node Binomial Coefficients Algorithm, Fibonacci Numbers Algorithm, Factorial Algorithm, Other Algorithms
- @subsection Binomial Coefficients
- @cindex Binomial coefficient algorithm
- Binomial coefficients @m{left({n}atop{k}right), C(n@C{}k)} are calculated
- by first arranging @math{k @le{} n/2} using @m{left({n}atop{k}right) =
- left({n}atop{n-k}right), C(n@C{}k) = C(n@C{}n-k)} if necessary, and then
- evaluating the following product simply from @math{i=2} to @math{i=k}.
- @tex
- $$ left({n}atop{k}right) = (n-k+1) prod_{i=2}^{k} {{n-k+i} over i} $$
- @end tex
- @ifnottex
- @example
- k (n-k+i)
- C(n,k) = (n-k+1) * prod -------
- i=2 i
- @end example
- @end ifnottex
- It's easy to show that each denominator @math{i} will divide the product so
- far, so the exact division algorithm is used (@pxref{Exact Division}).
- The numerators @math{n-k+i} and denominators @math{i} are first accumulated
- into as many fit a limb, to save multi-precision operations, though for
- @code{mpz_bin_ui} this applies only to the divisors, since @math{n} is an
- @code{mpz_t} and @math{n-k+i} in general won't fit in a limb at all.
- @node Fibonacci Numbers Algorithm, Lucas Numbers Algorithm, Binomial Coefficients Algorithm, Other Algorithms
- @subsection Fibonacci Numbers
- @cindex Fibonacci number algorithm
- The Fibonacci functions @code{mpz_fib_ui} and @code{mpz_fib2_ui} are designed
- for calculating isolated @m{F_n,F[n]} or @m{F_n,F[n]},@m{F_{n-1},F[n-1]}
- values efficiently.
- For small @math{n}, a table of single limb values in @code{__gmp_fib_table} is
- used. On a 32-bit limb this goes up to @m{F_{47},F[47]}, or on a 64-bit limb
- up to @m{F_{93},F[93]}. For convenience the table starts at @m{F_{-1},F[-1]}.
- Beyond the table, values are generated with a binary powering algorithm,
- calculating a pair @m{F_n,F[n]} and @m{F_{n-1},F[n-1]} working from high to
- low across the bits of @math{n}. The formulas used are
- @tex
- $$eqalign{
- F_{2k+1} &= 4F_k^2 - F_{k-1}^2 + 2(-1)^k cr
- F_{2k-1} &= F_k^2 + F_{k-1}^2 cr
- F_{2k} &= F_{2k+1} - F_{2k-1}
- }$$
- @end tex
- @ifnottex
- @example
- F[2k+1] = 4*F[k]^2 - F[k-1]^2 + 2*(-1)^k
- F[2k-1] = F[k]^2 + F[k-1]^2
- F[2k] = F[2k+1] - F[2k-1]
- @end example
- @end ifnottex
- At each step, @math{k} is the high @math{b} bits of @math{n}. If the next bit
- of @math{n} is 0 then @m{F_{2k},F[2k]},@m{F_{2k-1},F[2k-1]} is used, or if
- it's a 1 then @m{F_{2k+1},F[2k+1]},@m{F_{2k},F[2k]} is used, and the process
- repeated until all bits of @math{n} are incorporated. Notice these formulas
- require just two squares per bit of @math{n}.
- It'd be possible to handle the first few @math{n} above the single limb table
- with simple additions, using the defining Fibonacci recurrence @m{F_{k+1} =
- F_k + F_{k-1}, F[k+1]=F[k]+F[k-1]}, but this is not done since it usually
- turns out to be faster for only about 10 or 20 values of @math{n}, and
- including a block of code for just those doesn't seem worthwhile. If they
- really mattered it'd be better to extend the data table.
- Using a table avoids lots of calculations on small numbers, and makes small
- @math{n} go fast. A bigger table would make more small @math{n} go fast, it's
- just a question of balancing size against desired speed. For GMP the code is
- kept compact, with the emphasis primarily on a good powering algorithm.
- @code{mpz_fib2_ui} returns both @m{F_n,F[n]} and @m{F_{n-1},F[n-1]}, but
- @code{mpz_fib_ui} is only interested in @m{F_n,F[n]}. In this case the last
- step of the algorithm can become one multiply instead of two squares. One of
- the following two formulas is used, according as @math{n} is odd or even.
- @tex
- $$eqalign{
- F_{2k} &= F_k (F_k + 2F_{k-1}) cr
- F_{2k+1} &= (2F_k + F_{k-1}) (2F_k - F_{k-1}) + 2(-1)^k
- }$$
- @end tex
- @ifnottex
- @example
- F[2k] = F[k]*(F[k]+2F[k-1])
- F[2k+1] = (2F[k]+F[k-1])*(2F[k]-F[k-1]) + 2*(-1)^k
- @end example
- @end ifnottex
- @m{F_{2k+1},F[2k+1]} here is the same as above, just rearranged to be a
- multiply. For interest, the @m{2(-1)^k, 2*(-1)^k} term both here and above
- can be applied just to the low limb of the calculation, without a carry or
- borrow into further limbs, which saves some code size. See comments with
- @code{mpz_fib_ui} and the internal @code{mpn_fib2_ui} for how this is done.
- @node Lucas Numbers Algorithm, Random Number Algorithms, Fibonacci Numbers Algorithm, Other Algorithms
- @subsection Lucas Numbers
- @cindex Lucas number algorithm
- @code{mpz_lucnum2_ui} derives a pair of Lucas numbers from a pair of Fibonacci
- numbers with the following simple formulas.
- @tex
- $$eqalign{
- L_k &= F_k + 2F_{k-1} cr
- L_{k-1} &= 2F_k - F_{k-1}
- }$$
- @end tex
- @ifnottex
- @example
- L[k] = F[k] + 2*F[k-1]
- L[k-1] = 2*F[k] - F[k-1]
- @end example
- @end ifnottex
- @code{mpz_lucnum_ui} is only interested in @m{L_n,L[n]}, and some work can be
- saved. Trailing zero bits on @math{n} can be handled with a single square
- each.
- @tex
- $$ L_{2k} = L_k^2 - 2(-1)^k $$
- @end tex
- @ifnottex
- @example
- L[2k] = L[k]^2 - 2*(-1)^k
- @end example
- @end ifnottex
- And the lowest 1 bit can be handled with one multiply of a pair of Fibonacci
- numbers, similar to what @code{mpz_fib_ui} does.
- @tex
- $$ L_{2k+1} = 5F_{k-1} (2F_k + F_{k-1}) - 4(-1)^k $$
- @end tex
- @ifnottex
- @example
- L[2k+1] = 5*F[k-1]*(2*F[k]+F[k-1]) - 4*(-1)^k
- @end example
- @end ifnottex
- @node Random Number Algorithms, , Lucas Numbers Algorithm, Other Algorithms
- @subsection Random Numbers
- @cindex Random number algorithms
- For the @code{urandomb} functions, random numbers are generated simply by
- concatenating bits produced by the generator. As long as the generator has
- good randomness properties this will produce well-distributed @math{N} bit
- numbers.
- For the @code{urandomm} functions, random numbers in a range @math{0@le{}R<N}
- are generated by taking values @math{R} of @m{lceil log_2 N rceil,
- ceil(log2(N))} bits each until one satisfies @math{R<N}. This will normally
- require only one or two attempts, but the attempts are limited in case the
- generator is somehow degenerate and produces only 1 bits or similar.
- @cindex Mersenne twister algorithm
- The Mersenne Twister generator is by Matsumoto and Nishimura
- (@pxref{References}). It has a non-repeating period of @math{2^@W{19937}-1},
- which is a Mersenne prime, hence the name of the generator. The state is 624
- words of 32-bits each, which is iterated with one XOR and shift for each
- 32-bit word generated, making the algorithm very fast. Randomness properties
- are also very good and this is the default algorithm used by GMP.
- @cindex Linear congruential algorithm
- Linear congruential generators are described in many text books, for instance
- Knuth volume 2 (@pxref{References}). With a modulus @math{M} and parameters
- @math{A} and @math{C}, a integer state @math{S} is iterated by the formula
- @math{S @leftarrow{} A@GMPmultiply{}S+C @bmod{} M}. At each step the new
- state is a linear function of the previous, mod @math{M}, hence the name of
- the generator.
- In GMP only moduli of the form @math{2^N} are supported, and the current
- implementation is not as well optimized as it could be. Overheads are
- significant when @math{N} is small, and when @math{N} is large clearly the
- multiply at each step will become slow. This is not a big concern, since the
- Mersenne Twister generator is better in every respect and is therefore
- recommended for all normal applications.
- For both generators the current state can be deduced by observing enough
- output and applying some linear algebra (over GF(2) in the case of the
- Mersenne Twister). This generally means raw output is unsuitable for
- cryptographic applications without further hashing or the like.
- @node Assembly Coding, , Other Algorithms, Algorithms
- @section Assembly Coding
- @cindex Assembly coding
- The assembly subroutines in GMP are the most significant source of speed at
- small to moderate sizes. At larger sizes algorithm selection becomes more
- important, but of course speedups in low level routines will still speed up
- everything proportionally.
- Carry handling and widening multiplies that are important for GMP can't be
- easily expressed in C@. GCC @code{asm} blocks help a lot and are provided in
- @file{longlong.h}, but hand coding low level routines invariably offers a
- speedup over generic C by a factor of anything from 2 to 10.
- @menu
- * Assembly Code Organisation::
- * Assembly Basics::
- * Assembly Carry Propagation::
- * Assembly Cache Handling::
- * Assembly Functional Units::
- * Assembly Floating Point::
- * Assembly SIMD Instructions::
- * Assembly Software Pipelining::
- * Assembly Loop Unrolling::
- * Assembly Writing Guide::
- @end menu
- @node Assembly Code Organisation, Assembly Basics, Assembly Coding, Assembly Coding
- @subsection Code Organisation
- @cindex Assembly code organisation
- @cindex Code organisation
- The various @file{mpn} subdirectories contain machine-dependent code, written
- in C or assembly. The @file{mpn/generic} subdirectory contains default code,
- used when there's no machine-specific version of a particular file.
- Each @file{mpn} subdirectory is for an ISA family. Generally 32-bit and
- 64-bit variants in a family cannot share code and have separate directories.
- Within a family further subdirectories may exist for CPU variants.
- In each directory a @file{nails} subdirectory may exist, holding code with
- nails support for that CPU variant. A @code{NAILS_SUPPORT} directive in each
- file indicates the nails values the code handles. Nails code only exists
- where it's faster, or promises to be faster, than plain code. There's no
- effort put into nails if they're not going to enhance a given CPU.
- @node Assembly Basics, Assembly Carry Propagation, Assembly Code Organisation, Assembly Coding
- @subsection Assembly Basics
- @code{mpn_addmul_1} and @code{mpn_submul_1} are the most important routines
- for overall GMP performance. All multiplications and divisions come down to
- repeated calls to these. @code{mpn_add_n}, @code{mpn_sub_n},
- @code{mpn_lshift} and @code{mpn_rshift} are next most important.
- On some CPUs assembly versions of the internal functions
- @code{mpn_mul_basecase} and @code{mpn_sqr_basecase} give significant speedups,
- mainly through avoiding function call overheads. They can also potentially
- make better use of a wide superscalar processor, as can bigger primitives like
- @code{mpn_addmul_2} or @code{mpn_addmul_4}.
- The restrictions on overlaps between sources and destinations
- (@pxref{Low-level Functions}) are designed to facilitate a variety of
- implementations. For example, knowing @code{mpn_add_n} won't have partly
- overlapping sources and destination means reading can be done far ahead of
- writing on superscalar processors, and loops can be vectorized on a vector
- processor, depending on the carry handling.
- @node Assembly Carry Propagation, Assembly Cache Handling, Assembly Basics, Assembly Coding
- @subsection Carry Propagation
- @cindex Assembly carry propagation
- The problem that presents most challenges in GMP is propagating carries from
- one limb to the next. In functions like @code{mpn_addmul_1} and
- @code{mpn_add_n}, carries are the only dependencies between limb operations.
- On processors with carry flags, a straightforward CISC style @code{adc} is
- generally best. AMD K6 @code{mpn_addmul_1} however is an example of an
- unusual set of circumstances where a branch works out better.
- On RISC processors generally an add and compare for overflow is used. This
- sort of thing can be seen in @file{mpn/generic/aors_n.c}. Some carry
- propagation schemes require 4 instructions, meaning at least 4 cycles per
- limb, but other schemes may use just 1 or 2. On wide superscalar processors
- performance may be completely determined by the number of dependent
- instructions between carry-in and carry-out for each limb.
- On vector processors good use can be made of the fact that a carry bit only
- very rarely propagates more than one limb. When adding a single bit to a
- limb, there's only a carry out if that limb was @code{0xFF@dots{}FF} which on
- random data will be only 1 in @m{2GMPraise{@code{mp_bits_per_limb}},
- 2^mp_bits_per_limb}. @file{mpn/cray/add_n.c} is an example of this, it adds
- all limbs in parallel, adds one set of carry bits in parallel and then only
- rarely needs to fall through to a loop propagating further carries.
- On the x86s, GCC (as of version 2.95.2) doesn't generate particularly good code
- for the RISC style idioms that are necessary to handle carry bits in
- C@. Often conditional jumps are generated where @code{adc} or @code{sbb} forms
- would be better. And so unfortunately almost any loop involving carry bits
- needs to be coded in assembly for best results.
- @node Assembly Cache Handling, Assembly Functional Units, Assembly Carry Propagation, Assembly Coding
- @subsection Cache Handling
- @cindex Assembly cache handling
- GMP aims to perform well both on operands that fit entirely in L1 cache and
- those which don't.
- Basic routines like @code{mpn_add_n} or @code{mpn_lshift} are often used on
- large operands, so L2 and main memory performance is important for them.
- @code{mpn_mul_1} and @code{mpn_addmul_1} are mostly used for multiply and
- square basecases, so L1 performance matters most for them, unless assembly
- versions of @code{mpn_mul_basecase} and @code{mpn_sqr_basecase} exist, in
- which case the remaining uses are mostly for larger operands.
- For L2 or main memory operands, memory access times will almost certainly be
- more than the calculation time. The aim therefore is to maximize memory
- throughput, by starting a load of the next cache line while processing the
- contents of the previous one. Clearly this is only possible if the chip has a
- lock-up free cache or some sort of prefetch instruction. Most current chips
- have both these features.
- Prefetching sources combines well with loop unrolling, since a prefetch can be
- initiated once per unrolled loop (or more than once if the loop covers more
- than one cache line).
- On CPUs without write-allocate caches, prefetching destinations will ensure
- individual stores don't go further down the cache hierarchy, limiting
- bandwidth. Of course for calculations which are slow anyway, like
- @code{mpn_divrem_1}, write-throughs might be fine.
- The distance ahead to prefetch will be determined by memory latency versus
- throughput. The aim of course is to have data arriving continuously, at peak
- throughput. Some CPUs have limits on the number of fetches or prefetches in
- progress.
- If a special prefetch instruction doesn't exist then a plain load can be used,
- but in that case care must be taken not to attempt to read past the end of an
- operand, since that might produce a segmentation violation.
- Some CPUs or systems have hardware that detects sequential memory accesses and
- initiates suitable cache movements automatically, making life easy.
- @node Assembly Functional Units, Assembly Floating Point, Assembly Cache Handling, Assembly Coding
- @subsection Functional Units
- When choosing an approach for an assembly loop, consideration is given to
- what operations can execute simultaneously and what throughput can thereby be
- achieved. In some cases an algorithm can be tweaked to accommodate available
- resources.
- Loop control will generally require a counter and pointer updates, costing as
- much as 5 instructions, plus any delays a branch introduces. CPU addressing
- modes might reduce pointer updates, perhaps by allowing just one updating
- pointer and others expressed as offsets from it, or on CISC chips with all
- addressing done with the loop counter as a scaled index.
- The final loop control cost can be amortised by processing several limbs in
- each iteration (@pxref{Assembly Loop Unrolling}). This at least ensures loop
- control isn't a big fraction the work done.
- Memory throughput is always a limit. If perhaps only one load or one store
- can be done per cycle then 3 cycles/limb will the top speed for ``binary''
- operations like @code{mpn_add_n}, and any code achieving that is optimal.
- Integer resources can be freed up by having the loop counter in a float
- register, or by pressing the float units into use for some multiplying,
- perhaps doing every second limb on the float side (@pxref{Assembly Floating
- Point}).
- Float resources can be freed up by doing carry propagation on the integer
- side, or even by doing integer to float conversions in integers using bit
- twiddling.
- @node Assembly Floating Point, Assembly SIMD Instructions, Assembly Functional Units, Assembly Coding
- @subsection Floating Point
- @cindex Assembly floating Point
- Floating point arithmetic is used in GMP for multiplications on CPUs with poor
- integer multipliers. It's mostly useful for @code{mpn_mul_1},
- @code{mpn_addmul_1} and @code{mpn_submul_1} on 64-bit machines, and
- @code{mpn_mul_basecase} on both 32-bit and 64-bit machines.
- With IEEE 53-bit double precision floats, integer multiplications producing up
- to 53 bits will give exact results. Breaking a 64@cross{}64 multiplication
- into eight 16@cross{}@math{32@rightarrow{}48} bit pieces is convenient. With
- some care though six 21@cross{}@math{32@rightarrow{}53} bit products can be
- used, if one of the lower two 21-bit pieces also uses the sign bit.
- For the @code{mpn_mul_1} family of functions on a 64-bit machine, the
- invariant single limb is split at the start, into 3 or 4 pieces. Inside the
- loop, the bignum operand is split into 32-bit pieces. Fast conversion of
- these unsigned 32-bit pieces to floating point is highly machine-dependent.
- In some cases, reading the data into the integer unit, zero-extending to
- 64-bits, then transferring to the floating point unit back via memory is the
- only option.
- Converting partial products back to 64-bit limbs is usually best done as a
- signed conversion. Since all values are smaller than @m{2^{53},2^53}, signed
- and unsigned are the same, but most processors lack unsigned conversions.
- @sp 2
- Here is a diagram showing 16@cross{}32 bit products for an @code{mpn_mul_1} or
- @code{mpn_addmul_1} with a 64-bit limb. The single limb operand V is split
- into four 16-bit parts. The multi-limb operand U is split in the loop into
- two 32-bit parts.
- @tex
- globalnewdimenGMPbits globalGMPbits=0.18em
- defGMPbox#1#2#3{%
- hbox{%
- hbox to 128GMPbits{hfil
- vbox{%
- hrule
- hbox to 48GMPbits {GMPvrule hfil$#2$hfil vrule}%
- hrule}%
- hskip #1GMPbits}%
- raise GMPboxdepth hbox{hskip 2em #3}}}
- %
- GMPdisplay{%
- vbox{%
- hbox{%
- hbox to 128GMPbits {hfil
- vbox{%
- hrule
- hbox to 64GMPbits{%
- GMPvrule hfil$v48$hfil
- vrule hfil$v32$hfil
- vrule hfil$v16$hfil
- vrule hfil$v00$hfil
- vrule}
- hrule}}%
- raise GMPboxdepth hbox{hskip 2em V Operand}}
- vskip 0.5ex
- hbox{%
- hbox to 128GMPbits {hfil
- raise GMPboxdepth hbox{$times$hskip 1.5em}%
- vbox{%
- hrule
- hbox to 64GMPbits {%
- GMPvrule hfil$u32$hfil
- vrule hfil$u00$hfil
- vrule}%
- hrule}}%
- raise GMPboxdepth hbox{hskip 2em U Operand (one limb)}}%
- vskip 0.5ex
- hbox{vbox to 2ex{hrule width 128GMPbits}}%
- GMPbox{0}{u00 times v00}{$p00$hskip 1.5em 48-bit products}%
- vskip 0.5ex
- GMPbox{16}{u00 times v16}{$p16$}
- vskip 0.5ex
- GMPbox{32}{u00 times v32}{$p32$}
- vskip 0.5ex
- GMPbox{48}{u00 times v48}{$p48$}
- vskip 0.5ex
- GMPbox{32}{u32 times v00}{$r32$}
- vskip 0.5ex
- GMPbox{48}{u32 times v16}{$r48$}
- vskip 0.5ex
- GMPbox{64}{u32 times v32}{$r64$}
- vskip 0.5ex
- GMPbox{80}{u32 times v48}{$r80$}
- }}
- @end tex
- @ifnottex
- @example
- @group
- +---+---+---+---+
- |v48|v32|v16|v00| V operand
- +---+---+---+---+
- +-------+---+---+
- x | u32 | u00 | U operand (one limb)
- +---------------+
- ---------------------------------
- +-----------+
- | u00 x v00 | p00 48-bit products
- +-----------+
- +-----------+
- | u00 x v16 | p16
- +-----------+
- +-----------+
- | u00 x v32 | p32
- +-----------+
- +-----------+
- | u00 x v48 | p48
- +-----------+
- +-----------+
- | u32 x v00 | r32
- +-----------+
- +-----------+
- | u32 x v16 | r48
- +-----------+
- +-----------+
- | u32 x v32 | r64
- +-----------+
- +-----------+
- | u32 x v48 | r80
- +-----------+
- @end group
- @end example
- @end ifnottex
- @math{p32} and @math{r32} can be summed using floating-point addition, and
- likewise @math{p48} and @math{r48}. @math{p00} and @math{p16} can be summed
- with @math{r64} and @math{r80} from the previous iteration.
- For each loop then, four 49-bit quantities are transferred to the integer unit,
- aligned as follows,
- @tex
- % GMPbox here should be 49 bits wide, but use 51 to better show p16+r80'
- % crossing into the upper 64 bits.
- defGMPbox#1#2#3{%
- hbox{%
- hbox to 128GMPbits {%
- hfil
- vbox{%
- hrule
- hbox to 51GMPbits {GMPvrule hfil$#2$hfil vrule}%
- hrule}%
- hskip #1GMPbits}%
- raise GMPboxdepth hbox{hskip 1.5em $#3$hfil}%
- }}
- newboxb setboxbhbox{64 bits}%
- newdimenbw bw=wdb advancebw by 2em
- newdimenx x=128GMPbits
- advancex by -2bw
- dividex by4
- GMPdisplay{%
- vbox{%
- hbox to 128GMPbits {%
- GMPvrule
- raise 0.5ex vbox{hrule hbox to x {}}%
- hfil 64 bitshfil
- raise 0.5ex vbox{hrule hbox to x {}}%
- vrule
- raise 0.5ex vbox{hrule hbox to x {}}%
- hfil 64 bitshfil
- raise 0.5ex vbox{hrule hbox to x {}}%
- vrule}%
- vskip 0.7ex
- GMPbox{0}{p00+r64'}{i00}
- vskip 0.5ex
- GMPbox{16}{p16+r80'}{i16}
- vskip 0.5ex
- GMPbox{32}{p32+r32}{i32}
- vskip 0.5ex
- GMPbox{48}{p48+r48}{i48}
- }}
- @end tex
- @ifnottex
- @example
- @group
- |-----64bits----|-----64bits----|
- +------------+
- | p00 + r64' | i00
- +------------+
- +------------+
- | p16 + r80' | i16
- +------------+
- +------------+
- | p32 + r32 | i32
- +------------+
- +------------+
- | p48 + r48 | i48
- +------------+
- @end group
- @end example
- @end ifnottex
- The challenge then is to sum these efficiently and add in a carry limb,
- generating a low 64-bit result limb and a high 33-bit carry limb (@math{i48}
- extends 33 bits into the high half).
- @node Assembly SIMD Instructions, Assembly Software Pipelining, Assembly Floating Point, Assembly Coding
- @subsection SIMD Instructions
- @cindex Assembly SIMD
- The single-instruction multiple-data support in current microprocessors is
- aimed at signal processing algorithms where each data point can be treated
- more or less independently. There's generally not much support for
- propagating the sort of carries that arise in GMP.
- SIMD multiplications of say four 16@cross{}16 bit multiplies only do as much
- work as one 32@cross{}32 from GMP's point of view, and need some shifts and
- adds besides. But of course if say the SIMD form is fully pipelined and uses
- less instruction decoding then it may still be worthwhile.
- On the x86 chips, MMX has so far found a use in @code{mpn_rshift} and
- @code{mpn_lshift}, and is used in a special case for 16-bit multipliers in the
- P55 @code{mpn_mul_1}. SSE2 is used for Pentium 4 @code{mpn_mul_1},
- @code{mpn_addmul_1}, and @code{mpn_submul_1}.
- @node Assembly Software Pipelining, Assembly Loop Unrolling, Assembly SIMD Instructions, Assembly Coding
- @subsection Software Pipelining
- @cindex Assembly software pipelining
- Software pipelining consists of scheduling instructions around the branch
- point in a loop. For example a loop might issue a load not for use in the
- present iteration but the next, thereby allowing extra cycles for the data to
- arrive from memory.
- Naturally this is wanted only when doing things like loads or multiplies that
- take several cycles to complete, and only where a CPU has multiple functional
- units so that other work can be done in the meantime.
- A pipeline with several stages will have a data value in progress at each
- stage and each loop iteration moves them along one stage. This is like
- juggling.
- If the latency of some instruction is greater than the loop time then it will
- be necessary to unroll, so one register has a result ready to use while
- another (or multiple others) are still in progress. (@pxref{Assembly Loop
- Unrolling}).
- @node Assembly Loop Unrolling, Assembly Writing Guide, Assembly Software Pipelining, Assembly Coding
- @subsection Loop Unrolling
- @cindex Assembly loop unrolling
- Loop unrolling consists of replicating code so that several limbs are
- processed in each loop. At a minimum this reduces loop overheads by a
- corresponding factor, but it can also allow better register usage, for example
- alternately using one register combination and then another. Judicious use of
- @command{m4} macros can help avoid lots of duplication in the source code.
- Any amount of unrolling can be handled with a loop counter that's decremented
- by @math{N} each time, stopping when the remaining count is less than the
- further @math{N} the loop will process. Or by subtracting @math{N} at the
- start, the termination condition becomes when the counter @math{C} is less
- than 0 (and the count of remaining limbs is @math{C+N}).
- Alternately for a power of 2 unroll the loop count and remainder can be
- established with a shift and mask. This is convenient if also making a
- computed jump into the middle of a large loop.
- The limbs not a multiple of the unrolling can be handled in various ways, for
- example
- @itemize @bullet
- @item
- A simple loop at the end (or the start) to process the excess. Care will be
- wanted that it isn't too much slower than the unrolled part.
- @item
- A set of binary tests, for example after an 8-limb unrolling, test for 4 more
- limbs to process, then a further 2 more or not, and finally 1 more or not.
- This will probably take more code space than a simple loop.
- @item
- A @code{switch} statement, providing separate code for each possible excess,
- for example an 8-limb unrolling would have separate code for 0 remaining, 1
- remaining, etc, up to 7 remaining. This might take a lot of code, but may be
- the best way to optimize all cases in combination with a deep pipelined loop.
- @item
- A computed jump into the middle of the loop, thus making the first iteration
- handle the excess. This should make times smoothly increase with size, which
- is attractive, but setups for the jump and adjustments for pointers can be
- tricky and could become quite difficult in combination with deep pipelining.
- @end itemize
- @node Assembly Writing Guide, , Assembly Loop Unrolling, Assembly Coding
- @subsection Writing Guide
- @cindex Assembly writing guide
- This is a guide to writing software pipelined loops for processing limb
- vectors in assembly.
- First determine the algorithm and which instructions are needed. Code it
- without unrolling or scheduling, to make sure it works. On a 3-operand CPU
- try to write each new value to a new register, this will greatly simplify later
- steps.
- Then note for each instruction the functional unit and/or issue port
- requirements. If an instruction can use either of two units, like U0 or U1
- then make a category ``U0/U1''. Count the total using each unit (or combined
- unit), and count all instructions.
- Figure out from those counts the best possible loop time. The goal will be to
- find a perfect schedule where instruction latencies are completely hidden.
- The total instruction count might be the limiting factor, or perhaps a
- particular functional unit. It might be possible to tweak the instructions to
- help the limiting factor.
- Suppose the loop time is @math{N}, then make @math{N} issue buckets, with the
- final loop branch at the end of the last. Now fill the buckets with dummy
- instructions using the functional units desired. Run this to make sure the
- intended speed is reached.
- Now replace the dummy instructions with the real instructions from the slow
- but correct loop you started with. The first will typically be a load
- instruction. Then the instruction using that value is placed in a bucket an
- appropriate distance down. Run the loop again, to check it still runs at
- target speed.
- Keep placing instructions, frequently measuring the loop. After a few you
- will need to wrap around from the last bucket back to the top of the loop. If
- you used the new-register for new-value strategy above then there will be no
- register conflicts. If not then take care not to clobber something already in
- use. Changing registers at this time is very error prone.
- The loop will overlap two or more of the original loop iterations, and the
- computation of one vector element result will be started in one iteration of
- the new loop, and completed one or several iterations later.
- The final step is to create feed-in and wind-down code for the loop. A good
- way to do this is to make a copy (or copies) of the loop at the start and
- delete those instructions which don't have valid antecedents, and at the end
- replicate and delete those whose results are unwanted (including any further
- loads).
- The loop will have a minimum number of limbs loaded and processed, so the
- feed-in code must test if the request size is smaller and skip either to a
- suitable part of the wind-down or to special code for small sizes.
- @node Internals, Contributors, Algorithms, Top
- @chapter Internals
- @cindex Internals
- @strong{This chapter is provided only for informational purposes and the
- various internals described here may change in future GMP releases.
- Applications expecting to be compatible with future releases should use only
- the documented interfaces described in previous chapters.}
- @menu
- * Integer Internals::
- * Rational Internals::
- * Float Internals::
- * Raw Output Internals::
- * C++ Interface Internals::
- @end menu
- @node Integer Internals, Rational Internals, Internals, Internals
- @section Integer Internals
- @cindex Integer internals
- @code{mpz_t} variables represent integers using sign and magnitude, in space
- dynamically allocated and reallocated. The fields are as follows.
- @table @asis
- @item @code{_mp_size}
- The number of limbs, or the negative of that when representing a negative
- integer. Zero is represented by @code{_mp_size} set to zero, in which case
- the @code{_mp_d} data is unused.
- @item @code{_mp_d}
- A pointer to an array of limbs which is the magnitude. These are stored
- ``little endian'' as per the @code{mpn} functions, so @code{_mp_d[0]} is the
- least significant limb and @code{_mp_d[ABS(_mp_size)-1]} is the most
- significant. Whenever @code{_mp_size} is non-zero, the most significant limb
- is non-zero.
- Currently there's always at least one limb allocated, so for instance
- @code{mpz_set_ui} never needs to reallocate, and @code{mpz_get_ui} can fetch
- @code{_mp_d[0]} unconditionally (though its value is then only wanted if
- @code{_mp_size} is non-zero).
- @item @code{_mp_alloc}
- @code{_mp_alloc} is the number of limbs currently allocated at @code{_mp_d},
- and naturally @code{_mp_alloc >= ABS(_mp_size)}. When an @code{mpz} routine
- is about to (or might be about to) increase @code{_mp_size}, it checks
- @code{_mp_alloc} to see whether there's enough space, and reallocates if not.
- @code{MPZ_REALLOC} is generally used for this.
- @end table
- The various bitwise logical functions like @code{mpz_and} behave as if
- negative values were twos complement. But sign and magnitude is always used
- internally, and necessary adjustments are made during the calculations.
- Sometimes this isn't pretty, but sign and magnitude are best for other
- routines.
- Some internal temporary variables are setup with @code{MPZ_TMP_INIT} and these
- have @code{_mp_d} space obtained from @code{TMP_ALLOC} rather than the memory
- allocation functions. Care is taken to ensure that these are big enough that
- no reallocation is necessary (since it would have unpredictable consequences).
- @code{_mp_size} and @code{_mp_alloc} are @code{int}, although @code{mp_size_t}
- is usually a @code{long}. This is done to make the fields just 32 bits on
- some 64 bits systems, thereby saving a few bytes of data space but still
- providing plenty of range.
- @node Rational Internals, Float Internals, Integer Internals, Internals
- @section Rational Internals
- @cindex Rational internals
- @code{mpq_t} variables represent rationals using an @code{mpz_t} numerator and
- denominator (@pxref{Integer Internals}).
- The canonical form adopted is denominator positive (and non-zero), no common
- factors between numerator and denominator, and zero uniquely represented as
- 0/1.
- It's believed that casting out common factors at each stage of a calculation
- is best in general. A GCD is an @math{O(N^2)} operation so it's better to do
- a few small ones immediately than to delay and have to do a big one later.
- Knowing the numerator and denominator have no common factors can be used for
- example in @code{mpq_mul} to make only two cross GCDs necessary, not four.
- This general approach to common factors is badly sub-optimal in the presence
- of simple factorizations or little prospect for cancellation, but GMP has no
- way to know when this will occur. As per @ref{Efficiency}, that's left to
- applications. The @code{mpq_t} framework might still suit, with
- @code{mpq_numref} and @code{mpq_denref} for direct access to the numerator and
- denominator, or of course @code{mpz_t} variables can be used directly.
- @node Float Internals, Raw Output Internals, Rational Internals, Internals
- @section Float Internals
- @cindex Float internals
- Efficient calculation is the primary aim of GMP floats and the use of whole
- limbs and simple rounding facilitates this.
- @code{mpf_t} floats have a variable precision mantissa and a single machine
- word signed exponent. The mantissa is represented using sign and magnitude.
- @c FIXME: The arrow heads don't join to the lines exactly.
- @tex
- globalnewdimenGMPboxwidth GMPboxwidth=5em
- globalnewdimenGMPboxheight GMPboxheight=3ex
- defcentreline{hbox{raise 0.8ex vbox{hrule hbox{hfil}}}}
- GMPdisplay{%
- vbox{%
- hbox to 5GMPboxwidth {most significant limb hfil least significant limb}
- vskip 0.7ex
- defGMPcentreline#1{hbox{raise 0.5 ex vbox{hrule hbox to #1 {}}}}
- hbox {
- hbox to 3GMPboxwidth {%
- setbox 0 = hbox{@code{_mp_exp}}%
- dimen0=3GMPboxwidth
- advancedimen0 by -wd0
- dividedimen0 by 2
- advancedimen0 by -1em
- setbox1 = hbox{$rightarrow$}%
- dimen1=dimen0
- advancedimen1 by -wd1
- GMPcentreline{dimen0}%
- hfil
- box0%
- hfil
- GMPcentreline{dimen1{}}%
- box1}
- hbox to 2GMPboxwidth {hfil @code{_mp_d}}}
- vskip 0.5ex
- vbox {%
- hrule
- hbox{%
- vrule height 2ex depth 1ex
- hbox to GMPboxwidth {}%
- vrule
- hbox to GMPboxwidth {}%
- vrule
- hbox to GMPboxwidth {}%
- vrule
- hbox to GMPboxwidth {}%
- vrule
- hbox to GMPboxwidth {}%
- vrule}
- hrule
- }
- hbox {%
- hbox to 0.8 pt {}
- hbox to 3GMPboxwidth {%
- hfil $cdot$} hbox {$leftarrow$ radix pointhfil}}
- hbox to 5GMPboxwidth{%
- setbox 0 = hbox{@code{_mp_size}}%
- dimen0 = 5GMPboxwidth
- advancedimen0 by -wd0
- dividedimen0 by 2
- advancedimen0 by -1em
- dimen1 = dimen0
- setbox1 = hbox{$leftarrow$}%
- setbox2 = hbox{$rightarrow$}%
- advancedimen0 by -wd1
- advancedimen1 by -wd2
- hbox to 0.3 em {}%
- box1
- GMPcentreline{dimen0}%
- hfil
- box0
- hfil
- GMPcentreline{dimen1}%
- box2}
- }}
- @end tex
- @ifnottex
- @example
- most least
- significant significant
- limb limb
- _mp_d
- |---- _mp_exp ---> |
- _____ _____ _____ _____ _____
- |_____|_____|_____|_____|_____|
- . <------------ radix point
- <-------- _mp_size --------->
- @sp 1
- @end example
- @end ifnottex
- @noindent
- The fields are as follows.
- @table @asis
- @item @code{_mp_size}
- The number of limbs currently in use, or the negative of that when
- representing a negative value. Zero is represented by @code{_mp_size} and
- @code{_mp_exp} both set to zero, and in that case the @code{_mp_d} data is
- unused. (In the future @code{_mp_exp} might be undefined when representing
- zero.)
- @item @code{_mp_prec}
- The precision of the mantissa, in limbs. In any calculation the aim is to
- produce @code{_mp_prec} limbs of result (the most significant being non-zero).
- @item @code{_mp_d}
- A pointer to the array of limbs which is the absolute value of the mantissa.
- These are stored ``little endian'' as per the @code{mpn} functions, so
- @code{_mp_d[0]} is the least significant limb and
- @code{_mp_d[ABS(_mp_size)-1]} the most significant.
- The most significant limb is always non-zero, but there are no other
- restrictions on its value, in particular the highest 1 bit can be anywhere
- within the limb.
- @code{_mp_prec+1} limbs are allocated to @code{_mp_d}, the extra limb being
- for convenience (see below). There are no reallocations during a calculation,
- only in a change of precision with @code{mpf_set_prec}.
- @item @code{_mp_exp}
- The exponent, in limbs, determining the location of the implied radix point.
- Zero means the radix point is just above the most significant limb. Positive
- values mean a radix point offset towards the lower limbs and hence a value
- @math{@ge{} 1}, as for example in the diagram above. Negative exponents mean
- a radix point further above the highest limb.
- Naturally the exponent can be any value, it doesn't have to fall within the
- limbs as the diagram shows, it can be a long way above or a long way below.
- Limbs other than those included in the @code{@{_mp_d,_mp_size@}} data
- are treated as zero.
- @end table
- The @code{_mp_size} and @code{_mp_prec} fields are @code{int}, although the
- @code{mp_size_t} type is usually a @code{long}. The @code{_mp_exp} field is
- usually @code{long}. This is done to make some fields just 32 bits on some 64
- bits systems, thereby saving a few bytes of data space but still providing
- plenty of precision and a very large range.
- @sp 1
- @noindent
- The following various points should be noted.
- @table @asis
- @item Low Zeros
- The least significant limbs @code{_mp_d[0]} etc can be zero, though such low
- zeros can always be ignored. Routines likely to produce low zeros check and
- avoid them to save time in subsequent calculations, but for most routines
- they're quite unlikely and aren't checked.
- @item Mantissa Size Range
- The @code{_mp_size} count of limbs in use can be less than @code{_mp_prec} if
- the value can be represented in less. This means low precision values or
- small integers stored in a high precision @code{mpf_t} can still be operated
- on efficiently.
- @code{_mp_size} can also be greater than @code{_mp_prec}. Firstly a value is
- allowed to use all of the @code{_mp_prec+1} limbs available at @code{_mp_d},
- and secondly when @code{mpf_set_prec_raw} lowers @code{_mp_prec} it leaves
- @code{_mp_size} unchanged and so the size can be arbitrarily bigger than
- @code{_mp_prec}.
- @item Rounding
- All rounding is done on limb boundaries. Calculating @code{_mp_prec} limbs
- with the high non-zero will ensure the application requested minimum precision
- is obtained.
- The use of simple ``trunc'' rounding towards zero is efficient, since there's
- no need to examine extra limbs and increment or decrement.
- @item Bit Shifts
- Since the exponent is in limbs, there are no bit shifts in basic operations
- like @code{mpf_add} and @code{mpf_mul}. When differing exponents are
- encountered all that's needed is to adjust pointers to line up the relevant
- limbs.
- Of course @code{mpf_mul_2exp} and @code{mpf_div_2exp} will require bit shifts,
- but the choice is between an exponent in limbs which requires shifts there, or
- one in bits which requires them almost everywhere else.
- @item Use of @code{_mp_prec+1} Limbs
- The extra limb on @code{_mp_d} (@code{_mp_prec+1} rather than just
- @code{_mp_prec}) helps when an @code{mpf} routine might get a carry from its
- operation. @code{mpf_add} for instance will do an @code{mpn_add} of
- @code{_mp_prec} limbs. If there's no carry then that's the result, but if
- there is a carry then it's stored in the extra limb of space and
- @code{_mp_size} becomes @code{_mp_prec+1}.
- Whenever @code{_mp_prec+1} limbs are held in a variable, the low limb is not
- needed for the intended precision, only the @code{_mp_prec} high limbs. But
- zeroing it out or moving the rest down is unnecessary. Subsequent routines
- reading the value will simply take the high limbs they need, and this will be
- @code{_mp_prec} if their target has that same precision. This is no more than
- a pointer adjustment, and must be checked anyway since the destination
- precision can be different from the sources.
- Copy functions like @code{mpf_set} will retain a full @code{_mp_prec+1} limbs
- if available. This ensures that a variable which has @code{_mp_size} equal to
- @code{_mp_prec+1} will get its full exact value copied. Strictly speaking
- this is unnecessary since only @code{_mp_prec} limbs are needed for the
- application's requested precision, but it's considered that an @code{mpf_set}
- from one variable into another of the same precision ought to produce an exact
- copy.
- @item Application Precisions
- @code{__GMPF_BITS_TO_PREC} converts an application requested precision to an
- @code{_mp_prec}. The value in bits is rounded up to a whole limb then an
- extra limb is added since the most significant limb of @code{_mp_d} is only
- non-zero and therefore might contain only one bit.
- @code{__GMPF_PREC_TO_BITS} does the reverse conversion, and removes the extra
- limb from @code{_mp_prec} before converting to bits. The net effect of
- reading back with @code{mpf_get_prec} is simply the precision rounded up to a
- multiple of @code{mp_bits_per_limb}.
- Note that the extra limb added here for the high only being non-zero is in
- addition to the extra limb allocated to @code{_mp_d}. For example with a
- 32-bit limb, an application request for 250 bits will be rounded up to 8
- limbs, then an extra added for the high being only non-zero, giving an
- @code{_mp_prec} of 9. @code{_mp_d} then gets 10 limbs allocated. Reading
- back with @code{mpf_get_prec} will take @code{_mp_prec} subtract 1 limb and
- multiply by 32, giving 256 bits.
- Strictly speaking, the fact the high limb has at least one bit means that a
- float with, say, 3 limbs of 32-bits each will be holding at least 65 bits, but
- for the purposes of @code{mpf_t} it's considered simply to be 64 bits, a nice
- multiple of the limb size.
- @end table
- @node Raw Output Internals, C++ Interface Internals, Float Internals, Internals
- @section Raw Output Internals
- @cindex Raw output internals
- @noindent
- @code{mpz_out_raw} uses the following format.
- @tex
- globalnewdimenGMPboxwidth GMPboxwidth=5em
- globalnewdimenGMPboxheight GMPboxheight=3ex
- defcentreline{hbox{raise 0.8ex vbox{hrule hbox{hfil}}}}
- GMPdisplay{%
- vbox{%
- defGMPcentreline#1{hbox{raise 0.5 ex vbox{hrule hbox to #1 {}}}}
- vbox {%
- hrule
- hbox{%
- vrule height 2.5ex depth 1.5ex
- hbox to GMPboxwidth {hfil sizehfil}%
- vrule
- hbox to 3GMPboxwidth {hfil data byteshfil}%
- vrule}
- hrule}
- }}
- @end tex
- @ifnottex
- @example
- +------+------------------------+
- | size | data bytes |
- +------+------------------------+
- @end example
- @end ifnottex
- The size is 4 bytes written most significant byte first, being the number of
- subsequent data bytes, or the twos complement negative of that when a negative
- integer is represented. The data bytes are the absolute value of the integer,
- written most significant byte first.
- The most significant data byte is always non-zero, so the output is the same
- on all systems, irrespective of limb size.
- In GMP 1, leading zero bytes were written to pad the data bytes to a multiple
- of the limb size. @code{mpz_inp_raw} will still accept this, for
- compatibility.
- The use of ``big endian'' for both the size and data fields is deliberate, it
- makes the data easy to read in a hex dump of a file. Unfortunately it also
- means that the limb data must be reversed when reading or writing, so neither
- a big endian nor little endian system can just read and write @code{_mp_d}.
- @node C++ Interface Internals, , Raw Output Internals, Internals
- @section C++ Interface Internals
- @cindex C++ interface internals
- A system of expression templates is used to ensure something like @code{a=b+c}
- turns into a simple call to @code{mpz_add} etc. For @code{mpf_class}
- the scheme also ensures the precision of the final
- destination is used for any temporaries within a statement like
- @code{f=w*x+y*z}. These are important features which a naive implementation
- cannot provide.
- A simplified description of the scheme follows. The true scheme is
- complicated by the fact that expressions have different return types. For
- detailed information, refer to the source code.
- To perform an operation, say, addition, we first define a ``function object''
- evaluating it,
- @example
- struct __gmp_binary_plus
- @{
- static void eval(mpf_t f, mpf_t g, mpf_t h) @{ mpf_add(f, g, h); @}
- @};
- @end example
- @noindent
- And an ``additive expression'' object,
- @example
- __gmp_expr<__gmp_binary_expr<mpf_class, mpf_class, __gmp_binary_plus> >
- operator+(const mpf_class &f, const mpf_class &g)
- @{
- return __gmp_expr
- <__gmp_binary_expr<mpf_class, mpf_class, __gmp_binary_plus> >(f, g);
- @}
- @end example
- The seemingly redundant @code{__gmp_expr<__gmp_binary_expr<@dots{}>>} is used to
- encapsulate any possible kind of expression into a single template type. In
- fact even @code{mpf_class} etc are @code{typedef} specializations of
- @code{__gmp_expr}.
- Next we define assignment of @code{__gmp_expr} to @code{mpf_class}.
- @example
- template <class T>
- mpf_class & mpf_class::operator=(const __gmp_expr<T> &expr)
- @{
- expr.eval(this->get_mpf_t(), this->precision());
- return *this;
- @}
- template <class Op>
- void __gmp_expr<__gmp_binary_expr<mpf_class, mpf_class, Op> >::eval
- (mpf_t f, mp_bitcnt_t precision)
- @{
- Op::eval(f, expr.val1.get_mpf_t(), expr.val2.get_mpf_t());
- @}
- @end example
- where @code{expr.val1} and @code{expr.val2} are references to the expression's
- operands (here @code{expr} is the @code{__gmp_binary_expr} stored within the
- @code{__gmp_expr}).
- This way, the expression is actually evaluated only at the time of assignment,
- when the required precision (that of @code{f}) is known. Furthermore the
- target @code{mpf_t} is now available, thus we can call @code{mpf_add} directly
- with @code{f} as the output argument.
- Compound expressions are handled by defining operators taking subexpressions
- as their arguments, like this:
- @example
- template <class T, class U>
- __gmp_expr
- <__gmp_binary_expr<__gmp_expr<T>, __gmp_expr<U>, __gmp_binary_plus> >
- operator+(const __gmp_expr<T> &expr1, const __gmp_expr<U> &expr2)
- @{
- return __gmp_expr
- <__gmp_binary_expr<__gmp_expr<T>, __gmp_expr<U>, __gmp_binary_plus> >
- (expr1, expr2);
- @}
- @end example
- And the corresponding specializations of @code{__gmp_expr::eval}:
- @example
- template <class T, class U, class Op>
- void __gmp_expr
- <__gmp_binary_expr<__gmp_expr<T>, __gmp_expr<U>, Op> >::eval
- (mpf_t f, mp_bitcnt_t precision)
- @{
- // declare two temporaries
- mpf_class temp1(expr.val1, precision), temp2(expr.val2, precision);
- Op::eval(f, temp1.get_mpf_t(), temp2.get_mpf_t());
- @}
- @end example
- The expression is thus recursively evaluated to any level of complexity and
- all subexpressions are evaluated to the precision of @code{f}.
- @node Contributors, References, Internals, Top
- @comment node-name, next, previous, up
- @appendix Contributors
- @cindex Contributors
- Torbj@"orn Granlund wrote the original GMP library and is still the main
- developer. Code not explicitly attributed to others, was contributed by
- Torbj@"orn. Several other individuals and organizations have contributed
- GMP. Here is a list in chronological order on first contribution:
- Gunnar Sj@"odin and Hans Riesel helped with mathematical problems in early
- versions of the library.
- Richard Stallman helped with the interface design and revised the first
- version of this manual.
- Brian Beuning and Doug Lea helped with testing of early versions of the
- library and made creative suggestions.
- John Amanatides of York University in Canada contributed the function
- @code{mpz_probab_prime_p}.
- Paul Zimmermann wrote the REDC-based mpz_powm code, the Sch@"onhage-Strassen
- FFT multiply code, and the Karatsuba square root code. He also improved the
- Toom3 code for GMP 4.2. Paul sparked the development of GMP 2, with his
- comparisons between bignum packages. The ECMNET project Paul is organizing
- was a driving force behind many of the optimizations in GMP 3. Paul also
- wrote the new GMP 4.3 nth root code (with Torbj@"orn).
- Ken Weber (Kent State University, Universidade Federal do Rio Grande do Sul)
- contributed now defunct versions of @code{mpz_gcd}, @code{mpz_divexact},
- @code{mpn_gcd}, and @code{mpn_bdivmod}, partially supported by CNPq (Brazil)
- grant 301314194-2.
- Per Bothner of Cygnus Support helped to set up GMP to use Cygnus' configure.
- He has also made valuable suggestions and tested numerous intermediary
- releases.
- Joachim Hollman was involved in the design of the @code{mpf} interface, and in
- the @code{mpz} design revisions for version 2.
- Bennet Yee contributed the initial versions of @code{mpz_jacobi} and
- @code{mpz_legendre}.
- Andreas Schwab contributed the files @file{mpn/m68k/lshift.S} and
- @file{mpn/m68k/rshift.S} (now in @file{.asm} form).
- Robert Harley of Inria, France and David Seal of ARM, England, suggested clever
- improvements for population count. Robert also wrote highly optimized
- Karatsuba and 3-way Toom multiplication functions for GMP 3, and contributed
- the ARM assembly code.
- Torsten Ekedahl of the Mathematical department of Stockholm University provided
- significant inspiration during several phases of the GMP development. His
- mathematical expertise helped improve several algorithms.
- Linus Nordberg wrote the new configure system based on autoconf and
- implemented the new random functions.
- Kevin Ryde worked on a large number of things: optimized x86 code, m4 asm
- macros, parameter tuning, speed measuring, the configure system, function
- inlining, divisibility tests, bit scanning, Jacobi symbols, Fibonacci and Lucas
- number functions, printf and scanf functions, perl interface, demo expression
- parser, the algorithms chapter in the manual, @file{gmpasm-mode.el}, and
- various miscellaneous improvements elsewhere.
- Kent Boortz made the Mac OS 9 port.
- Steve Root helped write the optimized alpha 21264 assembly code.
- Gerardo Ballabio wrote the @file{gmpxx.h} C++ class interface and the C++
- @code{istream} input routines.
- Jason Moxham rewrote @code{mpz_fac_ui}.
- Pedro Gimeno implemented the Mersenne Twister and made other random number
- improvements.
- Niels M@"oller wrote the sub-quadratic GCD and extended GCD code, the
- quadratic Hensel division code, and (with Torbj@"orn) the new divide and
- conquer division code for GMP 4.3. Niels also helped implement the new Toom
- multiply code for GMP 4.3 and implemented helper functions to simplify Toom
- evaluations for GMP 5.0. He wrote the original version of mpn_mulmod_bnm1.
- Alberto Zanoni and Marco Bodrato suggested the unbalanced multiply strategy,
- and found the optimal strategies for evaluation and interpolation in Toom
- multiplication.
- Marco Bodrato helped implement the new Toom multiply code for GMP 4.3 and
- implemented most of the new Toom multiply and squaring code for 5.0.
- He is the main author of the current mpn_mulmod_bnm1 and mpn_mullo_n. Marco
- also wrote the functions mpn_invert and mpn_invertappr.
- David Harvey suggested the internal function @code{mpn_bdiv_dbm1}, implementing
- division relevant to Toom multiplication. He also worked on fast assembly
- sequences, in particular on a fast AMD64 @code{mpn_mul_basecase}.
- Martin Boij wrote @code{mpn_perfect_power_p}.
- (This list is chronological, not ordered after significance. If you have
- contributed to GMP but are not listed above, please tell
- @email{gmp-devel@@gmplib.org} about the omission!)
- The development of floating point functions of GNU MP 2, were supported in part
- by the ESPRIT-BRA (Basic Research Activities) 6846 project POSSO (POlynomial
- System SOlving).
- The development of GMP 2, 3, and 4 was supported in part by the IDA Center for
- Computing Sciences.
- Thanks go to Hans Thorsen for donating an SGI system for the GMP test system
- environment.
- @node References, GNU Free Documentation License, Contributors, Top
- @comment node-name, next, previous, up
- @appendix References
- @cindex References
- @c FIXME: In tex, the @uref's are unhyphenated, which is good for clarity,
- @c but being long words they upset paragraph formatting (the preceding line
- @c can get badly stretched). Would like an conditional @* style line break
- @c if the uref is too long to fit on the last line of the paragraph, but it's
- @c not clear how to do that. For now explicit @texlinebreak{}s are used on
- @c paragraphs that come out bad.
- @section Books
- @itemize @bullet
- @item
- Jonathan M. Borwein and Peter B. Borwein, ``Pi and the AGM: A Study in
- Analytic Number Theory and Computational Complexity'', Wiley, 1998.
- @item
- Richard Crandall and Carl Pomerance, ``Prime Numbers: A Computational
- Perspective'', 2nd edition, Springer-Verlag, 2005.
- @texlinebreak{} @uref{http://math.dartmouth.edu/~carlp/}
- @item
- Henri Cohen, ``A Course in Computational Algebraic Number Theory'', Graduate
- Texts in Mathematics number 138, Springer-Verlag, 1993.
- @texlinebreak{} @uref{http://www.math.u-bordeaux.fr/~cohen/}
- @item
- Donald E. Knuth, ``The Art of Computer Programming'', volume 2,
- ``Seminumerical Algorithms'', 3rd edition, Addison-Wesley, 1998.
- @texlinebreak{} @uref{http://www-cs-faculty.stanford.edu/~knuth/taocp.html}
- @item
- John D. Lipson, ``Elements of Algebra and Algebraic Computing'',
- The Benjamin Cummings Publishing Company Inc, 1981.
- @item
- Alfred J. Menezes, Paul C. van Oorschot and Scott A. Vanstone, ``Handbook of
- Applied Cryptography'', @uref{http://www.cacr.math.uwaterloo.ca/hac/}
- @item
- Richard M. Stallman and the GCC Developer Community, ``Using the GNU Compiler
- Collection'', Free Software Foundation, 2008, available online
- @uref{http://gcc.gnu.org/onlinedocs/}, and in the GCC package
- @uref{ftp://ftp.gnu.org/gnu/gcc/}
- @end itemize
- @section Papers
- @itemize @bullet
- @item
- Yves Bertot, Nicolas Magaud and Paul Zimmermann, ``A Proof of GMP Square
- Root'', Journal of Automated Reasoning, volume 29, 2002, pp.@: 225-252. Also
- available online as INRIA Research Report 4475, June 2001,
- @uref{http://www.inria.fr/rrrt/rr-4475.html}
- @item
- Christoph Burnikel and Joachim Ziegler, ``Fast Recursive Division'',
- Max-Planck-Institut fuer Informatik Research Report MPI-I-98-1-022,
- @texlinebreak{} @uref{http://data.mpi-sb.mpg.de/internet/reports.nsf/NumberView/1998-1-022}
- @item
- Torbj@"orn Granlund and Peter L. Montgomery, ``Division by Invariant Integers
- using Multiplication'', in Proceedings of the SIGPLAN PLDI'94 Conference, June
- 1994. Also available @uref{ftp://ftp.cwi.nl/pub/pmontgom/divcnst.psa4.gz}
- (and .psl.gz).
- @item
- Niels M@"oller and Torbj@"orn Granlund, ``Improved division by invariant
- integers'', to appear.
- @item
- Torbj@"orn Granlund and Niels M@"oller, ``Division of integers large and
- small'', to appear.
- @item
- Tudor Jebelean,
- ``An algorithm for exact division'',
- Journal of Symbolic Computation,
- volume 15, 1993, pp.@: 169-180.
- Research report version available @texlinebreak{}
- @uref{ftp://ftp.risc.uni-linz.ac.at/pub/techreports/1992/92-35.ps.gz}
- @item
- Tudor Jebelean, ``Exact Division with Karatsuba Complexity - Extended
- Abstract'', RISC-Linz technical report 96-31, @texlinebreak{}
- @uref{ftp://ftp.risc.uni-linz.ac.at/pub/techreports/1996/96-31.ps.gz}
- @item
- Tudor Jebelean, ``Practical Integer Division with Karatsuba Complexity'',
- ISSAC 97, pp.@: 339-341. Technical report available @texlinebreak{}
- @uref{ftp://ftp.risc.uni-linz.ac.at/pub/techreports/1996/96-29.ps.gz}
- @item
- Tudor Jebelean, ``A Generalization of the Binary GCD Algorithm'', ISSAC 93,
- pp.@: 111-116. Technical report version available @texlinebreak{}
- @uref{ftp://ftp.risc.uni-linz.ac.at/pub/techreports/1993/93-01.ps.gz}
- @item
- Tudor Jebelean, ``A Double-Digit Lehmer-Euclid Algorithm for Finding the GCD
- of Long Integers'', Journal of Symbolic Computation, volume 19, 1995,
- pp.@: 145-157. Technical report version also available @texlinebreak{}
- @uref{ftp://ftp.risc.uni-linz.ac.at/pub/techreports/1992/92-69.ps.gz}
- @item
- Werner Krandick and Tudor Jebelean, ``Bidirectional Exact Integer Division'',
- Journal of Symbolic Computation, volume 21, 1996, pp.@: 441-455. Early
- technical report version also available
- @uref{ftp://ftp.risc.uni-linz.ac.at/pub/techreports/1994/94-50.ps.gz}
- @item
- Makoto Matsumoto and Takuji Nishimura, ``Mersenne Twister: A 623-dimensionally
- equidistributed uniform pseudorandom number generator'', ACM Transactions on
- Modelling and Computer Simulation, volume 8, January 1998, pp.@: 3-30.
- Available online @texlinebreak{}
- @uref{http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/mt.ps.gz} (or .pdf)
- @item
- R. Moenck and A. Borodin, ``Fast Modular Transforms via Division'',
- Proceedings of the 13th Annual IEEE Symposium on Switching and Automata
- Theory, October 1972, pp.@: 90-96. Reprinted as ``Fast Modular Transforms'',
- Journal of Computer and System Sciences, volume 8, number 3, June 1974,
- pp.@: 366-386.
- @item
- Niels M@"oller, ``On Sch@"onhage's algorithm and subquadratic integer GCD
- computation'', in Mathematics of Computation, volume 77, January 2008, pp.@:
- 589-607.
- @item
- Peter L. Montgomery, ``Modular Multiplication Without Trial Division'', in
- Mathematics of Computation, volume 44, number 170, April 1985.
- @item
- Arnold Sch@"onhage and Volker Strassen, ``Schnelle Multiplikation grosser
- Zahlen'', Computing 7, 1971, pp.@: 281-292.
- @item
- Kenneth Weber, ``The accelerated integer GCD algorithm'',
- ACM Transactions on Mathematical Software,
- volume 21, number 1, March 1995, pp.@: 111-122.
- @item
- Paul Zimmermann, ``Karatsuba Square Root'', INRIA Research Report 3805,
- November 1999, @uref{http://www.inria.fr/rrrt/rr-3805.html}
- @item
- Paul Zimmermann, ``A Proof of GMP Fast Division and Square Root
- Implementations'', @texlinebreak{}
- @uref{http://www.loria.fr/~zimmerma/papers/proof-div-sqrt.ps.gz}
- @item
- Dan Zuras, ``On Squaring and Multiplying Large Integers'', ARITH-11: IEEE
- Symposium on Computer Arithmetic, 1993, pp.@: 260 to 271. Reprinted as ``More
- on Multiplying and Squaring Large Integers'', IEEE Transactions on Computers,
- volume 43, number 8, August 1994, pp.@: 899-908.
- @end itemize
- @node GNU Free Documentation License, Concept Index, References, Top
- @appendix GNU Free Documentation License
- @cindex GNU Free Documentation License
- @cindex Free Documentation License
- @cindex Documentation license
- @include fdl-1.3.texi
- @node Concept Index, Function Index, GNU Free Documentation License, Top
- @comment node-name, next, previous, up
- @unnumbered Concept Index
- @printindex cp
- @node Function Index, , Concept Index, Top
- @comment node-name, next, previous, up
- @unnumbered Function and Type Index
- @printindex fn
- @bye
- @c Local variables:
- @c fill-column: 78
- @c compile-command: "make gmp.info"
- @c End: