source: trunk/third/gmp/tune/README @ 18191

Revision 18191, 18.5 KB checked in by ghudson, 22 years ago (diff)
This commit was generated by cvs2svn to compensate for changes in r18190, which included commits to RCS files with non-trunk default branches.
Line 
1Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of the GNU Lesser General Public License as published by
7the Free Software Foundation; either version 2.1 of the License, or (at your
8option) any later version.
9
10The GNU MP Library is distributed in the hope that it will be useful, but
11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
13License for more details.
14
15You should have received a copy of the GNU Lesser General Public License
16along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
17the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
1802111-1307, USA.
19
20
21
22
23
24               GMP SPEED MEASURING AND PARAMETER TUNING
25
26
27The programs in this directory are for knowledgeable users who want to
28measure GMP routines on their machine, and perhaps tweak some settings or
29identify things that can be improved.
30
31The programs here are tools, not ready to run solutions.  Nothing is built
32in a normal "make all", but various Makefile targets described below exist.
33
34Relatively few systems and CPUs have been tested, so be sure to verify that
35results are sensible before relying on them.
36
37
38
39
40MISCELLANEOUS NOTES
41
42--enable-assert
43
44    Don't configure with --enable-assert, since the extra code added by
45    assertion checking may influence measurements.
46
47Direct mapped caches
48
49    Some effort has been made to accommodate CPUs with direct mapped caches,
50    by putting data blocks more or less contiguously on the stack.  But this
51    will depend on TMP_ALLOC using alloca, and even then it may or may not
52    be enough.
53
54FreeBSD 4.2 i486 getrusage
55
56    This getrusage seems to be a bit doubtful, it looks like it's
57    microsecond accurate, but sometimes ru_utime remains unchanged after a
58    time of many microseconds has elapsed.  It'd be good to detect this in
59    the time.c initializations, but for now the suggestion is to pretend it
60    doesn't exist.
61
62        ./configure ac_cv_func_getrusage=no
63
64NetBSD 1.4.1 m68k macintosh time base
65
66    On this system it's been found getrusage often goes backwards, making it
67    unusable (configure is setup to ignore it).  gettimeofday sometimes
68    doesn't update atomically when it crosses a 1 second boundary.  Not sure
69    what to do about this.  Expect intermittent failures.
70
71SCO OpenUNIX 8 /etc/hw
72
73    /etc/hw takes about a second to return the cpu frequency, which suggests
74    perhaps it's measuring each time it runs.  If this is annoying when
75    running the speed program repeatedly then set a GMP_CPU_FREQUENCY
76    environment variable (see TIME BASE section below).
77
78Low resolution timebase
79
80    Parameter tuning can be very time consuming if the only timebase
81    available is a 10 millisecond clock tick, to the point of being
82    unusable.  This is currently the case on VAX and ARM systems.
83
84
85
86
87PARAMETER TUNING
88
89The "tuneup" program runs some tests designed to find the best settings for
90various thresholds, like MUL_KARATSUBA_THRESHOLD.  Its output can be put
91into gmp-mparam.h.  The program is built and run with
92
93        make tune
94
95If the thresholds indicated are grossly different from the values in the
96selected gmp-mparam.h then there may be a performance boost in applicable
97size ranges by changing gmp-mparam.h accordingly.
98
99Be sure to do a full reconfigure and rebuild to get any newly set thresholds
100to take effect.  A partial rebuild is enough sometimes, but a fresh
101configure and make is certain to be correct.
102
103If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of
104the mpn subdirectories then the values from "make tune" should be similar.
105But check that the configured CPU is right and there are no machine specific
106effects causing a difference.
107
108It's hoped the compiler and options used won't have too much effect on
109thresholds, since for most CPUs they ultimately come down to comparisons
110between assembler subroutines.  Missing out on the longlong.h macros by not
111using gcc will probably have an effect.
112
113Some thresholds produced by the tune program are merely single values chosen
114from what's a range of sizes where two algorithms are pretty much the same
115speed.  When this happens the program is likely to give somewhat different
116values on successive runs.  This is noticeable on the toom3 thresholds for
117instance.
118
119
120
121
122SPEED PROGRAM
123
124The "speed" program can be used for measuring and comparing various
125routines, and producing tables of data or gnuplot graphs.  Compile it with
126
127        make speed
128
129(Or on DOS systems "make speed.exe".)
130
131Here are some examples of how to use it.  Check the code for all the
132options.
133
134Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
135(whichever is greater).
136
137        ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
138        gnuplot foo.gnuplot
139
140Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and
141showing under mpn_lshift the difference between it and mpn_add_n.
142
143        ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1
144
145Using option -c for times in cycles is interesting but normally only
146necessary when looking carefully at assembler subroutines.  You might think
147it would always give an integer value, but this doesn't happen in practice,
148probably due to overheads in the time measurements.
149
150In the free-form output the "#" symbol against a measurement means the
151corresponding routine is fastest at that size.  This is a convenient visual
152cue when comparing different routines.  The graph data files <name>.data
153don't get this since it would upset gnuplot or other data viewers.
154
155
156
157
158TIME BASE
159
160The time measuring method is determined in time.c, based on what the
161configured host has available.  A cycle counter is preferred, possibly
162supplemented by another method if the counter has a limited range.  A
163microsecond accurate getrusage() or gettimeofday() will work quite well too.
164
165The cycle counters (except possibly on alpha) and gettimeofday() will depend
166on the machine being otherwise idle, or rather on other jobs not stealing
167CPU time from the measuring program.  Short routines (those that complete
168within a timeslice) should work even on a busy machine.
169
170Some trouble is taken by speed_measure() in common.c to avoid ill effects
171from sporadic interrupts, or other intermittent things (like cron waking up
172every minute).  But generally an idle machine will be necessary to be
173certain of consistent results.
174
175The CPU frequency is needed to convert between cycles and seconds, or for
176when a cycle counter is supplemented by getrusage() etc.  The speed program
177will convert as necessary according to the output format requested.  The
178tune program will work with either cycles or seconds.
179
180freq.c knows how to get the frequency on some systems, or can measure a
181cycle counter against gettimeofday() or getrusage(), but when that fails, or
182needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be
183used (in Hertz).  For example in "bash" on a 650 MHz machine,
184
185        export GMP_CPU_FREQUENCY=650e6
186
187A high precision time base makes it possible to get accurate measurements in
188a shorter time.
189
190
191
192
193EXAMPLE COMPARISONS - VARIOUS
194
195Here are some ideas for things that can be done with the speed program.
196
197There's always going to be a certain amount of overhead in the time
198measurements, due to reading the time base, and in the loop that runs a
199routine enough times to get a reading of the desired precision.  Noop
200functions taking various arguments are available to measure this.  The
201"overhead" printed by the speed program each time in its intro is the "noop"
202routine, but note that this is just for information, it isn't deducted from
203the times printed or anything.
204
205        ./speed -s 1 noop noop_wxs noop_wxys
206
207To see how many cycles per limb a routine is taking, look at the time
208increase when the size increments, using option -D.  This avoids fixed
209overheads in the measuring.  Also, remember many of the assembler routines
210have unrolled loops, so it might be necessary to compare times at, say, 16,
21132, 48, 64 etc to see what the unrolled part is taking, as opposed to any
212finishing off.
213
214        ./speed -s 16-64 -t 16 -C -D mpn_add_n
215
216The -C option on its own gives cycles per limb, but is really only useful at
217big sizes where fixed overheads are small compared to the code doing the
218real work.  Remember of course memory caching and/or page swapping will
219affect results at large sizes.
220
221        ./speed -s 500000 -C mpn_add_n
222
223Once a calculation stops fitting in the CPU data cache, it's going to start
224taking longer.  Exactly where this happens depends on the cache priming in
225the measuring routines, and on what sort of "least recently used" the
226hardware does.  Here's an example for a CPU with a 16kbyte L1 data cache and
22732-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
228limbs.
229
230        ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
231        gnuplot foo.gnuplot
232
233When a routine has an unrolled loop for, say, multiples of 8 limbs and then
234an ordinary loop for the remainder, it can happen that it's actually faster
235to do an operation on, say, 8 limbs than it is on 7 limbs.  The following
236draws a graph of mpn_sub_n, to see whether times smoothly increase with
237size.
238
239        ./speed -s 1-100 -c -P foo mpn_sub_n
240        gnuplot foo.gnuplot
241
242If mpn_lshift and mpn_rshift have special case code for shifts by 1, it
243ought to be faster (or at least not slower) than shifting by, say, 2 bits.
244
245        ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2
246
247An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
248if the lshift isn't faster there's an obvious improvement that's possible.
249
250        ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self
251
252On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
253destination is one of the sources is faster than a separate destination.
254Here's an example to see this.  ".1" selects dst==src1 for mpn_add_n (and
255mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL.
256
257        ./speed -s 1-200 -c mpn_add_n mpn_add_n.1
258
259The gmp manual points out that divisions by powers of two should be done
260using a right shift because it'll be significantly faster than an actual
261division.  The following shows by what factor mpn_rshift is faster than
262mpn_divrem_1, using division by 32 as an example.
263
264        ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32
265
266
267
268
269EXAMPLE COMPARISONS - MULTIPLICATION
270
271mul_basecase takes a ".<r>" parameter which is the first (larger) size
272parameter.  For example to show speeds for 20x1 up to 20x15 in cycles,
273
274        ./speed -s 1-15 -c mpn_mul_basecase.20
275
276mul_basecase with no parameter does an NxN multiply, so for example to show
277speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,
278
279        ./speed -s 1-20 -c mpn_mul_basecase
280
281sqr_basecase is implemented by a "triangular" method on most CPUs, making it
282up to twice as fast as mul_basecase.  In practice loop overheads and the
283products on the diagonal mean it falls short of this.  Here's an example
284running the two and showing by what factor an NxN mul_basecase is slower
285than an NxN sqr_basecase.  (Some versions of sqr_basecase only allow sizes
286below SQR_KARATSUBA_THRESHOLD, so if it crashes at that point don't worry.)
287
288        ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase
289
290The technique described above with -CD for showing the time difference in
291cycles per limb between two size operations can be done on an NxN
292mul_basecase using -E to change the basis for the size increment to N*N.
293For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
294doing 256 limbs.  The following therefore shows the per crossproduct speed
295of mul_basecase and sqr_basecase at around 20x20 limbs.
296
297        ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase
298
299Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
300interesting to compare it to mul_basecase as if it was.  For sqr_basecase
301the -F option can be used to base the deltas on N*(N+1)/2 operations, which
302is the triangular products sqr_basecase does.  For example,
303
304        ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase
305
306Both -E and -F are preliminary and might change.  A consistent approach to
307using them when claiming certain per crossproduct or per triangularproduct
308speeds hasn't really been established, but the increment between speeds in
309the range karatsuba will call seems sensible, that being k to k/2.  For
310instance, if the karatsuba threshold was 20 for the multiply and 30 for the
311square,
312
313        ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
314        ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase
315
316Two versions of toom3 interpolation and evaluation are available in
317mpn/generic/mul_n.c, using either a one-pass open-coded style or simple mpn
318subroutine calls.  The former is used on RISCs with lots of registers, the
319latter on other CPUs.  The two can be compared directly to check which is
320best.  Naturally it's sizes where toom3 is faster than karatsuba that are of
321interest.
322
323        ./speed -s 80-120 -c mpn_toom3_mul_n_mpn mpn_toom3_mul_n_open
324        ./speed -s 80-120 -c mpn_toom3_sqr_n_mpn mpn_toom3_sqr_n_open
325
326
327
328
329EXAMPLE COMPARISONS - MALLOC
330
331The gmp manual recommends application programs avoid excessive initializing
332and clearing of mpz_t variables (and mpq_t and mpf_t too).  Every new
333variable will at a minimum go through an init, a realloc for its first
334store, and finally a clear.  Quite how long that takes depends on the C
335library.  The following compares an mpz_init/realloc/clear to a 10 limb
336mpz_add.  Don't be surprised if the mallocing is quite slow.
337
338        ./speed -s 10 -c mpz_init_realloc_clear mpz_add
339
340On some systems malloc and free are much slower when dynamic linked.  The
341speed-dynamic program can be used to see this.  For example the following
342measures malloc/free, first static then dynamic.
343
344        ./speed -s 10 -c malloc_free
345        ./speed-dynamic -s 10 -c malloc_free
346
347Of course a real world program has big problems if it's doing so many
348mallocs and frees that it gets slowed down by a dynamic linked malloc.
349
350
351
352
353
354EXAMPLE COMPARISONS - STRING CONVERSIONS
355
356mpn_get_str does a binary to string conversion.  The base is specified with
357a ".<r>" parameter, or decimal by default.  Power of 2 bases are much faster
358than general bases.  The following compares decimal and hex for instance.
359
360        ./speed -s 1-20 -c mpn_get_str mpn_get_str.16
361
362Smaller bases need more divisions to split a given size number, and so are
363slower.  The following compares base 3 and base 9.  On small operands 9 will
364be nearly twice as fast, though at bigger sizes this reduces since in the
365current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit
366limbs) and those divisions come to dominate.
367
368        ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9
369
370mpn_set_str does a string to binary conversion.  The base is specified with
371a ".<r>" parameter, or decimal by default.  Power of 2 bases are faster than
372general bases on large conversions.
373
374        ./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10
375
376mpn_set_str also has some special case code for decimal which is a bit
377faster than the general case, basically by giving the compiler a chance to
378optimize some multiplications by 10.
379
380        ./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11
381
382
383
384
385EXAMPLE COMPARISONS - GCDs
386
387mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both
388x and y are single limbs.  This isn't tuned currently, but a value can be
389established by a measurement like
390
391        ./speed -s 10-32 mpn_gcd_1.10
392
393This runs src[0] from 10 to 32 bits, and y fixed at 10 bits.  If the div
394threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd
395is done by nibbling away at the 32-bit operands bit-by-bit.  When the
396threshold is small, say 1 bit, then an initial x%y is done to reduce it to a
39710x10 bit operation.
398
399The threshold in mpn/generic/gcd_1.c or the various assembler
400implementations can be tweaked up or down until there's no more speedups on
401interesting combinations of sizes.  Note that this affects only a 1x1 limb
402operation and so isn't very important.  (An Nx1 limb operation always does
403an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.)
404
405
406
407
408SPEED PROGRAM EXTENSIONS
409
410Potentially lots of things could be made available in the program, but it's
411been left at only the things that have actually been wanted and are likely
412to be reasonably useful in the future.
413
414Extensions should be fairly easy to make though.  speed-ext.c is an example,
415in a style that should suit one-off tests, or new code fragments under
416development.
417
418many.pl is a script for generating a new speed program supplemented with
419alternate versions of the standard routines.  It can be used for measuring
420experimental code, or for comparing different implementations that exist
421within a CPU family.
422
423
424
425
426THRESHOLD EXAMINING
427
428The speed program can be used to examine the speeds of different algorithms
429to check the tune program has done the right thing.  For example to examine
430the karatsuba multiply threshold,
431
432        ./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n
433
434When examining the toom3 threshold, remember it depends on the karatsuba
435threshold, so the right karatsuba threshold needs to be compiled into the
436library first.  The tune program uses specially recompiled versions of
437mpn/mul_n.c etc for this reason, but the speed program simply uses the
438normal libgmp.la.
439
440Note further that the various routines may recurse into themselves on sizes
441far enough above applicable thresholds.  For example, mpn_kara_mul_n will
442recurse into itself on sizes greater than twice the compiled-in
443MUL_KARATSUBA_THRESHOLD.
444
445When doing the above comparison between mul_basecase and kara_mul_n what's
446probably of interest is mul_basecase versus a kara_mul_n that does one level
447of Karatsuba then calls to mul_basecase, but this only happens on sizes less
448than twice the compiled MUL_KARATSUBA_THRESHOLD.  A larger value for that
449setting can be compiled-in to avoid the problem if necessary.  The same
450applies to toom3 and DC, though in a trickier fashion.
451
452There are some upper limits on some of the thresholds, arising from arrays
453dimensioned according to a threshold (mpn_mul_n), or asm code with certain
454sized displacements (some x86 versions of sqr_basecase).  So putting huge
455values for the thresholds, even just for testing, may fail.
456
457
458
459
460FUTURE
461
462Make a program to check the time base is working properly, for small and
463large measurements.  Make it able to test each available method, including
464perhaps the apparent resolution of each.
465
466Make a general mechanism for specifying operand overlap, and a syntax like
467maybe "mpn_add_n.dst=src2" to select it.  Some measuring routines do this
468sort of thing with the "r" parameter currently.
469
470
471
472----------------
473Local variables:
474mode: text
475fill-column: 76
476End:
Note: See TracBrowser for help on using the repository browser.