source: trunk/third/bzip2/manual_4.html @ 17062

Revision 17062, 24.8 KB checked in by ghudson, 23 years ago (diff)
This commit was generated by cvs2svn to compensate for changes in r17061, which included commits to RCS files with non-trunk default branches.
Line 
1<HTML>
2<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
3<!-- Created on January, 5  2002 by texi2html 1.64 -->
4<!--
5Written by: Lionel Cons <Lionel.Cons@cern.ch> (original author)
6            Karl Berry  <karl@freefriends.org>
7            Olaf Bachmann <obachman@mathematik.uni-kl.de>
8            and many others.
9Maintained by: Olaf Bachmann <obachman@mathematik.uni-kl.de>
10Send bugs and suggestions to <texi2html@mathematik.uni-kl.de>
11 
12-->
13<HEAD>
14<TITLE>Untitled Document: 4. Miscellanea</TITLE>
15
16<META NAME="description" CONTENT="Untitled Document: 4. Miscellanea">
17<META NAME="keywords" CONTENT="Untitled Document: 4. Miscellanea">
18<META NAME="resource-type" CONTENT="document">
19<META NAME="distribution" CONTENT="global">
20<META NAME="Generator" CONTENT="texi2html 1.64">
21
22</HEAD>
23
24<BODY LANG="" BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#800080" ALINK="#FF0000">
25
26<A NAME="SEC43"></A>
27<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
28<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_3.html#SEC42"> &lt; </A>]</TD>
29<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC44"> &gt; </A>]</TD>
30<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
31<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD>
32<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
33<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
34<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
35<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
36<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
37</TR></TABLE>
38<H1> 4. Miscellanea </H1>
39<!--docid::SEC43::-->
40<P>
41
42These are just some random thoughts of mine.  Your mileage may
43vary.
44</P><P>
45
46<HR SIZE="6">
47<A NAME="SEC44"></A>
48<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
49<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC43"> &lt; </A>]</TD>
50<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC45"> &gt; </A>]</TD>
51<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
52<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD>
53<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
54<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
55<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
56<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
57<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
58</TR></TABLE>
59<H2> 4.1 Limitations of the compressed file format </H2>
60<!--docid::SEC44::-->
61<CODE>bzip2-1.0</CODE>, <CODE>0.9.5</CODE> and <CODE>0.9.0</CODE>
62use exactly the same file format as the previous
63version, <CODE>bzip2-0.1</CODE>.  This decision was made in the interests of
64stability.  Creating yet another incompatible compressed file format
65would create further confusion and disruption for users.
66<P>
67
68Nevertheless, this is not a painless decision.  Development
69work since the release of <CODE>bzip2-0.1</CODE> in August 1997
70has shown complexities in the file format which slow down
71decompression and, in retrospect, are unnecessary.  These are:
72<UL>
73<LI>The run-length encoder, which is the first of the
74      compression transformations, is entirely irrelevant.
75      The original purpose was to protect the sorting algorithm
76      from the very worst case input: a string of repeated
77      symbols.  But algorithm steps Q6a and Q6b in the original
78      Burrows-Wheeler technical report (SRC-124) show how
79      repeats can be handled without difficulty in block
80      sorting.
81<LI>The randomisation mechanism doesn't really need to be
82      there.  Udi Manber and Gene Myers published a suffix
83      array construction algorithm a few years back, which
84      can be employed to sort any block, no matter how
85      repetitive, in O(N log N) time.  Subsequent work by
86      Kunihiko Sadakane has produced a derivative O(N (log N)^2)
87      algorithm which usually outperforms the Manber-Myers
88      algorithm.
89<P>
90
91      I could have changed to Sadakane's algorithm, but I find
92      it to be slower than <CODE>bzip2</CODE>'s existing algorithm for
93      most inputs, and the randomisation mechanism protects
94      adequately against bad cases.  I didn't think it was
95      a good tradeoff to make.  Partly this is due to the fact
96      that I was not flooded with email complaints about
97      <CODE>bzip2-0.1</CODE>'s performance on repetitive data, so
98      perhaps it isn't a problem for real inputs.
99</P><P>
100
101      Probably the best long-term solution,
102      and the one I have incorporated into 0.9.5 and above,
103      is to use the existing sorting
104      algorithm initially, and fall back to a O(N (log N)^2)
105      algorithm if the standard algorithm gets into difficulties.
106<LI>The compressed file format was never designed to be
107      handled by a library, and I have had to jump though
108      some hoops to produce an efficient implementation of
109      decompression.  It's a bit hairy.  Try passing
110      <CODE>decompress.c</CODE> through the C preprocessor
111      and you'll see what I mean.  Much of this complexity
112      could have been avoided if the compressed size of
113      each block of data was recorded in the data stream.
114<LI>An Adler-32 checksum, rather than a CRC32 checksum,
115      would be faster to compute.
116</UL>
117It would be fair to say that the <CODE>bzip2</CODE> format was frozen
118before I properly and fully understood the performance
119consequences of doing so.
120<P>
121
122Improvements which I was able to incorporate into
1230.9.0, despite using the same file format, are:
124<UL>
125<LI>Single array implementation of the inverse BWT.  This
126      significantly speeds up decompression, presumably
127      because it reduces the number of cache misses.
128<LI>Faster inverse MTF transform for large MTF values.  The
129      new implementation is based on the notion of sliding blocks
130      of values.
131<LI><CODE>bzip2-0.9.0</CODE> now reads and writes files with <CODE>fread</CODE>
132      and <CODE>fwrite</CODE>; version 0.1 used <CODE>putc</CODE> and <CODE>getc</CODE>.
133      Duh!  Well, you live and learn.
134<P>
135
136</UL>
137Further ahead, it would be nice
138to be able to do random access into files.  This will
139require some careful design of compressed file formats.
140<P>
141
142<HR SIZE="6">
143<A NAME="SEC45"></A>
144<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
145<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC44"> &lt; </A>]</TD>
146<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC46"> &gt; </A>]</TD>
147<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
148<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD>
149<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
150<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
151<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
152<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
153<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
154</TR></TABLE>
155<H2> 4.2 Portability issues </H2>
156<!--docid::SEC45::-->
157After some consideration, I have decided not to use
158GNU <CODE>autoconf</CODE> to configure 0.9.5 or 1.0.
159<P>
160
161<CODE>autoconf</CODE>, admirable and wonderful though it is,
162mainly assists with portability problems between Unix-like
163platforms.  But <CODE>bzip2</CODE> doesn't have much in the way
164of portability problems on Unix; most of the difficulties appear
165when porting to the Mac, or to Microsoft's operating systems.
166<CODE>autoconf</CODE> doesn't help in those cases, and brings in a
167whole load of new complexity.
168</P><P>
169
170Most people should be able to compile the library and program
171under Unix straight out-of-the-box, so to speak, especially
172if you have a version of GNU C available.
173</P><P>
174
175There are a couple of <CODE>__inline__</CODE> directives in the code.  GNU C
176(<CODE>gcc</CODE>) should be able to handle them.  If you're not using
177GNU C, your C compiler shouldn't see them at all.
178If your compiler does, for some reason, see them and doesn't
179like them, just <CODE>#define</CODE> <CODE>__inline__</CODE> to be <CODE>/* */</CODE>.  One
180easy way to do this is to compile with the flag <CODE>-D__inline__=</CODE>,
181which should be understood by most Unix compilers.
182</P><P>
183
184If you still have difficulties, try compiling with the macro
185<CODE>BZ_STRICT_ANSI</CODE> defined.  This should enable you to build the
186library in a strictly ANSI compliant environment.  Building the program
187itself like this is dangerous and not supported, since you remove
188<CODE>bzip2</CODE>'s checks against compressing directories, symbolic links,
189devices, and other not-really-a-file entities.  This could cause
190filesystem corruption!
191</P><P>
192
193One other thing: if you create a <CODE>bzip2</CODE> binary for public
194distribution, please try and link it statically (<CODE>gcc -s</CODE>).  This
195avoids all sorts of library-version issues that others may encounter
196later on.
197</P><P>
198
199If you build <CODE>bzip2</CODE> on Win32, you must set <CODE>BZ_UNIX</CODE> to 0 and
200<CODE>BZ_LCCWIN32</CODE> to 1, in the file <CODE>bzip2.c</CODE>, before compiling.
201Otherwise the resulting binary won't work correctly.
202</P><P>
203
204<HR SIZE="6">
205<A NAME="SEC46"></A>
206<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
207<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC45"> &lt; </A>]</TD>
208<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC47"> &gt; </A>]</TD>
209<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
210<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD>
211<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
212<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
213<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
214<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
215<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
216</TR></TABLE>
217<H2> 4.3 Reporting bugs </H2>
218<!--docid::SEC46::-->
219I tried pretty hard to make sure <CODE>bzip2</CODE> is
220bug free, both by design and by testing.  Hopefully
221you'll never need to read this section for real.
222<P>
223
224Nevertheless, if <CODE>bzip2</CODE> dies with a segmentation
225fault, a bus error or an internal assertion failure, it
226will ask you to email me a bug report.  Experience with
227version 0.1 shows that almost all these problems can
228be traced to either compiler bugs or hardware problems.
229<UL>
230<LI>
231Recompile the program with no optimisation, and see if it
232works.  And/or try a different compiler.
233I heard all sorts of stories about various flavours
234of GNU C (and other compilers) generating bad code for
235<CODE>bzip2</CODE>, and I've run across two such examples myself.
236<P>
237
2382.7.X versions of GNU C are known to generate bad code from
239time to time, at high optimisation levels. 
240If you get problems, try using the flags
241<CODE>-O2</CODE> <CODE>-fomit-frame-pointer</CODE> <CODE>-fno-strength-reduce</CODE>.
242You should specifically <EM>not</EM> use <CODE>-funroll-loops</CODE>.
243</P><P>
244
245You may notice that the Makefile runs six tests as part of
246the build process.  If the program passes all of these, it's
247a pretty good (but not 100%) indication that the compiler has
248done its job correctly.
249<LI>
250If <CODE>bzip2</CODE> crashes randomly, and the crashes are not
251repeatable, you may have a flaky memory subsystem.  <CODE>bzip2</CODE>
252really hammers your memory hierarchy, and if it's a bit marginal,
253you may get these problems.  Ditto if your disk or I/O subsystem
254is slowly failing.  Yup, this really does happen.
255<P>
256
257Try using a different machine of the same type, and see if
258you can repeat the problem.
259<LI>This isn't really a bug, but ... If <CODE>bzip2</CODE> tells
260you your file is corrupted on decompression, and you
261obtained the file via FTP, there is a possibility that you
262forgot to tell FTP to do a binary mode transfer.  That absolutely
263will cause the file to be non-decompressible.  You'll have to transfer
264it again.
265</UL>
266<P>
267
268If you've incorporated <CODE>libbzip2</CODE> into your own program
269and are getting problems, please, please, please, check that the
270parameters you are passing in calls to the library, are
271correct, and in accordance with what the documentation says
272is allowable.  I have tried to make the library robust against
273such problems, but I'm sure I haven't succeeded.
274</P><P>
275
276Finally, if the above comments don't help, you'll have to send
277me a bug report.  Now, it's just amazing how many people will
278send me a bug report saying something like
279<TABLE><tr><td>&nbsp;</td><td class=display><pre style="font-family: serif">   bzip2 crashed with segmentation fault on my machine
280</pre></td></tr></table>and absolutely nothing else.  Needless to say, a such a report
281is <EM>totally, utterly, completely and comprehensively 100% useless;
282a waste of your time, my time, and net bandwidth</EM>.
283With no details at all, there's no way I can possibly begin
284to figure out what the problem is.
285</P><P>
286
287The rules of the game are: facts, facts, facts.  Don't omit
288them because "oh, they won't be relevant".  At the bare
289minimum:
290<TABLE><tr><td>&nbsp;</td><td class=display><pre style="font-family: serif">   Machine type.  Operating system version. 
291   Exact version of <CODE>bzip2</CODE> (do <CODE>bzip2 -V</CODE>). 
292   Exact version of the compiler used. 
293   Flags passed to the compiler.
294</pre></td></tr></table>However, the most important single thing that will help me is
295the file that you were trying to compress or decompress at the
296time the problem happened.  Without that, my ability to do anything
297more than speculate about the cause, is limited.
298</P><P>
299
300Please remember that I connect to the Internet with a modem, so
301you should contact me before mailing me huge files.
302</P><P>
303
304<HR SIZE="6">
305<A NAME="SEC47"></A>
306<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
307<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC46"> &lt; </A>]</TD>
308<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC48"> &gt; </A>]</TD>
309<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
310<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD>
311<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
312<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
313<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
314<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
315<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
316</TR></TABLE>
317<H2> 4.4 Did you get the right package? </H2>
318<!--docid::SEC47::-->
319<P>
320
321<CODE>bzip2</CODE> is a resource hog.  It soaks up large amounts of CPU cycles
322and memory.  Also, it gives very large latencies.  In the worst case, you
323can feed many megabytes of uncompressed data into the library before
324getting any compressed output, so this probably rules out applications
325requiring interactive behaviour.
326</P><P>
327
328These aren't faults of my implementation, I hope, but more
329an intrinsic property of the Burrows-Wheeler transform (unfortunately). 
330Maybe this isn't what you want.
331</P><P>
332
333If you want a compressor and/or library which is faster, uses less
334memory but gets pretty good compression, and has minimal latency,
335consider Jean-loup
336Gailly's and Mark Adler's work, <CODE>zlib-1.1.3</CODE> and
337<CODE>gzip-1.2.4</CODE>.  Look for them at
338</P><P>
339
340<CODE>http://www.zlib.org</CODE> and
341<CODE>http://www.gzip.org</CODE> respectively.
342</P><P>
343
344For something faster and lighter still, you might try Markus F X J
345Oberhumer's <CODE>LZO</CODE> real-time compression/decompression library, at
346<BR> <CODE>http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html</CODE>.
347</P><P>
348
349If you want to use the <CODE>bzip2</CODE> algorithms to compress small blocks
350of data, 64k bytes or smaller, for example on an on-the-fly disk
351compressor, you'd be well advised not to use this library.  Instead,
352I've made a special library tuned for that kind of use.  It's part of
353<CODE>e2compr-0.40</CODE>, an on-the-fly disk compressor for the Linux
354<CODE>ext2</CODE> filesystem.  Look at
355<CODE>http://www.netspace.net.au/~reiter/e2compr</CODE>.
356</P><P>
357
358<HR SIZE="6">
359<A NAME="SEC48"></A>
360<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
361<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC47"> &lt; </A>]</TD>
362<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC49"> &gt; </A>]</TD>
363<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
364<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD>
365<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
366<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
367<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
368<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
369<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
370</TR></TABLE>
371<H2> 4.5 Testing </H2>
372<!--docid::SEC48::-->
373<P>
374
375A record of the tests I've done.
376</P><P>
377
378First, some data sets:
379<UL>
380<LI>B: a directory containing 6001 files, one for every length in the
381      range 0 to 6000 bytes.  The files contain random lowercase
382      letters.  18.7 megabytes.
383<LI>H: my home directory tree.  Documents, source code, mail files,
384      compressed data.  H contains B, and also a directory of
385      files designed as boundary cases for the sorting; mostly very
386      repetitive, nasty files.  565 megabytes.
387<LI>A: directory tree holding various applications built from source:
388      <CODE>egcs</CODE>, <CODE>gcc-2.8.1</CODE>, KDE, GTK, Octave, etc.
389      2200 megabytes.
390</UL>
391The tests conducted are as follows.  Each test means compressing
392(a copy of) each file in the data set, decompressing it and
393comparing it against the original.
394<P>
395
396First, a bunch of tests with block sizes and internal buffer
397sizes set very small,
398to detect any problems with the
399blocking and buffering mechanisms. 
400This required modifying the source code so as to try to
401break it.
402<OL>
403<LI>Data set H, with
404      buffer size of 1 byte, and block size of 23 bytes.
405<LI>Data set B, buffer sizes 1 byte, block size 1 byte.
406<LI>As (2) but small-mode decompression.
407<LI>As (2) with block size 2 bytes.
408<LI>As (2) with block size 3 bytes.
409<LI>As (2) with block size 4 bytes.
410<LI>As (2) with block size 5 bytes.
411<LI>As (2) with block size 6 bytes and small-mode decompression.
412<LI>H with buffer size of 1 byte, but normal block
413      size (up to 900000 bytes).
414</OL>
415Then some tests with unmodified source code.
416<OL>
417<LI>H, all settings normal.
418<LI>As (1), with small-mode decompress.
419<LI>H, compress with flag <CODE>-1</CODE>.
420<LI>H, compress with flag <CODE>-s</CODE>, decompress with flag <CODE>-s</CODE>.
421<LI>Forwards compatibility: H, <CODE>bzip2-0.1pl2</CODE> compressing,
422      <CODE>bzip2-0.9.5</CODE> decompressing, all settings normal.
423<LI>Backwards compatibility: H, <CODE>bzip2-0.9.5</CODE> compressing,
424      <CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal.
425<LI>Bigger tests: A, all settings normal.
426<LI>As (7), using the fallback (Sadakane-like) sorting algorithm.
427<LI>As (8), compress with flag <CODE>-1</CODE>, decompress with flag
428      <CODE>-s</CODE>.
429<LI>H, using the fallback sorting algorithm.
430<LI>Forwards compatibility: A, <CODE>bzip2-0.1pl2</CODE> compressing,
431      <CODE>bzip2-0.9.5</CODE> decompressing, all settings normal.
432<LI>Backwards compatibility: A, <CODE>bzip2-0.9.5</CODE> compressing,
433      <CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal.
434<LI>Misc test: about 400 megabytes of <CODE>.tar</CODE> files with
435      <CODE>bzip2</CODE> compiled with Checker (a memory access error
436       detector, like Purify).
437<LI>Misc tests to make sure it builds and runs ok on non-Linux/x86
438      platforms.
439</OL>
440These tests were conducted on a 225 MHz IDT WinChip machine, running
441Linux 2.0.36.  They represent nearly a week of continuous computation.
442All tests completed successfully.
443<P>
444
445<HR SIZE="6">
446<A NAME="SEC49"></A>
447<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
448<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC48"> &lt; </A>]</TD>
449<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt; ]</TD>
450<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
451<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD>
452<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
453<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
454<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
455<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
456<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
457</TR></TABLE>
458<H2> 4.6 Further reading </H2>
459<!--docid::SEC49::-->
460<CODE>bzip2</CODE> is not research work, in the sense that it doesn't present
461any new ideas.  Rather, it's an engineering exercise based on existing
462ideas.
463<P>
464
465Four documents describe essentially all the ideas behind <CODE>bzip2</CODE>:
466<TABLE><tr><td>&nbsp;</td><td class=example><pre>Michael Burrows and D. J. Wheeler:
467  "A block-sorting lossless data compression algorithm"
468   10th May 1994.
469   Digital SRC Research Report 124.
470   ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
471   If you have trouble finding it, try searching at the
472   New Zealand Digital Library, http://www.nzdl.org.
473
474Daniel S. Hirschberg and Debra A. LeLewer
475  "Efficient Decoding of Prefix Codes"
476   Communications of the ACM, April 1990, Vol 33, Number 4.
477   You might be able to get an electronic copy of this
478      from the ACM Digital Library.
479
480David J. Wheeler
481   Program bred3.c and accompanying document bred3.ps.
482   This contains the idea behind the multi-table Huffman
483   coding scheme.
484   ftp://ftp.cl.cam.ac.uk/users/djw3/
485
486Jon L. Bentley and Robert Sedgewick
487  "Fast Algorithms for Sorting and Searching Strings"
488   Available from Sedgewick's web page,
489   www.cs.princeton.edu/~rs
490</pre></td></tr></table>The following paper gives valuable additional insights into the
491algorithm, but is not immediately the basis of any code
492used in bzip2.
493<TABLE><tr><td>&nbsp;</td><td class=example><pre>Peter Fenwick:
494   Block Sorting Text Compression
495   Proceedings of the 19th Australasian Computer Science Conference,
496     Melbourne, Australia.  Jan 31 - Feb 2, 1996.
497   ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
498</pre></td></tr></table>Kunihiko Sadakane's sorting algorithm, mentioned above,
499is available from:
500<TABLE><tr><td>&nbsp;</td><td class=example><pre>http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
501</pre></td></tr></table>The Manber-Myers suffix array construction
502algorithm is described in a paper
503available from:
504<TABLE><tr><td>&nbsp;</td><td class=example><pre>http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
505</pre></td></tr></table>Finally, the following paper documents some recent investigations
506I made into the performance of sorting algorithms:
507<TABLE><tr><td>&nbsp;</td><td class=example><pre>Julian Seward:
508   On the Performance of BWT Sorting Algorithms
509   Proceedings of the IEEE Data Compression Conference 2000
510     Snowbird, Utah.  28-30 March 2000.
511</pre></td></tr></table></P><P>
512
513<HR SIZE="6">
514<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
515<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
516<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
517<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
518<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
519<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
520<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
521</TR></TABLE>
522<BR> 
523<FONT SIZE="-1">
524This document was generated
525by <I>Julian Seward</I> on <I>January, 5  2002</I>
526using <A HREF="http://www.mathematik.uni-kl.de/~obachman/Texi2html
527"><I>texi2html</I></A>
528
529</BODY>
530</HTML>
Note: See TracBrowser for help on using the repository browser.