source: trunk/third/bzip2/manual.texi @ 17062

Revision 17062, 89.2 KB checked in by ghudson, 23 years ago (diff)
This commit was generated by cvs2svn to compensate for changes in r17061, which included commits to RCS files with non-trunk default branches.
Line 
1\input texinfo  @c                                  -*- Texinfo -*-
2@setfilename bzip2.info
3
4@ignore
5This file documents bzip2 version 1.0.2, and associated library
6libbzip2, written by Julian Seward (jseward@acm.org).
7
8Copyright (C) 1996-2002 Julian R Seward
9
10Permission is granted to make and distribute verbatim copies of
11this manual provided the copyright notice and this permission notice
12are preserved on all copies.
13
14Permission is granted to copy and distribute translations of this manual
15into another language, under the above conditions for verbatim copies.
16@end ignore
17
18@ifinfo
19@format
20START-INFO-DIR-ENTRY
21* Bzip2: (bzip2).               A program and library for data compression.
22END-INFO-DIR-ENTRY
23@end format
24
25@end ifinfo
26
27@iftex
28@c @finalout
29@settitle bzip2 and libbzip2
30@titlepage
31@title bzip2 and libbzip2
32@subtitle a program and library for data compression
33@subtitle copyright (C) 1996-2002 Julian Seward
34@subtitle version 1.0.2 of 30 December 2001
35@author Julian Seward
36
37@end titlepage
38
39@parindent 0mm
40@parskip 2mm
41
42@end iftex
43@node Top,,, (dir)
44
45The following text is the License for this software.  You should
46find it identical to that contained in the file LICENSE in the
47source distribution.
48
49@bf{------------------ START OF THE LICENSE ------------------}
50
51This program, @code{bzip2},
52and associated library @code{libbzip2}, are
53Copyright (C) 1996-2002 Julian R Seward.  All rights reserved.
54
55Redistribution and use in source and binary forms, with or without
56modification, are permitted provided that the following conditions
57are met:
58@itemize @bullet
59@item
60   Redistributions of source code must retain the above copyright
61   notice, this list of conditions and the following disclaimer.
62@item
63   The origin of this software must not be misrepresented; you must
64   not claim that you wrote the original software.  If you use this
65   software in a product, an acknowledgment in the product
66   documentation would be appreciated but is not required.
67@item
68   Altered source versions must be plainly marked as such, and must
69   not be misrepresented as being the original software.
70@item
71   The name of the author may not be used to endorse or promote
72   products derived from this software without specific prior written
73   permission.
74@end itemize
75THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS
76OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
77WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
78ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
79DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
80DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
81GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
82INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
83WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
84NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
85SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
86
87Julian Seward, Cambridge, UK.
88
89@code{jseward@@acm.org}
90
91@code{bzip2}/@code{libbzip2} version 1.0.2 of 30 December 2001.
92
93@bf{------------------ END OF THE LICENSE ------------------}
94
95Web sites:
96
97@code{http://sources.redhat.com/bzip2}
98
99@code{http://www.cacheprof.org}
100
101PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented
102algorithms.  However, I do not have the resources available to carry out
103a full patent search.  Therefore I cannot give any guarantee of the
104above statement.
105
106
107
108
109
110
111
112@chapter Introduction
113
114@code{bzip2}  compresses  files  using the Burrows-Wheeler
115block-sorting text compression algorithm,  and  Huffman  coding.
116Compression  is  generally  considerably  better than that
117achieved by more conventional LZ77/LZ78-based compressors,
118and  approaches  the performance of the PPM family of statistical compressors.
119
120@code{bzip2} is built on top of @code{libbzip2}, a flexible library
121for handling compressed data in the @code{bzip2} format.  This manual
122describes both how to use the program and
123how to work with the library interface.  Most of the
124manual is devoted to this library, not the program,
125which is good news if your interest is only in the program.
126
127Chapter 2 describes how to use @code{bzip2}; this is the only part
128you need to read if you just want to know how to operate the program.
129Chapter 3 describes the programming interfaces in detail, and
130Chapter 4 records some miscellaneous notes which I thought
131ought to be recorded somewhere.
132
133
134@chapter How to use @code{bzip2}
135
136This chapter contains a copy of the @code{bzip2} man page,
137and nothing else.
138
139@quotation
140
141@unnumberedsubsubsec NAME
142@itemize
143@item @code{bzip2}, @code{bunzip2}
144- a block-sorting file compressor, v1.0.2
145@item @code{bzcat}
146- decompresses files to stdout
147@item @code{bzip2recover}
148- recovers data from damaged bzip2 files
149@end itemize
150
151@unnumberedsubsubsec SYNOPSIS
152@itemize
153@item @code{bzip2} [ -cdfkqstvzVL123456789 ] [ filenames ...  ]
154@item @code{bunzip2} [ -fkvsVL ] [ filenames ...  ]
155@item @code{bzcat} [ -s ] [ filenames ...  ]
156@item @code{bzip2recover} filename
157@end itemize
158
159@unnumberedsubsubsec DESCRIPTION
160
161@code{bzip2} compresses files using the Burrows-Wheeler block sorting
162text compression algorithm, and Huffman coding.  Compression is
163generally considerably better than that achieved by more conventional
164LZ77/LZ78-based compressors, and approaches the performance of the PPM
165family of statistical compressors.
166
167The command-line options are deliberately very similar to those of GNU
168@code{gzip}, but they are not identical.
169
170@code{bzip2} expects a list of file names to accompany the command-line
171flags.  Each file is replaced by a compressed version of itself, with
172the name @code{original_name.bz2}.  Each compressed file has the same
173modification date, permissions, and, when possible, ownership as the
174corresponding original, so that these properties can be correctly
175restored at decompression time.  File name handling is naive in the
176sense that there is no mechanism for preserving original file names,
177permissions, ownerships or dates in filesystems which lack these
178concepts, or have serious file name length restrictions, such as MS-DOS.
179
180@code{bzip2} and @code{bunzip2} will by default not overwrite existing
181files.  If you want this to happen, specify the @code{-f} flag.
182
183If no file names are specified, @code{bzip2} compresses from standard
184input to standard output.  In this case, @code{bzip2} will decline to
185write compressed output to a terminal, as this would be entirely
186incomprehensible and therefore pointless.
187
188@code{bunzip2} (or @code{bzip2 -d}) decompresses all
189specified files.  Files which were not created by @code{bzip2}
190will be detected and ignored, and a warning issued. 
191@code{bzip2} attempts to guess the filename for the decompressed file
192from that of the compressed file as follows:
193@itemize
194@item @code{filename.bz2 } becomes @code{filename}
195@item @code{filename.bz  } becomes @code{filename}
196@item @code{filename.tbz2} becomes @code{filename.tar}
197@item @code{filename.tbz } becomes @code{filename.tar}
198@item @code{anyothername } becomes @code{anyothername.out}
199@end itemize
200If the file does not end in one of the recognised endings,
201@code{.bz2}, @code{.bz},
202@code{.tbz2} or @code{.tbz}, @code{bzip2} complains that it cannot
203guess the name of the original file, and uses the original name
204with @code{.out} appended.
205
206As with compression, supplying no
207filenames causes decompression from standard input to standard output.
208
209@code{bunzip2} will correctly decompress a file which is the
210concatenation of two or more compressed files.  The result is the
211concatenation of the corresponding uncompressed files.  Integrity
212testing (@code{-t}) of concatenated compressed files is also supported.
213
214You can also compress or decompress files to the standard output by
215giving the @code{-c} flag.  Multiple files may be compressed and
216decompressed like this.  The resulting outputs are fed sequentially to
217stdout.  Compression of multiple files in this manner generates a stream
218containing multiple compressed file representations.  Such a stream
219can be decompressed correctly only by @code{bzip2} version 0.9.0 or
220later.  Earlier versions of @code{bzip2} will stop after decompressing
221the first file in the stream.
222
223@code{bzcat} (or @code{bzip2 -dc}) decompresses all specified files to
224the standard output.
225
226@code{bzip2} will read arguments from the environment variables
227@code{BZIP2} and @code{BZIP}, in that order, and will process them
228before any arguments read from the command line.  This gives a
229convenient way to supply default arguments.
230
231Compression is always performed, even if the compressed file is slightly
232larger than the original.  Files of less than about one hundred bytes
233tend to get larger, since the compression mechanism has a constant
234overhead in the region of 50 bytes.  Random data (including the output
235of most file compressors) is coded at about 8.05 bits per byte, giving
236an expansion of around 0.5%.
237
238As a self-check for your protection, @code{bzip2} uses 32-bit CRCs to
239make sure that the decompressed version of a file is identical to the
240original.  This guards against corruption of the compressed data, and
241against undetected bugs in @code{bzip2} (hopefully very unlikely).  The
242chances of data corruption going undetected is microscopic, about one
243chance in four billion for each file processed.  Be aware, though, that
244the check occurs upon decompression, so it can only tell you that
245something is wrong.  It can't help you recover the original uncompressed
246data.  You can use @code{bzip2recover} to try to recover data from
247damaged files.
248
249Return values: 0 for a normal exit, 1 for environmental problems (file
250not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
251compressed file, 3 for an internal consistency error (eg, bug) which
252caused @code{bzip2} to panic.
253
254
255@unnumberedsubsubsec OPTIONS
256@table @code
257@item -c  --stdout
258Compress or decompress to standard output.
259@item -d  --decompress
260Force decompression.  @code{bzip2}, @code{bunzip2} and @code{bzcat} are
261really the same program, and the decision about what actions to take is
262done on the basis of which name is used.  This flag overrides that
263mechanism, and forces bzip2 to decompress.
264@item -z --compress
265The complement to @code{-d}: forces compression, regardless of the
266invokation name.
267@item -t --test
268Check integrity of the specified file(s), but don't decompress them.
269This really performs a trial decompression and throws away the result.
270@item -f --force
271Force overwrite of output files.  Normally, @code{bzip2} will not overwrite
272existing output files.  Also forces @code{bzip2} to break hard links
273to files, which it otherwise wouldn't do.
274
275@code{bzip2} normally declines to decompress files which don't have the
276correct magic header bytes.  If forced (@code{-f}), however, it will
277pass such files through unmodified.  This is how GNU @code{gzip}
278behaves.
279@item -k --keep
280Keep (don't delete) input files during compression
281or decompression.
282@item -s --small
283Reduce memory usage, for compression, decompression and testing.  Files
284are decompressed and tested using a modified algorithm which only
285requires 2.5 bytes per block byte.  This means any file can be
286decompressed in 2300k of memory, albeit at about half the normal speed.
287
288During compression, @code{-s} selects a block size of 200k, which limits
289memory use to around the same figure, at the expense of your compression
290ratio.  In short, if your machine is low on memory (8 megabytes or
291less), use -s for everything.  See MEMORY MANAGEMENT below.
292@item -q --quiet
293Suppress non-essential warning messages.  Messages pertaining to
294I/O errors and other critical events will not be suppressed.
295@item -v --verbose
296Verbose mode -- show the compression ratio for each file processed.
297Further @code{-v}'s increase the verbosity level, spewing out lots of
298information which is primarily of interest for diagnostic purposes.
299@item -L --license -V --version
300Display the software version, license terms and conditions.
301@item -1 (or --fast) to -9 (or --best)
302Set the block size to 100 k, 200 k ..  900 k when compressing.  Has no
303effect when decompressing.  See MEMORY MANAGEMENT below.
304The @code{--fast} and @code{--best} aliases are primarily for GNU
305@code{gzip} compatibility.  In particular, @code{--fast} doesn't make
306things significantly faster.  And @code{--best} merely selects the
307default behaviour.
308@item --
309Treats all subsequent arguments as file names, even if they start
310with a dash.  This is so you can handle files with names beginning
311with a dash, for example: @code{bzip2 -- -myfilename}.
312@item --repetitive-fast
313@item --repetitive-best
314These flags are redundant in versions 0.9.5 and above.  They provided
315some coarse control over the behaviour of the sorting algorithm in
316earlier versions, which was sometimes useful.  0.9.5 and above have an
317improved algorithm which renders these flags irrelevant.
318@end table
319
320
321@unnumberedsubsubsec MEMORY MANAGEMENT
322
323@code{bzip2} compresses large files in blocks.  The block size affects
324both the compression ratio achieved, and the amount of memory needed for
325compression and decompression.  The flags @code{-1} through @code{-9}
326specify the block size to be 100,000 bytes through 900,000 bytes (the
327default) respectively.  At decompression time, the block size used for
328compression is read from the header of the compressed file, and
329@code{bunzip2} then allocates itself just enough memory to decompress
330the file.  Since block sizes are stored in compressed files, it follows
331that the flags @code{-1} to @code{-9} are irrelevant to and so ignored
332during decompression.
333
334Compression and decompression requirements, in bytes, can be estimated
335as:
336@example
337     Compression:   400k + ( 8 x block size )
338
339     Decompression: 100k + ( 4 x block size ), or
340                    100k + ( 2.5 x block size )
341@end example
342Larger block sizes give rapidly diminishing marginal returns.  Most of
343the compression comes from the first two or three hundred k of block
344size, a fact worth bearing in mind when using @code{bzip2} on small machines.
345It is also important to appreciate that the decompression memory
346requirement is set at compression time by the choice of block size.
347
348For files compressed with the default 900k block size, @code{bunzip2}
349will require about 3700 kbytes to decompress.  To support decompression
350of any file on a 4 megabyte machine, @code{bunzip2} has an option to
351decompress using approximately half this amount of memory, about 2300
352kbytes.  Decompression speed is also halved, so you should use this
353option only where necessary.  The relevant flag is @code{-s}.
354
355In general, try and use the largest block size memory constraints allow,
356since that maximises the compression achieved.  Compression and
357decompression speed are virtually unaffected by block size.
358
359Another significant point applies to files which fit in a single block
360-- that means most files you'd encounter using a large block size.  The
361amount of real memory touched is proportional to the size of the file,
362since the file is smaller than a block.  For example, compressing a file
36320,000 bytes long with the flag @code{-9} will cause the compressor to
364allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
365kbytes of it.  Similarly, the decompressor will allocate 3700k but only
366touch 100k + 20000 * 4 = 180 kbytes.
367
368Here is a table which summarises the maximum memory usage for different
369block sizes.  Also recorded is the total compressed size for 14 files of
370the Calgary Text Compression Corpus totalling 3,141,622 bytes.  This
371column gives some feel for how compression varies with block size.
372These figures tend to understate the advantage of larger block sizes for
373larger files, since the Corpus is dominated by smaller files.
374@example
375          Compress   Decompress   Decompress   Corpus
376   Flag     usage      usage       -s usage     Size
377
378    -1      1200k       500k         350k      914704
379    -2      2000k       900k         600k      877703
380    -3      2800k      1300k         850k      860338
381    -4      3600k      1700k        1100k      846899
382    -5      4400k      2100k        1350k      845160
383    -6      5200k      2500k        1600k      838626
384    -7      6100k      2900k        1850k      834096
385    -8      6800k      3300k        2100k      828642
386    -9      7600k      3700k        2350k      828642
387@end example
388
389@unnumberedsubsubsec RECOVERING DATA FROM DAMAGED FILES
390
391@code{bzip2} compresses files in blocks, usually 900kbytes long.  Each
392block is handled independently.  If a media or transmission error causes
393a multi-block @code{.bz2} file to become damaged, it may be possible to
394recover data from the undamaged blocks in the file.
395
396The compressed representation of each block is delimited by a 48-bit
397pattern, which makes it possible to find the block boundaries with
398reasonable certainty.  Each block also carries its own 32-bit CRC, so
399damaged blocks can be distinguished from undamaged ones.
400
401@code{bzip2recover} is a simple program whose purpose is to search for
402blocks in @code{.bz2} files, and write each block out into its own
403@code{.bz2} file.  You can then use @code{bzip2 -t} to test the
404integrity of the resulting files, and decompress those which are
405undamaged.
406
407@code{bzip2recover}
408takes a single argument, the name of the damaged file, and writes a
409number of files @code{rec00001file.bz2}, @code{rec00002file.bz2}, etc,
410containing the extracted blocks.  The output filenames are designed so
411that the use of wildcards in subsequent processing -- for example,
412@code{bzip2 -dc rec*file.bz2 > recovered_data} -- processes the files in
413the correct order.
414
415@code{bzip2recover} should be of most use dealing with large @code{.bz2}
416files, as these will contain many blocks.  It is clearly futile to use
417it on damaged single-block files, since a damaged block cannot be
418recovered.  If you wish to minimise any potential data loss through
419media or transmission errors, you might consider compressing with a
420smaller block size.
421
422
423@unnumberedsubsubsec PERFORMANCE NOTES
424
425The sorting phase of compression gathers together similar strings in the
426file.  Because of this, files containing very long runs of repeated
427symbols, like "aabaabaabaab ..."  (repeated several hundred times) may
428compress more slowly than normal.  Versions 0.9.5 and above fare much
429better than previous versions in this respect.  The ratio between
430worst-case and average-case compression time is in the region of 10:1.
431For previous versions, this figure was more like 100:1.  You can use the
432@code{-vvvv} option to monitor progress in great detail, if you want.
433
434Decompression speed is unaffected by these phenomena.
435
436@code{bzip2} usually allocates several megabytes of memory to operate
437in, and then charges all over it in a fairly random fashion.  This means
438that performance, both for compressing and decompressing, is largely
439determined by the speed at which your machine can service cache misses.
440Because of this, small changes to the code to reduce the miss rate have
441been observed to give disproportionately large performance improvements.
442I imagine @code{bzip2} will perform best on machines with very large
443caches.
444
445
446@unnumberedsubsubsec CAVEATS
447
448I/O error messages are not as helpful as they could be.  @code{bzip2}
449tries hard to detect I/O errors and exit cleanly, but the details of
450what the problem is sometimes seem rather misleading.
451
452This manual page pertains to version 1.0.2 of @code{bzip2}.  Compressed
453data created by this version is entirely forwards and backwards
454compatible with the previous public releases, versions 0.1pl2, 0.9.0,
4550.9.5, 1.0.0 and 1.0.1, but with the following exception: 0.9.0 and
456above can correctly decompress multiple concatenated compressed files.
4570.1pl2 cannot do this; it will stop after decompressing just the first
458file in the stream.
459
460@code{bzip2recover} versions prior to this one, 1.0.2, used 32-bit
461integers to represent bit positions in compressed files, so it could not
462handle compressed files more than 512 megabytes long.  Version 1.0.2 and
463above uses 64-bit ints on some platforms which support them (GNU
464supported targets, and Windows).  To establish whether or not
465@code{bzip2recover} was built with such a limitation, run it without
466arguments.  In any event you can build yourself an unlimited version if
467you can recompile it with @code{MaybeUInt64} set to be an unsigned
46864-bit integer.
469
470
471
472@unnumberedsubsubsec AUTHOR
473Julian Seward, @code{jseward@@acm.org}.
474
475@code{http://sources.redhat.com/bzip2}
476
477The ideas embodied in @code{bzip2} are due to (at least) the following
478people: Michael Burrows and David Wheeler (for the block sorting
479transformation), David Wheeler (again, for the Huffman coder), Peter
480Fenwick (for the structured coding model in the original @code{bzip},
481and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
482(for the arithmetic coder in the original @code{bzip}).  I am much
483indebted for their help, support and advice.  See the manual in the
484source distribution for pointers to sources of documentation.  Christian
485von Roques encouraged me to look for faster sorting algorithms, so as to
486speed up compression.  Bela Lubkin encouraged me to improve the
487worst-case compression performance.  The @code{bz*} scripts are derived
488from those of GNU @code{gzip}.  Many people sent patches, helped with
489portability problems, lent machines, gave advice and were generally
490helpful.
491
492@end quotation
493
494
495
496
497@chapter Programming with @code{libbzip2}
498
499This chapter describes the programming interface to @code{libbzip2}.
500
501For general background information, particularly about memory
502use and performance aspects, you'd be well advised to read Chapter 2
503as well.
504
505@section Top-level structure
506
507@code{libbzip2} is a flexible library for compressing and decompressing
508data in the @code{bzip2} data format.  Although packaged as a single
509entity, it helps to regard the library as three separate parts: the low
510level interface, and the high level interface, and some utility
511functions.
512
513The structure of @code{libbzip2}'s interfaces is similar to
514that of Jean-loup Gailly's and Mark Adler's excellent @code{zlib}
515library.
516
517All externally visible symbols have names beginning @code{BZ2_}.
518This is new in version 1.0.  The intention is to minimise pollution
519of the namespaces of library clients.
520
521@subsection Low-level summary
522
523This interface provides services for compressing and decompressing
524data in memory.  There's no provision for dealing with files, streams
525or any other I/O mechanisms, just straight memory-to-memory work.
526In fact, this part of the library can be compiled without inclusion
527of @code{stdio.h}, which may be helpful for embedded applications.
528
529The low-level part of the library has no global variables and
530is therefore thread-safe.
531
532Six routines make up the low level interface:
533@code{BZ2_bzCompressInit}, @code{BZ2_bzCompress}, and @* @code{BZ2_bzCompressEnd}
534for compression,
535and a corresponding trio @code{BZ2_bzDecompressInit}, @* @code{BZ2_bzDecompress}
536and @code{BZ2_bzDecompressEnd} for decompression. 
537The @code{*Init} functions allocate
538memory for compression/decompression and do other
539initialisations, whilst the @code{*End} functions close down operations
540and release memory.
541
542The real work is done by @code{BZ2_bzCompress} and @code{BZ2_bzDecompress}. 
543These compress and decompress data from a user-supplied input buffer
544to a user-supplied output buffer.  These buffers can be any size;
545arbitrary quantities of data are handled by making repeated calls
546to these functions.  This is a flexible mechanism allowing a
547consumer-pull style of activity, or producer-push, or a mixture of
548both.
549
550
551
552@subsection High-level summary
553
554This interface provides some handy wrappers around the low-level
555interface to facilitate reading and writing @code{bzip2} format
556files (@code{.bz2} files).  The routines provide hooks to facilitate
557reading files in which the @code{bzip2} data stream is embedded
558within some larger-scale file structure, or where there are
559multiple @code{bzip2} data streams concatenated end-to-end.
560
561For reading files, @code{BZ2_bzReadOpen}, @code{BZ2_bzRead},
562@code{BZ2_bzReadClose} and @* @code{BZ2_bzReadGetUnused} are supplied.  For
563writing files, @code{BZ2_bzWriteOpen}, @code{BZ2_bzWrite} and
564@code{BZ2_bzWriteFinish} are available.
565
566As with the low-level library, no global variables are used
567so the library is per se thread-safe.  However, if I/O errors
568occur whilst reading or writing the underlying compressed files,
569you may have to consult @code{errno} to determine the cause of
570the error.  In that case, you'd need a C library which correctly
571supports @code{errno} in a multithreaded environment.
572
573To make the library a little simpler and more portable,
574@code{BZ2_bzReadOpen} and @code{BZ2_bzWriteOpen} require you to pass them file
575handles (@code{FILE*}s) which have previously been opened for reading or
576writing respectively.  That avoids portability problems associated with
577file operations and file attributes, whilst not being much of an
578imposition on the programmer.
579
580
581
582@subsection Utility functions summary
583For very simple needs, @code{BZ2_bzBuffToBuffCompress} and
584@code{BZ2_bzBuffToBuffDecompress} are provided.  These compress
585data in memory from one buffer to another buffer in a single
586function call.  You should assess whether these functions
587fulfill your memory-to-memory compression/decompression
588requirements before investing effort in understanding the more
589general but more complex low-level interface.
590
591Yoshioka Tsuneo (@code{QWF00133@@niftyserve.or.jp} /
592@code{tsuneo-y@@is.aist-nara.ac.jp}) has contributed some functions to
593give better @code{zlib} compatibility.  These functions are
594@code{BZ2_bzopen}, @code{BZ2_bzread}, @code{BZ2_bzwrite}, @code{BZ2_bzflush},
595@code{BZ2_bzclose},
596@code{BZ2_bzerror} and @code{BZ2_bzlibVersion}.  You may find these functions
597more convenient for simple file reading and writing, than those in the
598high-level interface.  These functions are not (yet) officially part of
599the library, and are minimally documented here.  If they break, you
600get to keep all the pieces.  I hope to document them properly when time
601permits.
602
603Yoshioka also contributed modifications to allow the library to be
604built as a Windows DLL.
605
606
607@section Error handling
608
609The library is designed to recover cleanly in all situations, including
610the worst-case situation of decompressing random data.  I'm not
611100% sure that it can always do this, so you might want to add
612a signal handler to catch segmentation violations during decompression
613if you are feeling especially paranoid.  I would be interested in
614hearing more about the robustness of the library to corrupted
615compressed data.
616
617Version 1.0 is much more robust in this respect than
6180.9.0 or 0.9.5.  Investigations with Checker (a tool for
619detecting problems with memory management, similar to Purify)
620indicate that, at least for the few files I tested, all single-bit
621errors in the decompressed data are caught properly, with no
622segmentation faults, no reads of uninitialised data and no
623out of range reads or writes.  So it's certainly much improved,
624although I wouldn't claim it to be totally bombproof.
625
626The file @code{bzlib.h} contains all definitions needed to use
627the library.  In particular, you should definitely not include
628@code{bzlib_private.h}.
629
630In @code{bzlib.h}, the various return values are defined.  The following
631list is not intended as an exhaustive description of the circumstances
632in which a given value may be returned -- those descriptions are given
633later.  Rather, it is intended to convey the rough meaning of each
634return value.  The first five actions are normal and not intended to
635denote an error situation.
636@table @code
637@item BZ_OK
638The requested action was completed successfully.
639@item BZ_RUN_OK
640@itemx BZ_FLUSH_OK
641@itemx BZ_FINISH_OK
642In @code{BZ2_bzCompress}, the requested flush/finish/nothing-special action
643was completed successfully.
644@item BZ_STREAM_END
645Compression of data was completed, or the logical stream end was
646detected during decompression.
647@end table
648
649The following return values indicate an error of some kind.
650@table @code
651@item BZ_CONFIG_ERROR
652Indicates that the library has been improperly compiled on your
653platform -- a major configuration error.  Specifically, it means
654that @code{sizeof(char)}, @code{sizeof(short)} and @code{sizeof(int)}
655are not 1, 2 and 4 respectively, as they should be.  Note that the
656library should still work properly on 64-bit platforms which follow
657the LP64 programming model -- that is, where @code{sizeof(long)}
658and @code{sizeof(void*)} are 8.  Under LP64, @code{sizeof(int)} is
659still 4, so @code{libbzip2}, which doesn't use the @code{long} type,
660is OK.
661@item BZ_SEQUENCE_ERROR
662When using the library, it is important to call the functions in the
663correct sequence and with data structures (buffers etc) in the correct
664states.  @code{libbzip2} checks as much as it can to ensure this is
665happening, and returns @code{BZ_SEQUENCE_ERROR} if not.  Code which
666complies precisely with the function semantics, as detailed below,
667should never receive this value; such an event denotes buggy code
668which you should investigate.
669@item BZ_PARAM_ERROR
670Returned when a parameter to a function call is out of range
671or otherwise manifestly incorrect.  As with @code{BZ_SEQUENCE_ERROR},
672this denotes a bug in the client code.  The distinction between
673@code{BZ_PARAM_ERROR} and @code{BZ_SEQUENCE_ERROR} is a bit hazy, but still worth
674making.
675@item BZ_MEM_ERROR
676Returned when a request to allocate memory failed.  Note that the
677quantity of memory needed to decompress a stream cannot be determined
678until the stream's header has been read.  So @code{BZ2_bzDecompress} and
679@code{BZ2_bzRead} may return @code{BZ_MEM_ERROR} even though some of
680the compressed data has been read.  The same is not true for
681compression; once @code{BZ2_bzCompressInit} or @code{BZ2_bzWriteOpen} have
682successfully completed, @code{BZ_MEM_ERROR} cannot occur.
683@item BZ_DATA_ERROR
684Returned when a data integrity error is detected during decompression.
685Most importantly, this means when stored and computed CRCs for the
686data do not match.  This value is also returned upon detection of any
687other anomaly in the compressed data.
688@item BZ_DATA_ERROR_MAGIC
689As a special case of @code{BZ_DATA_ERROR}, it is sometimes useful to
690know when the compressed stream does not start with the correct
691magic bytes (@code{'B' 'Z' 'h'}). 
692@item BZ_IO_ERROR
693Returned by @code{BZ2_bzRead} and @code{BZ2_bzWrite} when there is an error
694reading or writing in the compressed file, and by @code{BZ2_bzReadOpen}
695and @code{BZ2_bzWriteOpen} for attempts to use a file for which the
696error indicator (viz, @code{ferror(f)}) is set.
697On receipt of @code{BZ_IO_ERROR}, the caller should consult
698@code{errno} and/or @code{perror} to acquire operating-system
699specific information about the problem.
700@item BZ_UNEXPECTED_EOF
701Returned by @code{BZ2_bzRead} when the compressed file finishes
702before the logical end of stream is detected.
703@item BZ_OUTBUFF_FULL
704Returned by @code{BZ2_bzBuffToBuffCompress} and
705@code{BZ2_bzBuffToBuffDecompress} to indicate that the output data
706will not fit into the output buffer provided.
707@end table
708
709
710
711@section Low-level interface
712
713@subsection @code{BZ2_bzCompressInit}
714@example
715typedef
716   struct @{
717      char *next_in;
718      unsigned int avail_in;
719      unsigned int total_in_lo32;
720      unsigned int total_in_hi32;
721
722      char *next_out;
723      unsigned int avail_out;
724      unsigned int total_out_lo32;
725      unsigned int total_out_hi32;
726
727      void *state;
728
729      void *(*bzalloc)(void *,int,int);
730      void (*bzfree)(void *,void *);
731      void *opaque;
732   @}
733   bz_stream;
734
735int BZ2_bzCompressInit ( bz_stream *strm,
736                         int blockSize100k,
737                         int verbosity,
738                         int workFactor );
739
740@end example
741
742Prepares for compression.  The @code{bz_stream} structure
743holds all data pertaining to the compression activity. 
744A @code{bz_stream} structure should be allocated and initialised
745prior to the call.
746The fields of @code{bz_stream}
747comprise the entirety of the user-visible data.  @code{state}
748is a pointer to the private data structures required for compression.
749
750Custom memory allocators are supported, via fields @code{bzalloc},
751@code{bzfree},
752and @code{opaque}.  The value
753@code{opaque} is passed to as the first argument to
754all calls to @code{bzalloc} and @code{bzfree}, but is
755otherwise ignored by the library.
756The call @code{bzalloc ( opaque, n, m )} is expected to return a
757pointer @code{p} to
758@code{n * m} bytes of memory, and @code{bzfree ( opaque, p )}
759should free
760that memory.
761
762If you don't want to use a custom memory allocator, set @code{bzalloc},
763@code{bzfree} and
764@code{opaque} to @code{NULL},
765and the library will then use the standard @code{malloc}/@code{free}
766routines.
767
768Before calling @code{BZ2_bzCompressInit}, fields @code{bzalloc},
769@code{bzfree} and @code{opaque} should
770be filled appropriately, as just described.  Upon return, the internal
771state will have been allocated and initialised, and @code{total_in_lo32},
772@code{total_in_hi32}, @code{total_out_lo32} and
773@code{total_out_hi32} will have been set to zero. 
774These four fields are used by the library
775to inform the caller of the total amount of data passed into and out of
776the library, respectively.  You should not try to change them.
777As of version 1.0, 64-bit counts are maintained, even on 32-bit
778platforms, using the @code{_hi32} fields to store the upper 32 bits
779of the count.  So, for example, the total amount of data in
780is @code{(total_in_hi32 << 32) + total_in_lo32}.
781
782Parameter @code{blockSize100k} specifies the block size to be used for
783compression.  It should be a value between 1 and 9 inclusive, and the
784actual block size used is 100000 x this figure.  9 gives the best
785compression but takes most memory.
786
787Parameter @code{verbosity} should be set to a number between 0 and 4
788inclusive.  0 is silent, and greater numbers give increasingly verbose
789monitoring/debugging output.  If the library has been compiled with
790@code{-DBZ_NO_STDIO}, no such output will appear for any verbosity
791setting.
792
793Parameter @code{workFactor} controls how the compression phase behaves
794when presented with worst case, highly repetitive, input data.  If
795compression runs into difficulties caused by repetitive data, the
796library switches from the standard sorting algorithm to a fallback
797algorithm.  The fallback is slower than the standard algorithm by
798perhaps a factor of three, but always behaves reasonably, no matter how
799bad the input.
800
801Lower values of @code{workFactor} reduce the amount of effort the
802standard algorithm will expend before resorting to the fallback.  You
803should set this parameter carefully; too low, and many inputs will be
804handled by the fallback algorithm and so compress rather slowly, too
805high, and your average-to-worst case compression times can become very
806large.  The default value of 30 gives reasonable behaviour over a wide
807range of circumstances.
808
809Allowable values range from 0 to 250 inclusive.  0 is a special case,
810equivalent to using the default value of 30.
811
812Note that the compressed output generated is the same regardless of
813whether or not the fallback algorithm is used.
814
815Be aware also that this parameter may disappear entirely in future
816versions of the library.  In principle it should be possible to devise a
817good way to automatically choose which algorithm to use.  Such a
818mechanism would render the parameter obsolete.
819
820Possible return values:
821@display
822      @code{BZ_CONFIG_ERROR}
823         if the library has been mis-compiled
824      @code{BZ_PARAM_ERROR}
825         if @code{strm} is @code{NULL}
826         or @code{blockSize} < 1 or @code{blockSize} > 9
827         or @code{verbosity} < 0 or @code{verbosity} > 4
828         or @code{workFactor} < 0 or @code{workFactor} > 250
829      @code{BZ_MEM_ERROR}
830         if not enough memory is available
831      @code{BZ_OK}
832         otherwise
833@end display
834Allowable next actions:
835@display
836      @code{BZ2_bzCompress}
837         if @code{BZ_OK} is returned
838      no specific action needed in case of error
839@end display
840
841@subsection @code{BZ2_bzCompress}
842@example
843   int BZ2_bzCompress ( bz_stream *strm, int action );
844@end example
845Provides more input and/or output buffer space for the library.  The
846caller maintains input and output buffers, and calls @code{BZ2_bzCompress} to
847transfer data between them.
848
849Before each call to @code{BZ2_bzCompress}, @code{next_in} should point at
850the data to be compressed, and @code{avail_in} should indicate how many
851bytes the library may read.  @code{BZ2_bzCompress} updates @code{next_in},
852@code{avail_in} and @code{total_in} to reflect the number of bytes it
853has read.
854
855Similarly, @code{next_out} should point to a buffer in which the
856compressed data is to be placed, with @code{avail_out} indicating how
857much output space is available.  @code{BZ2_bzCompress} updates
858@code{next_out}, @code{avail_out} and @code{total_out} to reflect the
859number of bytes output.
860
861You may provide and remove as little or as much data as you like on each
862call of @code{BZ2_bzCompress}.  In the limit, it is acceptable to supply and
863remove data one byte at a time, although this would be terribly
864inefficient.  You should always ensure that at least one byte of output
865space is available at each call.
866
867A second purpose of @code{BZ2_bzCompress} is to request a change of mode of the
868compressed stream. 
869
870Conceptually, a compressed stream can be in one of four states: IDLE,
871RUNNING, FLUSHING and FINISHING.  Before initialisation
872(@code{BZ2_bzCompressInit}) and after termination (@code{BZ2_bzCompressEnd}), a
873stream is regarded as IDLE.
874
875Upon initialisation (@code{BZ2_bzCompressInit}), the stream is placed in the
876RUNNING state.  Subsequent calls to @code{BZ2_bzCompress} should pass
877@code{BZ_RUN} as the requested action; other actions are illegal and
878will result in @code{BZ_SEQUENCE_ERROR}.
879
880At some point, the calling program will have provided all the input data
881it wants to.  It will then want to finish up -- in effect, asking the
882library to process any data it might have buffered internally.  In this
883state, @code{BZ2_bzCompress} will no longer attempt to read data from
884@code{next_in}, but it will want to write data to @code{next_out}.
885Because the output buffer supplied by the user can be arbitrarily small,
886the finishing-up operation cannot necessarily be done with a single call
887of @code{BZ2_bzCompress}.
888
889Instead, the calling program passes @code{BZ_FINISH} as an action to
890@code{BZ2_bzCompress}.  This changes the stream's state to FINISHING.  Any
891remaining input (ie, @code{next_in[0 .. avail_in-1]}) is compressed and
892transferred to the output buffer.  To do this, @code{BZ2_bzCompress} must be
893called repeatedly until all the output has been consumed.  At that
894point, @code{BZ2_bzCompress} returns @code{BZ_STREAM_END}, and the stream's
895state is set back to IDLE.  @code{BZ2_bzCompressEnd} should then be
896called.
897
898Just to make sure the calling program does not cheat, the library makes
899a note of @code{avail_in} at the time of the first call to
900@code{BZ2_bzCompress} which has @code{BZ_FINISH} as an action (ie, at the
901time the program has announced its intention to not supply any more
902input).  By comparing this value with that of @code{avail_in} over
903subsequent calls to @code{BZ2_bzCompress}, the library can detect any
904attempts to slip in more data to compress.  Any calls for which this is
905detected will return @code{BZ_SEQUENCE_ERROR}.  This indicates a
906programming mistake which should be corrected.
907
908Instead of asking to finish, the calling program may ask
909@code{BZ2_bzCompress} to take all the remaining input, compress it and
910terminate the current (Burrows-Wheeler) compression block.  This could
911be useful for error control purposes.  The mechanism is analogous to
912that for finishing: call @code{BZ2_bzCompress} with an action of
913@code{BZ_FLUSH}, remove output data, and persist with the
914@code{BZ_FLUSH} action until the value @code{BZ_RUN} is returned.  As
915with finishing, @code{BZ2_bzCompress} detects any attempt to provide more
916input data once the flush has begun.
917
918Once the flush is complete, the stream returns to the normal RUNNING
919state.
920
921This all sounds pretty complex, but isn't really.  Here's a table
922which shows which actions are allowable in each state, what action
923will be taken, what the next state is, and what the non-error return
924values are.  Note that you can't explicitly ask what state the
925stream is in, but nor do you need to -- it can be inferred from the
926values returned by @code{BZ2_bzCompress}.
927@display
928IDLE/@code{any}           
929      Illegal.  IDLE state only exists after @code{BZ2_bzCompressEnd} or
930      before @code{BZ2_bzCompressInit}.
931      Return value = @code{BZ_SEQUENCE_ERROR}
932
933RUNNING/@code{BZ_RUN}     
934      Compress from @code{next_in} to @code{next_out} as much as possible.
935      Next state = RUNNING
936      Return value = @code{BZ_RUN_OK}
937
938RUNNING/@code{BZ_FLUSH}   
939      Remember current value of @code{next_in}.  Compress from @code{next_in}
940      to @code{next_out} as much as possible, but do not accept any more input. 
941      Next state = FLUSHING
942      Return value = @code{BZ_FLUSH_OK}
943
944RUNNING/@code{BZ_FINISH} 
945      Remember current value of @code{next_in}.  Compress from @code{next_in}
946      to @code{next_out} as much as possible, but do not accept any more input.
947      Next state = FINISHING
948      Return value = @code{BZ_FINISH_OK}
949
950FLUSHING/@code{BZ_FLUSH} 
951      Compress from @code{next_in} to @code{next_out} as much as possible,
952      but do not accept any more input. 
953      If all the existing input has been used up and all compressed
954      output has been removed
955         Next state = RUNNING; Return value = @code{BZ_RUN_OK}
956      else
957         Next state = FLUSHING; Return value = @code{BZ_FLUSH_OK}
958
959FLUSHING/other     
960      Illegal.
961      Return value = @code{BZ_SEQUENCE_ERROR}
962
963FINISHING/@code{BZ_FINISH} 
964      Compress from @code{next_in} to @code{next_out} as much as possible,
965      but to not accept any more input. 
966      If all the existing input has been used up and all compressed
967      output has been removed
968         Next state = IDLE; Return value = @code{BZ_STREAM_END}
969      else
970         Next state = FINISHING; Return value = @code{BZ_FINISHING}
971
972FINISHING/other
973      Illegal.
974      Return value = @code{BZ_SEQUENCE_ERROR}
975@end display
976
977That still looks complicated?  Well, fair enough.  The usual sequence
978of calls for compressing a load of data is:
979@itemize @bullet
980@item Get started with @code{BZ2_bzCompressInit}.
981@item Shovel data in and shlurp out its compressed form using zero or more
982calls of @code{BZ2_bzCompress} with action = @code{BZ_RUN}.
983@item Finish up. 
984Repeatedly call @code{BZ2_bzCompress} with action = @code{BZ_FINISH},
985copying out the compressed output, until @code{BZ_STREAM_END} is returned.
986@item Close up and go home.  Call @code{BZ2_bzCompressEnd}.
987@end itemize
988If the data you want to compress fits into your input buffer all
989at once, you can skip the calls of @code{BZ2_bzCompress ( ..., BZ_RUN )} and
990just do the @code{BZ2_bzCompress ( ..., BZ_FINISH )} calls.
991
992All required memory is allocated by @code{BZ2_bzCompressInit}.  The
993compression library can accept any data at all (obviously).  So you
994shouldn't get any error return values from the @code{BZ2_bzCompress} calls.
995If you do, they will be @code{BZ_SEQUENCE_ERROR}, and indicate a bug in
996your programming.
997
998Trivial other possible return values:
999@display
1000      @code{BZ_PARAM_ERROR}   
1001         if @code{strm} is @code{NULL}, or @code{strm->s} is @code{NULL}
1002@end display
1003
1004@subsection @code{BZ2_bzCompressEnd}
1005@example
1006int BZ2_bzCompressEnd ( bz_stream *strm );
1007@end example
1008Releases all memory associated with a compression stream.
1009
1010Possible return values:
1011@display
1012   @code{BZ_PARAM_ERROR}    if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1013   @code{BZ_OK}    otherwise
1014@end display
1015
1016
1017@subsection @code{BZ2_bzDecompressInit}
1018@example
1019int BZ2_bzDecompressInit ( bz_stream *strm, int verbosity, int small );
1020@end example
1021Prepares for decompression.  As with @code{BZ2_bzCompressInit}, a
1022@code{bz_stream} record should be allocated and initialised before the
1023call.  Fields @code{bzalloc}, @code{bzfree} and @code{opaque} should be
1024set if a custom memory allocator is required, or made @code{NULL} for
1025the normal @code{malloc}/@code{free} routines.  Upon return, the internal
1026state will have been initialised, and @code{total_in} and
1027@code{total_out} will be zero.
1028
1029For the meaning of parameter @code{verbosity}, see @code{BZ2_bzCompressInit}.
1030
1031If @code{small} is nonzero, the library will use an alternative
1032decompression algorithm which uses less memory but at the cost of
1033decompressing more slowly (roughly speaking, half the speed, but the
1034maximum memory requirement drops to around 2300k).  See Chapter 2 for
1035more information on memory management.
1036
1037Note that the amount of memory needed to decompress
1038a stream cannot be determined until the stream's header has been read,
1039so even if @code{BZ2_bzDecompressInit} succeeds, a subsequent
1040@code{BZ2_bzDecompress} could fail with @code{BZ_MEM_ERROR}.
1041
1042Possible return values:
1043@display
1044      @code{BZ_CONFIG_ERROR}
1045         if the library has been mis-compiled
1046      @code{BZ_PARAM_ERROR}
1047         if @code{(small != 0 && small != 1)}
1048         or @code{(verbosity < 0 || verbosity > 4)}
1049      @code{BZ_MEM_ERROR}
1050         if insufficient memory is available
1051@end display
1052
1053Allowable next actions:
1054@display
1055      @code{BZ2_bzDecompress}
1056         if @code{BZ_OK} was returned
1057      no specific action required in case of error
1058@end display
1059
1060 
1061
1062@subsection @code{BZ2_bzDecompress}
1063@example
1064int BZ2_bzDecompress ( bz_stream *strm );
1065@end example
1066Provides more input and/out output buffer space for the library.  The
1067caller maintains input and output buffers, and uses @code{BZ2_bzDecompress}
1068to transfer data between them.
1069
1070Before each call to @code{BZ2_bzDecompress}, @code{next_in}
1071should point at the compressed data,
1072and @code{avail_in} should indicate how many bytes the library
1073may read.  @code{BZ2_bzDecompress} updates @code{next_in}, @code{avail_in}
1074and @code{total_in}
1075to reflect the number of bytes it has read.
1076
1077Similarly, @code{next_out} should point to a buffer in which the uncompressed
1078output is to be placed, with @code{avail_out} indicating how much output space
1079is available.  @code{BZ2_bzCompress} updates @code{next_out},
1080@code{avail_out} and @code{total_out} to reflect
1081the number of bytes output.
1082
1083You may provide and remove as little or as much data as you like on
1084each call of @code{BZ2_bzDecompress}. 
1085In the limit, it is acceptable to
1086supply and remove data one byte at a time, although this would be
1087terribly inefficient.  You should always ensure that at least one
1088byte of output space is available at each call.
1089
1090Use of @code{BZ2_bzDecompress} is simpler than @code{BZ2_bzCompress}.
1091
1092You should provide input and remove output as described above, and
1093repeatedly call @code{BZ2_bzDecompress} until @code{BZ_STREAM_END} is
1094returned.  Appearance of @code{BZ_STREAM_END} denotes that
1095@code{BZ2_bzDecompress} has detected the logical end of the compressed
1096stream.  @code{BZ2_bzDecompress} will not produce @code{BZ_STREAM_END} until
1097all output data has been placed into the output buffer, so once
1098@code{BZ_STREAM_END} appears, you are guaranteed to have available all
1099the decompressed output, and @code{BZ2_bzDecompressEnd} can safely be
1100called.
1101
1102If case of an error return value, you should call @code{BZ2_bzDecompressEnd}
1103to clean up and release memory.
1104
1105Possible return values:
1106@display
1107      @code{BZ_PARAM_ERROR}
1108         if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1109         or @code{strm->avail_out < 1}
1110      @code{BZ_DATA_ERROR}
1111         if a data integrity error is detected in the compressed stream
1112      @code{BZ_DATA_ERROR_MAGIC}
1113         if the compressed stream doesn't begin with the right magic bytes
1114      @code{BZ_MEM_ERROR}
1115         if there wasn't enough memory available
1116      @code{BZ_STREAM_END}
1117         if the logical end of the data stream was detected and all
1118         output in has been consumed, eg @code{s->avail_out > 0}
1119      @code{BZ_OK}
1120         otherwise
1121@end display
1122Allowable next actions:
1123@display
1124      @code{BZ2_bzDecompress}
1125         if @code{BZ_OK} was returned
1126      @code{BZ2_bzDecompressEnd}
1127         otherwise
1128@end display
1129
1130
1131@subsection @code{BZ2_bzDecompressEnd}
1132@example
1133int BZ2_bzDecompressEnd ( bz_stream *strm );
1134@end example
1135Releases all memory associated with a decompression stream.
1136
1137Possible return values:
1138@display
1139      @code{BZ_PARAM_ERROR}
1140         if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1141      @code{BZ_OK}
1142         otherwise
1143@end display
1144
1145Allowable next actions:
1146@display
1147      None.
1148@end display
1149
1150
1151@section High-level interface
1152
1153This interface provides functions for reading and writing
1154@code{bzip2} format files.  First, some general points.
1155
1156@itemize @bullet
1157@item All of the functions take an @code{int*} first argument,
1158  @code{bzerror}.
1159  After each call, @code{bzerror} should be consulted first to determine
1160  the outcome of the call.  If @code{bzerror} is @code{BZ_OK},
1161  the call completed
1162  successfully, and only then should the return value of the function
1163  (if any) be consulted.  If @code{bzerror} is @code{BZ_IO_ERROR},
1164  there was an error
1165  reading/writing the underlying compressed file, and you should
1166  then consult @code{errno}/@code{perror} to determine the
1167  cause of the difficulty.
1168  @code{bzerror} may also be set to various other values; precise details are
1169  given on a per-function basis below.
1170@item If @code{bzerror} indicates an error
1171  (ie, anything except @code{BZ_OK} and @code{BZ_STREAM_END}),
1172  you should immediately call @code{BZ2_bzReadClose} (or @code{BZ2_bzWriteClose},
1173  depending on whether you are attempting to read or to write)
1174  to free up all resources associated
1175  with the stream.  Once an error has been indicated, behaviour of all calls
1176  except @code{BZ2_bzReadClose} (@code{BZ2_bzWriteClose}) is undefined. 
1177  The implication is that (1) @code{bzerror} should
1178  be checked after each call, and (2) if @code{bzerror} indicates an error,
1179  @code{BZ2_bzReadClose} (@code{BZ2_bzWriteClose}) should then be called to clean up.
1180@item The @code{FILE*} arguments passed to
1181   @code{BZ2_bzReadOpen}/@code{BZ2_bzWriteOpen} 
1182  should be set to binary mode.
1183  Most Unix systems will do this by default, but other platforms,
1184  including Windows and Mac, will not.  If you omit this, you may
1185  encounter problems when moving code to new platforms.
1186@item Memory allocation requests are handled by
1187  @code{malloc}/@code{free}. 
1188  At present
1189  there is no facility for user-defined memory allocators in the file I/O
1190  functions (could easily be added, though).
1191@end itemize
1192
1193
1194
1195@subsection @code{BZ2_bzReadOpen}
1196@example
1197   typedef void BZFILE;
1198
1199   BZFILE *BZ2_bzReadOpen ( int *bzerror, FILE *f,
1200                            int small, int verbosity,
1201                            void *unused, int nUnused );
1202@end example
1203Prepare to read compressed data from file handle @code{f}.  @code{f}
1204should refer to a file which has been opened for reading, and for which
1205the error indicator (@code{ferror(f)})is not set.  If @code{small} is 1,
1206the library will try to decompress using less memory, at the expense of
1207speed.
1208
1209For reasons explained below, @code{BZ2_bzRead} will decompress the
1210@code{nUnused} bytes starting at @code{unused}, before starting to read
1211from the file @code{f}.  At most @code{BZ_MAX_UNUSED} bytes may be
1212supplied like this.  If this facility is not required, you should pass
1213@code{NULL} and @code{0} for @code{unused} and n@code{Unused}
1214respectively.
1215
1216For the meaning of parameters @code{small} and @code{verbosity},
1217see @code{BZ2_bzDecompressInit}.
1218
1219The amount of memory needed to decompress a file cannot be determined
1220until the file's header has been read.  So it is possible that
1221@code{BZ2_bzReadOpen} returns @code{BZ_OK} but a subsequent call of
1222@code{BZ2_bzRead} will return @code{BZ_MEM_ERROR}.
1223
1224Possible assignments to @code{bzerror}:
1225@display
1226      @code{BZ_CONFIG_ERROR}
1227         if the library has been mis-compiled
1228      @code{BZ_PARAM_ERROR}
1229         if @code{f} is @code{NULL}
1230         or @code{small} is neither @code{0} nor @code{1}                 
1231         or @code{(unused == NULL && nUnused != 0)}
1232         or @code{(unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED))}
1233      @code{BZ_IO_ERROR}   
1234         if @code{ferror(f)} is nonzero
1235      @code{BZ_MEM_ERROR}   
1236         if insufficient memory is available
1237      @code{BZ_OK}
1238         otherwise.
1239@end display
1240
1241Possible return values:
1242@display
1243      Pointer to an abstract @code{BZFILE}       
1244         if @code{bzerror} is @code{BZ_OK}   
1245      @code{NULL}
1246         otherwise
1247@end display
1248
1249Allowable next actions:
1250@display
1251      @code{BZ2_bzRead}
1252         if @code{bzerror} is @code{BZ_OK}   
1253      @code{BZ2_bzClose}
1254         otherwise
1255@end display
1256
1257
1258@subsection @code{BZ2_bzRead}
1259@example
1260   int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len );
1261@end example
1262Reads up to @code{len} (uncompressed) bytes from the compressed file
1263@code{b} into
1264the buffer @code{buf}.  If the read was successful,
1265@code{bzerror} is set to @code{BZ_OK}
1266and the number of bytes read is returned.  If the logical end-of-stream
1267was detected, @code{bzerror} will be set to @code{BZ_STREAM_END},
1268and the number
1269of bytes read is returned.  All other @code{bzerror} values denote an error.
1270
1271@code{BZ2_bzRead} will supply @code{len} bytes,
1272unless the logical stream end is detected
1273or an error occurs.  Because of this, it is possible to detect the
1274stream end by observing when the number of bytes returned is
1275less than the number
1276requested.  Nevertheless, this is regarded as inadvisable; you should
1277instead check @code{bzerror} after every call and watch out for
1278@code{BZ_STREAM_END}.
1279
1280Internally, @code{BZ2_bzRead} copies data from the compressed file in chunks
1281of size @code{BZ_MAX_UNUSED} bytes
1282before decompressing it.  If the file contains more bytes than strictly
1283needed to reach the logical end-of-stream, @code{BZ2_bzRead} will almost certainly
1284read some of the trailing data before signalling @code{BZ_SEQUENCE_END}.
1285To collect the read but unused data once @code{BZ_SEQUENCE_END} has
1286appeared, call @code{BZ2_bzReadGetUnused} immediately before @code{BZ2_bzReadClose}.
1287
1288Possible assignments to @code{bzerror}:
1289@display
1290      @code{BZ_PARAM_ERROR}
1291         if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0}
1292      @code{BZ_SEQUENCE_ERROR}
1293         if @code{b} was opened with @code{BZ2_bzWriteOpen}
1294      @code{BZ_IO_ERROR}
1295         if there is an error reading from the compressed file
1296      @code{BZ_UNEXPECTED_EOF}
1297         if the compressed file ended before the logical end-of-stream was detected
1298      @code{BZ_DATA_ERROR}
1299         if a data integrity error was detected in the compressed stream
1300      @code{BZ_DATA_ERROR_MAGIC}
1301         if the stream does not begin with the requisite header bytes (ie, is not
1302         a @code{bzip2} data file).  This is really a special case of @code{BZ_DATA_ERROR}.
1303      @code{BZ_MEM_ERROR}
1304         if insufficient memory was available
1305      @code{BZ_STREAM_END}
1306         if the logical end of stream was detected.
1307      @code{BZ_OK}
1308         otherwise.
1309@end display
1310
1311Possible return values:
1312@display
1313      number of bytes read
1314         if @code{bzerror} is @code{BZ_OK} or @code{BZ_STREAM_END}
1315      undefined
1316         otherwise
1317@end display
1318
1319Allowable next actions:
1320@display
1321      collect data from @code{buf}, then @code{BZ2_bzRead} or @code{BZ2_bzReadClose}
1322         if @code{bzerror} is @code{BZ_OK}
1323      collect data from @code{buf}, then @code{BZ2_bzReadClose} or @code{BZ2_bzReadGetUnused}
1324         if @code{bzerror} is @code{BZ_SEQUENCE_END}   
1325      @code{BZ2_bzReadClose}
1326         otherwise
1327@end display
1328
1329
1330
1331@subsection @code{BZ2_bzReadGetUnused}
1332@example
1333   void BZ2_bzReadGetUnused ( int* bzerror, BZFILE *b,
1334                              void** unused, int* nUnused );
1335@end example
1336Returns data which was read from the compressed file but was not needed
1337to get to the logical end-of-stream.  @code{*unused} is set to the address
1338of the data, and @code{*nUnused} to the number of bytes.  @code{*nUnused} will
1339be set to a value between @code{0} and @code{BZ_MAX_UNUSED} inclusive.
1340
1341This function may only be called once @code{BZ2_bzRead} has signalled
1342@code{BZ_STREAM_END} but before @code{BZ2_bzReadClose}.
1343
1344Possible assignments to @code{bzerror}:
1345@display
1346      @code{BZ_PARAM_ERROR}
1347         if @code{b} is @code{NULL}
1348         or @code{unused} is @code{NULL} or @code{nUnused} is @code{NULL}
1349      @code{BZ_SEQUENCE_ERROR}
1350         if @code{BZ_STREAM_END} has not been signalled
1351         or if @code{b} was opened with @code{BZ2_bzWriteOpen}
1352     @code{BZ_OK}
1353         otherwise
1354@end display
1355
1356Allowable next actions:
1357@display
1358      @code{BZ2_bzReadClose}
1359@end display
1360
1361
1362@subsection @code{BZ2_bzReadClose}
1363@example
1364   void BZ2_bzReadClose ( int *bzerror, BZFILE *b );
1365@end example
1366Releases all memory pertaining to the compressed file @code{b}. 
1367@code{BZ2_bzReadClose} does not call @code{fclose} on the underlying file
1368handle, so you should do that yourself if appropriate.
1369@code{BZ2_bzReadClose} should be called to clean up after all error
1370situations.
1371
1372Possible assignments to @code{bzerror}:
1373@display
1374      @code{BZ_SEQUENCE_ERROR}
1375         if @code{b} was opened with @code{BZ2_bzOpenWrite}
1376      @code{BZ_OK}
1377         otherwise
1378@end display
1379
1380Allowable next actions:
1381@display
1382      none
1383@end display
1384
1385
1386
1387@subsection @code{BZ2_bzWriteOpen}
1388@example
1389   BZFILE *BZ2_bzWriteOpen ( int *bzerror, FILE *f,
1390                             int blockSize100k, int verbosity,
1391                             int workFactor );
1392@end example
1393Prepare to write compressed data to file handle @code{f}. 
1394@code{f} should refer to
1395a file which has been opened for writing, and for which the error
1396indicator (@code{ferror(f)})is not set. 
1397
1398For the meaning of parameters @code{blockSize100k},
1399@code{verbosity} and @code{workFactor}, see
1400@* @code{BZ2_bzCompressInit}.
1401
1402All required memory is allocated at this stage, so if the call
1403completes successfully, @code{BZ_MEM_ERROR} cannot be signalled by a
1404subsequent call to @code{BZ2_bzWrite}.
1405
1406Possible assignments to @code{bzerror}:
1407@display
1408      @code{BZ_CONFIG_ERROR}
1409         if the library has been mis-compiled
1410      @code{BZ_PARAM_ERROR}
1411         if @code{f} is @code{NULL}
1412         or @code{blockSize100k < 1} or @code{blockSize100k > 9}
1413      @code{BZ_IO_ERROR}
1414         if @code{ferror(f)} is nonzero
1415      @code{BZ_MEM_ERROR}
1416         if insufficient memory is available
1417      @code{BZ_OK}
1418         otherwise
1419@end display
1420
1421Possible return values:
1422@display
1423      Pointer to an abstract @code{BZFILE} 
1424         if @code{bzerror} is @code{BZ_OK}   
1425      @code{NULL}
1426         otherwise
1427@end display
1428
1429Allowable next actions:
1430@display
1431      @code{BZ2_bzWrite}
1432         if @code{bzerror} is @code{BZ_OK}
1433         (you could go directly to @code{BZ2_bzWriteClose}, but this would be pretty pointless)
1434      @code{BZ2_bzWriteClose}
1435         otherwise
1436@end display
1437
1438
1439
1440@subsection @code{BZ2_bzWrite}
1441@example
1442   void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len );
1443@end example
1444Absorbs @code{len} bytes from the buffer @code{buf}, eventually to be
1445compressed and written to the file.
1446
1447Possible assignments to @code{bzerror}:
1448@display
1449      @code{BZ_PARAM_ERROR}
1450         if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0}
1451      @code{BZ_SEQUENCE_ERROR}
1452         if b was opened with @code{BZ2_bzReadOpen}
1453      @code{BZ_IO_ERROR}
1454         if there is an error writing the compressed file.
1455      @code{BZ_OK}
1456         otherwise
1457@end display
1458
1459
1460
1461
1462@subsection @code{BZ2_bzWriteClose}
1463@example
1464   void BZ2_bzWriteClose ( int *bzerror, BZFILE* f,
1465                           int abandon,
1466                           unsigned int* nbytes_in,
1467                           unsigned int* nbytes_out );
1468
1469   void BZ2_bzWriteClose64 ( int *bzerror, BZFILE* f,
1470                             int abandon,
1471                             unsigned int* nbytes_in_lo32,
1472                             unsigned int* nbytes_in_hi32,
1473                             unsigned int* nbytes_out_lo32,
1474                             unsigned int* nbytes_out_hi32 );
1475@end example
1476
1477Compresses and flushes to the compressed file all data so far supplied
1478by @code{BZ2_bzWrite}.  The logical end-of-stream markers are also written, so
1479subsequent calls to @code{BZ2_bzWrite} are illegal.  All memory associated
1480with the compressed file @code{b} is released. 
1481@code{fflush} is called on the
1482compressed file, but it is not @code{fclose}'d.
1483
1484If @code{BZ2_bzWriteClose} is called to clean up after an error, the only
1485action is to release the memory.  The library records the error codes
1486issued by previous calls, so this situation will be detected
1487automatically.  There is no attempt to complete the compression
1488operation, nor to @code{fflush} the compressed file.  You can force this
1489behaviour to happen even in the case of no error, by passing a nonzero
1490value to @code{abandon}.
1491
1492If @code{nbytes_in} is non-null, @code{*nbytes_in} will be set to be the
1493total volume of uncompressed data handled.  Similarly, @code{nbytes_out}
1494will be set to the total volume of compressed data written.  For
1495compatibility with older versions of the library, @code{BZ2_bzWriteClose}
1496only yields the lower 32 bits of these counts.  Use
1497@code{BZ2_bzWriteClose64} if you want the full 64 bit counts.  These
1498two functions are otherwise absolutely identical.
1499
1500
1501Possible assignments to @code{bzerror}:
1502@display
1503      @code{BZ_SEQUENCE_ERROR}
1504         if @code{b} was opened with @code{BZ2_bzReadOpen}
1505      @code{BZ_IO_ERROR}
1506         if there is an error writing the compressed file
1507      @code{BZ_OK}
1508         otherwise
1509@end display
1510
1511@subsection Handling embedded compressed data streams
1512
1513The high-level library facilitates use of
1514@code{bzip2} data streams which form some part of a surrounding, larger
1515data stream.
1516@itemize @bullet
1517@item For writing, the library takes an open file handle, writes
1518compressed data to it, @code{fflush}es it but does not @code{fclose} it.
1519The calling application can write its own data before and after the
1520compressed data stream, using that same file handle.
1521@item Reading is more complex, and the facilities are not as general
1522as they could be since generality is hard to reconcile with efficiency.
1523@code{BZ2_bzRead} reads from the compressed file in blocks of size
1524@code{BZ_MAX_UNUSED} bytes, and in doing so probably will overshoot
1525the logical end of compressed stream.
1526To recover this data once decompression has
1527ended, call @code{BZ2_bzReadGetUnused} after the last call of @code{BZ2_bzRead}
1528(the one returning @code{BZ_STREAM_END}) but before calling
1529@code{BZ2_bzReadClose}.
1530@end itemize
1531
1532This mechanism makes it easy to decompress multiple @code{bzip2}
1533streams placed end-to-end.  As the end of one stream, when @code{BZ2_bzRead}
1534returns @code{BZ_STREAM_END}, call @code{BZ2_bzReadGetUnused} to collect the
1535unused data (copy it into your own buffer somewhere). 
1536That data forms the start of the next compressed stream.
1537To start uncompressing that next stream, call @code{BZ2_bzReadOpen} again,
1538feeding in the unused data via the @code{unused}/@code{nUnused}
1539parameters.
1540Keep doing this until @code{BZ_STREAM_END} return coincides with the
1541physical end of file (@code{feof(f)}).  In this situation
1542@code{BZ2_bzReadGetUnused}
1543will of course return no data.
1544
1545This should give some feel for how the high-level interface can be used.
1546If you require extra flexibility, you'll have to bite the bullet and get
1547to grips with the low-level interface.
1548
1549@subsection Standard file-reading/writing code
1550Here's how you'd write data to a compressed file:
1551@example @code
1552FILE*   f;
1553BZFILE* b;
1554int     nBuf;
1555char    buf[ /* whatever size you like */ ];
1556int     bzerror;
1557int     nWritten;
1558
1559f = fopen ( "myfile.bz2", "w" );
1560if (!f) @{
1561   /* handle error */
1562@}
1563b = BZ2_bzWriteOpen ( &bzerror, f, 9 );
1564if (bzerror != BZ_OK) @{
1565   BZ2_bzWriteClose ( b );
1566   /* handle error */
1567@}
1568
1569while ( /* condition */ ) @{
1570   /* get data to write into buf, and set nBuf appropriately */
1571   nWritten = BZ2_bzWrite ( &bzerror, b, buf, nBuf );
1572   if (bzerror == BZ_IO_ERROR) @{
1573      BZ2_bzWriteClose ( &bzerror, b );
1574      /* handle error */
1575   @}
1576@}
1577
1578BZ2_bzWriteClose ( &bzerror, b );
1579if (bzerror == BZ_IO_ERROR) @{
1580   /* handle error */
1581@}
1582@end example
1583And to read from a compressed file:
1584@example
1585FILE*   f;
1586BZFILE* b;
1587int     nBuf;
1588char    buf[ /* whatever size you like */ ];
1589int     bzerror;
1590int     nWritten;
1591
1592f = fopen ( "myfile.bz2", "r" );
1593if (!f) @{
1594   /* handle error */
1595@}
1596b = BZ2_bzReadOpen ( &bzerror, f, 0, NULL, 0 );
1597if (bzerror != BZ_OK) @{
1598   BZ2_bzReadClose ( &bzerror, b );
1599   /* handle error */
1600@}
1601
1602bzerror = BZ_OK;
1603while (bzerror == BZ_OK && /* arbitrary other conditions */) @{
1604   nBuf = BZ2_bzRead ( &bzerror, b, buf, /* size of buf */ );
1605   if (bzerror == BZ_OK) @{
1606      /* do something with buf[0 .. nBuf-1] */
1607   @}
1608@}
1609if (bzerror != BZ_STREAM_END) @{
1610   BZ2_bzReadClose ( &bzerror, b );
1611   /* handle error */
1612@} else @{
1613   BZ2_bzReadClose ( &bzerror );
1614@}
1615@end example
1616
1617
1618
1619@section Utility functions
1620@subsection @code{BZ2_bzBuffToBuffCompress}
1621@example
1622   int BZ2_bzBuffToBuffCompress( char*         dest,
1623                                 unsigned int* destLen,
1624                                 char*         source,
1625                                 unsigned int  sourceLen,
1626                                 int           blockSize100k,
1627                                 int           verbosity,
1628                                 int           workFactor );
1629@end example
1630Attempts to compress the data in @code{source[0 .. sourceLen-1]}
1631into the destination buffer, @code{dest[0 .. *destLen-1]}.
1632If the destination buffer is big enough, @code{*destLen} is
1633set to the size of the compressed data, and @code{BZ_OK} is
1634returned.  If the compressed data won't fit, @code{*destLen}
1635is unchanged, and @code{BZ_OUTBUFF_FULL} is returned.
1636
1637Compression in this manner is a one-shot event, done with a single call
1638to this function.  The resulting compressed data is a complete
1639@code{bzip2} format data stream.  There is no mechanism for making
1640additional calls to provide extra input data.  If you want that kind of
1641mechanism, use the low-level interface.
1642
1643For the meaning of parameters @code{blockSize100k}, @code{verbosity}
1644and @code{workFactor}, @* see @code{BZ2_bzCompressInit}.
1645
1646To guarantee that the compressed data will fit in its buffer, allocate
1647an output buffer of size 1% larger than the uncompressed data, plus
1648six hundred extra bytes.
1649
1650@code{BZ2_bzBuffToBuffDecompress} will not write data at or
1651beyond @code{dest[*destLen]}, even in case of buffer overflow.
1652
1653Possible return values:
1654@display
1655      @code{BZ_CONFIG_ERROR}
1656         if the library has been mis-compiled
1657      @code{BZ_PARAM_ERROR}
1658         if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL}
1659         or @code{blockSize100k < 1} or @code{blockSize100k > 9}
1660         or @code{verbosity < 0} or @code{verbosity > 4}
1661         or @code{workFactor < 0} or @code{workFactor > 250}
1662      @code{BZ_MEM_ERROR}
1663         if insufficient memory is available
1664      @code{BZ_OUTBUFF_FULL}
1665         if the size of the compressed data exceeds @code{*destLen}
1666      @code{BZ_OK}
1667         otherwise
1668@end display
1669
1670
1671
1672@subsection @code{BZ2_bzBuffToBuffDecompress}
1673@example
1674   int BZ2_bzBuffToBuffDecompress ( char*         dest,
1675                                    unsigned int* destLen,
1676                                    char*         source,
1677                                    unsigned int  sourceLen,
1678                                    int           small,
1679                                    int           verbosity );
1680@end example
1681Attempts to decompress the data in @code{source[0 .. sourceLen-1]}
1682into the destination buffer, @code{dest[0 .. *destLen-1]}.
1683If the destination buffer is big enough, @code{*destLen} is
1684set to the size of the uncompressed data, and @code{BZ_OK} is
1685returned.  If the compressed data won't fit, @code{*destLen}
1686is unchanged, and @code{BZ_OUTBUFF_FULL} is returned.
1687
1688@code{source} is assumed to hold a complete @code{bzip2} format
1689data stream.  @* @code{BZ2_bzBuffToBuffDecompress} tries to decompress
1690the entirety of the stream into the output buffer.
1691
1692For the meaning of parameters @code{small} and @code{verbosity},
1693see @code{BZ2_bzDecompressInit}.
1694
1695Because the compression ratio of the compressed data cannot be known in
1696advance, there is no easy way to guarantee that the output buffer will
1697be big enough.  You may of course make arrangements in your code to
1698record the size of the uncompressed data, but such a mechanism is beyond
1699the scope of this library.
1700
1701@code{BZ2_bzBuffToBuffDecompress} will not write data at or
1702beyond @code{dest[*destLen]}, even in case of buffer overflow.
1703
1704Possible return values:
1705@display
1706      @code{BZ_CONFIG_ERROR}
1707         if the library has been mis-compiled
1708      @code{BZ_PARAM_ERROR}
1709         if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL}
1710         or @code{small != 0 && small != 1}
1711         or @code{verbosity < 0} or @code{verbosity > 4}
1712      @code{BZ_MEM_ERROR}
1713         if insufficient memory is available
1714      @code{BZ_OUTBUFF_FULL}
1715         if the size of the compressed data exceeds @code{*destLen}
1716      @code{BZ_DATA_ERROR}
1717         if a data integrity error was detected in the compressed data
1718      @code{BZ_DATA_ERROR_MAGIC}
1719         if the compressed data doesn't begin with the right magic bytes
1720      @code{BZ_UNEXPECTED_EOF}
1721         if the compressed data ends unexpectedly
1722      @code{BZ_OK}
1723         otherwise
1724@end display
1725
1726
1727
1728@section @code{zlib} compatibility functions
1729Yoshioka Tsuneo has contributed some functions to
1730give better @code{zlib} compatibility.  These functions are
1731@code{BZ2_bzopen}, @code{BZ2_bzread}, @code{BZ2_bzwrite}, @code{BZ2_bzflush},
1732@code{BZ2_bzclose},
1733@code{BZ2_bzerror} and @code{BZ2_bzlibVersion}.
1734These functions are not (yet) officially part of
1735the library.  If they break, you get to keep all the pieces.
1736Nevertheless, I think they work ok.
1737@example
1738typedef void BZFILE;
1739
1740const char * BZ2_bzlibVersion ( void );
1741@end example
1742Returns a string indicating the library version.
1743@example
1744BZFILE * BZ2_bzopen  ( const char *path, const char *mode );
1745BZFILE * BZ2_bzdopen ( int        fd,    const char *mode );
1746@end example
1747Opens a @code{.bz2} file for reading or writing, using either its name
1748or a pre-existing file descriptor.
1749Analogous to @code{fopen} and @code{fdopen}.
1750@example         
1751int BZ2_bzread  ( BZFILE* b, void* buf, int len );
1752int BZ2_bzwrite ( BZFILE* b, void* buf, int len );
1753@end example
1754Reads/writes data from/to a previously opened @code{BZFILE}.
1755Analogous to @code{fread} and @code{fwrite}.
1756@example
1757int  BZ2_bzflush ( BZFILE* b );
1758void BZ2_bzclose ( BZFILE* b );
1759@end example
1760Flushes/closes a @code{BZFILE}.  @code{BZ2_bzflush} doesn't actually do
1761anything.  Analogous to @code{fflush} and @code{fclose}.
1762
1763@example
1764const char * BZ2_bzerror ( BZFILE *b, int *errnum )
1765@end example
1766Returns a string describing the more recent error status of
1767@code{b}, and also sets @code{*errnum} to its numerical value.
1768
1769
1770@section Using the library in a @code{stdio}-free environment
1771
1772@subsection Getting rid of @code{stdio}
1773
1774In a deeply embedded application, you might want to use just
1775the memory-to-memory functions.  You can do this conveniently
1776by compiling the library with preprocessor symbol @code{BZ_NO_STDIO}
1777defined.  Doing this gives you a library containing only the following
1778eight functions:
1779
1780@code{BZ2_bzCompressInit}, @code{BZ2_bzCompress}, @code{BZ2_bzCompressEnd} @*
1781@code{BZ2_bzDecompressInit}, @code{BZ2_bzDecompress}, @code{BZ2_bzDecompressEnd} @*
1782@code{BZ2_bzBuffToBuffCompress}, @code{BZ2_bzBuffToBuffDecompress}
1783
1784When compiled like this, all functions will ignore @code{verbosity}
1785settings.
1786
1787@subsection Critical error handling
1788@code{libbzip2} contains a number of internal assertion checks which
1789should, needless to say, never be activated.  Nevertheless, if an
1790assertion should fail, behaviour depends on whether or not the library
1791was compiled with @code{BZ_NO_STDIO} set.
1792
1793For a normal compile, an assertion failure yields the message
1794@example
1795   bzip2/libbzip2: internal error number N.
1796   This is a bug in bzip2/libbzip2, 1.0.2, 30-Dec-2001.
1797   Please report it to me at: jseward@@acm.org.  If this happened
1798   when you were using some program which uses libbzip2 as a
1799   component, you should also report this bug to the author(s)
1800   of that program.  Please make an effort to report this bug;
1801   timely and accurate bug reports eventually lead to higher
1802   quality software.  Thanks.  Julian Seward, 30 December 2001.
1803@end example
1804where @code{N} is some error code number.  If @code{N == 1007}, it also
1805prints some extra text advising the reader that unreliable memory is
1806often associated with internal error 1007.  (This is a
1807frequently-observed-phenomenon with versions 1.0.0/1.0.1).
1808
1809@code{exit(3)} is then called.
1810
1811For a @code{stdio}-free library, assertion failures result
1812in a call to a function declared as:
1813@example
1814   extern void bz_internal_error ( int errcode );
1815@end example
1816The relevant code is passed as a parameter.  You should supply
1817such a function.
1818
1819In either case, once an assertion failure has occurred, any
1820@code{bz_stream} records involved can be regarded as invalid.
1821You should not attempt to resume normal operation with them.
1822
1823You may, of course, change critical error handling to suit
1824your needs.  As I said above, critical errors indicate bugs
1825in the library and should not occur.  All "normal" error
1826situations are indicated via error return codes from functions,
1827and can be recovered from.
1828
1829
1830@section Making a Windows DLL
1831Everything related to Windows has been contributed by Yoshioka Tsuneo
1832@* (@code{QWF00133@@niftyserve.or.jp} /
1833@code{tsuneo-y@@is.aist-nara.ac.jp}), so you should send your queries to
1834him (but perhaps Cc: me, @code{jseward@@acm.org}).
1835
1836My vague understanding of what to do is: using Visual C++ 5.0,
1837open the project file @code{libbz2.dsp}, and build.  That's all.
1838
1839If you can't
1840open the project file for some reason, make a new one, naming these files:
1841@code{blocksort.c}, @code{bzlib.c}, @code{compress.c},
1842@code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @*
1843@code{randtable.c} and @code{libbz2.def}.  You will also need
1844to name the header files @code{bzlib.h} and @code{bzlib_private.h}.
1845
1846If you don't use VC++, you may need to define the proprocessor symbol
1847@code{_WIN32}.
1848
1849Finally, @code{dlltest.c} is a sample program using the DLL.  It has a
1850project file, @code{dlltest.dsp}.
1851
1852If you just want a makefile for Visual C, have a look at
1853@code{makefile.msc}.
1854
1855Be aware that if you compile @code{bzip2} itself on Win32, you must set
1856@code{BZ_UNIX} to 0 and @code{BZ_LCCWIN32} to 1, in the file
1857@code{bzip2.c}, before compiling.  Otherwise the resulting binary won't
1858work correctly.
1859
1860I haven't tried any of this stuff myself, but it all looks plausible.
1861
1862
1863
1864@chapter Miscellanea
1865
1866These are just some random thoughts of mine.  Your mileage may
1867vary.
1868
1869@section Limitations of the compressed file format
1870@code{bzip2-1.0}, @code{0.9.5} and @code{0.9.0}
1871use exactly the same file format as the previous
1872version, @code{bzip2-0.1}.  This decision was made in the interests of
1873stability.  Creating yet another incompatible compressed file format
1874would create further confusion and disruption for users.
1875
1876Nevertheless, this is not a painless decision.  Development
1877work since the release of @code{bzip2-0.1} in August 1997
1878has shown complexities in the file format which slow down
1879decompression and, in retrospect, are unnecessary.  These are:
1880@itemize @bullet
1881@item The run-length encoder, which is the first of the
1882      compression transformations, is entirely irrelevant.
1883      The original purpose was to protect the sorting algorithm
1884      from the very worst case input: a string of repeated
1885      symbols.  But algorithm steps Q6a and Q6b in the original
1886      Burrows-Wheeler technical report (SRC-124) show how
1887      repeats can be handled without difficulty in block
1888      sorting.
1889@item The randomisation mechanism doesn't really need to be
1890      there.  Udi Manber and Gene Myers published a suffix
1891      array construction algorithm a few years back, which
1892      can be employed to sort any block, no matter how
1893      repetitive, in O(N log N) time.  Subsequent work by
1894      Kunihiko Sadakane has produced a derivative O(N (log N)^2)
1895      algorithm which usually outperforms the Manber-Myers
1896      algorithm.
1897
1898      I could have changed to Sadakane's algorithm, but I find
1899      it to be slower than @code{bzip2}'s existing algorithm for
1900      most inputs, and the randomisation mechanism protects
1901      adequately against bad cases.  I didn't think it was
1902      a good tradeoff to make.  Partly this is due to the fact
1903      that I was not flooded with email complaints about
1904      @code{bzip2-0.1}'s performance on repetitive data, so
1905      perhaps it isn't a problem for real inputs.
1906
1907      Probably the best long-term solution,
1908      and the one I have incorporated into 0.9.5 and above,
1909      is to use the existing sorting
1910      algorithm initially, and fall back to a O(N (log N)^2)
1911      algorithm if the standard algorithm gets into difficulties.
1912@item The compressed file format was never designed to be
1913      handled by a library, and I have had to jump though
1914      some hoops to produce an efficient implementation of
1915      decompression.  It's a bit hairy.  Try passing
1916      @code{decompress.c} through the C preprocessor
1917      and you'll see what I mean.  Much of this complexity
1918      could have been avoided if the compressed size of
1919      each block of data was recorded in the data stream.
1920@item An Adler-32 checksum, rather than a CRC32 checksum,
1921      would be faster to compute.
1922@end itemize
1923It would be fair to say that the @code{bzip2} format was frozen
1924before I properly and fully understood the performance
1925consequences of doing so.
1926
1927Improvements which I was able to incorporate into
19280.9.0, despite using the same file format, are:
1929@itemize @bullet
1930@item Single array implementation of the inverse BWT.  This
1931      significantly speeds up decompression, presumably
1932      because it reduces the number of cache misses.
1933@item Faster inverse MTF transform for large MTF values.  The
1934      new implementation is based on the notion of sliding blocks
1935      of values.
1936@item @code{bzip2-0.9.0} now reads and writes files with @code{fread}
1937      and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}.
1938      Duh!  Well, you live and learn.
1939
1940@end itemize
1941Further ahead, it would be nice
1942to be able to do random access into files.  This will
1943require some careful design of compressed file formats.
1944
1945
1946
1947@section Portability issues
1948After some consideration, I have decided not to use
1949GNU @code{autoconf} to configure 0.9.5 or 1.0.
1950
1951@code{autoconf}, admirable and wonderful though it is,
1952mainly assists with portability problems between Unix-like
1953platforms.  But @code{bzip2} doesn't have much in the way
1954of portability problems on Unix; most of the difficulties appear
1955when porting to the Mac, or to Microsoft's operating systems.
1956@code{autoconf} doesn't help in those cases, and brings in a
1957whole load of new complexity.
1958
1959Most people should be able to compile the library and program
1960under Unix straight out-of-the-box, so to speak, especially
1961if you have a version of GNU C available.
1962
1963There are a couple of @code{__inline__} directives in the code.  GNU C
1964(@code{gcc}) should be able to handle them.  If you're not using
1965GNU C, your C compiler shouldn't see them at all.
1966If your compiler does, for some reason, see them and doesn't
1967like them, just @code{#define} @code{__inline__} to be @code{/* */}.  One
1968easy way to do this is to compile with the flag @code{-D__inline__=},
1969which should be understood by most Unix compilers.
1970
1971If you still have difficulties, try compiling with the macro
1972@code{BZ_STRICT_ANSI} defined.  This should enable you to build the
1973library in a strictly ANSI compliant environment.  Building the program
1974itself like this is dangerous and not supported, since you remove
1975@code{bzip2}'s checks against compressing directories, symbolic links,
1976devices, and other not-really-a-file entities.  This could cause
1977filesystem corruption!
1978
1979One other thing: if you create a @code{bzip2} binary for public
1980distribution, please try and link it statically (@code{gcc -s}).  This
1981avoids all sorts of library-version issues that others may encounter
1982later on.
1983
1984If you build @code{bzip2} on Win32, you must set @code{BZ_UNIX} to 0 and
1985@code{BZ_LCCWIN32} to 1, in the file @code{bzip2.c}, before compiling.
1986Otherwise the resulting binary won't work correctly.
1987
1988
1989
1990@section Reporting bugs
1991I tried pretty hard to make sure @code{bzip2} is
1992bug free, both by design and by testing.  Hopefully
1993you'll never need to read this section for real.
1994
1995Nevertheless, if @code{bzip2} dies with a segmentation
1996fault, a bus error or an internal assertion failure, it
1997will ask you to email me a bug report.  Experience with
1998version 0.1 shows that almost all these problems can
1999be traced to either compiler bugs or hardware problems.
2000@itemize @bullet
2001@item
2002Recompile the program with no optimisation, and see if it
2003works.  And/or try a different compiler.
2004I heard all sorts of stories about various flavours
2005of GNU C (and other compilers) generating bad code for
2006@code{bzip2}, and I've run across two such examples myself.
2007
20082.7.X versions of GNU C are known to generate bad code from
2009time to time, at high optimisation levels. 
2010If you get problems, try using the flags
2011@code{-O2} @code{-fomit-frame-pointer} @code{-fno-strength-reduce}.
2012You should specifically @emph{not} use @code{-funroll-loops}.
2013
2014You may notice that the Makefile runs six tests as part of
2015the build process.  If the program passes all of these, it's
2016a pretty good (but not 100%) indication that the compiler has
2017done its job correctly.
2018@item
2019If @code{bzip2} crashes randomly, and the crashes are not
2020repeatable, you may have a flaky memory subsystem.  @code{bzip2}
2021really hammers your memory hierarchy, and if it's a bit marginal,
2022you may get these problems.  Ditto if your disk or I/O subsystem
2023is slowly failing.  Yup, this really does happen.
2024
2025Try using a different machine of the same type, and see if
2026you can repeat the problem.
2027@item This isn't really a bug, but ... If @code{bzip2} tells
2028you your file is corrupted on decompression, and you
2029obtained the file via FTP, there is a possibility that you
2030forgot to tell FTP to do a binary mode transfer.  That absolutely
2031will cause the file to be non-decompressible.  You'll have to transfer
2032it again.
2033@end itemize
2034
2035If you've incorporated @code{libbzip2} into your own program
2036and are getting problems, please, please, please, check that the
2037parameters you are passing in calls to the library, are
2038correct, and in accordance with what the documentation says
2039is allowable.  I have tried to make the library robust against
2040such problems, but I'm sure I haven't succeeded.
2041
2042Finally, if the above comments don't help, you'll have to send
2043me a bug report.  Now, it's just amazing how many people will
2044send me a bug report saying something like
2045@display
2046   bzip2 crashed with segmentation fault on my machine
2047@end display
2048and absolutely nothing else.  Needless to say, a such a report
2049is @emph{totally, utterly, completely and comprehensively 100% useless;
2050a waste of your time, my time, and net bandwidth}.
2051With no details at all, there's no way I can possibly begin
2052to figure out what the problem is.
2053
2054The rules of the game are: facts, facts, facts.  Don't omit
2055them because "oh, they won't be relevant".  At the bare
2056minimum:
2057@display
2058   Machine type.  Operating system version. 
2059   Exact version of @code{bzip2} (do @code{bzip2 -V}). 
2060   Exact version of the compiler used. 
2061   Flags passed to the compiler.
2062@end display
2063However, the most important single thing that will help me is
2064the file that you were trying to compress or decompress at the
2065time the problem happened.  Without that, my ability to do anything
2066more than speculate about the cause, is limited.
2067
2068Please remember that I connect to the Internet with a modem, so
2069you should contact me before mailing me huge files.
2070
2071
2072@section Did you get the right package?
2073
2074@code{bzip2} is a resource hog.  It soaks up large amounts of CPU cycles
2075and memory.  Also, it gives very large latencies.  In the worst case, you
2076can feed many megabytes of uncompressed data into the library before
2077getting any compressed output, so this probably rules out applications
2078requiring interactive behaviour.
2079
2080These aren't faults of my implementation, I hope, but more
2081an intrinsic property of the Burrows-Wheeler transform (unfortunately). 
2082Maybe this isn't what you want.
2083
2084If you want a compressor and/or library which is faster, uses less
2085memory but gets pretty good compression, and has minimal latency,
2086consider Jean-loup
2087Gailly's and Mark Adler's work, @code{zlib-1.1.3} and
2088@code{gzip-1.2.4}.  Look for them at
2089
2090@code{http://www.zlib.org} and
2091@code{http://www.gzip.org} respectively.
2092
2093For something faster and lighter still, you might try Markus F X J
2094Oberhumer's @code{LZO} real-time compression/decompression library, at
2095@* @code{http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html}.
2096
2097If you want to use the @code{bzip2} algorithms to compress small blocks
2098of data, 64k bytes or smaller, for example on an on-the-fly disk
2099compressor, you'd be well advised not to use this library.  Instead,
2100I've made a special library tuned for that kind of use.  It's part of
2101@code{e2compr-0.40}, an on-the-fly disk compressor for the Linux
2102@code{ext2} filesystem.  Look at
2103@code{http://www.netspace.net.au/~reiter/e2compr}.
2104
2105
2106
2107@section Testing
2108
2109A record of the tests I've done.
2110
2111First, some data sets:
2112@itemize @bullet
2113@item B: a directory containing 6001 files, one for every length in the
2114      range 0 to 6000 bytes.  The files contain random lowercase
2115      letters.  18.7 megabytes.
2116@item H: my home directory tree.  Documents, source code, mail files,
2117      compressed data.  H contains B, and also a directory of
2118      files designed as boundary cases for the sorting; mostly very
2119      repetitive, nasty files.  565 megabytes.
2120@item A: directory tree holding various applications built from source:
2121      @code{egcs}, @code{gcc-2.8.1}, KDE, GTK, Octave, etc.
2122      2200 megabytes.
2123@end itemize
2124The tests conducted are as follows.  Each test means compressing
2125(a copy of) each file in the data set, decompressing it and
2126comparing it against the original.
2127
2128First, a bunch of tests with block sizes and internal buffer
2129sizes set very small,
2130to detect any problems with the
2131blocking and buffering mechanisms. 
2132This required modifying the source code so as to try to
2133break it.
2134@enumerate
2135@item Data set H, with
2136      buffer size of 1 byte, and block size of 23 bytes.
2137@item Data set B, buffer sizes 1 byte, block size 1 byte.
2138@item As (2) but small-mode decompression.
2139@item As (2) with block size 2 bytes.
2140@item As (2) with block size 3 bytes.
2141@item As (2) with block size 4 bytes.
2142@item As (2) with block size 5 bytes.
2143@item As (2) with block size 6 bytes and small-mode decompression.
2144@item H with buffer size of 1 byte, but normal block
2145      size (up to 900000 bytes).
2146@end enumerate
2147Then some tests with unmodified source code.
2148@enumerate
2149@item H, all settings normal.
2150@item As (1), with small-mode decompress.
2151@item H, compress with flag @code{-1}.
2152@item H, compress with flag @code{-s}, decompress with flag @code{-s}.
2153@item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing,
2154      @code{bzip2-0.9.5} decompressing, all settings normal.
2155@item Backwards compatibility:  H, @code{bzip2-0.9.5} compressing,
2156      @code{bzip2-0.1pl2} decompressing, all settings normal.
2157@item Bigger tests: A, all settings normal.
2158@item As (7), using the fallback (Sadakane-like) sorting algorithm.
2159@item As (8), compress with flag @code{-1}, decompress with flag
2160      @code{-s}.
2161@item H, using the fallback sorting algorithm.
2162@item Forwards compatibility: A, @code{bzip2-0.1pl2} compressing,
2163      @code{bzip2-0.9.5} decompressing, all settings normal.
2164@item Backwards compatibility:  A, @code{bzip2-0.9.5} compressing,
2165      @code{bzip2-0.1pl2} decompressing, all settings normal.
2166@item Misc test: about 400 megabytes of @code{.tar} files with
2167      @code{bzip2} compiled with Checker (a memory access error
2168       detector, like Purify).
2169@item Misc tests to make sure it builds and runs ok on non-Linux/x86
2170      platforms.
2171@end enumerate
2172These tests were conducted on a 225 MHz IDT WinChip machine, running
2173Linux 2.0.36.  They represent nearly a week of continuous computation.
2174All tests completed successfully.
2175
2176
2177@section Further reading
2178@code{bzip2} is not research work, in the sense that it doesn't present
2179any new ideas.  Rather, it's an engineering exercise based on existing
2180ideas.
2181
2182Four documents describe essentially all the ideas behind @code{bzip2}:
2183@example
2184Michael Burrows and D. J. Wheeler:
2185  "A block-sorting lossless data compression algorithm"
2186   10th May 1994.
2187   Digital SRC Research Report 124.
2188   ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
2189   If you have trouble finding it, try searching at the
2190   New Zealand Digital Library, http://www.nzdl.org.
2191
2192Daniel S. Hirschberg and Debra A. LeLewer
2193  "Efficient Decoding of Prefix Codes"
2194   Communications of the ACM, April 1990, Vol 33, Number 4.
2195   You might be able to get an electronic copy of this
2196      from the ACM Digital Library.
2197
2198David J. Wheeler
2199   Program bred3.c and accompanying document bred3.ps.
2200   This contains the idea behind the multi-table Huffman
2201   coding scheme.
2202   ftp://ftp.cl.cam.ac.uk/users/djw3/
2203
2204Jon L. Bentley and Robert Sedgewick
2205  "Fast Algorithms for Sorting and Searching Strings"
2206   Available from Sedgewick's web page,
2207   www.cs.princeton.edu/~rs
2208@end example
2209The following paper gives valuable additional insights into the
2210algorithm, but is not immediately the basis of any code
2211used in bzip2.
2212@example
2213Peter Fenwick:
2214   Block Sorting Text Compression
2215   Proceedings of the 19th Australasian Computer Science Conference,
2216     Melbourne, Australia.  Jan 31 - Feb 2, 1996.
2217   ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
2218@end example
2219Kunihiko Sadakane's sorting algorithm, mentioned above,
2220is available from:
2221@example
2222http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
2223@end example
2224The Manber-Myers suffix array construction
2225algorithm is described in a paper
2226available from:
2227@example
2228http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
2229@end example
2230Finally, the following paper documents some recent investigations
2231I made into the performance of sorting algorithms:
2232@example
2233Julian Seward:
2234   On the Performance of BWT Sorting Algorithms
2235   Proceedings of the IEEE Data Compression Conference 2000
2236     Snowbird, Utah.  28-30 March 2000.
2237@end example
2238
2239
2240@contents
2241
2242@bye
2243
Note: See TracBrowser for help on using the repository browser.