source: trunk/third/rx/COOKOFF @ 10430

Revision 10430, 6.1 KB checked in by ghudson, 28 years ago (diff)
This commit was generated by cvs2svn to compensate for changes in r10429, which included commits to RCS files with non-trunk default branches.
Line 
1                     The Regexp Library Cook-off
2
3Rx is, among other things, an implementation of the interface
4specified by POSIX for programming with regular expressions.  Some
5other implementations are GNU regex.c and Henry Spencer's regex
6library.
7
8If you are maintaining a program or library that includes a regexp
9matcher, you might want to consider which regexp implementation is
10best.  Regexp matchers are very complicated; they are hard to get
11right, hard to make fast and efficient, and hard to evaluate.
12Therefore, choosing the best implementation for your needs is no easy
13task; neither is maintaining an implementation.
14
15To my knowledge, there are no comprehensive, free-software test suites
16to help you evaluate regex function implementations.  This release of
17Rx includes some tests to try to help fix that.  The release includes
18test programs which you can use to measure some aspects of the
19correctness and performance of your favorite POSIX regexp library.  If
20you use these, please consider adding new tests to the collection and
21sending them to the author of Rx.
22
23Below is a summary of what I learned comparing GNU regex and Rx on one
24system.  There are also some hints for running the tests yourself.
25Performance numbers refer to tests done on a P90 using GCC 2.7.0 to
26compile with debugging information and optimization ("gcc -g -O").
27
28Caution: the results so far are pretty one-sided in favor of Rx, but I
29doubt the real situation is that simple.  If you find a case where GNU
30regex wins, please contribute it to the test suite.
31
32
33                      CORRECTNESS RESULTS SO FAR
34
35GNU Regex is not properly POSIX, as illustrated by the test
36"rgx-tests/backrefs".  That test uses the regexp:
37
38        (abc|abcd)(d|)
39
40and compares it to the string:
41
42        abcd
43
44Regex correctly says that the string matches the pattern, but
45incorrectly sets the "pmatch" data returned by regexec.  According to
46POSIX,
47
48        pmatch[1] == (0, 4) "abcd"
49        pmatch[2] == (4, 4) ""
50
51but GNU regex reports:
52
53        pmatch[1] == (0, 3) "abc"
54        pmatch[2] == (3, 4) "d"
55
56That bug occurs because GNU regex doesn't do enough backtracking.  It
57accepts the match it finds because it is a longest match, but it does
58not properly enforce the leftmost-longest criteria for
59sub-expressions.
60
61I haven't found any cases where Rx fails to return the values
62specified by POSIX (except for internationalization features which
63aren't implemented yet).
64
65
66
67                      PERFORMANCE RESULTS SO FAR
68
69Rx is a little bit smaller than regex:
70
71    POSIX functions from librx.a:
72        text    data
73        32757   2241
74
75    regex.o:
76        text    data
77        43333   522
78
79Rx normally uses more memory at run-time, though the amount is
80configurable.
81
82
83Some speed tests:
84
85
86* regcomp012
87
88This test stresses the performance of "regcomp".  The performance of
89"regexec" doesn't matter much to its outcome.  On this test, GNU regex
90is about 10 times faster than Rx.
91
92
93
94* regexec012
95
96This test stresses the performance of "regexec" on a case artificially
97created to maximize the amount of backtracking GNU regex will have to
98do.  Examples like this seem to be rare in real life, but not
99non-existent.
100
101This test is run three times, each time doubling the length of the
102string being checked.  In the shortest case, GNU regex is just a
103little bit faster -- but with each doubling of the string length, Rx's
104run-time goes up quadratically, while GNU Regex's grows exponentially.
105Therefore, as the test gets longer, Rx quickly becomes much faster.
106
107This is easy to see using a graph made by "gnuplot".  Instructions for
108making such a plot are given later in this file.
109
110
111
112* longlitstr
113
114This test uses a simple regular expression: a literal string.  It
115compares that to a rather long input string.  So, this test stresses
116the performance of the inner loop of regexec, with no backtracking at
117all.
118
119On this test, Rx ran 3 times faster than GNU regex.
120
121
122
123* n-shortlitstr
124
125This test also uses a simple regular expression: a literal string.  It
126compares that to a largish number of input strings, each a line taken
127from the source of a texinfo manual.  So, this test stresses
128the performance of the overall performance of regexec -- not just the
129inner loop, but also the overhead cost of calling and returning from regexec.
130
131On this test, Rx ran about 3 times faster than GNU regex.
132
133
134
135* longdotstr
136
137This test uses the same long test string as longlitstr, but this time
138the regexp contains a .*.  On this test, Rx went a little about 3x
139faster than GNU regex.
140
141This test was added because during the course of development it was
142discovered that this Rx was incredibly slow on this test.  A new
143optimization (a generalization of the "fastmap" strategy) was added to
144fix the problem.
145
146
147
148                    RUNNING THE CORRECTNESS TESTS
149
150The directory "rgx-tests" contains programs that check the correctness
151of a regexp implementation.  Every test gets its own subdirectory.  A
152simple system of shell scripts is used to run the tests.  These
153commands illustrate how the tests were run for Rx and GNU regex on the
154author's system:
155
156Tests must be run from this directory:
157
158        % cd rgx-tests
159
160Compile the tests for GNU rx:
161
162        % ./=each ./=compile ../../rx/inst-rxposix.h ../../=build/rx/librx.a
163       
164Run the tests for GNU rx:
165
166        % ./=each ./=doit ,rx
167
168Examine the results by eye:
169
170        % cat */,rx
171        [....]
172
173Search for errors.  Error reports begin with "###"
174
175        % grep -s  "###" */,rx
176        [....]
177
178Compile the tests for GNU regex:
179
180        % ./=each ./=compile ../../regex/regex.h ../../=build/regex/regex.o
181       
182Run the tests for GNU regex:
183
184        % ./=each ./=doit ,gnu
185
186Examine the results by eye:
187
188        % cat */,gnu
189        [....]
190
191Search for errors.  Error reports begin with "###"
192
193        % grep -s  "###" */,gnu
194        [....]
195
196
197
198                    RUNNING THE PERFORMANCE TESTS
199
200The performance tests are run much like the correctness tests:
201
202        % cd rgx-perf
203        % ./=each ./=compile ../../rx/inst-rxposix.h ../../=build/rx/librx.a
204        % ./=each ./=doit ,rx
205        % ./=each ./=compile ../../regex/regex.h ../../=build/regex/regex.o
206        % ./=each ./=doit ,gnu
207
208Most of the timing results are best read by eye.
209
210One of the performance tests, the "regexec012" test, produces results
211suitable for graphing with GNU plot:
212
213        % gnuplot
214        gnuplot> plot "regexec012/,rx-plot" with linespoints, "regexec012/,gnu-plot" with linespoints
215        gnuplot> quit
216
Note: See TracBrowser for help on using the repository browser.