1 | The Regexp Library Cook-off |
---|
2 | |
---|
3 | Rx is, among other things, an implementation of the interface |
---|
4 | specified by POSIX for programming with regular expressions. Some |
---|
5 | other implementations are GNU regex.c and Henry Spencer's regex |
---|
6 | library. |
---|
7 | |
---|
8 | If you are maintaining a program or library that includes a regexp |
---|
9 | matcher, you might want to consider which regexp implementation is |
---|
10 | best. Regexp matchers are very complicated; they are hard to get |
---|
11 | right, hard to make fast and efficient, and hard to evaluate. |
---|
12 | Therefore, choosing the best implementation for your needs is no easy |
---|
13 | task; neither is maintaining an implementation. |
---|
14 | |
---|
15 | To my knowledge, there are no comprehensive, free-software test suites |
---|
16 | to help you evaluate regex function implementations. This release of |
---|
17 | Rx includes some tests to try to help fix that. The release includes |
---|
18 | test programs which you can use to measure some aspects of the |
---|
19 | correctness and performance of your favorite POSIX regexp library. If |
---|
20 | you use these, please consider adding new tests to the collection and |
---|
21 | sending them to the author of Rx. |
---|
22 | |
---|
23 | Below is a summary of what I learned comparing GNU regex and Rx on one |
---|
24 | system. There are also some hints for running the tests yourself. |
---|
25 | Performance numbers refer to tests done on a P90 using GCC 2.7.0 to |
---|
26 | compile with debugging information and optimization ("gcc -g -O"). |
---|
27 | |
---|
28 | Caution: the results so far are pretty one-sided in favor of Rx, but I |
---|
29 | doubt the real situation is that simple. If you find a case where GNU |
---|
30 | regex wins, please contribute it to the test suite. |
---|
31 | |
---|
32 | |
---|
33 | CORRECTNESS RESULTS SO FAR |
---|
34 | |
---|
35 | GNU Regex is not properly POSIX, as illustrated by the test |
---|
36 | "rgx-tests/backrefs". That test uses the regexp: |
---|
37 | |
---|
38 | (abc|abcd)(d|) |
---|
39 | |
---|
40 | and compares it to the string: |
---|
41 | |
---|
42 | abcd |
---|
43 | |
---|
44 | Regex correctly says that the string matches the pattern, but |
---|
45 | incorrectly sets the "pmatch" data returned by regexec. According to |
---|
46 | POSIX, |
---|
47 | |
---|
48 | pmatch[1] == (0, 4) "abcd" |
---|
49 | pmatch[2] == (4, 4) "" |
---|
50 | |
---|
51 | but GNU regex reports: |
---|
52 | |
---|
53 | pmatch[1] == (0, 3) "abc" |
---|
54 | pmatch[2] == (3, 4) "d" |
---|
55 | |
---|
56 | That bug occurs because GNU regex doesn't do enough backtracking. It |
---|
57 | accepts the match it finds because it is a longest match, but it does |
---|
58 | not properly enforce the leftmost-longest criteria for |
---|
59 | sub-expressions. |
---|
60 | |
---|
61 | I haven't found any cases where Rx fails to return the values |
---|
62 | specified by POSIX (except for internationalization features which |
---|
63 | aren't implemented yet). |
---|
64 | |
---|
65 | |
---|
66 | |
---|
67 | PERFORMANCE RESULTS SO FAR |
---|
68 | |
---|
69 | Rx is a little bit smaller than regex: |
---|
70 | |
---|
71 | POSIX functions from librx.a: |
---|
72 | text data |
---|
73 | 32757 2241 |
---|
74 | |
---|
75 | regex.o: |
---|
76 | text data |
---|
77 | 43333 522 |
---|
78 | |
---|
79 | Rx normally uses more memory at run-time, though the amount is |
---|
80 | configurable. |
---|
81 | |
---|
82 | |
---|
83 | Some speed tests: |
---|
84 | |
---|
85 | |
---|
86 | * regcomp012 |
---|
87 | |
---|
88 | This test stresses the performance of "regcomp". The performance of |
---|
89 | "regexec" doesn't matter much to its outcome. On this test, GNU regex |
---|
90 | is about 10 times faster than Rx. |
---|
91 | |
---|
92 | |
---|
93 | |
---|
94 | * regexec012 |
---|
95 | |
---|
96 | This test stresses the performance of "regexec" on a case artificially |
---|
97 | created to maximize the amount of backtracking GNU regex will have to |
---|
98 | do. Examples like this seem to be rare in real life, but not |
---|
99 | non-existent. |
---|
100 | |
---|
101 | This test is run three times, each time doubling the length of the |
---|
102 | string being checked. In the shortest case, GNU regex is just a |
---|
103 | little bit faster -- but with each doubling of the string length, Rx's |
---|
104 | run-time goes up quadratically, while GNU Regex's grows exponentially. |
---|
105 | Therefore, as the test gets longer, Rx quickly becomes much faster. |
---|
106 | |
---|
107 | This is easy to see using a graph made by "gnuplot". Instructions for |
---|
108 | making such a plot are given later in this file. |
---|
109 | |
---|
110 | |
---|
111 | |
---|
112 | * longlitstr |
---|
113 | |
---|
114 | This test uses a simple regular expression: a literal string. It |
---|
115 | compares that to a rather long input string. So, this test stresses |
---|
116 | the performance of the inner loop of regexec, with no backtracking at |
---|
117 | all. |
---|
118 | |
---|
119 | On this test, Rx ran 3 times faster than GNU regex. |
---|
120 | |
---|
121 | |
---|
122 | |
---|
123 | * n-shortlitstr |
---|
124 | |
---|
125 | This test also uses a simple regular expression: a literal string. It |
---|
126 | compares that to a largish number of input strings, each a line taken |
---|
127 | from the source of a texinfo manual. So, this test stresses |
---|
128 | the performance of the overall performance of regexec -- not just the |
---|
129 | inner loop, but also the overhead cost of calling and returning from regexec. |
---|
130 | |
---|
131 | On this test, Rx ran about 3 times faster than GNU regex. |
---|
132 | |
---|
133 | |
---|
134 | |
---|
135 | * longdotstr |
---|
136 | |
---|
137 | This test uses the same long test string as longlitstr, but this time |
---|
138 | the regexp contains a .*. On this test, Rx went a little about 3x |
---|
139 | faster than GNU regex. |
---|
140 | |
---|
141 | This test was added because during the course of development it was |
---|
142 | discovered that this Rx was incredibly slow on this test. A new |
---|
143 | optimization (a generalization of the "fastmap" strategy) was added to |
---|
144 | fix the problem. |
---|
145 | |
---|
146 | |
---|
147 | |
---|
148 | RUNNING THE CORRECTNESS TESTS |
---|
149 | |
---|
150 | The directory "rgx-tests" contains programs that check the correctness |
---|
151 | of a regexp implementation. Every test gets its own subdirectory. A |
---|
152 | simple system of shell scripts is used to run the tests. These |
---|
153 | commands illustrate how the tests were run for Rx and GNU regex on the |
---|
154 | author's system: |
---|
155 | |
---|
156 | Tests must be run from this directory: |
---|
157 | |
---|
158 | % cd rgx-tests |
---|
159 | |
---|
160 | Compile the tests for GNU rx: |
---|
161 | |
---|
162 | % ./=each ./=compile ../../rx/inst-rxposix.h ../../=build/rx/librx.a |
---|
163 | |
---|
164 | Run the tests for GNU rx: |
---|
165 | |
---|
166 | % ./=each ./=doit ,rx |
---|
167 | |
---|
168 | Examine the results by eye: |
---|
169 | |
---|
170 | % cat */,rx |
---|
171 | [....] |
---|
172 | |
---|
173 | Search for errors. Error reports begin with "###" |
---|
174 | |
---|
175 | % grep -s "###" */,rx |
---|
176 | [....] |
---|
177 | |
---|
178 | Compile the tests for GNU regex: |
---|
179 | |
---|
180 | % ./=each ./=compile ../../regex/regex.h ../../=build/regex/regex.o |
---|
181 | |
---|
182 | Run the tests for GNU regex: |
---|
183 | |
---|
184 | % ./=each ./=doit ,gnu |
---|
185 | |
---|
186 | Examine the results by eye: |
---|
187 | |
---|
188 | % cat */,gnu |
---|
189 | [....] |
---|
190 | |
---|
191 | Search for errors. Error reports begin with "###" |
---|
192 | |
---|
193 | % grep -s "###" */,gnu |
---|
194 | [....] |
---|
195 | |
---|
196 | |
---|
197 | |
---|
198 | RUNNING THE PERFORMANCE TESTS |
---|
199 | |
---|
200 | The performance tests are run much like the correctness tests: |
---|
201 | |
---|
202 | % cd rgx-perf |
---|
203 | % ./=each ./=compile ../../rx/inst-rxposix.h ../../=build/rx/librx.a |
---|
204 | % ./=each ./=doit ,rx |
---|
205 | % ./=each ./=compile ../../regex/regex.h ../../=build/regex/regex.o |
---|
206 | % ./=each ./=doit ,gnu |
---|
207 | |
---|
208 | Most of the timing results are best read by eye. |
---|
209 | |
---|
210 | One of the performance tests, the "regexec012" test, produces results |
---|
211 | suitable for graphing with GNU plot: |
---|
212 | |
---|
213 | % gnuplot |
---|
214 | gnuplot> plot "regexec012/,rx-plot" with linespoints, "regexec012/,gnu-plot" with linespoints |
---|
215 | gnuplot> quit |
---|
216 | |
---|