1 | Stability |
---|
2 | --------- |
---|
3 | * ibex_open should never crash, and should never return NULL without |
---|
4 | errno being set. Should check for errors when reading. |
---|
5 | |
---|
6 | |
---|
7 | Performance |
---|
8 | ----------- |
---|
9 | * Profiling, keep thinking about data structures, etc. |
---|
10 | |
---|
11 | * Check memory usage |
---|
12 | |
---|
13 | * See if writing the "inverse image" of long ref streams helps |
---|
14 | compression without hurting performance now. (ie, if a word appears in |
---|
15 | more than half of the files, write out the list of files it _doesn't_ |
---|
16 | appear in). (I tried this before, and it wasn't working well, but the |
---|
17 | file format and data structures have changed a lot.) |
---|
18 | |
---|
19 | * We could save a noticeable chunk of time if normalize_word computed |
---|
20 | the hash of the word and then we could pass that into |
---|
21 | g_hash_table_insert somehow. |
---|
22 | |
---|
23 | * Make a copy of the buffer to be indexed (or provide interface for |
---|
24 | caller to say ibex can munge the provided data) and then use that |
---|
25 | rather than constantly copying things. ? |
---|
26 | |
---|
27 | |
---|
28 | Functionality |
---|
29 | ------------- |
---|
30 | * ibex file locking |
---|
31 | |
---|
32 | * specify file mode in ibex_open |
---|
33 | |
---|
34 | * ibex_find* need to normalize the search words... should this be done |
---|
35 | by the caller or by ibex_find? |
---|
36 | |
---|
37 | * Needs to be some way to do a secondary search after getting results |
---|
38 | back from ibex_find* (ie, for "foo near bar"). This either has to be |
---|
39 | done by ibex, or requires us to export the normalize interface. |
---|
40 | |
---|
41 | * Does there need to be an ibex_find_any, or is that easy enough for the |
---|
42 | caller to do? |
---|
43 | |
---|
44 | * utf8_trans needs to cover at least two more code pages. This is |
---|
45 | tricky because it's not clear whether some of the letters there should |
---|
46 | be translated to ASCII or left as UTF8. This requires some |
---|
47 | investigation. |
---|
48 | |
---|
49 | * ibex_index_* need to ignore HTML tags. |
---|
50 | NAME = [A-Za-z][A-Za-z0-9.-]* |
---|
51 | </?{NAME}(\s*{NAME}(\s*=\s*({NAME}|"[^"]*"|'[^']*')))*> |
---|
52 | <!(--([^-]*|-[^-])--\s*)*> |
---|
53 | |
---|
54 | ugh. ok, simplifying, we get: |
---|
55 | <[^!](([^"'>]*("[^"]*"|'[^']*'))*> or |
---|
56 | <!(--([^-]*|-[^-])--\s*)*> |
---|
57 | |
---|
58 | which is still not simple. sigh. |
---|
59 | |
---|
60 | * ibex_index_* need to recognize and ignore "non-text". Particularly |
---|
61 | BinHex and uuencoding. |
---|