[16769] | 1 | Stability |
---|
| 2 | --------- |
---|
| 3 | * ibex_open should never crash, and should never return NULL without |
---|
| 4 | errno being set. Should check for errors when reading. |
---|
| 5 | |
---|
| 6 | |
---|
| 7 | Performance |
---|
| 8 | ----------- |
---|
| 9 | * Profiling, keep thinking about data structures, etc. |
---|
| 10 | |
---|
| 11 | * Check memory usage |
---|
| 12 | |
---|
| 13 | * See if writing the "inverse image" of long ref streams helps |
---|
| 14 | compression without hurting performance now. (ie, if a word appears in |
---|
| 15 | more than half of the files, write out the list of files it _doesn't_ |
---|
| 16 | appear in). (I tried this before, and it wasn't working well, but the |
---|
| 17 | file format and data structures have changed a lot.) |
---|
| 18 | |
---|
| 19 | * We could save a noticeable chunk of time if normalize_word computed |
---|
| 20 | the hash of the word and then we could pass that into |
---|
| 21 | g_hash_table_insert somehow. |
---|
| 22 | |
---|
| 23 | * Make a copy of the buffer to be indexed (or provide interface for |
---|
| 24 | caller to say ibex can munge the provided data) and then use that |
---|
| 25 | rather than constantly copying things. ? |
---|
| 26 | |
---|
| 27 | |
---|
| 28 | Functionality |
---|
| 29 | ------------- |
---|
| 30 | * ibex file locking |
---|
| 31 | |
---|
| 32 | * specify file mode in ibex_open |
---|
| 33 | |
---|
| 34 | * ibex_find* need to normalize the search words... should this be done |
---|
| 35 | by the caller or by ibex_find? |
---|
| 36 | |
---|
| 37 | * Needs to be some way to do a secondary search after getting results |
---|
| 38 | back from ibex_find* (ie, for "foo near bar"). This either has to be |
---|
| 39 | done by ibex, or requires us to export the normalize interface. |
---|
| 40 | |
---|
| 41 | * Does there need to be an ibex_find_any, or is that easy enough for the |
---|
| 42 | caller to do? |
---|
| 43 | |
---|
| 44 | * utf8_trans needs to cover at least two more code pages. This is |
---|
| 45 | tricky because it's not clear whether some of the letters there should |
---|
| 46 | be translated to ASCII or left as UTF8. This requires some |
---|
| 47 | investigation. |
---|
| 48 | |
---|
| 49 | * ibex_index_* need to ignore HTML tags. |
---|
| 50 | NAME = [A-Za-z][A-Za-z0-9.-]* |
---|
| 51 | </?{NAME}(\s*{NAME}(\s*=\s*({NAME}|"[^"]*"|'[^']*')))*> |
---|
| 52 | <!(--([^-]*|-[^-])--\s*)*> |
---|
| 53 | |
---|
| 54 | ugh. ok, simplifying, we get: |
---|
| 55 | <[^!](([^"'>]*("[^"]*"|'[^']*'))*> or |
---|
| 56 | <!(--([^-]*|-[^-])--\s*)*> |
---|
| 57 | |
---|
| 58 | which is still not simple. sigh. |
---|
| 59 | |
---|
| 60 | * ibex_index_* need to recognize and ignore "non-text". Particularly |
---|
| 61 | BinHex and uuencoding. |
---|