Abstract
We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form. Our first compressed data structure retrieves the occ occurrences of a pattern P[1, p] within a text T[1, n] in O(p + occ log1+ε n) time for any chosen ε, 0 < ε < 1 This data structure uses at most 5nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. The space usage is &(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows-Wheeler Transform, and can be regarded as a compressed suffix array. Our second compressed data structure achieves O(p + occ) query time using O(nHk(T)logε n) + o(n) bits of storage for any chosen ε, 0 < ε < 1. Therefore, it provides optimal output-sensitive query time using o(n log n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows-Wheeler Transform and the LZ78 algorithm.
| Original language | English |
|---|---|
| Pages (from-to) | 552-581 |
| Number of pages | 30 |
| Journal | Journal of the ACM |
| Volume | 52 |
| Issue number | 4 |
| DOIs | |
| Publication status | Published - 2005 |
| Externally published | Yes |
Keywords
- Burrows-Wheeler transform
- Full-text indexing
- Indexing data structure
- Lempel-Ziv compressor
- Pattern searching
- Suffix array
- Suffix tree
- Text compression
Fingerprint
Dive into the research topics of 'Indexing compressed text'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver