Skip to main navigation Skip to search Skip to main content

Indexing compressed text

  • Paolo Ferragina
  • , Giovanni Manzini

Research output: Contribution to journalArticlepeer-review

Abstract

We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form. Our first compressed data structure retrieves the occ occurrences of a pattern P[1, p] within a text T[1, n] in O(p + occ log1+ε n) time for any chosen ε, 0 < ε < 1 This data structure uses at most 5nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. The space usage is &(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows-Wheeler Transform, and can be regarded as a compressed suffix array. Our second compressed data structure achieves O(p + occ) query time using O(nHk(T)logε n) + o(n) bits of storage for any chosen ε, 0 < ε < 1. Therefore, it provides optimal output-sensitive query time using o(n log n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows-Wheeler Transform and the LZ78 algorithm.

Original languageEnglish
Pages (from-to)552-581
Number of pages30
JournalJournal of the ACM
Volume52
Issue number4
DOIs
Publication statusPublished - 2005
Externally publishedYes

Keywords

  • Burrows-Wheeler transform
  • Full-text indexing
  • Indexing data structure
  • Lempel-Ziv compressor
  • Pattern searching
  • Suffix array
  • Suffix tree
  • Text compression

Fingerprint

Dive into the research topics of 'Indexing compressed text'. Together they form a unique fingerprint.

Cite this