Indexing compressed text

Paolo Ferragina, Giovanni Manzini

Risultato della ricerca: Contributo su rivistaArticolo in rivistapeer review

Abstract

We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form. Our first compressed data structure retrieves the occ occurrences of a pattern P[1, p] within a text T[1, n] in O(p + occ log1+ε n) time for any chosen ε, 0 < ε < 1 This data structure uses at most 5nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. The space usage is &(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows-Wheeler Transform, and can be regarded as a compressed suffix array. Our second compressed data structure achieves O(p + occ) query time using O(nHk(T)logε n) + o(n) bits of storage for any chosen ε, 0 < ε < 1. Therefore, it provides optimal output-sensitive query time using o(n log n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows-Wheeler Transform and the LZ78 algorithm.

Lingua originaleInglese
pagine (da-a)552-581
Numero di pagine30
RivistaJournal of the ACM
Volume52
Numero di pubblicazione4
DOI
Stato di pubblicazionePubblicato - 2005
Pubblicato esternamente

Fingerprint

Entra nei temi di ricerca di 'Indexing compressed text'. Insieme formano una fingerprint unica.

Cita questo