Compressed representations of sequences and full-text indexes

  • Paolo Ferragina
  • , Giovanni Manzini
  • , Veli Mäkinen
  • , Gonzalo Navarro

Risultato della ricerca: Contributo su rivistaArticolo in rivistapeer review

Abstract

Given a sequence S = s1s2 . . . sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq , as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an information-theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r , we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r/ log log n) time. Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet. Specifically, we design a variant of the FM-index that indexes a string T [1, n] within nHk (T)+o(n) bits of storage, where Hk (T) is the kth-order empirical entropy of T . This space bound holds simultaneously for all k = a log n, constant 0 < a < 1, and = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P[1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log1+e n) time for any constant 0 < e < 1; and reports a text substring of length in O(+ log1+e n) time. Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the kth-order entropy of the text T, which is the best space obtained in previous work. We can also handle larger alphabets of size = O(n), for any 0 < β < 1, by paying o(n log) extra space and multiplying all query times by O(log/log log n).

Lingua originaleInglese
Numero di articolo1240243
RivistaACM Transactions on Algorithms
Volume3
Numero di pubblicazione2
DOI
Stato di pubblicazionePubblicato - 1 mag 2007
Pubblicato esternamente

Fingerprint

Entra nei temi di ricerca di 'Compressed representations of sequences and full-text indexes'. Insieme formano una fingerprint unica.

Cita questo