TY - JOUR
T1 - Compressed representations of sequences and full-text indexes
AU - Ferragina, Paolo
AU - Manzini, Giovanni
AU - Mäkinen, Veli
AU - Navarro, Gonzalo
PY - 2007/5/1
Y1 - 2007/5/1
N2 - Given a sequence S = s1s2 . . . sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq , as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an information-theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r , we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r/ log log n) time. Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet. Specifically, we design a variant of the FM-index that indexes a string T [1, n] within nHk (T)+o(n) bits of storage, where Hk (T) is the kth-order empirical entropy of T . This space bound holds simultaneously for all k = a log n, constant 0 < a < 1, and = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P[1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log1+e n) time for any constant 0 < e < 1; and reports a text substring of length in O(+ log1+e n) time. Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the kth-order entropy of the text T, which is the best space obtained in previous work. We can also handle larger alphabets of size = O(n), for any 0 < β < 1, by paying o(n log) extra space and multiplying all query times by O(log/log log n).
AB - Given a sequence S = s1s2 . . . sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq , as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an information-theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r , we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r/ log log n) time. Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet. Specifically, we design a variant of the FM-index that indexes a string T [1, n] within nHk (T)+o(n) bits of storage, where Hk (T) is the kth-order empirical entropy of T . This space bound holds simultaneously for all k = a log n, constant 0 < a < 1, and = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P[1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log1+e n) time for any constant 0 < e < 1; and reports a text substring of length in O(+ log1+e n) time. Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the kth-order entropy of the text T, which is the best space obtained in previous work. We can also handle larger alphabets of size = O(n), for any 0 < β < 1, by paying o(n log) extra space and multiplying all query times by O(log/log log n).
KW - Burrows-Wheeler transform
KW - Compression boosting
KW - Entropy
KW - Rank and select
KW - Text compression
KW - Text indexing
KW - Wavelet tree
UR - https://www.scopus.com/pages/publications/34250171723
U2 - 10.1145/1240233.1240243
DO - 10.1145/1240233.1240243
M3 - Article
SN - 1549-6325
VL - 3
JO - ACM Transactions on Algorithms
JF - ACM Transactions on Algorithms
IS - 2
M1 - 1240243
ER -