Boosting textual compression in optimal linear time

Paolo Ferragina, Raffaele Giancarlo, Giovanni Manzini, Marinella Sciortino

Risultato della ricerca: Contributo su rivistaArticolo in rivistapeer review

Abstract

We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the "best possible" contexts; (b) it is very simple and optimal in terms of time; and (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique displaying these properties. Technically, our boosting technique builds upon three main ingredients: the Burrows-Wheeler Transform, the Suffix Tree data structure, and a greedy algorithm to process them. Specifically, we show that there exists a proper partition of the Burrows-Wheeler Transform of a string 5 that shows a deep combinatorial relation with the kth order entropy of s. That partition can be identified via a greedy processing of the suffix tree of s with the aim of minimizing a proper objective function over its nodes. The final compressed string is then obtained by compressing individually each substring of the partition by means of the base compressor we wish to boost. Our boosting technique is inherently combinatorial because it does not need to assume any prior probabilistic model about the source emitting s, and it does not deploy any training, parameter estimation and learning. Various corollaries are derived from this main achievement. Among the others, we show analytically that using our booster, we get better compression algorithms than some of the best existing ones, that is, LZ77, LZ78, PPMC and the ones derived from the Burrows-Wheeler Transform. Further, we settle analytically some long-standing open problems about the algorithmic structure and the performance of BWT-based compressors. Namely, we provide the first family of BWT algorithms that do not use Move-To-Front or Symbol Ranking as a part of the compression process.

Lingua originaleInglese
pagine (da-a)688-713
Numero di pagine26
RivistaJournal of the ACM
Volume52
Numero di pubblicazione4
DOI
Stato di pubblicazionePubblicato - 2005
Pubblicato esternamente

Fingerprint

Entra nei temi di ricerca di 'Boosting textual compression in optimal linear time'. Insieme formano una fingerprint unica.

Cita questo