On compressing the textual web

Paolo Ferragina, Giovanni Manzini

Risultato della ricerca: Capitolo in libro/report/atti di convegnoContributo a conferenzapeer review

Abstract

Nowadays we know how to effectively compress most basic components of any modern search engine, such as, the graphs arising from the Web structure and/or its usage, the posting lists, and the dictionary of terms. But we are not aware of any study which has deeply addressed the issue of compressing the raw Web pages. Many Web applications use simple compression algorithms- e.g. gzip, or word-based Move-to-Front or Huffman coders- and conclude that, even compressed, raw data take more space than Inverted Lists. In this paper we investigate two typical scenarios of use of data compression for large Web collections. In the first scenario, the compressed pages are stored on disk and we only need to support the fast scanning of large parts of the compressed collection (such as for map-reduce paradigms). In the second scenario, we consider the fast access to individual pages of the compressed collection that is distributed among the RAMs of many PCs (such as for search engines and miners). For the first scenario, we provide a thorough experimental comparison among state-of-the-art compressors thus indicating pros and cons of the available solutions. For the second scenario, we compare compressed-storage solutions with the new technology of compressed self-indexes [45]. Our results show that Web pages are more compressible than expected and, consequently, that some common beliefs in this area should be reconsidered. Our results are novel for the large spectrum of tested approaches and the size of datasets, and provide a threefold contribution: a nontrivial baseline for designing new compressed-storage solutions, a guide for software developers faced with Web-page storage, and a natural complement to the recent figures on Inverted List-compression achieved by [57, 58].

Lingua originaleInglese
Titolo della pubblicazione ospiteWSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining
Pagine391-400
Numero di pagine10
DOI
Stato di pubblicazionePubblicato - 2010
Pubblicato esternamente
Evento3rd ACM International Conference on Web Search and Data Mining, WSDM 2010 - New York City, NY, United States
Durata: 3 feb 20106 feb 2010

Serie di pubblicazioni

NomeWSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???3rd ACM International Conference on Web Search and Data Mining, WSDM 2010
Paese/TerritorioUnited States
CittàNew York City, NY
Periodo3/02/106/02/10

Fingerprint

Entra nei temi di ricerca di 'On compressing the textual web'. Insieme formano una fingerprint unica.

Cita questo