Which compression format you should use depends on your application. Do you want to maximize the speed of your application or are you more concerned about keeping storage costs down? In general, you should try different strategies for your application, and benchmark them with representative data-sets to find the best approach.

For large, unbounded files, like log files, the options are:

  • Store the files uncompressed.
  • Use a compression format that supports splitting, like bzip2 (although bzip2 is fairly slow), or one that can be indexed to support splitting, like LZO.
  • Split the file into chunks in the application and compress each chunk separately using any supported compression format. In this case, you should choose the chunk size so that the compressed chunks are approximately the size of an HDFS block.
  • Use Sequence File, which supports compression and splitting.
  • Use an Avro data file, which supports compression and splitting, just like Sequence File, but has the added advantage of being readable and writable from many languages, not just Java.

       For large files, you should not use a compression format that does not support splitting on the whole file, since you lose locality and make MapReduce applications very in efficient. For archival purposes, consider the Hadoop archive format, although it does not support compression.


Related Post

Leave a Reply