What features in a data set would lead to Huffman coding outperforming Deflate compression?
I am looking at lossless data compression for numerical raster data sets such as terrestrial elevation data. I am seeing a non-trivial number of cases where Huffman coding produces smaller outputs than Deflate compression. I am a bit surprised by this result because, for most data I’ve looked at in the past, Deflate is better than Huffman by a comfortable margin
I am using the standard Java API for Deflate (level 6) and a home-baked version for Huffman. The techniques I use are described at https://gwlucastrig.github.io/GridfourDocs/notes/GridfourDataCompressionAlgorithms.html The implementation compresses blocks of 11000 bytes and selects either Huffman or Deflate based on results. Huffman was selected in about 7700 cases (36%) and Deflate in about 14000 (64%). The only other clue I have is that the average first-order entropy in the Huffman-selected blocks was 5.83 bits/symbol, and the average in the Deflate-selected blocks was 6.16 bits/symbol.
A bit more detail on the compression test is available at https://gwlucastrig.github.io/GridfourDocs/notes/EntropyMetricForDataCompressionCaseStudies.html#method_selection