Hacker News

Building a data compression utility in Haskell using Huffman codes

by lazamaron 7/4/2024, 4:29:12 AM with 16 comments

by mrkeenon 7/4/2024, 7:49:31 AM
There exists an array-based, in-place algorithm for this, reducing the need to allocate trees and chase pointers.
I mention this only because, when I learned the tree-based approach at uni, I simply wasn't aware that there was another way to do it, and I'm wondering how many of you that's true for as well.
While the tree approach is intuitive and illuminating, it probably makes more sense to work with in-place arrays, since the situations when you care most about compression are probably the situations when you have a lot of data and want to run fast.
```
  In-Place Calculation of Minimum-Redundancy Codes
  Moffat, Katajainen.  1995.
  http://hjemmesider.diku.dk/~jyrki/Paper/WADS95.pdf
```
by trompon 7/4/2024, 8:44:08 AM
> To make it unambiguous we must make sure that no code word is a prefix of another code word.
Technically, this is not quite correct. The class of so-called uniquely decodable codes is unambigous, and a superset of the prefix codes. One simple example of a uniquely decodable code is the reverse of a prefix code. For the example in the article that would be
```
    a 1
    b 00
    c 10
```
While the code for a is a prefix of the code of c, one can still unambiguously decode any code sequence by processing it in reverse order. It would be interesting to see a uniquely decodable code that is neither a prefix code nor one in reverse.
by atlintotson 7/4/2024, 8:26:49 AM
This is great! Are there any other similar tutorials going through writing a Haskell program, but with some more advanced features (monad transformers, lenses, etc)
by goldfishgoldon 7/4/2024, 1:29:59 PM
Coursera’s functional programming course (in Scala) includes a pretty similar Huffman coding assignment with autograder if anybody wants to take a stab at it themselves.
https://www.coursera.org/learn/scala-functional-programming?...
by banish-m4on 7/4/2024, 8:13:23 AM
Last time I used Huffman codes, it was to run a MICMAC processor macroprogram (assembly text) in the fewest number of microcycles and to use the fewest microinstructions in the microprogram (microcode). So starting with a histogram of the macroinstructions executed (IIRC, I first wrote an interpreter in C to count how many of each were executed), I crafted a progressive decoding microcode program to implement all of the required ISA macro-operations. IIRC, the macro instruction ISA I created was bit-granular instead of byte-oriented. In the real world, it would've been slow and inconvenient. What's nice about Huffman codes is that you can vary the prefix depth based on the distribution of values, so you don't have to have lopsided codes based on 1 bit prefixes.
Also, the microprogram had to deal with branch prediction because it was a non-superscalar pipelined processor model. Guess the wrong branch, and enjoy wasting cycles on a pipeline stall while the correct branch filters forward.
by tankfeederon 7/4/2024, 6:03:18 AM
https://rosettacode.org/wiki/Huffman_coding
by ykonstanton 7/4/2024, 9:22:33 AM
Hey, since this is likely to attract Haskell programmers: how fast is Haskell these days for a programmer intent on writing optimized code? I am particularly interested in its performance for numerical crunching like matrix operations and other stuff that benefit from SIMD.
by alwinaugustinon 7/4/2024, 6:59:07 AM
Thanks for sharing. Very nice and insightful.
by chvrchbvrneron 7/4/2024, 10:15:07 AM
I think there is a typo in the table of the "Creating prefix-free codes" section. D should be '0010' (not '0110').
Otherwise a great read, thanks!
by lo_zamoyskion 7/5/2024, 3:22:08 PM
Btw, what is that on the girl's shirt in the image?
Direct link: https://lazamar.github.io/images/data-compressor.svg
by londons_exploreon 7/4/2024, 7:18:15 AM
For all readers, arithmetic codes are better in nearly all ways. They can be implemented in less RAM and code, they compress and decompress to a better ratio, and the probabilities of different symbols appearing can be dynamically updated during the stream far more easily.
The only reason Huffman codes are used is they were invented first and arithmetic codes were patented. That patent has now expired, so we should use the better design.
by lynx23on 7/4/2024, 6:47:06 AM
Very nice read, thanks for sharing!
by bdahzon 7/4/2024, 10:01:53 AM
How is the performance when compared to similar implementations in C/C++ or Rust?
by benreesmanon 7/4/2024, 8:58:38 AM
Haskell is a really nice language. In general I don’t identify as an X programmer for any value of X: I tend to write in a half dozen languages daily and they all suck in their own special way.
But on two separate occasions I made important career decisions with opportunity cost to work with highly lethal GHC contributors: those people are just really good.
If Haskell sucks like all languages it’s because Haskell excels at using computers to compute something: Haskell considers data shuffling a strictly secondary concern compared to doing actual computations.
by polterguy1000on 7/4/2024, 6:33:33 PM
[flagged]
by 2-3-7-43-1807on 7/4/2024, 10:21:43 AM
and yet another episode in the series of "look, what I did with Haskell"