-
Notifications
You must be signed in to change notification settings - Fork 1.3k
compression: Implement @ciscorn's dictionary approach #3398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Massive savings. Thanks so much @ciscorn for providing the initial code for choosing the dictionary. This adds a bit of time to the build, both to find the dictionary but also because (for reasons I don't fully understand), the binary search in the compress() function no longer worked and had to be replaced with a linear search. I think this is because the intended invariant is that for codebook entries that encode to the same number of bits, the entries are ordered in ascending value. However, I mis-placed the transition from "words" to "byte/char values" so the codebook entries for words are in word-order rather than their code order. Because this price is only paid at build time, I didn't care to determine exactly where the correct fix was. I also commented out a line to produce the "estimated total memory size" -- at least on the unix build with TRANSLATION=ja, this led to a build time KeyError trying to compute the codebook size for all the strings. I think this occurs because some single unicode code point ('ァ') is no longer present as itself in the compressed strings, due to always being replaced by a word. As promised, this seems to save hundreds of bytes in the German translation on the trinket m0. Testing performed: - built trinket_m0 in several languages - built and ran unix port in several languages (en, de_DE, ja) and ran simple error-producing codes like ./micropython -c '1/0'
Most users and the CI system are running in configurations where Python configures stdout and stderr in UTF-8 mode. However, Windows is different, setting values like CP1252. This led to a build failure on Windows, because makeqstrdata printed Unicode strings to its stdout, expecting them to be encoded as UTF-8. This script is writing (stdout) to a compiler input file and potentially printing messages (stderr) to a log or console. Explicitly configure stdout to use utf-8 to get consistent behavior on all platforms, and configure stderr so that if any log/diagnostic messages are printed that cannot be displayed correctly, they are still displayed instead of creating an error while trying to print the diagnostic information. I considered setting the encodings both to ascii, but this would just be occasionally inconvenient to developers like me who want to show diagnostic info on stderr and in comments while working with the compression code. Closes: adafruit#3408
Small improvements to the dictionary compression
Could you explain briefly how this works? It's not clear to me from the code what units it now compresses. |
Also, how does it compare to MicroPython's approach? micropython#5861 |
I improved the initial comment on this PR and also added an explanation in the comments of translate.h:
Micropython's approach is very English-centric. It literally looks for the most frequent "words", where "words" are surrounded by whitespace on either side. It has hard-coded assumptions that all code points are below 128, so it uses huffman codes below 128 for the characters themselves and above 128 for the words, and does Huffman compression on those values. The table itself is not limited to 256 bytes and does not have an explicit table of lengths or ends, but only because it can re-use the high bit of each byte in the table to indicate "is last letter in word". To contrast, @ciscorn's scheme (like our existing scheme) works for arbitrary Unicode encodings, and is not limited by a definition of a "word". This is especially important for the Japanese translation, as whitespace doesn't play the same role in setting word boundaries as it does in English and other languages you might be familiar with:
If we went back to the "word length" method of computing the offsets into the dictionary table, we could move beyond 256 bytes of table. Otherwise, I think this accomplishes everything Micropython does and more. |
23009bd
to
a8e98cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thank you for the comments! I think it's a good balance between a fully greedy solution and one that won't take forever. :-)
Massive savings. Thanks so much @ciscorn for providing the initial code for choosing the dictionary at #3370 (comment)
Now, a "dictionary" is chosen based on the frequency of strings of various lengths within the input corpus, according to a somewhat magic weight function. Some huffman codes, starting at 128, are set aside to indicate indices into this dictionary. The dictionary itself is up to 256 code points. The big advance (compared to bigrams) is that the dictionary items are arbitrary length. For instance, on Trinket M0 in English, here are some of the dictionary items:
[' must be ', ' argument', ' specifi', 'attribut', 'support', ...]
. " must be " occurs 21 times and gets a Huffman coding of 11001101 which is very economical (fewer bits than the single code point "%", for instance).This adds a bit of time to the build, in order to compute the dictionary.
I also deleted a line to produce the "estimated total memory size" -- at least on the unix build with TRANSLATION=ja, this led to a build time KeyError trying to compute the codebook size for all the strings. I think this occurs because some single unicode code point ('ァ') is no longer present as itself in the compressed strings, due to always being replaced by a word. Because this line was not accounting for ngrams before or words now, it was inaccurate anyway.
As promised, this seems to save hundreds of bytes in the German translation on the trinket m0.
Testing performed: