compression: Implement @ciscorn's dictionary approach #3398

jepler · 2020-09-12T15:15:45Z

Massive savings. Thanks so much @ciscorn for providing the initial code for choosing the dictionary at #3370 (comment)

Now, a "dictionary" is chosen based on the frequency of strings of various lengths within the input corpus, according to a somewhat magic weight function. Some huffman codes, starting at 128, are set aside to indicate indices into this dictionary. The dictionary itself is up to 256 code points. The big advance (compared to bigrams) is that the dictionary items are arbitrary length. For instance, on Trinket M0 in English, here are some of the dictionary items: [' must be ', ' argument', ' specifi', 'attribut', 'support', ...]. " must be " occurs 21 times and gets a Huffman coding of 11001101 which is very economical (fewer bits than the single code point "%", for instance).

This adds a bit of time to the build, in order to compute the dictionary.

I also deleted a line to produce the "estimated total memory size" -- at least on the unix build with TRANSLATION=ja, this led to a build time KeyError trying to compute the codebook size for all the strings. I think this occurs because some single unicode code point ('ァ') is no longer present as itself in the compressed strings, due to always being replaced by a word. Because this line was not accounting for ngrams before or words now, it was inaccurate anyway.

As promised, this seems to save hundreds of bytes in the German translation on the trinket m0.

Testing performed:

built trinket_m0 (en, de_DE)
built and ran unix port in several languages (en, de_DE, ja) and ran simple error-producing codes like ./micropython -c '1/0', verifying there were no display errors

@ciscorn

Massive savings. Thanks so much @ciscorn for providing the initial code for choosing the dictionary. This adds a bit of time to the build, both to find the dictionary but also because (for reasons I don't fully understand), the binary search in the compress() function no longer worked and had to be replaced with a linear search. I think this is because the intended invariant is that for codebook entries that encode to the same number of bits, the entries are ordered in ascending value. However, I mis-placed the transition from "words" to "byte/char values" so the codebook entries for words are in word-order rather than their code order. Because this price is only paid at build time, I didn't care to determine exactly where the correct fix was. I also commented out a line to produce the "estimated total memory size" -- at least on the unix build with TRANSLATION=ja, this led to a build time KeyError trying to compute the codebook size for all the strings. I think this occurs because some single unicode code point ('ァ') is no longer present as itself in the compressed strings, due to always being replaced by a word. As promised, this seems to save hundreds of bytes in the German translation on the trinket m0. Testing performed: - built trinket_m0 in several languages - built and ran unix port in several languages (en, de_DE, ja) and ran simple error-producing codes like ./micropython -c '1/0'

Most users and the CI system are running in configurations where Python configures stdout and stderr in UTF-8 mode. However, Windows is different, setting values like CP1252. This led to a build failure on Windows, because makeqstrdata printed Unicode strings to its stdout, expecting them to be encoded as UTF-8. This script is writing (stdout) to a compiler input file and potentially printing messages (stderr) to a log or console. Explicitly configure stdout to use utf-8 to get consistent behavior on all platforms, and configure stderr so that if any log/diagnostic messages are printed that cannot be displayed correctly, they are still displayed instead of creating an error while trying to print the diagnostic information. I considered setting the encodings both to ascii, but this would just be occasionally inconvenient to developers like me who want to show diagnostic info on stderr and in comments while working with the compression code. Closes: adafruit#3408

ciscorn · 2020-09-13T17:10:37Z

@jepler Thank you for implementing my idea!

I've created a pull request on your local repo (jepler#3) to do further optimizations.

Small improvements to the dictionary compression

tannewt · 2020-09-14T22:29:01Z

Could you explain briefly how this works? It's not clear to me from the code what units it now compresses.

tannewt · 2020-09-14T22:31:07Z

Also, how does it compare to MicroPython's approach? micropython#5861

jepler · 2020-09-15T18:28:42Z

I improved the initial comment on this PR and also added an explanation in the comments of translate.h:

+// - code points starting at 128 (word_start) and potentially extending
+//   to 255 (word_end) (but never interfering with the target
+//   language's used code points) stand for dictionary entries in a
+//   dictionary with size up to 256 code points.  The dictionary entries
+//   are computed with a heuristic based on frequent substrings of 2 to
+//   9 code points.  These are called "words" but are not, grammatically
+//   speaking, words.  They're just spans of code points that frequently
+//   occur together.
+//
+// - dictionary entries are non-overlapping, and the _ending_ index of each
+//   entry is stored in an array.  Since the index given is the ending
+//   index, the array is called "wends".

Micropython's approach is very English-centric. It literally looks for the most frequent "words", where "words" are surrounded by whitespace on either side. It has hard-coded assumptions that all code points are below 128, so it uses huffman codes below 128 for the characters themselves and above 128 for the words, and does Huffman compression on those values. The table itself is not limited to 256 bytes and does not have an explicit table of lengths or ends, but only because it can re-use the high bit of each byte in the table to indicate "is last letter in word".

To contrast, @ciscorn's scheme (like our existing scheme) works for arbitrary Unicode encodings, and is not limited by a definition of a "word". This is especially important for the Japanese translation, as whitespace doesn't play the same role in setting word boundaries as it does in English and other languages you might be familiar with:

msgid "%q must be >= 0"
msgstr "%qは0以上でなければなりません"

If we went back to the "word length" method of computing the offsets into the dictionary table, we could move beyond 256 bytes of table. Otherwise, I think this accomplishes everything Micropython does and more.

py/makeqstrdata.py

tannewt

Ok, thank you for the comments! I think it's a good balance between a fully greedy solution and one that won't take forever. :-)

jepler changed the title ~~compression: Implement ciscorn's dictionary approach~~ compression: Implement @ciscorn's dictionary approach Sep 12, 2020

jepler mentioned this pull request Sep 13, 2020

mkqstrdata encoding assumption on windows msys #3408

Closed

Small improvements to the dictionary compression

d18d79a

Merge pull request #3 from ciscorn/dict-comp

9abfc51

Small improvements to the dictionary compression

jepler requested a review from tannewt September 14, 2020 17:20

supervisor translate: explain the dictionary

d9e336d

tannewt reviewed Sep 15, 2020

View reviewed changes

py/makeqstrdata.py Outdated Show resolved Hide resolved

jepler requested a review from tannewt September 16, 2020 13:00

makeqstrdata: comment my understanding of @ciscorn's code

a8e98cd

jepler force-pushed the better-dictionary-compression branch from 23009bd to a8e98cd Compare September 16, 2020 13:28

tannewt approved these changes Sep 16, 2020

View reviewed changes

tannewt merged commit 750bc1e into adafruit:main Sep 16, 2020

jepler deleted the better-dictionary-compression branch November 3, 2021 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

compression: Implement @ciscorn's dictionary approach #3398

compression: Implement @ciscorn's dictionary approach #3398

Uh oh!

jepler commented Sep 12, 2020 •

edited

Loading

Uh oh!

ciscorn commented Sep 13, 2020

Uh oh!

tannewt commented Sep 14, 2020

Uh oh!

tannewt commented Sep 14, 2020

Uh oh!

jepler commented Sep 15, 2020

Uh oh!

Uh oh!

tannewt left a comment

Uh oh!

Uh oh!

compression: Implement @ciscorn's dictionary approach #3398

compression: Implement @ciscorn's dictionary approach #3398

Uh oh!

Conversation

jepler commented Sep 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ciscorn commented Sep 13, 2020

Uh oh!

tannewt commented Sep 14, 2020

Uh oh!

tannewt commented Sep 14, 2020

Uh oh!

jepler commented Sep 15, 2020

Uh oh!

Uh oh!

tannewt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jepler commented Sep 12, 2020 •

edited

Loading