Skip to content

compression: Implement @ciscorn's dictionary approach #3398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 16, 2020

Conversation

jepler
Copy link

@jepler jepler commented Sep 12, 2020

Massive savings. Thanks so much @ciscorn for providing the initial code for choosing the dictionary at #3370 (comment)

Now, a "dictionary" is chosen based on the frequency of strings of various lengths within the input corpus, according to a somewhat magic weight function. Some huffman codes, starting at 128, are set aside to indicate indices into this dictionary. The dictionary itself is up to 256 code points. The big advance (compared to bigrams) is that the dictionary items are arbitrary length. For instance, on Trinket M0 in English, here are some of the dictionary items: [' must be ', ' argument', ' specifi', 'attribut', 'support', ...]. " must be " occurs 21 times and gets a Huffman coding of 11001101 which is very economical (fewer bits than the single code point "%", for instance).

This adds a bit of time to the build, in order to compute the dictionary.

I also deleted a line to produce the "estimated total memory size" -- at least on the unix build with TRANSLATION=ja, this led to a build time KeyError trying to compute the codebook size for all the strings. I think this occurs because some single unicode code point ('ァ') is no longer present as itself in the compressed strings, due to always being replaced by a word. Because this line was not accounting for ngrams before or words now, it was inaccurate anyway.

As promised, this seems to save hundreds of bytes in the German translation on the trinket m0.

Testing performed:

  • built trinket_m0 (en, de_DE)
  • built and ran unix port in several languages (en, de_DE, ja) and ran simple error-producing codes like ./micropython -c '1/0', verifying there were no display errors

Massive savings.  Thanks so much @ciscorn for providing the initial
code for choosing the dictionary.

This adds a bit of time to the build, both to find the dictionary
but also because (for reasons I don't fully understand), the binary
search in the compress() function no longer worked and had to be
replaced with a linear search.

I think this is because the intended invariant is that for codebook
entries that encode to the same number of bits, the entries are ordered
in ascending value.  However, I mis-placed the transition from "words"
to "byte/char values" so the codebook entries for words are in word-order
rather than their code order.

Because this price is only paid at build time, I didn't care to determine
exactly where the correct fix was.

I also commented out a line to produce the "estimated total memory size"
-- at least on the unix build with TRANSLATION=ja, this led to a build
time KeyError trying to compute the codebook size for all the strings.
I think this occurs because some single unicode code point ('ァ') is
no longer present as itself in the compressed strings, due to always
being replaced by a word.

As promised, this seems to save hundreds of bytes in the German translation
on the trinket m0.

Testing performed:
 - built trinket_m0 in several languages
 - built and ran unix port in several languages (en, de_DE, ja) and ran
   simple error-producing codes like ./micropython -c '1/0'
@jepler jepler changed the title compression: Implement ciscorn's dictionary approach compression: Implement @ciscorn's dictionary approach Sep 12, 2020
Most users and the CI system are running in configurations where Python
configures stdout and stderr in UTF-8 mode.  However, Windows is different,
setting values like CP1252.  This led to a build failure on Windows, because
makeqstrdata printed Unicode strings to its stdout, expecting them to be
encoded as UTF-8.

This script is writing (stdout) to a compiler input file and potentially
printing messages (stderr) to a log or console.  Explicitly configure stdout to
use utf-8 to get consistent behavior on all platforms, and configure stderr so
that if any log/diagnostic messages are printed that cannot be displayed
correctly, they are still displayed instead of creating an error while trying
to print the diagnostic information.

I considered setting the encodings both to ascii, but this would just be
occasionally inconvenient to developers like me who want to show diagnostic
info on stderr and in comments while working with the compression code.

Closes: adafruit#3408
@ciscorn
Copy link

ciscorn commented Sep 13, 2020

@jepler Thank you for implementing my idea!

I've created a pull request on your local repo (jepler#3) to do further optimizations.

Small improvements to the dictionary compression
@jepler jepler requested a review from tannewt September 14, 2020 17:20
@tannewt
Copy link
Member

tannewt commented Sep 14, 2020

Could you explain briefly how this works? It's not clear to me from the code what units it now compresses.

@tannewt
Copy link
Member

tannewt commented Sep 14, 2020

Also, how does it compare to MicroPython's approach? micropython#5861

@jepler
Copy link
Author

jepler commented Sep 15, 2020

I improved the initial comment on this PR and also added an explanation in the comments of translate.h:

+// - code points starting at 128 (word_start) and potentially extending
+//   to 255 (word_end) (but never interfering with the target
+//   language's used code points) stand for dictionary entries in a
+//   dictionary with size up to 256 code points.  The dictionary entries
+//   are computed with a heuristic based on frequent substrings of 2 to
+//   9 code points.  These are called "words" but are not, grammatically
+//   speaking, words.  They're just spans of code points that frequently
+//   occur together.
+//
+// - dictionary entries are non-overlapping, and the _ending_ index of each
+//   entry is stored in an array.  Since the index given is the ending
+//   index, the array is called "wends".

Micropython's approach is very English-centric. It literally looks for the most frequent "words", where "words" are surrounded by whitespace on either side. It has hard-coded assumptions that all code points are below 128, so it uses huffman codes below 128 for the characters themselves and above 128 for the words, and does Huffman compression on those values. The table itself is not limited to 256 bytes and does not have an explicit table of lengths or ends, but only because it can re-use the high bit of each byte in the table to indicate "is last letter in word".

To contrast, @ciscorn's scheme (like our existing scheme) works for arbitrary Unicode encodings, and is not limited by a definition of a "word". This is especially important for the Japanese translation, as whitespace doesn't play the same role in setting word boundaries as it does in English and other languages you might be familiar with:

msgid "%q must be >= 0"
msgstr "%qは0以上でなければなりません"

If we went back to the "word length" method of computing the offsets into the dictionary table, we could move beyond 256 bytes of table. Otherwise, I think this accomplishes everything Micropython does and more.

@jepler jepler requested a review from tannewt September 16, 2020 13:00
@jepler jepler force-pushed the better-dictionary-compression branch from 23009bd to a8e98cd Compare September 16, 2020 13:28
Copy link
Member

@tannewt tannewt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thank you for the comments! I think it's a good balance between a fully greedy solution and one that won't take forever. :-)

@tannewt tannewt merged commit 750bc1e into adafruit:main Sep 16, 2020
@jepler jepler deleted the better-dictionary-compression branch November 3, 2021 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants