Skip to content

translation: Compress as unicode, not bytes #2345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 3, 2019

Conversation

jepler
Copy link

@jepler jepler commented Dec 2, 2019

By treating each unicode code-point as a single entity for huffman compression, the overall compression rate can be somewhat improved without changing the algorithm. On the decompression side, when compressed values above 127 are encountered, they need to be converted from a 16-bit Unicode code point into a UTF-8 byte sequence.

Doing this returns approximately 1.5kB of flash storage with the zh_Latn_pinyin translation. (292 -> 1768 bytes remaining in my build of trinket_m0)

Other "more ASCII" translations benefit less, and in fact zh_Latn_pinyin is no longer the most constrained translation! (de_DE 1156 -> 1384 bytes free in flash, I didn't check others before pushing for CI)

English is slightly pessimized, 2840 -> 2788 bytes, probably mostly because the "values" array was changed from uint8_t to uint16_t, which is strictly not required for an all-ASCII translation. This could probably be avoided in this case, but as English is not the most constrained translation it doesn't really matter.

Testing performed: built for feather nRF52840 express and trinket m0 in English and zh_Latn_pinyin; ran and verified the localized messages such as

Àn xià rènhé jiàn jìnrù REPL. Shǐyòng CTRL-D chóngxīn jiāzài.

and

Press any key to enter the REPL. Use CTRL-D to reload.

were properly displayed.

By treating each unicode code-point as a single entity for huffman
compression, the overall compression rate can be somewhat improved
without changing the algorithm.  On the decompression side, when
compressed values above 127 are encountered, they need to be
converted from a 16-bit Unicode code point into a UTF-8 byte
sequence.

Doing this returns approximately 1.5kB of flash storage with the
zh_Latn_pinyin translation. (292 -> 1768 bytes remaining in my build
of trinket_m0)

Other "more ASCII" translations benefit less, and in fact
zh_Latn_pinyin is no longer the most constrained translation!
(de_DE 1156 -> 1384 bytes free in flash, I didn't check others
before pushing for CI)

English is slightly pessimized, 2840 -> 2788 bytes, probably mostly
because the "values" array was changed from uint8_t to uint16_t,
which is strictly not required for an all-ASCII translation.  This
could probably be avoided in this case, but as English is not the
most constrained translation it doesn't really matter.

Testing performed: built for feather nRF52840 express and trinket m0
in English and zh_Latn_pinyin; ran and verified the localized
messages such as
    Àn xià rènhé jiàn jìnrù REPL. Shǐyòng CTRL-D chóngxīn jiāzài.
and
    Press any key to enter the REPL. Use CTRL-D to reload.
were properly displayed.
@jepler jepler requested a review from tannewt December 2, 2019 15:51
@dhalbert
Copy link
Collaborator

dhalbert commented Dec 2, 2019

Suppose we made the choice of byte vs code point be a compile-time option? Then we could choose the best one for the particular translation.

EDIT: The Python script could just have a list of languages for which to use code points. Or it could even do the work twice and pick the smaller one.

@jepler jepler removed the request for review from tannewt December 2, 2019 16:15
@jepler
Copy link
Author

jepler commented Dec 2, 2019

I may have mistranscribed the values earlier, or the translation size differs from trinket_m0 to pirkey_m0 (pirkey_m0 finished its Actions sooner, so I chose it for this table). Here are the savings from github actions logs, and all boards are slightly improved. In this particular build, de_DE is the now the most resource constrained.

pirkey_m0 Free flash (bytes)    
Translation Before After Saved
pl 4464 4900 436
en_US 5440 5488 48
pt_BR 5204 5300 96
de_DE 3856 4052 196
it_IT 4704 4792 88
fr 3992 4280 288
zh_Latn_pinyin 3028 4440 1412
fil 4780 4824 44
ID 5112 5156 44
ko 4280 4972 692
es 4516 4664 148
en_x_pirate 5416 5460 44

@jepler jepler requested a review from tannewt December 2, 2019 19:28
@dhalbert
Copy link
Collaborator

dhalbert commented Dec 2, 2019

OK, my previous comments are now moot, given the new results.

If a translation only has unicode code points 255 and below, the "values"
array can be 8 bits instead of 16 bits.  This reclaims some code size,
e.g., in a local build, trinket_m0 / en_US reclaimed 112 bytes and de_DE
reclaimed 104 bytes.  However, languages like zh_Latn_pinyin, which use
code points above 255, did not benefit.
Copy link
Member

@tannewt tannewt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super awesome! Thank you so much for adding this.

I had originally added support for all byte values in order to handle compressing dynamically loaded strings but we haven't added that. I'm happy to trade that future for this reality. Thanks!

@tannewt tannewt merged commit 15886b1 into adafruit:master Dec 3, 2019
@jepler jepler deleted the compressed-unicode branch November 3, 2021 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants