translation: Compress as unicode, not bytes #2345

jepler · 2019-12-02T15:51:30Z

By treating each unicode code-point as a single entity for huffman compression, the overall compression rate can be somewhat improved without changing the algorithm. On the decompression side, when compressed values above 127 are encountered, they need to be converted from a 16-bit Unicode code point into a UTF-8 byte sequence.

Doing this returns approximately 1.5kB of flash storage with the zh_Latn_pinyin translation. (292 -> 1768 bytes remaining in my build of trinket_m0)

Other "more ASCII" translations benefit less, and in fact zh_Latn_pinyin is no longer the most constrained translation! (de_DE 1156 -> 1384 bytes free in flash, I didn't check others before pushing for CI)

English is slightly pessimized, 2840 -> 2788 bytes, probably mostly because the "values" array was changed from uint8_t to uint16_t, which is strictly not required for an all-ASCII translation. This could probably be avoided in this case, but as English is not the most constrained translation it doesn't really matter.

Testing performed: built for feather nRF52840 express and trinket m0 in English and zh_Latn_pinyin; ran and verified the localized messages such as

Àn xià rènhé jiàn jìnrù REPL. Shǐyòng CTRL-D chóngxīn jiāzài.

and

Press any key to enter the REPL. Use CTRL-D to reload.

were properly displayed.

By treating each unicode code-point as a single entity for huffman compression, the overall compression rate can be somewhat improved without changing the algorithm. On the decompression side, when compressed values above 127 are encountered, they need to be converted from a 16-bit Unicode code point into a UTF-8 byte sequence. Doing this returns approximately 1.5kB of flash storage with the zh_Latn_pinyin translation. (292 -> 1768 bytes remaining in my build of trinket_m0) Other "more ASCII" translations benefit less, and in fact zh_Latn_pinyin is no longer the most constrained translation! (de_DE 1156 -> 1384 bytes free in flash, I didn't check others before pushing for CI) English is slightly pessimized, 2840 -> 2788 bytes, probably mostly because the "values" array was changed from uint8_t to uint16_t, which is strictly not required for an all-ASCII translation. This could probably be avoided in this case, but as English is not the most constrained translation it doesn't really matter. Testing performed: built for feather nRF52840 express and trinket m0 in English and zh_Latn_pinyin; ran and verified the localized messages such as Àn xià rènhé jiàn jìnrù REPL. Shǐyòng CTRL-D chóngxīn jiāzài. and Press any key to enter the REPL. Use CTRL-D to reload. were properly displayed.

dhalbert · 2019-12-02T15:53:21Z

Suppose we made the choice of byte vs code point be a compile-time option? Then we could choose the best one for the particular translation.

EDIT: The Python script could just have a list of languages for which to use code points. Or it could even do the work twice and pick the smaller one.

jepler · 2019-12-02T16:56:29Z

I may have mistranscribed the values earlier, or the translation size differs from trinket_m0 to pirkey_m0 (pirkey_m0 finished its Actions sooner, so I chose it for this table). Here are the savings from github actions logs, and all boards are slightly improved. In this particular build, de_DE is the now the most resource constrained.

pirkey_m0	Free flash (bytes)
Translation	Before	After	Saved
pl	4464	4900	436
en_US	5440	5488	48
pt_BR	5204	5300	96
de_DE	3856	4052	196
it_IT	4704	4792	88
fr	3992	4280	288
zh_Latn_pinyin	3028	4440	1412
fil	4780	4824	44
ID	5112	5156	44
ko	4280	4972	692
es	4516	4664	148
en_x_pirate	5416	5460	44

dhalbert · 2019-12-02T20:04:43Z

OK, my previous comments are now moot, given the new results.

If a translation only has unicode code points 255 and below, the "values" array can be 8 bits instead of 16 bits. This reclaims some code size, e.g., in a local build, trinket_m0 / en_US reclaimed 112 bytes and de_DE reclaimed 104 bytes. However, languages like zh_Latn_pinyin, which use code points above 255, did not benefit.

tannewt

Super awesome! Thank you so much for adding this.

I had originally added support for all byte values in order to handle compressing dynamically loaded strings but we haven't added that. I'm happy to trade that future for this reality. Thanks!

jepler requested a review from tannewt December 2, 2019 15:51

jepler removed the request for review from tannewt December 2, 2019 16:15

makeqstrdata: fix printing of 'increased length' message

879e104

jepler requested a review from tannewt December 2, 2019 19:28

tannewt approved these changes Dec 3, 2019

View reviewed changes

tannewt merged commit 15886b1 into adafruit:master Dec 3, 2019

jepler deleted the compressed-unicode branch November 3, 2021 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

translation: Compress as unicode, not bytes #2345

translation: Compress as unicode, not bytes #2345

Uh oh!

jepler commented Dec 2, 2019

Uh oh!

dhalbert commented Dec 2, 2019 •

edited

Loading

Uh oh!

jepler commented Dec 2, 2019

Uh oh!

dhalbert commented Dec 2, 2019

Uh oh!

tannewt left a comment

Uh oh!

Uh oh!

translation: Compress as unicode, not bytes #2345

translation: Compress as unicode, not bytes #2345

Uh oh!

Conversation

jepler commented Dec 2, 2019

Uh oh!

dhalbert commented Dec 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jepler commented Dec 2, 2019

Uh oh!

dhalbert commented Dec 2, 2019

Uh oh!

tannewt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dhalbert commented Dec 2, 2019 •

edited

Loading