-
Notifications
You must be signed in to change notification settings - Fork 1.3k
translation: Compress as unicode, not bytes #2345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
By treating each unicode code-point as a single entity for huffman compression, the overall compression rate can be somewhat improved without changing the algorithm. On the decompression side, when compressed values above 127 are encountered, they need to be converted from a 16-bit Unicode code point into a UTF-8 byte sequence. Doing this returns approximately 1.5kB of flash storage with the zh_Latn_pinyin translation. (292 -> 1768 bytes remaining in my build of trinket_m0) Other "more ASCII" translations benefit less, and in fact zh_Latn_pinyin is no longer the most constrained translation! (de_DE 1156 -> 1384 bytes free in flash, I didn't check others before pushing for CI) English is slightly pessimized, 2840 -> 2788 bytes, probably mostly because the "values" array was changed from uint8_t to uint16_t, which is strictly not required for an all-ASCII translation. This could probably be avoided in this case, but as English is not the most constrained translation it doesn't really matter. Testing performed: built for feather nRF52840 express and trinket m0 in English and zh_Latn_pinyin; ran and verified the localized messages such as Àn xià rènhé jiàn jìnrù REPL. Shǐyòng CTRL-D chóngxīn jiāzài. and Press any key to enter the REPL. Use CTRL-D to reload. were properly displayed.
Suppose we made the choice of byte vs code point be a compile-time option? Then we could choose the best one for the particular translation. EDIT: The Python script could just have a list of languages for which to use code points. Or it could even do the work twice and pick the smaller one. |
I may have mistranscribed the values earlier, or the translation size differs from trinket_m0 to pirkey_m0 (pirkey_m0 finished its Actions sooner, so I chose it for this table). Here are the savings from github actions logs, and all boards are slightly improved. In this particular build, de_DE is the now the most resource constrained.
|
OK, my previous comments are now moot, given the new results. |
If a translation only has unicode code points 255 and below, the "values" array can be 8 bits instead of 16 bits. This reclaims some code size, e.g., in a local build, trinket_m0 / en_US reclaimed 112 bytes and de_DE reclaimed 104 bytes. However, languages like zh_Latn_pinyin, which use code points above 255, did not benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super awesome! Thank you so much for adding this.
I had originally added support for all byte values in order to handle compressing dynamically loaded strings but we haven't added that. I'm happy to trade that future for this reality. Thanks!
By treating each unicode code-point as a single entity for huffman compression, the overall compression rate can be somewhat improved without changing the algorithm. On the decompression side, when compressed values above 127 are encountered, they need to be converted from a 16-bit Unicode code point into a UTF-8 byte sequence.
Doing this returns approximately 1.5kB of flash storage with the zh_Latn_pinyin translation. (292 -> 1768 bytes remaining in my build of trinket_m0)
Other "more ASCII" translations benefit less, and in fact zh_Latn_pinyin is no longer the most constrained translation! (de_DE 1156 -> 1384 bytes free in flash, I didn't check others before pushing for CI)
English is slightly pessimized, 2840 -> 2788 bytes, probably mostly because the "values" array was changed from uint8_t to uint16_t, which is strictly not required for an all-ASCII translation. This could probably be avoided in this case, but as English is not the most constrained translation it doesn't really matter.
Testing performed: built for feather nRF52840 express and trinket m0 in English and zh_Latn_pinyin; ran and verified the localized messages such as
and
were properly displayed.