You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Previously, all words in the (deduplicated) bitset would be stored raw -- a full
64 bits (8 bytes). Now, those words that are equivalent to others through a
specific mapping are stored separately and "mapped" to the original when
loading; this shrinks the table sizes significantly, as each mapped word is
stored in 2 bytes (a 4x decrease from the previous).
The new encoding is also potentially non-optimal: the "mapped" byte is
frequently repeated, as in practice many mapped words use the same base word.
Currently we only support two forms of mapping: rotation and inversion. Note
that these are both guaranteed to map transitively if at all, and supporting
mappings for which this is not true may require a more interesting algorithm for
choosing the optimal pairing.
Updated sizes:
Alphabetic : 2622 bytes (- 414 bytes)
Case_Ignorable : 1803 bytes (- 330 bytes)
Cased : 808 bytes (- 126 bytes)
Cc : 32 bytes
Grapheme_Extend: 1508 bytes (- 252 bytes)
Lowercase : 901 bytes (- 84 bytes)
N : 1064 bytes (- 156 bytes)
Uppercase : 838 bytes (- 96 bytes)
White_Space : 91 bytes (- 6 bytes)
Total table sizes: 9667 bytes (-1,464 bytes)
0 commit comments