You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Compression] Implement a new strategy for finding frequently used substrings.
The code-book compression (compression using a dictionary) shortens words by
encoding references to a string table. This commit changes the code that
constructs the string table. Before we were scanning all of the substrings in
the input upto a certain length and we sorted them by frequency. The
disadvantage of that approach was that we encoded parts of substrings multiple
times. For example, the word "Collection" and the substring "ollec" had the same
frequency. Attempts to prune the list was too compute intensive and not very
effective (we checked if "ollec" is a substring of "Collection" and if they had
a similar frequency).
This commit implements a completely different approach. We now partition the
long words into tokens. For example, the string "ArrayType10Collection" is split
to "Array" + "Type" + "10Collection.
This method is very effective and with the updated tables we can now reduce the
size of the string table by 57%! This change also reduces the size of the string
table by 1/3.
With this change (the auto-generated h files, that are not included in this
commit) the size of the swift dylib on x86 is reduced from 4.4MB to 3.6MB.
0 commit comments