Skip to content

Commit 1ea703c

Browse files
committed
[Compression] Implement a new strategy for finding frequently used substrings.
The code-book compression (compression using a dictionary) shortens words by encoding references to a string table. This commit changes the code that constructs the string table. Before we were scanning all of the substrings in the input upto a certain length and we sorted them by frequency. The disadvantage of that approach was that we encoded parts of substrings multiple times. For example, the word "Collection" and the substring "ollec" had the same frequency. Attempts to prune the list was too compute intensive and not very effective (we checked if "ollec" is a substring of "Collection" and if they had a similar frequency). This commit implements a completely different approach. We now partition the long words into tokens. For example, the string "ArrayType10Collection" is split to "Array" + "Type" + "10Collection. This method is very effective and with the updated tables we can now reduce the size of the string table by 57%! This change also reduces the size of the string table by 1/3. With this change (the auto-generated h files, that are not included in this commit) the size of the swift dylib on x86 is reduced from 4.4MB to 3.6MB.
1 parent 2a27acd commit 1ea703c

File tree

1 file changed

+62
-11
lines changed

1 file changed

+62
-11
lines changed

utils/name-compression/CBCGen.py

Lines changed: 62 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -15,30 +15,80 @@ def collect_top_entries(val):
1515
Collect the most frequent substrings and organize them in a table.
1616
"""
1717
# sort items by hit rate.
18-
lst = sorted(hist.items(), key=lambda x: x[1] * len(x[0]) , reverse=True)[0:val]
18+
lst = sorted(hist.items(), key=lambda x: x[1] , reverse=True)[0:val]
1919
# Strip out entries with a small number of hits.
2020
# These entries are not likely to help the compressor and can extend the compile
2121
# time of the mangler unnecessarily.
22-
lst = filter(lambda p: p[1] > 500, lst)
22+
lst = filter(lambda p: p[1] > 15 and len(p[0]) > 3, lst)
2323
return lst
2424

25+
def getTokens(line):
26+
"""
27+
Split the incoming line into independent parts. The tokenizer has rules for
28+
extracting identifiers (strings that start with digits followed by letters),
29+
rules for detecting words (strings that start with upper case letters and
30+
continue with lower case letters) and rules to glue swift mangling tokens
31+
into subsequent words.
32+
"""
33+
# String builder.
34+
sb = ""
35+
# The last character.
36+
Last = ""
37+
for ch in line:
38+
if Last.isupper():
39+
# Uppercase letter to digits -> starts a new token.
40+
if ch.isdigit():
41+
if len(sb) > 3:
42+
yield sb
43+
sb = ""
44+
sb += ch
45+
Last = ch
46+
continue
47+
# Uppercase letter to lowercase or uppercase -> continue.
48+
Last = ch
49+
sb += ch
50+
continue
51+
52+
# Digit -> continue.
53+
if Last.isdigit():
54+
Last = ch
55+
sb += ch
56+
continue
57+
58+
# Lowercase letter to digit or uppercase letter -> stop.
59+
if Last.islower():
60+
if ch.isdigit() or ch.isupper():
61+
if len(sb) > 4:
62+
yield sb
63+
sb = ""
64+
sb += ch
65+
Last = ch
66+
continue
67+
Last = ch
68+
sb += ch
69+
continue
70+
71+
# Just append unclassified characters to the token.
72+
if len(sb) > 3:
73+
yield sb
74+
sb = ""
75+
sb += ch
76+
Last = ch
77+
yield sb
78+
2579
def addLine(line):
2680
"""
2781
Extract all of the possible substrings from \p line and insert them into
2882
the substring dictionary. This method knows to ignore the _T swift prefix.
2983
"""
3084
if not line.startswith("__T"): return
3185

32-
# strip the "__T" for the prefix calculations
86+
# Strip the "__T" for the prefix calculations.
3387
line = line[3:]
3488

35-
max_string_length = 9
36-
string_len = len(line)
37-
for seg_len in xrange(3, max_string_length):
38-
for start_idx in xrange(string_len - seg_len):
39-
substr = line[start_idx:start_idx+seg_len]
40-
hist[substr] += 1
41-
89+
# Add all of the tokens in the word to the histogram.
90+
for tok in getTokens(line):
91+
hist[tok] += 1
4292

4393
# Read all of the input files and add the substrings into the table.
4494
for f in filenames:
@@ -54,7 +104,8 @@ def addLine(line):
54104
encoders = [c for c in charset] # alphabet without the escape chars.
55105
enc_len = len(encoders)
56106

57-
# Take the most frequent entries from the table.
107+
# Take the most frequent entries from the table that fit into the range of
108+
# our indices (assuming two characters for indices).
58109
table = collect_top_entries(enc_len * enc_len)
59110

60111
# Calculate the reverse mapping between the char to its index.

0 commit comments

Comments
 (0)