Skip to content

Commit c54dc86

Browse files
jeffhostetlerGit for Windows Build Agent
authored andcommitted
t/lib-unicode-nfc-nfd: helper prereqs for testing unicode nfc/nfd
Create a set of prereqs to help understand how file names are handled by the filesystem when they contain NFC and NFD Unicode characters. Signed-off-by: Jeff Hostetler <[email protected]>
1 parent 02e0197 commit c54dc86

File tree

1 file changed

+167
-0
lines changed

1 file changed

+167
-0
lines changed

t/lib-unicode-nfc-nfd.sh

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# Help detect how Unicode NFC and NFD are handled on the filesystem.
2+
3+
# A simple character that has a NFD form.
4+
#
5+
# NFC: U+00e9 LATIN SMALL LETTER E WITH ACUTE
6+
# UTF8(NFC): \xc3 \xa9
7+
#
8+
# NFD: U+0065 LATIN SMALL LETTER E
9+
# U+0301 COMBINING ACUTE ACCENT
10+
# UTF8(NFD): \x65 + \xcc \x81
11+
#
12+
utf8_nfc=$(printf "\xc3\xa9")
13+
utf8_nfd=$(printf "\x65\xcc\x81")
14+
15+
# Is the OS or the filesystem "Unicode composition sensitive"?
16+
#
17+
# That is, does the OS or the filesystem allow files to exist with
18+
# both the NFC and NFD spellings? Or, does the OS/FS lie to us and
19+
# tell us that the NFC and NFD forms are equivalent.
20+
#
21+
# This is or may be independent of what type of filesystem we have,
22+
# since it might be handled by the OS at a layer above the FS.
23+
# Testing shows on MacOS using APFS, HFS+, and FAT32 reports a
24+
# collision, for example.
25+
#
26+
# This does not tell us how the Unicode pathname will be spelled
27+
# on disk, but rather only that the two spelling "collide". We
28+
# will examine the actual on disk spelling in a later prereq.
29+
#
30+
test_lazy_prereq UNICODE_COMPOSITION_SENSITIVE '
31+
mkdir trial_${utf8_nfc} &&
32+
mkdir trial_${utf8_nfd}
33+
'
34+
35+
# Is the spelling of an NFC pathname preserved on disk?
36+
#
37+
# On MacOS with HFS+ and FAT32, NFC paths are converted into NFD
38+
# and on APFS, NFC paths are preserved. As we have established
39+
# above, this is independent of "composition sensitivity".
40+
#
41+
# 0000000 63 5f c3 a9
42+
#
43+
# (/usr/bin/od output contains different amount of whitespace
44+
# on different platforms, so we need the wildcards here.)
45+
#
46+
test_lazy_prereq UNICODE_NFC_PRESERVED '
47+
mkdir c_${utf8_nfc} &&
48+
ls | od -t x1 | grep "63 *5f *c3 *a9"
49+
'
50+
51+
# Is the spelling of an NFD pathname preserved on disk?
52+
#
53+
# 0000000 64 5f 65 cc 81
54+
#
55+
test_lazy_prereq UNICODE_NFD_PRESERVED '
56+
mkdir d_${utf8_nfd} &&
57+
ls | od -t x1 | grep "64 *5f *65 *cc *81"
58+
'
59+
mkdir c_${utf8_nfc} &&
60+
mkdir d_${utf8_nfd} &&
61+
62+
# The following _DOUBLE_ forms are more for my curiosity,
63+
# but there may be quirks lurking when there are multiple
64+
# combining characters in non-canonical order.
65+
66+
# Unicode also allows multiple combining characters
67+
# that can be decomposed in pieces.
68+
#
69+
# NFC: U+1f67 GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI
70+
# UTF8(NFC): \xe1 \xbd \xa7
71+
#
72+
# NFD1: U+1f61 GREEK SMALL LETTER OMEGA WITH DASIA
73+
# U+0342 COMBINING GREEK PERISPOMENI
74+
# UTF8(NFD1): \xe1 \xbd \xa1 + \xcd \x82
75+
#
76+
# But U+1f61 decomposes into
77+
# NFD2: U+03c9 GREEK SMALL LETTER OMEGA
78+
# U+0314 COMBINING REVERSED COMMA ABOVE
79+
# UTF8(NFD2): \xcf \x89 + \xcc \x94
80+
#
81+
# Yielding: \xcf \x89 + \xcc \x94 + \xcd \x82
82+
#
83+
# Note that I've used the canonical ordering of the
84+
# combinining characters. It is also possible to
85+
# swap them. My testing shows that that non-standard
86+
# ordering also causes a collision in mkdir. However,
87+
# the resulting names don't draw correctly on the
88+
# terminal (implying that the on-disk format also has
89+
# them out of order).
90+
#
91+
greek_nfc=$(printf "\xe1\xbd\xa7")
92+
greek_nfd1=$(printf "\xe1\xbd\xa1\xcd\x82")
93+
greek_nfd2=$(printf "\xcf\x89\xcc\x94\xcd\x82")
94+
95+
# See if a double decomposition also collides.
96+
#
97+
test_lazy_prereq UNICODE_DOUBLE_COMPOSITION_SENSITIVE '
98+
mkdir trial_${greek_nfc} &&
99+
mkdir trial_${greek_nfd2}
100+
'
101+
102+
# See if the NFC spelling appears on the disk.
103+
#
104+
test_lazy_prereq UNICODE_DOUBLE_NFC_PRESERVED '
105+
mkdir c_${greek_nfc} &&
106+
ls | od -t x1 | grep "63 *5f *e1 *bd *a7"
107+
'
108+
109+
# See if the NFD spelling appears on the disk.
110+
#
111+
test_lazy_prereq UNICODE_DOUBLE_NFD_PRESERVED '
112+
mkdir d_${greek_nfd2} &&
113+
ls | od -t x1 | grep "64 *5f *cf *89 *cc *94 *cd *82"
114+
'
115+
116+
# The following is for debugging. I found it useful when
117+
# trying to understand the various (OS, FS) quirks WRT
118+
# Unicode and how composition/decomposition is handled.
119+
# For example, when trying to understand how (macOS, APFS)
120+
# and (macOS, HFS) and (macOS, FAT32) compare.
121+
#
122+
# It is rather noisy, so it is disabled by default.
123+
#
124+
if test "$unicode_debug" = "true"
125+
then
126+
if test_have_prereq UNICODE_COMPOSITION_SENSITIVE
127+
then
128+
echo NFC and NFD are distinct on this OS/filesystem.
129+
else
130+
echo NFC and NFD are aliases on this OS/filesystem.
131+
fi
132+
133+
if test_have_prereq UNICODE_NFC_PRESERVED
134+
then
135+
echo NFC maintains original spelling.
136+
else
137+
echo NFC is modified.
138+
fi
139+
140+
if test_have_prereq UNICODE_NFD_PRESERVED
141+
then
142+
echo NFD maintains original spelling.
143+
else
144+
echo NFD is modified.
145+
fi
146+
147+
if test_have_prereq UNICODE_DOUBLE_COMPOSITION_SENSITIVE
148+
then
149+
echo DOUBLE NFC and NFD are distinct on this OS/filesystem.
150+
else
151+
echo DOUBLE NFC and NFD are aliases on this OS/filesystem.
152+
fi
153+
154+
if test_have_prereq UNICODE_DOUBLE_NFC_PRESERVED
155+
then
156+
echo Double NFC maintains original spelling.
157+
else
158+
echo Double NFC is modified.
159+
fi
160+
161+
if test_have_prereq UNICODE_DOUBLE_NFD_PRESERVED
162+
then
163+
echo Double NFD maintains original spelling.
164+
else
165+
echo Double NFD is modified.
166+
fi
167+
fi

0 commit comments

Comments
 (0)