stdlib: Improve String comparison performance #7339

ymanton · 2017-02-08T19:28:27Z

This patch attempts to recover some of the performance lost on Linux due to ef974af. A more detailed description is provided in the commit message, but the short summary is that we can use a memcmp-like approach to compare the binary representations of two strings until we hit a difference, after which we can fall back to the more expensive ICU routines. A memcmp-like routine, written in C, is provided which does a modest job of optimizing the comparison by comparing uintptr_t-sized blocks at a time when possible.

The performance lost due to ef974af strongly depends on which version of ICU Swift is built against; versions of ICU prior to ICU 53 are particularly bad when dealing with ASCII strings so in those environments ef974af is particularly costly. I've tested on both Ubuntu 14.04 (with ICU 52) and Ubuntu 16.04 (with ICU 55) with a simple micro-benchmark @djones6 came up with that tests various scenarios; results are as follows:

Ubuntu 16.04

Test	Swift 3.1	Swift 3.1+patch	3.1 / 3.1+patch
ASCII lexically equal, binary equal	439.02	26	16.89
ASCII lexically not equal, same length	491.1	149.80	3.28
ASCII lexically not equal, different lengths	496.91	148.91	3.34
UTF16 lexically equal, binary equal	35.84	27.43	1.31
UTF16 lexically equal, binary not equal, same length	267.98	232.45	1.15
UTF16 lexically not equal, same length	265.44	227.90	1.16

Ubuntu 14.04

Test	Swift 3.1	Swift 3.1+patch	3.1 / 3.1+patch
ASCII lexically equal, binary equal	8418.33	19.79	425.31
ASCII lexically not equal, same length	8038.33	400.19	20.07
ASCII lexically not equal, different lengths	8037.67	399.96	20.10
UTF16 lexically equal, binary equal	22.25	21.38	1.04
UTF16 lexically equal, binary not equal, same length	224.21	175.08	1.28
UTF16 lexically not equal, same length	220.91	171.83	1.29

(Measurements are elapsed time in seconds for 1 million comparisons)

I'm in the process of re-writing this micro-benchmark to conform to the conventions used in swift/benchmarks (as suggested by @dabrahams) however it seems that benchmarking is not supported on Linux and requires more than a few changes to the build system to get working so in the interest of getting early feedback on this patch itself I put together this PR. Any feedback would be appreciated.

moiseev · 2017-02-08T20:58:54Z

/cc @dabrahams

moiseev · 2017-02-08T21:01:51Z

@swift-ci Please test Linux platform

gottesmm · 2017-02-08T22:54:55Z

@ymanton One thing that we have discussed is switching the benchmark suite to build using swiftpm. I may have a patch around that does this. It would enable this to be built on Linux with some small modifications. Interested?

ymanton · 2017-02-08T23:49:45Z

@gottesmm Sure, that would be helpful.

jrose-apple · 2017-02-09T01:43:47Z

Is this correct? "abcdèfg" and "abcdëfg" have a common byte prefix of "abcde" (when in decomposed form, anyway), but would comparing starting with the combining character produce a different result than comparing starting with the base character?

dabrahams · 2017-02-09T23:17:18Z

@ymanton Could you please write and executes a test case that answers @jrose-apple's question? Thanks

moiseev · 2017-02-09T23:36:48Z

/cc @milseman

ymanton · 2017-02-10T12:05:11Z

would comparing starting with the combining character produce a different result than comparing starting with the base character?

I put together a quick test that takes every composed character that is formed by the cross product of [a-zA-Z] x [U+0300-U+036F] (i.e. pairing every latin alphabetical character with every combining diacritical mark) and in the middle of "abcd" and "ef", then I compared every one of those strings against every other, both for equality and order -- the results were identical with and without this patch.

I can't claim to be a Unicode expert, and my attempts at reading through the Unicode Collation Algorithm spec in the hopes of finding statements that says this is legitimate have yet to succeed, but it seems to pass the sniff test anyway. I'll try to come up with something more thorough, perhaps utilizing the UCA conformance tests.

Having said that, here are some alternatives that come to mind:

On hitting a mismatch we can always restart ICU collation from offset 0, but this would make all unequal string comparisons slower than they are right now.
We could check if the mismatch involves simple characters (e.g. in the ASCII range) and only start ICU collation off at the mismatch offset in those cases. This would penalize a smaller subset of unequal string comparisons.
The inverse of the above would be checking if the mismatch involves combining or modifying characters only restarting ICU collation at offset 0 in those cases.
We could backtrack if the mismatch involves combining or modifying characters and find the first standalone character in order to restart ICU collation at that position. The backtracking itself might be fairly expensive in terms of tests per character, but I imagine in most cases you'd be backtracking past a small number of characters.

jrose-apple · 2017-02-10T16:48:41Z

The Collation algorithm TR has this under "What Collation Is Not":

Collation order is not preserved under concatenation or substring operations, in general.

For example, the fact that x is less than y does not mean that x + z is less than y + z, because characters may form contractions across the substring or concatenation boundaries. In summary:

x < y does not imply that xz < yz
x < y does not imply that zx < zy
xz < yz does not imply that x < y
zx < zy does not imply that x < y

However, it's unclear whether this applies to non-localized collations. Further down, under "Default Unicode Collation Element Table":

In the Default Unicode Collation Element Table, contractions are necessary where a canonical decomposable character requires a distinct primary weight in the table, so that the canonical-equivalent character sequences are given the same weights. For example, Indic two-part vowels have primary weights as units, and their canonical-equivalent sequence of vowel parts must be given the same primary weight by means of a contraction entry in the table. The same applies to a number of precomposed Cyrillic characters with diacritic marks and to a small number of Arabic letters with madda or hamza marks.

The Default Unicode Collation Element Table is constructed to be consistent with the Unicode Normalization algorithm, and to respect the Unicode character properties. It is not, however, merely algorithmically derivable based on considerations of canonical equivalence and an inspection of character properties, because the assignment of levels also takes into account characteristics of particular scripts. For example, the combining marks generally have secondary collation elements; however, the Indic combining vowels are given non-zero Level 1 weights, because they are as significant in sorting as the consonants.

The simplest example where this would go wrong in a localized comparison is "abcde" vs. "abche" in Spanish locales, where "ch" is treated as a single ~~letter~~ collation element. It's a little unclear to me whether we'd be allowed to tailor the default table in such a way that we use no "contractions" to avoid this problem, and then secondarily unclear whether that would be desirable.

The mitigation of only doing this for ASCII mismatches is probably good enough, but someone would need to double-check that none of the multi-character collation element mappings in the default table have non-initial ASCII characters. I think that would be a sufficient condition but you or @dabrahams should probably check my reading.

I like the idea of having a little Swift program that can run through the UCA conformance tests. We could even add that to our validation tests.

jrose-apple · 2017-02-10T16:51:59Z

Ah, by the way, please don't take my comments as dismissal or disrespect. I/We definitely appreciate you coming in with a patch to improve things. :-)

jrose-apple · 2017-02-10T16:52:59Z

Ah, one workflow note: we usually take patches to the master branch first, and then cherry-pick them to the release branch after we've seen that there are no issues.

dabrahams · 2017-02-11T23:23:32Z

Per the string manifesto, we really don't want to be using the UCA for non-localized comparisons anyway. However, the exact semantics of string compArison are most likely going to be changing soon And we should only change them once. Michael is working in this area. Michael, will you be able to use anything from this patch? Sent from my moss-covered three-handled family gradunza

…

On Feb 8, 2017, at 5:44 PM, Jordan Rose ***@***.***> wrote: Is this correct? "abcdèfg" and "abcdëfg" have a common byte prefix of "abcde" (when in decomposed form, anyway), but would comparing starting with the combining character produce a different result than comparing starting with the base character? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

milseman · 2017-02-13T21:27:16Z

I am working on the comparison algorithm and semantics for String in Swift 4. However, if there is a semantics-preserving optimization, such as this PR, we should take it.

My main concern with this PR, because I am not a Unicode expert, is whether returning the offset of the first code unit difference is correct. I believe that returning the offset of the code unit that began the new grapheme would be. The tight loop could be adjusted to remember the offset of the last non-combining character and return that instead.

@ymanton, does this PR exhibit different behavior if the compared strings are canonicalized differently? That is, if their normal forms differ or they are not even in a normal form? To use the collation TR's example, do strings containing the following equivalent graphemes compare equivalently no matter how they are expressed?

ự	U+1EF1 LATIN SMALL LETTER U WITH HORN AND DOT BELOW
ụ ◌̛	U+1EE5 LATIN SMALL LETTER U WITH DOT BELOW, U+031B COMBINING HORN
u ◌̛ ◌̣	U+0075 LATIN SMALL LETTER U, U+031B COMBINING HORN, U+0323 COMBINING DOT BELOW
ư ◌̣	U+01B0 LATIN SMALL LETTER U WITH HORN, U+0323 COMBINING DOT BELOW
u ◌̣ ◌̛	U+0075 LATIN SMALL LETTER U, U+0323 COMBINING DOT BELOW, U+031B COMBINING HORN

ymanton · 2017-02-14T06:51:36Z

@milseman,

@ymanton, does this PR exhibit different behavior if the compared strings are canonicalized differently? That is, if their normal forms differ or they are not even in a normal form? To use the collation TR's example, do strings containing the following equivalent graphemes compare equivalently no matter how they are expressed?

The given examples all compare equally to one another. Only two of the five exercise this PR. ự, ụ ◌̛ and ư ◌̣ all have unique leading code points so the UCA is used right off the bat. Only u ◌̛ ◌̣ and u ◌̣ ◌̛ are interesting here and for these two the first code point matches via memcmp and the remaining code points are passed on to the UCA where ◌̛ ◌̣ and ◌̣ ◌̛ come out equal as well, as expected.

milseman · 2017-02-14T19:22:20Z

Ah, the case I was most worried about was actually not equality but preserving < when normal forms differ and we're breaking in the middle of a grapheme.

For example, is there some x, y, z, and w such that:

UCA states that xyw < xzw, but either z < y or zw < yw.

In the case of z < y, if the strings are presented as xwy and xwz, then UCA breaking at the z and y would result in a different sort order than the whole grapheme.

In the case of zw < yw, if the strings are presented as xyw and xzw, then UCA breaking at the yw and zw would result in a different sort order than the whole grapheme.

x, y, z and w above need not be confined to only unicode code points, but might be arbitrary sequences of code points while still inside a grapheme.

If so, then running UCA comparison while breaking within a grapheme would be incorrect.

ematejska · 2017-02-20T20:33:19Z

@swift-ci please test

swift-ci · 2017-02-20T20:48:31Z

Build failed
Jenkins build - Swift Test OS X Platform
Git Commit - f2c21d34b164f8d146f37225ebbda07b01ef57b7
Test requested by - @ematejska

swift-ci · 2017-02-20T20:54:16Z

Build failed
Jenkins build - Swift Test Linux Platform
Git Commit - f2c21d34b164f8d146f37225ebbda07b01ef57b7
Test requested by - @ematejska

tkremenek · 2017-02-20T21:15:12Z

@swift-ci clean test

swift-ci · 2017-02-20T21:31:31Z

Build failed
Jenkins build - Swift Test OS X Platform
Git Commit - f2c21d34b164f8d146f37225ebbda07b01ef57b7
Test requested by - @tkremenek

ymanton · 2017-02-20T22:07:04Z

Fixed the OS X build break.

tkremenek · 2017-02-20T22:07:47Z

@swift-ci test

swift-ci · 2017-02-20T22:08:04Z

Build failed
Jenkins build - Swift Test Linux Platform
Git Commit - f2c21d34b164f8d146f37225ebbda07b01ef57b7
Test requested by - @tkremenek

swift-ci · 2017-02-20T22:08:09Z

Build failed
Jenkins build - Swift Test OS X Platform
Git Commit - f2c21d34b164f8d146f37225ebbda07b01ef57b7
Test requested by - @tkremenek

ymanton · 2017-02-20T22:17:47Z

@milseman, I think we are safe on this point for most cases, particularly for combining characters, because grapheme sort keys are in most cases composed via concatenation. The code points x, y, z, y have sort keys k(x), k(y), k(z), k(w). The grapheme clusters xyw and xzw have the sort keys k(x)+k(y)+k(w) and k(x)+k(z)+k(w) respectively -- i.e. the grapheme cluster's sort key is the concatenation of the code point sort keys. This is a bit of a simplification as UCA sorting is weighted, but I think the end result is the same none the less.

For xyw < xzw to be true k(y) < k(z) has to be true, since the other two keys and their relative positions are the same, which means that z < y or zw < yw cannot be true.

The one place where we can run into problems is with contractions, as @jrose-apple mentioned a while back. This is where the characters ch form a single collation element who's sort key k(ch) is not a combination of the individual characters' sort keys k(c) and k(h). For this grapheme cluster we will indeed get different results if we invoke UCA in the middle of the grapheme. Unfortunately, such contractions are not limited to localized sorting, as the DUCET contains many such contractions. To deal with this we can explore some of the earlier suggestions that came up as the workaround is effectively the same, we want to avoid invoking the UCA in the middle of a contraction.

dabrahams · 2017-02-21T01:43:28Z

Sent from my moss-covered three-handled family gradunza

On Feb 14, 2017, at 9:23 AM, Michael Ilseman ***@***.***> wrote: Ah, the case I was most worried about was actually not equality but preserving < when normal forms differ and we're breaking in the middle of a grapheme. For example, is there some x``y``z``w such that: UCA states that xyw < xzw, but either z < y or zw < 'yw'. In the case of z < y, if the strings are presented as xwy and xwz, then UCA breaking at the z and y would result in a different sort order than the whole grapheme. In the case of zw < yw, if the strings are presented as xyw and xzw, then UCA breaking at the yw and zw would result in a different sort order than the whole grapheme. x``y``z and w above need not be confined to only unicode code points, but might be arbitrary sequences of code points while still inside a grapheme. If so, then breaking comparison at the z would produce the wrong sort order.

You don't need to actually find such an example to prove there is a problem; it is sufficient that the UCA plus the kinds of things expressible by collation tables are able to produce such a result, and IIUC, they are.

…

ymanton · 2017-02-27T20:49:47Z

I went ahead and put together a test case that runs through the appropriate UCA conformance test (for variable weighting set to non-ignorable). Once I was satisfied that the test was correct and that current Swift 3.1 passed all 145996 comparisons I ran it again with this PR. This uncovered a few hundred failures caused by surrogate code points. In the discussion up to this point I had forgotten that we'd be dealing with UTF16 encoded strings and that surrogates would come into play, but that's a relatively easy problem to deal with. Excluding failures involving surrogates we had 14 failures, all involving contractions and failing as expected.

So in order to properly order contractions and surrogate pairs we can't invoke the UCA with UTF16 strings that start in the middle of a contraction or surrogate pair. Rather than carry out a lot of expensive operations to figure out if we're in the middle of a contraction or surrogate pair, we simply step back a fixed number of code units, equal to the longest possible contraction, or the length of a surrogate pair (2), whichever is greater, minus 1 (while taking care that we don't step back past the start of the strings). This should produce a correct result at the cost of re-comparing a few characters that we know are equal, which is likely much cheaper than calculating a more precise number of code units to step back.

In order to facilitate this I added a new routine to find the longest possible contraction defined by the root collator and cached its length in a static global. The search should therefore only be performed once-ish however there is no synchronization here to guarantee that. I believe the routine is thread-safe and idempotent and should produce the same results even if more than one thread manages to sneak in during the initialization window, nevertheless I'd welcome more scrutiny on this or any other part of the patch.

With the above this PR passes the conformance test cleanly.

(Side note: I tried writing the conformance test in the form of a test case should we want it to be part of the regression suite, however it uses Foundation to parse the input file and Foundation seems to be unavailable to test cases on Linux (?), so for now it's a standalone program.)

tkremenek · 2017-03-03T22:44:18Z

@swift-ci test

swift-ci · 2017-03-03T22:44:37Z

Build failed
Jenkins build - Swift Test OS X Platform
Git Commit - ad82190bafda48a0fa6790bd6881183da0fe7e71
Test requested by - @tkremenek

swift-ci · 2017-03-03T22:45:52Z

Build failed
Jenkins build - Swift Test Linux Platform
Git Commit - ad82190bafda48a0fa6790bd6881183da0fe7e71
Test requested by - @tkremenek

airspeedswift · 2017-03-08T01:34:20Z

Hi @ymanton – it looks like this isn't building for OS X still, and also needs a bit more discussion and alignment with the changes @milseman is planning on making for the Swift 4 string comparison, so is probably getting late for 3.1. Do you think you could rebase the PR onto master? Generally, all changes should be done on master, and then cherry-picked onto a release branch if needed.

ymanton · 2017-03-08T21:19:29Z

Do you think you could rebase the PR onto master?

Will do. I've gotten access to an OS X machine finally so I'll rebase on master and make sure it builds and runs cleanly everywhere in short order.

(On a side note, it would be nice to get this into 3.1 if possible. The performance we can recover, particularly for ASCII and especially on Ubuntu 14.04, would put Swift 3.1 in a better place until Swift 4 lands in my humble opinion.)

This patch improves String comparison performance by introducing a memcmp-like helper routine that will perform a binary comparison on two buffers of equal length N and return the offset of the first difference if any, or N if they contain no differences. The helper is invoked from String operators == (for non-Objective-C runtimes only) and < (for all runtimes) when possible, the critera being that the Strings must have the same element size and must use contiguous storage. If we detect a mismatch in the binary representation of the Strings we fall back to using the ICU routines to determine the result of the comparison. Strings of differing binary length are handled by invoking the helper to check up to the length of the shorter string and falling back to ICU to deal with the remainder. As a special case for UTF16 Strings, before falling back to ICU to handle mismatched sub-strings, we take care to insure that we don't pass sub-strings that begin in the middle of a collation element.

ymanton · 2017-03-10T16:46:18Z

Rebased against master and fixed the OS X issues.

ymanton · 2017-03-13T18:14:20Z

I modified the conformance test to run without Foundation on Linux, but unfortunately (independent of this PR) I don't think it will make a good test case. First, you need to use the input file that was packaged with the ICU source of the version of ICU you're building against; that's the only input that seems to have any hope of passing the test. (For example, on Ubuntu 16.04 with ICU 55 only the input file from ICU 55 passes. Input from ICU 52 and 58 both fail.) This is because the input file changes to account for bug fixes and changes in the default collation table. Second, even a matching input file is not guaranteed to pass, again because of bugs. (For example, on 14.04 with ICU 52 even the input file from ICU 52 fails.) The conformance test included with ICU contains code to check for and ignore expected failures.

In short, I think this test is only useful when run manually, against a corresponding input file the user manually retrieves, and only to do before/after testing to see how the number of failures changes. Anyhow, here's the code for anyone who finds it useful in the future.

https://gist.github.com/ymanton/78cc08d805b6e6db65dbed6d43a6eb4a

ymanton · 2017-03-21T14:36:13Z

Any comment? Can I bug someone to run the regression tests?

CodaFi · 2017-03-21T15:11:29Z

@swift-ci please test

swift-ci · 2017-03-21T15:21:02Z

Build failed
Jenkins build - Swift Test Linux Platform
Git Commit - 7a8ba2d
Test requested by - @CodaFi

CodaFi · 2017-03-21T15:26:09Z

The SIL thing isn't yours.

djones6 · 2017-03-27T09:52:57Z

What's the next step here - does another build need to be requested?

CodaFi · 2017-03-27T15:06:04Z

@swift-ci please test Linux platform

ymanton · 2017-04-19T19:19:02Z

Ping ✋

milseman · 2017-04-19T20:36:33Z

@airspeedswift while we will probably change and unify the sort order semantics for String in Swift 4, do you want to take a change like this for Linux in Swift 3?

ymanton · 2017-05-17T16:28:59Z

Any decision on this?

milseman · 2017-05-17T16:45:55Z

@airspeedswift ping?

While we will probably change and unify the sort order semantics for String in Swift 4, do you want to take a change like this for Linux in Swift 3? Alternatively, would a modified approach (such as mentioned earlier in this thread) be acceptable?

CodaFi · 2017-09-14T16:53:12Z

This patch has fallen behind, still doesn't have tests, and may not make sense given the new string model. Is it still worthwhile to have? Should it be closed until we can figure out a more comprehensive fix later?

@milseman @airspeedswift

swift-ci · 2017-09-14T16:53:44Z

Build failed
Swift Test Linux Platform
Git Sha - 7a8ba2d

CodaFi · 2017-09-14T16:56:43Z

(Oh hai Swift-ci)

swift-ci · 2017-09-14T17:09:42Z

Build failed
Swift Test OS X Platform
Git Sha - 7a8ba2d

CodaFi · 2017-09-22T19:51:40Z

Going to close this due to age. Please feel free to reopen it @ymanton.

ymanton force-pushed the swift-3.1-string-memcmp branch from f2c21d3 to ad82190 Compare February 20, 2017 22:06

ymanton force-pushed the swift-3.1-string-memcmp branch 2 times, most recently from 16190fb to 7954ce5 Compare February 27, 2017 20:46

ymanton force-pushed the swift-3.1-string-memcmp branch from bd0f226 to 9044103 Compare February 28, 2017 15:54

ymanton force-pushed the swift-3.1-string-memcmp branch from 9044103 to 7a8ba2d Compare March 10, 2017 16:42

ymanton changed the base branch from swift-3.1-branch to master March 10, 2017 16:44

milseman requested a review from airspeedswift March 28, 2017 00:00

This was referenced Jul 31, 2017

Enable Foundation to build against a new libicu swiftlang/swift-corelibs-foundation#1144

Closed

Enable Foundation to build against a new libicu #11266

Closed

CodaFi closed this Sep 22, 2017

stdlib: Improve String comparison performance #7339

stdlib: Improve String comparison performance #7339

Uh oh!

Conversation

ymanton commented Feb 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

moiseev commented Feb 8, 2017

Uh oh!

moiseev commented Feb 8, 2017

Uh oh!

gottesmm commented Feb 8, 2017

Uh oh!

ymanton commented Feb 8, 2017

Uh oh!

jrose-apple commented Feb 9, 2017

Uh oh!

dabrahams commented Feb 9, 2017

Uh oh!

moiseev commented Feb 9, 2017

Uh oh!

ymanton commented Feb 10, 2017

Uh oh!

jrose-apple commented Feb 10, 2017

Collation order is not preserved under concatenation or substring operations, in general.

Uh oh!

jrose-apple commented Feb 10, 2017

Uh oh!

jrose-apple commented Feb 10, 2017

Uh oh!

dabrahams commented Feb 11, 2017 via email

Uh oh!

milseman commented Feb 13, 2017

Uh oh!

ymanton commented Feb 14, 2017

Uh oh!

milseman commented Feb 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ematejska commented Feb 20, 2017

Uh oh!

swift-ci commented Feb 20, 2017

Uh oh!

swift-ci commented Feb 20, 2017

Uh oh!

tkremenek commented Feb 20, 2017

Uh oh!

swift-ci commented Feb 20, 2017

Uh oh!

ymanton commented Feb 20, 2017

Uh oh!

tkremenek commented Feb 20, 2017

Uh oh!

swift-ci commented Feb 20, 2017

Uh oh!

swift-ci commented Feb 20, 2017

Uh oh!

ymanton commented Feb 20, 2017

Uh oh!

dabrahams commented Feb 21, 2017 via email

Uh oh!

ymanton commented Feb 27, 2017

Uh oh!

tkremenek commented Mar 3, 2017

Uh oh!

swift-ci commented Mar 3, 2017

Uh oh!

swift-ci commented Mar 3, 2017

Uh oh!

airspeedswift commented Mar 8, 2017

Uh oh!

ymanton commented Mar 8, 2017

Uh oh!

ymanton commented Mar 10, 2017

Uh oh!

ymanton commented Mar 13, 2017

Uh oh!

ymanton commented Mar 21, 2017

Uh oh!

CodaFi commented Mar 21, 2017

ymanton commented Feb 8, 2017 •

edited

Loading

milseman commented Feb 14, 2017 •

edited

Loading