Skip to content

stdlib: Improve String comparison performance #7339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ymanton
Copy link

@ymanton ymanton commented Feb 8, 2017

This patch attempts to recover some of the performance lost on Linux due to ef974af. A more detailed description is provided in the commit message, but the short summary is that we can use a memcmp-like approach to compare the binary representations of two strings until we hit a difference, after which we can fall back to the more expensive ICU routines. A memcmp-like routine, written in C, is provided which does a modest job of optimizing the comparison by comparing uintptr_t-sized blocks at a time when possible.

The performance lost due to ef974af strongly depends on which version of ICU Swift is built against; versions of ICU prior to ICU 53 are particularly bad when dealing with ASCII strings so in those environments ef974af is particularly costly. I've tested on both Ubuntu 14.04 (with ICU 52) and Ubuntu 16.04 (with ICU 55) with a simple micro-benchmark @djones6 came up with that tests various scenarios; results are as follows:

Ubuntu 16.04

Test Swift 3.1 Swift 3.1+patch 3.1 / 3.1+patch
ASCII lexically equal, binary equal 439.02 26 16.89
ASCII lexically not equal, same length 491.1 149.80 3.28
ASCII lexically not equal, different lengths 496.91 148.91 3.34
UTF16 lexically equal, binary equal 35.84 27.43 1.31
UTF16 lexically equal, binary not equal, same length 267.98 232.45 1.15
UTF16 lexically not equal, same length 265.44 227.90 1.16

Ubuntu 14.04

Test Swift 3.1 Swift 3.1+patch 3.1 / 3.1+patch
ASCII lexically equal, binary equal 8418.33 19.79 425.31
ASCII lexically not equal, same length 8038.33 400.19 20.07
ASCII lexically not equal, different lengths 8037.67 399.96 20.10
UTF16 lexically equal, binary equal 22.25 21.38 1.04
UTF16 lexically equal, binary not equal, same length 224.21 175.08 1.28
UTF16 lexically not equal, same length 220.91 171.83 1.29

(Measurements are elapsed time in seconds for 1 million comparisons)

I'm in the process of re-writing this micro-benchmark to conform to the conventions used in swift/benchmarks (as suggested by @dabrahams) however it seems that benchmarking is not supported on Linux and requires more than a few changes to the build system to get working so in the interest of getting early feedback on this patch itself I put together this PR. Any feedback would be appreciated.

@moiseev
Copy link
Contributor

moiseev commented Feb 8, 2017

/cc @dabrahams

@moiseev
Copy link
Contributor

moiseev commented Feb 8, 2017

@swift-ci Please test Linux platform

@gottesmm
Copy link
Contributor

gottesmm commented Feb 8, 2017

@ymanton One thing that we have discussed is switching the benchmark suite to build using swiftpm. I may have a patch around that does this. It would enable this to be built on Linux with some small modifications. Interested?

@ymanton
Copy link
Author

ymanton commented Feb 8, 2017

@gottesmm Sure, that would be helpful.

@jrose-apple
Copy link
Contributor

Is this correct? "abcdèfg" and "abcdëfg" have a common byte prefix of "abcde" (when in decomposed form, anyway), but would comparing starting with the combining character produce a different result than comparing starting with the base character?

@dabrahams
Copy link
Contributor

@ymanton Could you please write and executes a test case that answers @jrose-apple's question? Thanks

@moiseev
Copy link
Contributor

moiseev commented Feb 9, 2017

/cc @milseman

@ymanton
Copy link
Author

ymanton commented Feb 10, 2017

would comparing starting with the combining character produce a different result than comparing starting with the base character?

I put together a quick test that takes every composed character that is formed by the cross product of [a-zA-Z] x [U+0300-U+036F] (i.e. pairing every latin alphabetical character with every combining diacritical mark) and in the middle of "abcd" and "ef", then I compared every one of those strings against every other, both for equality and order -- the results were identical with and without this patch.

I can't claim to be a Unicode expert, and my attempts at reading through the Unicode Collation Algorithm spec in the hopes of finding statements that says this is legitimate have yet to succeed, but it seems to pass the sniff test anyway. I'll try to come up with something more thorough, perhaps utilizing the UCA conformance tests.

Having said that, here are some alternatives that come to mind:

  • On hitting a mismatch we can always restart ICU collation from offset 0, but this would make all unequal string comparisons slower than they are right now.
  • We could check if the mismatch involves simple characters (e.g. in the ASCII range) and only start ICU collation off at the mismatch offset in those cases. This would penalize a smaller subset of unequal string comparisons.
  • The inverse of the above would be checking if the mismatch involves combining or modifying characters only restarting ICU collation at offset 0 in those cases.
  • We could backtrack if the mismatch involves combining or modifying characters and find the first standalone character in order to restart ICU collation at that position. The backtracking itself might be fairly expensive in terms of tests per character, but I imagine in most cases you'd be backtracking past a small number of characters.

@jrose-apple
Copy link
Contributor

The Collation algorithm TR has this under "What Collation Is Not":

Collation order is not preserved under concatenation or substring operations, in general.

For example, the fact that x is less than y does not mean that x + z is less than y + z, because characters may form contractions across the substring or concatenation boundaries. In summary:

x < y does not imply that xz < yz
x < y does not imply that zx < zy
xz < yz does not imply that x < y
zx < zy does not imply that x < y

However, it's unclear whether this applies to non-localized collations. Further down, under "Default Unicode Collation Element Table":

In the Default Unicode Collation Element Table, contractions are necessary where a canonical decomposable character requires a distinct primary weight in the table, so that the canonical-equivalent character sequences are given the same weights. For example, Indic two-part vowels have primary weights as units, and their canonical-equivalent sequence of vowel parts must be given the same primary weight by means of a contraction entry in the table. The same applies to a number of precomposed Cyrillic characters with diacritic marks and to a small number of Arabic letters with madda or hamza marks.

The Default Unicode Collation Element Table is constructed to be consistent with the Unicode Normalization algorithm, and to respect the Unicode character properties. It is not, however, merely algorithmically derivable based on considerations of canonical equivalence and an inspection of character properties, because the assignment of levels also takes into account characteristics of particular scripts. For example, the combining marks generally have secondary collation elements; however, the Indic combining vowels are given non-zero Level 1 weights, because they are as significant in sorting as the consonants.

The simplest example where this would go wrong in a localized comparison is "abcde" vs. "abche" in Spanish locales, where "ch" is treated as a single letter collation element. It's a little unclear to me whether we'd be allowed to tailor the default table in such a way that we use no "contractions" to avoid this problem, and then secondarily unclear whether that would be desirable.

The mitigation of only doing this for ASCII mismatches is probably good enough, but someone would need to double-check that none of the multi-character collation element mappings in the default table have non-initial ASCII characters. I think that would be a sufficient condition but you or @dabrahams should probably check my reading.

I like the idea of having a little Swift program that can run through the UCA conformance tests. We could even add that to our validation tests.

@jrose-apple
Copy link
Contributor

Ah, by the way, please don't take my comments as dismissal or disrespect. I/We definitely appreciate you coming in with a patch to improve things. :-)

@jrose-apple
Copy link
Contributor

Ah, one workflow note: we usually take patches to the master branch first, and then cherry-pick them to the release branch after we've seen that there are no issues.

@dabrahams
Copy link
Contributor

dabrahams commented Feb 11, 2017 via email

@milseman
Copy link
Member

I am working on the comparison algorithm and semantics for String in Swift 4. However, if there is a semantics-preserving optimization, such as this PR, we should take it.

My main concern with this PR, because I am not a Unicode expert, is whether returning the offset of the first code unit difference is correct. I believe that returning the offset of the code unit that began the new grapheme would be. The tight loop could be adjusted to remember the offset of the last non-combining character and return that instead.

@ymanton, does this PR exhibit different behavior if the compared strings are canonicalized differently? That is, if their normal forms differ or they are not even in a normal form? To use the collation TR's example, do strings containing the following equivalent graphemes compare equivalently no matter how they are expressed?

ự	U+1EF1 LATIN SMALL LETTER U WITH HORN AND DOT BELOW
ụ ◌̛	U+1EE5 LATIN SMALL LETTER U WITH DOT BELOW, U+031B COMBINING HORN
u ◌̛ ◌̣	U+0075 LATIN SMALL LETTER U, U+031B COMBINING HORN, U+0323 COMBINING DOT BELOW
ư ◌̣	U+01B0 LATIN SMALL LETTER U WITH HORN, U+0323 COMBINING DOT BELOW
u ◌̣ ◌̛	U+0075 LATIN SMALL LETTER U, U+0323 COMBINING DOT BELOW, U+031B COMBINING HORN

@ymanton
Copy link
Author

ymanton commented Feb 14, 2017

@milseman,

@ymanton, does this PR exhibit different behavior if the compared strings are canonicalized differently? That is, if their normal forms differ or they are not even in a normal form? To use the collation TR's example, do strings containing the following equivalent graphemes compare equivalently no matter how they are expressed?

The given examples all compare equally to one another. Only two of the five exercise this PR. , ụ ◌̛ and ư ◌̣ all have unique leading code points so the UCA is used right off the bat. Only u ◌̛ ◌̣ and u ◌̣ ◌̛ are interesting here and for these two the first code point matches via memcmp and the remaining code points are passed on to the UCA where ◌̛ ◌̣ and ◌̣ ◌̛ come out equal as well, as expected.

@milseman
Copy link
Member

milseman commented Feb 14, 2017

Ah, the case I was most worried about was actually not equality but preserving < when normal forms differ and we're breaking in the middle of a grapheme.

For example, is there some x, y, z, and w such that:

UCA states that xyw < xzw, but either z < y or zw < yw.

In the case of z < y, if the strings are presented as xwy and xwz, then UCA breaking at the z and y would result in a different sort order than the whole grapheme.

In the case of zw < yw, if the strings are presented as xyw and xzw, then UCA breaking at the yw and zw would result in a different sort order than the whole grapheme.

x, y, z and w above need not be confined to only unicode code points, but might be arbitrary sequences of code points while still inside a grapheme.

If so, then running UCA comparison while breaking within a grapheme would be incorrect.

@ematejska
Copy link
Contributor

@swift-ci please test

@swift-ci
Copy link
Contributor

Build failed
Jenkins build - Swift Test OS X Platform
Git Commit - f2c21d34b164f8d146f37225ebbda07b01ef57b7
Test requested by - @ematejska

@swift-ci
Copy link
Contributor

Build failed
Jenkins build - Swift Test Linux Platform
Git Commit - f2c21d34b164f8d146f37225ebbda07b01ef57b7
Test requested by - @ematejska

@tkremenek
Copy link
Member

@swift-ci clean test

@swift-ci
Copy link
Contributor

Build failed
Jenkins build - Swift Test OS X Platform
Git Commit - f2c21d34b164f8d146f37225ebbda07b01ef57b7
Test requested by - @tkremenek

@ymanton ymanton force-pushed the swift-3.1-string-memcmp branch from f2c21d3 to ad82190 Compare February 20, 2017 22:06
@ymanton
Copy link
Author

ymanton commented Feb 20, 2017

Fixed the OS X build break.

@tkremenek
Copy link
Member

@swift-ci test

@swift-ci
Copy link
Contributor

Build failed
Jenkins build - Swift Test Linux Platform
Git Commit - f2c21d34b164f8d146f37225ebbda07b01ef57b7
Test requested by - @tkremenek

@swift-ci
Copy link
Contributor

Build failed
Jenkins build - Swift Test OS X Platform
Git Commit - f2c21d34b164f8d146f37225ebbda07b01ef57b7
Test requested by - @tkremenek

@ymanton
Copy link
Author

ymanton commented Feb 20, 2017

@milseman, I think we are safe on this point for most cases, particularly for combining characters, because grapheme sort keys are in most cases composed via concatenation. The code points x, y, z, y have sort keys k(x), k(y), k(z), k(w). The grapheme clusters xyw and xzw have the sort keys k(x)+k(y)+k(w) and k(x)+k(z)+k(w) respectively -- i.e. the grapheme cluster's sort key is the concatenation of the code point sort keys. This is a bit of a simplification as UCA sorting is weighted, but I think the end result is the same none the less.

For xyw < xzw to be true k(y) < k(z) has to be true, since the other two keys and their relative positions are the same, which means that z < y or zw < yw cannot be true.

The one place where we can run into problems is with contractions, as @jrose-apple mentioned a while back. This is where the characters ch form a single collation element who's sort key k(ch) is not a combination of the individual characters' sort keys k(c) and k(h). For this grapheme cluster we will indeed get different results if we invoke UCA in the middle of the grapheme. Unfortunately, such contractions are not limited to localized sorting, as the DUCET contains many such contractions. To deal with this we can explore some of the earlier suggestions that came up as the workaround is effectively the same, we want to avoid invoking the UCA in the middle of a contraction.

@dabrahams
Copy link
Contributor

dabrahams commented Feb 21, 2017 via email

@ymanton ymanton force-pushed the swift-3.1-string-memcmp branch 2 times, most recently from 16190fb to 7954ce5 Compare February 27, 2017 20:46
@ymanton
Copy link
Author

ymanton commented Feb 27, 2017

I went ahead and put together a test case that runs through the appropriate UCA conformance test (for variable weighting set to non-ignorable). Once I was satisfied that the test was correct and that current Swift 3.1 passed all 145996 comparisons I ran it again with this PR. This uncovered a few hundred failures caused by surrogate code points. In the discussion up to this point I had forgotten that we'd be dealing with UTF16 encoded strings and that surrogates would come into play, but that's a relatively easy problem to deal with. Excluding failures involving surrogates we had 14 failures, all involving contractions and failing as expected.

So in order to properly order contractions and surrogate pairs we can't invoke the UCA with UTF16 strings that start in the middle of a contraction or surrogate pair. Rather than carry out a lot of expensive operations to figure out if we're in the middle of a contraction or surrogate pair, we simply step back a fixed number of code units, equal to the longest possible contraction, or the length of a surrogate pair (2), whichever is greater, minus 1 (while taking care that we don't step back past the start of the strings). This should produce a correct result at the cost of re-comparing a few characters that we know are equal, which is likely much cheaper than calculating a more precise number of code units to step back.

In order to facilitate this I added a new routine to find the longest possible contraction defined by the root collator and cached its length in a static global. The search should therefore only be performed once-ish however there is no synchronization here to guarantee that. I believe the routine is thread-safe and idempotent and should produce the same results even if more than one thread manages to sneak in during the initialization window, nevertheless I'd welcome more scrutiny on this or any other part of the patch.

With the above this PR passes the conformance test cleanly.

(Side note: I tried writing the conformance test in the form of a test case should we want it to be part of the regression suite, however it uses Foundation to parse the input file and Foundation seems to be unavailable to test cases on Linux (?), so for now it's a standalone program.)

@ymanton ymanton force-pushed the swift-3.1-string-memcmp branch from bd0f226 to 9044103 Compare February 28, 2017 15:54
@tkremenek
Copy link
Member

@swift-ci test

@swift-ci
Copy link
Contributor

swift-ci commented Mar 3, 2017

Build failed
Jenkins build - Swift Test OS X Platform
Git Commit - ad82190bafda48a0fa6790bd6881183da0fe7e71
Test requested by - @tkremenek

@swift-ci
Copy link
Contributor

swift-ci commented Mar 3, 2017

Build failed
Jenkins build - Swift Test Linux Platform
Git Commit - ad82190bafda48a0fa6790bd6881183da0fe7e71
Test requested by - @tkremenek

@airspeedswift
Copy link
Member

Hi @ymanton – it looks like this isn't building for OS X still, and also needs a bit more discussion and alignment with the changes @milseman is planning on making for the Swift 4 string comparison, so is probably getting late for 3.1. Do you think you could rebase the PR onto master? Generally, all changes should be done on master, and then cherry-picked onto a release branch if needed.

@ymanton
Copy link
Author

ymanton commented Mar 8, 2017

Do you think you could rebase the PR onto master?

Will do. I've gotten access to an OS X machine finally so I'll rebase on master and make sure it builds and runs cleanly everywhere in short order.

(On a side note, it would be nice to get this into 3.1 if possible. The performance we can recover, particularly for ASCII and especially on Ubuntu 14.04, would put Swift 3.1 in a better place until Swift 4 lands in my humble opinion.)

This patch improves String comparison performance by
introducing a memcmp-like helper routine that will perform
a binary comparison on two buffers of equal length N and
return the offset of the first difference if any, or N if
they contain no differences.

The helper is invoked from String operators == (for
non-Objective-C runtimes only) and < (for all runtimes)
when possible, the critera being that the Strings must have
the same element size and must use contiguous storage. If
we detect a mismatch in the binary representation of the
Strings we fall back to using the ICU routines to determine
the result of the comparison. Strings of differing binary
length are handled by invoking the helper to check up to
the length of the shorter string and falling back to ICU
to deal with the remainder.

As a special case for UTF16 Strings, before falling back
to ICU to handle mismatched sub-strings, we take care to
insure that we don't pass sub-strings that begin in the
middle of a collation element.
@ymanton ymanton force-pushed the swift-3.1-string-memcmp branch from 9044103 to 7a8ba2d Compare March 10, 2017 16:42
@ymanton ymanton changed the base branch from swift-3.1-branch to master March 10, 2017 16:44
@ymanton
Copy link
Author

ymanton commented Mar 10, 2017

Rebased against master and fixed the OS X issues.

@ymanton
Copy link
Author

ymanton commented Mar 13, 2017

I modified the conformance test to run without Foundation on Linux, but unfortunately (independent of this PR) I don't think it will make a good test case. First, you need to use the input file that was packaged with the ICU source of the version of ICU you're building against; that's the only input that seems to have any hope of passing the test. (For example, on Ubuntu 16.04 with ICU 55 only the input file from ICU 55 passes. Input from ICU 52 and 58 both fail.) This is because the input file changes to account for bug fixes and changes in the default collation table. Second, even a matching input file is not guaranteed to pass, again because of bugs. (For example, on 14.04 with ICU 52 even the input file from ICU 52 fails.) The conformance test included with ICU contains code to check for and ignore expected failures.

In short, I think this test is only useful when run manually, against a corresponding input file the user manually retrieves, and only to do before/after testing to see how the number of failures changes. Anyhow, here's the code for anyone who finds it useful in the future.

https://gist.github.com/ymanton/78cc08d805b6e6db65dbed6d43a6eb4a

@ymanton
Copy link
Author

ymanton commented Mar 21, 2017

Any comment? Can I bug someone to run the regression tests?

@CodaFi
Copy link
Contributor

CodaFi commented Mar 21, 2017

@swift-ci please test

@swift-ci
Copy link
Contributor

Build failed
Jenkins build - Swift Test Linux Platform
Git Commit - 7a8ba2d
Test requested by - @CodaFi

@CodaFi
Copy link
Contributor

CodaFi commented Mar 21, 2017

The SIL thing isn't yours.

@djones6
Copy link

djones6 commented Mar 27, 2017

What's the next step here - does another build need to be requested?

@CodaFi
Copy link
Contributor

CodaFi commented Mar 27, 2017

@swift-ci please test Linux platform

@milseman milseman requested a review from airspeedswift March 28, 2017 00:00
@ymanton
Copy link
Author

ymanton commented Apr 19, 2017

Ping ✋

@milseman
Copy link
Member

@airspeedswift while we will probably change and unify the sort order semantics for String in Swift 4, do you want to take a change like this for Linux in Swift 3?

@ymanton
Copy link
Author

ymanton commented May 17, 2017

Any decision on this?

@milseman
Copy link
Member

@airspeedswift ping?

While we will probably change and unify the sort order semantics for String in Swift 4, do you want to take a change like this for Linux in Swift 3? Alternatively, would a modified approach (such as mentioned earlier in this thread) be acceptable?

@CodaFi
Copy link
Contributor

CodaFi commented Sep 14, 2017

This patch has fallen behind, still doesn't have tests, and may not make sense given the new string model. Is it still worthwhile to have? Should it be closed until we can figure out a more comprehensive fix later?

@milseman @airspeedswift

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - 7a8ba2d

@CodaFi
Copy link
Contributor

CodaFi commented Sep 14, 2017

(Oh hai Swift-ci)

@swift-ci
Copy link
Contributor

Build failed
Swift Test OS X Platform
Git Sha - 7a8ba2d

@CodaFi
Copy link
Contributor

CodaFi commented Sep 22, 2017

Going to close this due to age. Please feel free to reopen it @ymanton.

@CodaFi CodaFi closed this Sep 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.