-
Notifications
You must be signed in to change notification settings - Fork 10.5k
stdlib: Improve String comparison performance #7339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/cc @dabrahams |
@swift-ci Please test Linux platform |
@ymanton One thing that we have discussed is switching the benchmark suite to build using swiftpm. I may have a patch around that does this. It would enable this to be built on Linux with some small modifications. Interested? |
@gottesmm Sure, that would be helpful. |
Is this correct? "abcdèfg" and "abcdëfg" have a common byte prefix of "abcde" (when in decomposed form, anyway), but would comparing starting with the combining character produce a different result than comparing starting with the base character? |
@ymanton Could you please write and executes a test case that answers @jrose-apple's question? Thanks |
/cc @milseman |
I put together a quick test that takes every composed character that is formed by the cross product of I can't claim to be a Unicode expert, and my attempts at reading through the Unicode Collation Algorithm spec in the hopes of finding statements that says this is legitimate have yet to succeed, but it seems to pass the sniff test anyway. I'll try to come up with something more thorough, perhaps utilizing the UCA conformance tests. Having said that, here are some alternatives that come to mind:
|
The Collation algorithm TR has this under "What Collation Is Not":
However, it's unclear whether this applies to non-localized collations. Further down, under "Default Unicode Collation Element Table":
The simplest example where this would go wrong in a localized comparison is "abcde" vs. "abche" in Spanish locales, where "ch" is treated as a single The mitigation of only doing this for ASCII mismatches is probably good enough, but someone would need to double-check that none of the multi-character collation element mappings in the default table have non-initial ASCII characters. I think that would be a sufficient condition but you or @dabrahams should probably check my reading. I like the idea of having a little Swift program that can run through the UCA conformance tests. We could even add that to our validation tests. |
Ah, by the way, please don't take my comments as dismissal or disrespect. I/We definitely appreciate you coming in with a patch to improve things. :-) |
Ah, one workflow note: we usually take patches to the master branch first, and then cherry-pick them to the release branch after we've seen that there are no issues. |
Per the string manifesto, we really don't want to be using the UCA for non-localized comparisons anyway. However, the exact semantics of string compArison are most likely going to be changing soon And we should only change them once. Michael is working in this area. Michael, will you be able to use anything from this patch?
Sent from my moss-covered three-handled family gradunza
… On Feb 8, 2017, at 5:44 PM, Jordan Rose ***@***.***> wrote:
Is this correct? "abcdèfg" and "abcdëfg" have a common byte prefix of "abcde" (when in decomposed form, anyway), but would comparing starting with the combining character produce a different result than comparing starting with the base character?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I am working on the comparison algorithm and semantics for String in Swift 4. However, if there is a semantics-preserving optimization, such as this PR, we should take it. My main concern with this PR, because I am not a Unicode expert, is whether returning the offset of the first code unit difference is correct. I believe that returning the offset of the code unit that began the new grapheme would be. The tight loop could be adjusted to remember the offset of the last non-combining character and return that instead. @ymanton, does this PR exhibit different behavior if the compared strings are canonicalized differently? That is, if their normal forms differ or they are not even in a normal form? To use the collation TR's example, do strings containing the following equivalent graphemes compare equivalently no matter how they are expressed?
|
The given examples all compare equally to one another. Only two of the five exercise this PR. |
Ah, the case I was most worried about was actually not equality but preserving For example, is there some UCA states that In the case of In the case of
If so, then running UCA comparison while breaking within a grapheme would be incorrect. |
@swift-ci please test |
Build failed |
Build failed |
@swift-ci clean test |
Build failed |
f2c21d3
to
ad82190
Compare
Fixed the OS X build break. |
@swift-ci test |
Build failed |
Build failed |
@milseman, I think we are safe on this point for most cases, particularly for combining characters, because grapheme sort keys are in most cases composed via concatenation. The code points For The one place where we can run into problems is with contractions, as @jrose-apple mentioned a while back. This is where the characters |
Sent from my moss-covered three-handled family gradunza
On Feb 14, 2017, at 9:23 AM, Michael Ilseman ***@***.***> wrote:
Ah, the case I was most worried about was actually not equality but preserving < when normal forms differ and we're breaking in the middle of a grapheme.
For example, is there some x``y``z``w such that:
UCA states that xyw < xzw, but either z < y or zw < 'yw'.
In the case of z < y, if the strings are presented as xwy and xwz, then UCA breaking at the z and y would result in a different sort order than the whole grapheme.
In the case of zw < yw, if the strings are presented as xyw and xzw, then UCA breaking at the yw and zw would result in a different sort order than the whole grapheme.
x``y``z and w above need not be confined to only unicode code points, but might be arbitrary sequences of code points while still inside a grapheme.
If so, then breaking comparison at the z would produce the wrong sort order.
You don't need to actually find such an example to prove there is a problem; it is sufficient that the UCA plus the kinds of things expressible by collation tables are able to produce such a result, and IIUC, they are.
… |
16190fb
to
7954ce5
Compare
I went ahead and put together a test case that runs through the appropriate UCA conformance test (for variable weighting set to non-ignorable). Once I was satisfied that the test was correct and that current Swift 3.1 passed all 145996 comparisons I ran it again with this PR. This uncovered a few hundred failures caused by surrogate code points. In the discussion up to this point I had forgotten that we'd be dealing with UTF16 encoded strings and that surrogates would come into play, but that's a relatively easy problem to deal with. Excluding failures involving surrogates we had 14 failures, all involving contractions and failing as expected. So in order to properly order contractions and surrogate pairs we can't invoke the UCA with UTF16 strings that start in the middle of a contraction or surrogate pair. Rather than carry out a lot of expensive operations to figure out if we're in the middle of a contraction or surrogate pair, we simply step back a fixed number of code units, equal to the longest possible contraction, or the length of a surrogate pair (2), whichever is greater, minus 1 (while taking care that we don't step back past the start of the strings). This should produce a correct result at the cost of re-comparing a few characters that we know are equal, which is likely much cheaper than calculating a more precise number of code units to step back. In order to facilitate this I added a new routine to find the longest possible contraction defined by the root collator and cached its length in a static global. The search should therefore only be performed once-ish however there is no synchronization here to guarantee that. I believe the routine is thread-safe and idempotent and should produce the same results even if more than one thread manages to sneak in during the initialization window, nevertheless I'd welcome more scrutiny on this or any other part of the patch. With the above this PR passes the conformance test cleanly. (Side note: I tried writing the conformance test in the form of a test case should we want it to be part of the regression suite, however it uses Foundation to parse the input file and Foundation seems to be unavailable to test cases on Linux (?), so for now it's a standalone program.) |
bd0f226
to
9044103
Compare
@swift-ci test |
Build failed |
Build failed |
Hi @ymanton – it looks like this isn't building for OS X still, and also needs a bit more discussion and alignment with the changes @milseman is planning on making for the Swift 4 string comparison, so is probably getting late for 3.1. Do you think you could rebase the PR onto master? Generally, all changes should be done on master, and then cherry-picked onto a release branch if needed. |
Will do. I've gotten access to an OS X machine finally so I'll rebase on master and make sure it builds and runs cleanly everywhere in short order. (On a side note, it would be nice to get this into 3.1 if possible. The performance we can recover, particularly for ASCII and especially on Ubuntu 14.04, would put Swift 3.1 in a better place until Swift 4 lands in my humble opinion.) |
This patch improves String comparison performance by introducing a memcmp-like helper routine that will perform a binary comparison on two buffers of equal length N and return the offset of the first difference if any, or N if they contain no differences. The helper is invoked from String operators == (for non-Objective-C runtimes only) and < (for all runtimes) when possible, the critera being that the Strings must have the same element size and must use contiguous storage. If we detect a mismatch in the binary representation of the Strings we fall back to using the ICU routines to determine the result of the comparison. Strings of differing binary length are handled by invoking the helper to check up to the length of the shorter string and falling back to ICU to deal with the remainder. As a special case for UTF16 Strings, before falling back to ICU to handle mismatched sub-strings, we take care to insure that we don't pass sub-strings that begin in the middle of a collation element.
9044103
to
7a8ba2d
Compare
Rebased against master and fixed the OS X issues. |
I modified the conformance test to run without Foundation on Linux, but unfortunately (independent of this PR) I don't think it will make a good test case. First, you need to use the input file that was packaged with the ICU source of the version of ICU you're building against; that's the only input that seems to have any hope of passing the test. (For example, on Ubuntu 16.04 with ICU 55 only the input file from ICU 55 passes. Input from ICU 52 and 58 both fail.) This is because the input file changes to account for bug fixes and changes in the default collation table. Second, even a matching input file is not guaranteed to pass, again because of bugs. (For example, on 14.04 with ICU 52 even the input file from ICU 52 fails.) The conformance test included with ICU contains code to check for and ignore expected failures. In short, I think this test is only useful when run manually, against a corresponding input file the user manually retrieves, and only to do before/after testing to see how the number of failures changes. Anyhow, here's the code for anyone who finds it useful in the future. https://gist.github.com/ymanton/78cc08d805b6e6db65dbed6d43a6eb4a |
Any comment? Can I bug someone to run the regression tests? |
@swift-ci please test |
Build failed |
The SIL thing isn't yours. |
What's the next step here - does another build need to be requested? |
@swift-ci please test Linux platform |
Ping ✋ |
@airspeedswift while we will probably change and unify the sort order semantics for String in Swift 4, do you want to take a change like this for Linux in Swift 3? |
Any decision on this? |
@airspeedswift ping? While we will probably change and unify the sort order semantics for String in Swift 4, do you want to take a change like this for Linux in Swift 3? Alternatively, would a modified approach (such as mentioned earlier in this thread) be acceptable? |
This patch has fallen behind, still doesn't have tests, and may not make sense given the new string model. Is it still worthwhile to have? Should it be closed until we can figure out a more comprehensive fix later? |
Build failed |
(Oh hai Swift-ci) |
Build failed |
Going to close this due to age. Please feel free to reopen it @ymanton. |
This patch attempts to recover some of the performance lost on Linux due to ef974af. A more detailed description is provided in the commit message, but the short summary is that we can use a
memcmp
-like approach to compare the binary representations of two strings until we hit a difference, after which we can fall back to the more expensive ICU routines. Amemcmp
-like routine, written in C, is provided which does a modest job of optimizing the comparison by comparinguintptr_t
-sized blocks at a time when possible.The performance lost due to ef974af strongly depends on which version of ICU Swift is built against; versions of ICU prior to ICU 53 are particularly bad when dealing with ASCII strings so in those environments ef974af is particularly costly. I've tested on both Ubuntu 14.04 (with ICU 52) and Ubuntu 16.04 (with ICU 55) with a simple micro-benchmark @djones6 came up with that tests various scenarios; results are as follows:
Ubuntu 16.04
Ubuntu 14.04
(Measurements are elapsed time in seconds for 1 million comparisons)
I'm in the process of re-writing this micro-benchmark to conform to the conventions used in
swift/benchmarks
(as suggested by @dabrahams) however it seems that benchmarking is not supported on Linux and requires more than a few changes to the build system to get working so in the interest of getting early feedback on this patch itself I put together this PR. Any feedback would be appreciated.