String bridging and internal NUL [SR-2225], take four #1253

xwu · 2017-10-08T16:31:05Z

This PR is a rebased version of #604, itself a rebased version of #549, a revision to #496.
It resolves SR-2225 and, apparently, rdar://problem/19034601.

What's going on in this PR

Existing bridging is buggy

The existing implementation of bridging from _NSCFString to String was broken in several ways:

The "fast path" blew away internal NUL characters because it used facilities for C strings.
Merely removing the "fast path" strangely caused tests to fail. Debugging revealed that the reason for failing tests was that the default "slow path" was also broken.
The "slow path" used CFStringGetLength() and CFStringGetCharacters(). However, if the CFString in question contains multi-byte characters and was initialized using bytes, then neither of these CF functions behaves as documented:
- CFStringGetLength() seems to return the length in bytes, not the length in UTF16 code pairs.
- CFStringGetCharacters() seems to return the bytes, not the UTF16 code pairs.

As a consequence, if UTF8-encoded bytes that encode internal NUL characters and multi-byte characters were used to initialize an NSString, the "fast path" would clobber bridging in one way, and the "slow path" would clobber bridging in a different way.

Revised bridging

Since CF UTF16-related functions are erratic, I use a buffer to store UTF8-encoded bytes. If some way can be found to determine (reliably, not relying on the result of a buggy implementation in CF) the length in bytes of a CFString that uses UTF8-encoded bytes for storage, then a fast path can be restored that doesn't require allocating a buffer.

Modified tests

After implementing this revised bridging, several tests began to fail. On further examination, these tests were found to be incorrect because they relied on multiple tandem NUL characters being lost on bridging. On a Mac, Darwin Foundation fails these tests as well. I have corrected the tests.

This PR additionally adds a test for constructing an NSString with data that contains an internal NUL character.

Miscellaneous changes

This PR rolls in the previously discussed (#496) change in init?(data:encoding:).

xwu · 2017-10-08T16:35:09Z

/cc @phausler @parkera

spevans · 2017-10-08T17:00:13Z

@swift-ci please test

xwu · 2017-10-08T18:49:51Z

@spevans Interesting discrepancy in the type of kCFStringEncodingUTF8 between Linux and Darwin; pretty sure that it wasn't there last year. Should build now--please trigger tests again :)

phausler · 2017-10-08T18:56:20Z

Nah that type difference has been there since day 1 of swift-corelibs sadly

phausler · 2017-10-08T18:58:13Z

I have a feeling this is not the full fix btw the problems you outlined about the CF functions are likely incorrect abstractions due to the lack of factory patterns: where on Darwin we create strings much differently than how it is done here: I need to take a further look at this change to understand more of what is going on (will take a further review of it on Monday)

spevans · 2017-10-08T19:08:50Z

@swift-ci please test

parkera · 2017-10-09T18:24:32Z

Foundation/String.swift

+            // FIXME: If we had a reliable way to determine the length of `cf`
+            //        in bytes, then we wouldn't have to allocate a new buffer
+            //        if `CFStringGetCStringPtr(_:_:)` doesn't return nil.
+            if let buffer = CFStringGetCharactersPtr(cf) {


The recommended pattern is documented here:

https://developer.apple.com/library/content/documentation/CoreFoundation/Conceptual/CFStrings/Articles/AccessingContents.html

Specifically:

CFStringRef str; const UniChar *chars; str = CFStringCreateWithCString(NULL, "Hello World", kCFStringEncodingMacRoman); chars = CFStringGetCharactersPtr(str); if (chars == NULL) { CFIndex length = CFStringGetLength(str); UniChar *buffer = malloc(length * sizeof(UniChar)); CFStringGetCharacters(str, CFRangeMake(0, length), buffer); // Process the characters... free(buffer); }

As I write above in the PR description, this recommended pattern cannot be used because CFStringGetCharacters is buggy and returns bytes instead of UTF16 characters if the underlying storage is UTF8-based; attempting to use the recommended pattern clobbers multibyte Unicode strings. We must here check if CFStringGetCharactersPtr returns NULL, and if so, retrieve UTF8-encoded bytes instead.

parkera · 2017-10-09T18:25:09Z

Foundation/String.swift

-            let str = CFStringGetCStringPtr(cf, CFStringEncoding(kCFStringEncodingUTF8))
-            if str != nil {
-                result = String(cString: str!)
+            let length = CFStringGetLength(cf) // This value is not always the length in characters, in spite of what the documentation says


I think the disconnect here comes from the Foundation definition of characters (UTF16 code points) vs the one Swift uses (grapheme clusters).

This is a rebase of something I worked on over a year ago, so memories are no longer fresh, but here's what I recall from that time:

CFStringGetLength is simply buggy on its own terms. That is, it's not a UTF16 vs. grapheme clusters issue. It's that CFStringGetLength returns length in bytes if the underlying storage is UTF8-based, which is plainly contrary to its documented behavior. This buggy behavior is convenient in that we could actually rely on it for a bridging fast path here that avoids allocating yet another buffer, but then we would be relying on the buggy behavior of CFStringGetLength never being fixed. That seems unwise to me, and I have avoided doing so here.

That is not the behavior on Darwin, can you provide an example of how this is failing or behaving differently on Linux? if so it is definitely a more serious bug than you are attempting to address here and we should fix that first before hand imho.

It's going to take me a bit of time to reconstruct the test cases that led to that conclusion when I last looked into this in 2016. However, it was very much the same behavior on Darwin and on Linux at the time. The gist of how I came to that conclusion was as follows.

The existing bridging code is:

// ... else if type(of: source) == _NSCFString.self { let cf = unsafeBitCast(source, to: CFString.self) let str = CFStringGetCStringPtr(cf, CFStringEncoding(kCFStringEncodingUTF8)) // (A) if str != nil { // (B) result = String(cString: str!) } else { // (C) let length = CFStringGetLength(cf) let buffer = UnsafeMutablePointer<UniChar>.allocate(capacity: length) CFStringGetCharacters(cf, CFRangeMake(0, length), buffer) let str = String._fromCodeUnitSequence(UTF16.self, input: UnsafeBufferPointer(start: buffer, count: length)) buffer.deinitialize(count: length) buffer.deallocate(capacity: length) result = str } }

(A) and (B) clearly blow away any internal NUL characters. By contrast, (C) does not. Removing (A) and (B) should result in inefficient but correct bridging.

However, removing (A) and (B), leaving (C) alone, actually caused a bunch of Foundation tests to fail, even on macOS. When I looked into the cause of the failures, I found that in certain circumstances when CFStringGetCStringPtr doesn't return nil, CFStringGetLength and CFStringGetCharacters return incorrect results that cause result to be a clobbered version of source. Those circumstances only occurred when source had multibyte characters.

parkera · 2017-10-09T18:26:46Z

TestFoundation/TestJSONSerialization.swift

@@ -1373,8 +1373,8 @@ extension TestJSONSerialization {
            outputStream.open()
            let result = try JSONSerialization.writeJSONObject(dict, toStream: outputStream, options: [])
            outputStream.close()
-            if(result > -1) {
-                XCTAssertEqual(NSString(bytes: buffer, length: buffer.count, encoding: String.Encoding.utf8.rawValue), "{\"a\":{\"b\":1}}")
+            if result > -1 {


Let's leave these diffs out of this PR for clarity.

The parentheses here are trivial, but the change in the subsequent line is required for the test to pass as the test itself is buggy. We discussed this in #604 a year ago.

alblue · 2017-11-10T11:43:33Z

There's a conflict in the test script. Can we rebase this and revisit the change, or should we close it for now?

xwu · 2017-11-11T00:29:45Z

I will rebase and investigate outstanding reviewer questions shortly.

alblue · 2017-11-28T09:55:03Z

Closing, since this has gone stale - please reopen it once rebased and issues resolved.

xwu added 2 commits October 8, 2017 10:38

[SR-2225] String bridging and internal NUL

804a70e

Remove reason for failure in TestNSString comparison test

76e274f

Insert numeric casts for Linux

5ea91f8

parkera reviewed Oct 9, 2017

View reviewed changes

alblue closed this Nov 28, 2017

xwu mentioned this pull request Sep 2, 2018

SR-7455: Allow NUL in Strings to match Darwin #1677

Merged

String bridging and internal NUL [SR-2225], take four #1253

String bridging and internal NUL [SR-2225], take four #1253

Uh oh!

Conversation

xwu commented Oct 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's going on in this PR

Existing bridging is buggy

Revised bridging

Modified tests

Miscellaneous changes

Uh oh!

xwu commented Oct 8, 2017

Uh oh!

spevans commented Oct 8, 2017

Uh oh!

xwu commented Oct 8, 2017

Uh oh!

phausler commented Oct 8, 2017

Uh oh!

phausler commented Oct 8, 2017

Uh oh!

spevans commented Oct 8, 2017

Uh oh!

parkera Oct 9, 2017

Choose a reason for hiding this comment

Uh oh!

parkera Oct 9, 2017

Choose a reason for hiding this comment

Uh oh!

xwu Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parkera Oct 9, 2017

Choose a reason for hiding this comment

Uh oh!

xwu Oct 9, 2017

Choose a reason for hiding this comment

Uh oh!

phausler Oct 9, 2017

Choose a reason for hiding this comment

Uh oh!

xwu Oct 10, 2017

Choose a reason for hiding this comment

Uh oh!

parkera Oct 9, 2017

Choose a reason for hiding this comment

Uh oh!

xwu Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alblue commented Nov 10, 2017

Uh oh!

xwu commented Nov 11, 2017

Uh oh!

alblue commented Nov 28, 2017

Uh oh!

Uh oh!

xwu commented Oct 8, 2017 •

edited

Loading

xwu Oct 9, 2017 •

edited

Loading

xwu Oct 9, 2017 •

edited

Loading