Skip to content

String bridging and internal NUL [SR-2225], take four #1253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

xwu
Copy link
Contributor

@xwu xwu commented Oct 8, 2017

This PR is a rebased version of #604, itself a rebased version of #549, a revision to #496.
It resolves SR-2225 and, apparently, rdar://problem/19034601.

What's going on in this PR

Existing bridging is buggy

The existing implementation of bridging from _NSCFString to String was broken in several ways:

  1. The "fast path" blew away internal NUL characters because it used facilities for C strings.
  2. Merely removing the "fast path" strangely caused tests to fail. Debugging revealed that the reason for failing tests was that the default "slow path" was also broken.
  3. The "slow path" used CFStringGetLength() and CFStringGetCharacters(). However, if the CFString in question contains multi-byte characters and was initialized using bytes, then neither of these CF functions behaves as documented:
    • CFStringGetLength() seems to return the length in bytes, not the length in UTF16 code pairs.
    • CFStringGetCharacters() seems to return the bytes, not the UTF16 code pairs.

As a consequence, if UTF8-encoded bytes that encode internal NUL characters and multi-byte characters were used to initialize an NSString, the "fast path" would clobber bridging in one way, and the "slow path" would clobber bridging in a different way.

Revised bridging

Since CF UTF16-related functions are erratic, I use a buffer to store UTF8-encoded bytes. If some way can be found to determine (reliably, not relying on the result of a buggy implementation in CF) the length in bytes of a CFString that uses UTF8-encoded bytes for storage, then a fast path can be restored that doesn't require allocating a buffer.

Modified tests

After implementing this revised bridging, several tests began to fail. On further examination, these tests were found to be incorrect because they relied on multiple tandem NUL characters being lost on bridging. On a Mac, Darwin Foundation fails these tests as well. I have corrected the tests.

This PR additionally adds a test for constructing an NSString with data that contains an internal NUL character.

Miscellaneous changes

This PR rolls in the previously discussed (#496) change in init?(data:encoding:).

@xwu
Copy link
Contributor Author

xwu commented Oct 8, 2017

/cc @phausler @parkera

@spevans
Copy link
Contributor

spevans commented Oct 8, 2017

@swift-ci please test

@xwu
Copy link
Contributor Author

xwu commented Oct 8, 2017

@spevans Interesting discrepancy in the type of kCFStringEncodingUTF8 between Linux and Darwin; pretty sure that it wasn't there last year. Should build now--please trigger tests again :)

@phausler
Copy link
Contributor

phausler commented Oct 8, 2017

Nah that type difference has been there since day 1 of swift-corelibs sadly

@phausler
Copy link
Contributor

phausler commented Oct 8, 2017

I have a feeling this is not the full fix btw the problems you outlined about the CF functions are likely incorrect abstractions due to the lack of factory patterns: where on Darwin we create strings much differently than how it is done here: I need to take a further look at this change to understand more of what is going on (will take a further review of it on Monday)

@spevans
Copy link
Contributor

spevans commented Oct 8, 2017

@swift-ci please test

// FIXME: If we had a reliable way to determine the length of `cf`
// in bytes, then we wouldn't have to allocate a new buffer
// if `CFStringGetCStringPtr(_:_:)` doesn't return nil.
if let buffer = CFStringGetCharactersPtr(cf) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically:

CFStringRef str;
const UniChar *chars;
 
str = CFStringCreateWithCString(NULL, "Hello World", kCFStringEncodingMacRoman);
chars = CFStringGetCharactersPtr(str);
if (chars == NULL) {
    CFIndex length = CFStringGetLength(str);
    UniChar *buffer = malloc(length * sizeof(UniChar));
    CFStringGetCharacters(str, CFRangeMake(0, length), buffer);
    // Process the characters...
    free(buffer);
}

Copy link
Contributor Author

@xwu xwu Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I write above in the PR description, this recommended pattern cannot be used because CFStringGetCharacters is buggy and returns bytes instead of UTF16 characters if the underlying storage is UTF8-based; attempting to use the recommended pattern clobbers multibyte Unicode strings. We must here check if CFStringGetCharactersPtr returns NULL, and if so, retrieve UTF8-encoded bytes instead.

let str = CFStringGetCStringPtr(cf, CFStringEncoding(kCFStringEncodingUTF8))
if str != nil {
result = String(cString: str!)
let length = CFStringGetLength(cf) // This value is not always the length in characters, in spite of what the documentation says
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the disconnect here comes from the Foundation definition of characters (UTF16 code points) vs the one Swift uses (grapheme clusters).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rebase of something I worked on over a year ago, so memories are no longer fresh, but here's what I recall from that time:

CFStringGetLength is simply buggy on its own terms. That is, it's not a UTF16 vs. grapheme clusters issue. It's that CFStringGetLength returns length in bytes if the underlying storage is UTF8-based, which is plainly contrary to its documented behavior. This buggy behavior is convenient in that we could actually rely on it for a bridging fast path here that avoids allocating yet another buffer, but then we would be relying on the buggy behavior of CFStringGetLength never being fixed. That seems unwise to me, and I have avoided doing so here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not the behavior on Darwin, can you provide an example of how this is failing or behaving differently on Linux? if so it is definitely a more serious bug than you are attempting to address here and we should fix that first before hand imho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's going to take me a bit of time to reconstruct the test cases that led to that conclusion when I last looked into this in 2016. However, it was very much the same behavior on Darwin and on Linux at the time. The gist of how I came to that conclusion was as follows.

The existing bridging code is:

// ...
        else if type(of: source) == _NSCFString.self {
            let cf = unsafeBitCast(source, to: CFString.self)
            let str = CFStringGetCStringPtr(cf, CFStringEncoding(kCFStringEncodingUTF8)) // (A)
            if str != nil { // (B)
                result = String(cString: str!)
            } else { // (C)
                let length = CFStringGetLength(cf)
                let buffer = UnsafeMutablePointer<UniChar>.allocate(capacity: length)
                CFStringGetCharacters(cf, CFRangeMake(0, length), buffer)
                
                let str = String._fromCodeUnitSequence(UTF16.self, input: UnsafeBufferPointer(start: buffer, count: length))
                buffer.deinitialize(count: length)
                buffer.deallocate(capacity: length)
                result = str
            }
        }

(A) and (B) clearly blow away any internal NUL characters. By contrast, (C) does not. Removing (A) and (B) should result in inefficient but correct bridging.

However, removing (A) and (B), leaving (C) alone, actually caused a bunch of Foundation tests to fail, even on macOS. When I looked into the cause of the failures, I found that in certain circumstances when CFStringGetCStringPtr doesn't return nil, CFStringGetLength and CFStringGetCharacters return incorrect results that cause result to be a clobbered version of source. Those circumstances only occurred when source had multibyte characters.

@@ -1373,8 +1373,8 @@ extension TestJSONSerialization {
outputStream.open()
let result = try JSONSerialization.writeJSONObject(dict, toStream: outputStream, options: [])
outputStream.close()
if(result > -1) {
XCTAssertEqual(NSString(bytes: buffer, length: buffer.count, encoding: String.Encoding.utf8.rawValue), "{\"a\":{\"b\":1}}")
if result > -1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave these diffs out of this PR for clarity.

Copy link
Contributor Author

@xwu xwu Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parentheses here are trivial, but the change in the subsequent line is required for the test to pass as the test itself is buggy. We discussed this in #604 a year ago.

@alblue
Copy link
Contributor

alblue commented Nov 10, 2017

There's a conflict in the test script. Can we rebase this and revisit the change, or should we close it for now?

@xwu
Copy link
Contributor Author

xwu commented Nov 11, 2017

I will rebase and investigate outstanding reviewer questions shortly.

@alblue
Copy link
Contributor

alblue commented Nov 28, 2017

Closing, since this has gone stale - please reopen it once rebased and issues resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants