-
Notifications
You must be signed in to change notification settings - Fork 1.2k
String bridging and internal NUL [SR-2225], take four #1253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@swift-ci please test |
@spevans Interesting discrepancy in the type of |
Nah that type difference has been there since day 1 of swift-corelibs sadly |
I have a feeling this is not the full fix btw the problems you outlined about the CF functions are likely incorrect abstractions due to the lack of factory patterns: where on Darwin we create strings much differently than how it is done here: I need to take a further look at this change to understand more of what is going on (will take a further review of it on Monday) |
@swift-ci please test |
// FIXME: If we had a reliable way to determine the length of `cf` | ||
// in bytes, then we wouldn't have to allocate a new buffer | ||
// if `CFStringGetCStringPtr(_:_:)` doesn't return nil. | ||
if let buffer = CFStringGetCharactersPtr(cf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recommended pattern is documented here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically:
CFStringRef str;
const UniChar *chars;
str = CFStringCreateWithCString(NULL, "Hello World", kCFStringEncodingMacRoman);
chars = CFStringGetCharactersPtr(str);
if (chars == NULL) {
CFIndex length = CFStringGetLength(str);
UniChar *buffer = malloc(length * sizeof(UniChar));
CFStringGetCharacters(str, CFRangeMake(0, length), buffer);
// Process the characters...
free(buffer);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I write above in the PR description, this recommended pattern cannot be used because CFStringGetCharacters
is buggy and returns bytes instead of UTF16 characters if the underlying storage is UTF8-based; attempting to use the recommended pattern clobbers multibyte Unicode strings. We must here check if CFStringGetCharactersPtr
returns NULL
, and if so, retrieve UTF8-encoded bytes instead.
let str = CFStringGetCStringPtr(cf, CFStringEncoding(kCFStringEncodingUTF8)) | ||
if str != nil { | ||
result = String(cString: str!) | ||
let length = CFStringGetLength(cf) // This value is not always the length in characters, in spite of what the documentation says |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the disconnect here comes from the Foundation definition of characters (UTF16 code points) vs the one Swift uses (grapheme clusters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a rebase of something I worked on over a year ago, so memories are no longer fresh, but here's what I recall from that time:
CFStringGetLength
is simply buggy on its own terms. That is, it's not a UTF16 vs. grapheme clusters issue. It's that CFStringGetLength
returns length in bytes if the underlying storage is UTF8-based, which is plainly contrary to its documented behavior. This buggy behavior is convenient in that we could actually rely on it for a bridging fast path here that avoids allocating yet another buffer, but then we would be relying on the buggy behavior of CFStringGetLength
never being fixed. That seems unwise to me, and I have avoided doing so here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not the behavior on Darwin, can you provide an example of how this is failing or behaving differently on Linux? if so it is definitely a more serious bug than you are attempting to address here and we should fix that first before hand imho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's going to take me a bit of time to reconstruct the test cases that led to that conclusion when I last looked into this in 2016. However, it was very much the same behavior on Darwin and on Linux at the time. The gist of how I came to that conclusion was as follows.
The existing bridging code is:
// ...
else if type(of: source) == _NSCFString.self {
let cf = unsafeBitCast(source, to: CFString.self)
let str = CFStringGetCStringPtr(cf, CFStringEncoding(kCFStringEncodingUTF8)) // (A)
if str != nil { // (B)
result = String(cString: str!)
} else { // (C)
let length = CFStringGetLength(cf)
let buffer = UnsafeMutablePointer<UniChar>.allocate(capacity: length)
CFStringGetCharacters(cf, CFRangeMake(0, length), buffer)
let str = String._fromCodeUnitSequence(UTF16.self, input: UnsafeBufferPointer(start: buffer, count: length))
buffer.deinitialize(count: length)
buffer.deallocate(capacity: length)
result = str
}
}
(A) and (B) clearly blow away any internal NUL characters. By contrast, (C) does not. Removing (A) and (B) should result in inefficient but correct bridging.
However, removing (A) and (B), leaving (C) alone, actually caused a bunch of Foundation tests to fail, even on macOS. When I looked into the cause of the failures, I found that in certain circumstances when CFStringGetCStringPtr
doesn't return nil
, CFStringGetLength
and CFStringGetCharacters
return incorrect results that cause result
to be a clobbered version of source
. Those circumstances only occurred when source
had multibyte characters.
@@ -1373,8 +1373,8 @@ extension TestJSONSerialization { | |||
outputStream.open() | |||
let result = try JSONSerialization.writeJSONObject(dict, toStream: outputStream, options: []) | |||
outputStream.close() | |||
if(result > -1) { | |||
XCTAssertEqual(NSString(bytes: buffer, length: buffer.count, encoding: String.Encoding.utf8.rawValue), "{\"a\":{\"b\":1}}") | |||
if result > -1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave these diffs out of this PR for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parentheses here are trivial, but the change in the subsequent line is required for the test to pass as the test itself is buggy. We discussed this in #604 a year ago.
There's a conflict in the test script. Can we rebase this and revisit the change, or should we close it for now? |
I will rebase and investigate outstanding reviewer questions shortly. |
Closing, since this has gone stale - please reopen it once rebased and issues resolved. |
This PR is a rebased version of #604, itself a rebased version of #549, a revision to #496.
It resolves SR-2225 and, apparently, rdar://problem/19034601.
What's going on in this PR
Existing bridging is buggy
The existing implementation of bridging from _NSCFString to String was broken in several ways:
As a consequence, if UTF8-encoded bytes that encode internal NUL characters and multi-byte characters were used to initialize an NSString, the "fast path" would clobber bridging in one way, and the "slow path" would clobber bridging in a different way.
Revised bridging
Since CF UTF16-related functions are erratic, I use a buffer to store UTF8-encoded bytes. If some way can be found to determine (reliably, not relying on the result of a buggy implementation in CF) the length in bytes of a CFString that uses UTF8-encoded bytes for storage, then a fast path can be restored that doesn't require allocating a buffer.
Modified tests
After implementing this revised bridging, several tests began to fail. On further examination, these tests were found to be incorrect because they relied on multiple tandem NUL characters being lost on bridging. On a Mac, Darwin Foundation fails these tests as well. I have corrected the tests.
This PR additionally adds a test for constructing an NSString with data that contains an internal NUL character.
Miscellaneous changes
This PR rolls in the previously discussed (#496) change in
init?(data:encoding:)
.