-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[SR-2225] String bridging and internal NUL (rebased) #604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@parkera Any post-Swift 3 thoughts on this issue? |
I'll take a look at this soon. Thanks @xwu |
|
||
// FIXME: If we had a reliable way to determine the length of `cf` | ||
// in bytes, then we wouldn't have to allocate a new buffer | ||
// if `CFStringGetCStringPtr(_:_:)` doesn't return nil. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct way to obtain the backing buffers is to expect nil and switch on either the 8 bit or 16 bit backing variants of CFString more like the previous code did, else you have a distinct chance that the CFStringGetBytes function will have to do a full walk of the unichars and convert each one.
This is roughly how a c implementation can be the fastest at converting to either a UTF8 or UTF16 buffer with an applier pseudo-lambda
void applyBuffer(CFStringRef str, void (*applier)(const void *buffer, CFStringEncoding)) {
const char *cptr = CFStringGetCStringPtr(str, kCFStringEncodingUTF8);
if (cptr != NULL) {
applier(cptr, kCFStringEncodingUTF8);
} else {
const UniChar *uptr = CFStringGetCharactersPtr(str);
if (uptr != NULL) {
applier(uptr, kCFStringEncodingUTF16);
} else {
CFIndex length = CFStringGetLength(str);
UniChar stack_characters[1024] = {0}; // adjust stack size according to taste
UniChar *characters = &stack_characters[0];
if (length > sizeof(stack_characters) / sizeof(stack_characters[0])) {
characters = malloc(sizeof(UniChar) * length);
}
CFStringGetCharacters(str, CFRangeMake(0, length), characters);
applier(characters, kCFStringEncodingUTF16);
if (characters != &stack_characters[0]) {
free(characters);
}
}
}
}
CFString, NSString and struct String all have dual backing implementations that either are fast paths for 8 bit storage or 16 bit storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat. I will study this code and implement accordingly. Currently traveling so can't promise quick turnaround.
@@ -33,7 +33,7 @@ class TestNSStream : XCTestCase { | |||
} | |||
|
|||
func test_InputStreamWithData(){ | |||
let message: NSString = "Hello, playground" | |||
let message: NSString = "Hello, playground\0\0\0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be better to write it's own tests instead of hijacking other un-related unit tests (this applies to all of the unit test changes here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, these are meant to be minimal changes to make these tests pass, not a way of slipping in additional tests; they relied on buggy behavior in their current form and must be changed once string bridging is fixed. The alternative is to alter some of the hardcoded buffer lengths and offsets, but it's potato po-tah-to IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be concrete, these tests expect NSStrings of 17 characters to roundtrip through a 20-character buffer. After string bridging is fixed, we must either make the string 20 characters long or the buffer 17 characters long. Not modifying the test is not an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm interesting, running this test on Darwin (in it's original form) fails.
This looks like the test is busted and should be fixed. Which I feel like the correct fix for the test is to have the buffer be 17 bytes and not add null bytes onto the NSString since that is definitely an uncommon edge case behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry for not communicating that fact more clearly (the fact that these tests fail on Darwin). It's stuffed in the middle of my PR description. Happy to modify things the other way if you prefer; notably, however you fix it, the test passes with the currently busted string bridging, so I'm definitely not sneaking in any test cases under the radar :)
@phausler See if these latest changes adequately address your comments. |
This pull request has gone stale, closing. Please rebase before re-opening. |
(A rebased version of #549.)
What's going on in this PR
Existing bridging is buggy
The existing implementation of bridging from
_NSCFString
toString
is broken in several ways:CFStringGetLength()
andCFStringGetCharacters()
. However, if theCFString
in question has been initialized using bytes and contains multi-byte characters, neither of these CF functions behaves as documented:CFStringGetLength()
returns the length in bytes, not the length in UTF16 code pairs as it's supposed to.CFStringGetCharacters()
returns the bytes, not the UTF16 code pairs as it's supposed to.As a consequence, if UTF8-encoded bytes that encode internal NUL characters are used to initialize an
NSString
, the "fast path" would clobber bridging in one way; but, if those bytes encode multi-byte characters, the "slow path" would clobber bridging in a different way.Revised bridging
Since CF UTF16-related functions are unreliable, I allocate a buffer to store UTF8-encoded bytes. If some way can be found to determine (reliably, not relying on the result of a buggy implementation in CF) the length in bytes of a CFString that uses UTF8-encoded bytes for storage, then a fast path can be restored that doesn't require allocating a buffer.
Modified tests
After implementing this revised bridging, five tests began to fail. On further examination, these tests were found to be incorrect because they relied on multiple tandem NUL characters being lost on bridging. On a Mac, Darwin Foundation would fail these five tests as well. I have corrected the tests.
This PR also adds a test for constructing an
NSString
with data that contains an internal NUL character.Miscellaneous changes
This PR rolls in the previously discussed (#496) change in
init?(data:encoding:)
.This PR also fixes two nits: a typo in a test name and inconsistent use of an unindented
//
.