Skip to content

[SR-2225] String bridging and internal NUL (rebased) #604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

[SR-2225] String bridging and internal NUL (rebased) #604

wants to merge 2 commits into from

Conversation

xwu
Copy link
Contributor

@xwu xwu commented Aug 29, 2016

(A rebased version of #549.)

What's going on in this PR

Existing bridging is buggy

The existing implementation of bridging from _NSCFString to String is broken in several ways:

  1. The "fast path" blows away internal NUL characters because it uses facilities for C strings.
  2. Merely removing the "fast path" causes tests to fail. The underlying reason for the failing tests is that the default "slow path" is also broken.
  3. The "slow path" uses CFStringGetLength() and CFStringGetCharacters(). However, if the CFString in question has been initialized using bytes and contains multi-byte characters, neither of these CF functions behaves as documented:
    • CFStringGetLength() returns the length in bytes, not the length in UTF16 code pairs as it's supposed to.
    • CFStringGetCharacters() returns the bytes, not the UTF16 code pairs as it's supposed to.

As a consequence, if UTF8-encoded bytes that encode internal NUL characters are used to initialize an NSString, the "fast path" would clobber bridging in one way; but, if those bytes encode multi-byte characters, the "slow path" would clobber bridging in a different way.

Revised bridging

Since CF UTF16-related functions are unreliable, I allocate a buffer to store UTF8-encoded bytes. If some way can be found to determine (reliably, not relying on the result of a buggy implementation in CF) the length in bytes of a CFString that uses UTF8-encoded bytes for storage, then a fast path can be restored that doesn't require allocating a buffer.

Modified tests

After implementing this revised bridging, five tests began to fail. On further examination, these tests were found to be incorrect because they relied on multiple tandem NUL characters being lost on bridging. On a Mac, Darwin Foundation would fail these five tests as well. I have corrected the tests.

This PR also adds a test for constructing an NSString with data that contains an internal NUL character.

Miscellaneous changes

This PR rolls in the previously discussed (#496) change in init?(data:encoding:).
This PR also fixes two nits: a typo in a test name and inconsistent use of an unindented //.

@xwu
Copy link
Contributor Author

xwu commented Sep 20, 2016

@parkera Any post-Swift 3 thoughts on this issue?

@parkera
Copy link
Contributor

parkera commented Sep 22, 2016

I'll take a look at this soon. Thanks @xwu

@parkera parkera self-assigned this Sep 22, 2016

// FIXME: If we had a reliable way to determine the length of `cf`
// in bytes, then we wouldn't have to allocate a new buffer
// if `CFStringGetCStringPtr(_:_:)` doesn't return nil.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct way to obtain the backing buffers is to expect nil and switch on either the 8 bit or 16 bit backing variants of CFString more like the previous code did, else you have a distinct chance that the CFStringGetBytes function will have to do a full walk of the unichars and convert each one.

This is roughly how a c implementation can be the fastest at converting to either a UTF8 or UTF16 buffer with an applier pseudo-lambda

void applyBuffer(CFStringRef str, void (*applier)(const void *buffer, CFStringEncoding)) {
    const char *cptr = CFStringGetCStringPtr(str, kCFStringEncodingUTF8);
    if (cptr != NULL) {
        applier(cptr, kCFStringEncodingUTF8);
    } else {
        const UniChar *uptr = CFStringGetCharactersPtr(str);
        if (uptr != NULL) {
            applier(uptr, kCFStringEncodingUTF16);
        } else {
            CFIndex length = CFStringGetLength(str);
            UniChar stack_characters[1024] = {0}; // adjust stack size according to taste
            UniChar *characters = &stack_characters[0];
            if (length > sizeof(stack_characters) / sizeof(stack_characters[0])) {
                characters = malloc(sizeof(UniChar) * length);
            }
            CFStringGetCharacters(str, CFRangeMake(0, length), characters);
            applier(characters, kCFStringEncodingUTF16);
            if (characters != &stack_characters[0]) {
                free(characters);
            }
        }
    }
}

CFString, NSString and struct String all have dual backing implementations that either are fast paths for 8 bit storage or 16 bit storage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat. I will study this code and implement accordingly. Currently traveling so can't promise quick turnaround.

@@ -33,7 +33,7 @@ class TestNSStream : XCTestCase {
}

func test_InputStreamWithData(){
let message: NSString = "Hello, playground"
let message: NSString = "Hello, playground\0\0\0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better to write it's own tests instead of hijacking other un-related unit tests (this applies to all of the unit test changes here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, these are meant to be minimal changes to make these tests pass, not a way of slipping in additional tests; they relied on buggy behavior in their current form and must be changed once string bridging is fixed. The alternative is to alter some of the hardcoded buffer lengths and offsets, but it's potato po-tah-to IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be concrete, these tests expect NSStrings of 17 characters to roundtrip through a 20-character buffer. After string bridging is fixed, we must either make the string 20 characters long or the buffer 17 characters long. Not modifying the test is not an option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm interesting, running this test on Darwin (in it's original form) fails.

This looks like the test is busted and should be fixed. Which I feel like the correct fix for the test is to have the buffer be 17 bytes and not add null bytes onto the NSString since that is definitely an uncommon edge case behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry for not communicating that fact more clearly (the fact that these tests fail on Darwin). It's stuffed in the middle of my PR description. Happy to modify things the other way if you prefer; notably, however you fix it, the test passes with the currently busted string bridging, so I'm definitely not sneaking in any test cases under the radar :)

@xwu
Copy link
Contributor Author

xwu commented Oct 29, 2016

@phausler See if these latest changes adequately address your comments.

@alblue
Copy link
Contributor

alblue commented Oct 5, 2017

This pull request has gone stale, closing. Please rebase before re-opening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants