Skip to content

Make NSString(contentsOf:usedEncoding:) tests work on macOS #1149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 2, 2017

Conversation

spevans
Copy link
Contributor

@spevans spevans commented Aug 1, 2017

  • Add the NSString-UTF32-{LE,BE}-data.txt tests to the copy resources.

  • Add the BOM marker 0xFEFF to the start of the comparison string.

Im not sure if it is correct to have the BOM as the first character in a string but with this change the tests now work on macOS and still work ok on Linux.

string.character(at: 0) returns the BOM on macOS, does that sound correct?

The UTF32 changes were merged in #928

@ianpartridge
Copy link
Contributor

Thanks - I'll give this a test here.

@ianpartridge
Copy link
Contributor

Confirmed that the failing tests now pass for me, running TestFoundation on Xcode 9 beta 4.

@ianpartridge
Copy link
Contributor

@swift-ci please test

@spevans
Copy link
Contributor Author

spevans commented Aug 1, 2017

@ianpartridge Does it seems correct to you that the first character of the string would be the BOM? I thought that once it had been parsed by the underlying library the String should be just be unicode and independent of encoding.

@ianpartridge
Copy link
Contributor

I think what's happening is that Swift just treats the first character of the string as the codepoint that represents a BOM, not a BOM itself?

let str: String = "\u{FEFF}NSString"
print(str.count)
print(str)
print(Array(str))
print(Array(str.unicodeScalars))
print(Array(str.utf8))
print(Array(str.utf16))

prints

9
NSString
["", "N", "S", "S", "t", "r", "i", "n", "g"]
["\u{FEFF}", "N", "S", "S", "t", "r", "i", "n", "g"]
[239, 187, 191, 78, 83, 83, 116, 114, 105, 110, 103]
[65279, 78, 83, 83, 116, 114, 105, 110, 103]

I don't know what the Unicode specification says about this - @milseman would know for sure.

@ianpartridge
Copy link
Contributor

If I am reading http://www.unicode.org/versions/Unicode10.0.0/ch23.pdf#G19635 correctly, the relevant bit says:

[...] where Unicode text has known byte order, initial U+FEFF characters are not required, but for backward compatibility are to be interpreted as zero width no-break spaces

So I think Swift is doing the right thing here by just treating the "BOM" as a Unicode scalar with the semantics of a "zero width no-break space".

@spevans
Copy link
Contributor Author

spevans commented Aug 1, 2017

I just did a test on latest snapshow, linux v macOS

$ ~/swift-DEVELOPMENT-SNAPSHOT-2017-07-31-a-ubuntu16.04/usr/bin/swift
Welcome to Swift version 4.0-dev (LLVM ee9f1a5743, Clang 9a7d2d2f21, Swift 106f4bec0a). Type :help for assistance.
  1> print("\u{FEFF}NSString" == "NSString")
true
  2>  


$ Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2017-07-31-a.xctoolchain/usr/bin/swift
Welcome to Apple Swift version 4.0-dev (LLVM ee9f1a5743, Clang 9a7d2d2f21, Swift 106f4bec0a). Type :help for assistance.
  1> print("\u{FEFF}NSString" == "NSString")
false
  2>  

I think this could cause problems later on down the road, Im just not sure which one is wrong or if Linux is just being more generous in allowing it.

@ianpartridge
Copy link
Contributor

This might be caused by different underlying versions of ICU being used?

@spevans
Copy link
Contributor Author

spevans commented Aug 1, 2017

Ah yes, you could be right about that

@spevans
Copy link
Contributor Author

spevans commented Aug 1, 2017

@milseman @parkera Is it ok that Linux and macOS treat strings slightly differently, possibly due to different underlying ICU versions?

@ianpartridge
Copy link
Contributor

I don't think it is OK. I think it's a bug on Linux.

@spevans
Copy link
Contributor Author

spevans commented Aug 2, 2017

@ianpartridge I think you may be right and I suspect the bug is lower down in stdlib rather than Foundation. I think this is ok to merge (to fix macOS) and I will follow up with a question about BOMs to swift-dev.

@ianpartridge
Copy link
Contributor

I'm not sure this PR is a fix... Shouldn't NSString(contentsOf:usedEncoding:) detect the BOM in the file and not include it in the resulting string?

- Add the NSString-UTF32-{LE,BE}-data.txt tests to the copy resources.

- Skip BOM header when passing data to CFStringCreateWithBytes().
@spevans spevans force-pushed the pr_nsstring_bom_fix branch from 0507518 to abe7693 Compare August 2, 2017 14:09
@spevans
Copy link
Contributor Author

spevans commented Aug 2, 2017

@ianpartridge You are correct, if passing encoding as utf{16,32} the BOM is examined and skipped. If passing utf{16,32}{LE, BE} then its assumed there is no BOM and it needs to be skipped manually. Ive added the skip and it now works on Xcode9b4

@ianpartridge
Copy link
Contributor

This makes sense to me - let's try Linux 🙂

@ianpartridge
Copy link
Contributor

@swift-ci please test

@ianpartridge
Copy link
Contributor

I tested this PR on Xcode myself, and confirmed the tests all now pass 🍾

@ianpartridge ianpartridge merged commit 55d7e65 into swiftlang:master Aug 2, 2017
@ianpartridge
Copy link
Contributor

P.S. I think this means that the Linux tests were only passing because of the other bug you found: print("\u{FEFF}NSString" == "NSString") // true

Pure luck!

@spevans
Copy link
Contributor Author

spevans commented Aug 2, 2017

Yes indeed, I think there is still a unicode issue/difference between macOS and Linux but that can be look at seperately.

@spevans spevans deleted the pr_nsstring_bom_fix branch August 9, 2017 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants