Make NSString(contentsOf:usedEncoding:) tests work on macOS #1149

spevans · 2017-08-01T11:45:27Z

Add the NSString-UTF32-{LE,BE}-data.txt tests to the copy resources.
Add the BOM marker 0xFEFF to the start of the comparison string.

Im not sure if it is correct to have the BOM as the first character in a string but with this change the tests now work on macOS and still work ok on Linux.

string.character(at: 0) returns the BOM on macOS, does that sound correct?

The UTF32 changes were merged in #928

ianpartridge · 2017-08-01T11:59:56Z

Thanks - I'll give this a test here.

ianpartridge · 2017-08-01T12:19:35Z

Confirmed that the failing tests now pass for me, running TestFoundation on Xcode 9 beta 4.

ianpartridge · 2017-08-01T12:19:44Z

@swift-ci please test

spevans · 2017-08-01T12:30:44Z

@ianpartridge Does it seems correct to you that the first character of the string would be the BOM? I thought that once it had been parsed by the underlying library the String should be just be unicode and independent of encoding.

ianpartridge · 2017-08-01T12:51:48Z

I think what's happening is that Swift just treats the first character of the string as the codepoint that represents a BOM, not a BOM itself?

let str: String = "\u{FEFF}NSString"
print(str.count)
print(str)
print(Array(str))
print(Array(str.unicodeScalars))
print(Array(str.utf8))
print(Array(str.utf16))

prints

9
NSString
["", "N", "S", "S", "t", "r", "i", "n", "g"]
["\u{FEFF}", "N", "S", "S", "t", "r", "i", "n", "g"]
[239, 187, 191, 78, 83, 83, 116, 114, 105, 110, 103]
[65279, 78, 83, 83, 116, 114, 105, 110, 103]

I don't know what the Unicode specification says about this - @milseman would know for sure.

ianpartridge · 2017-08-01T13:00:23Z

If I am reading http://www.unicode.org/versions/Unicode10.0.0/ch23.pdf#G19635 correctly, the relevant bit says:

[...] where Unicode text has known byte order, initial U+FEFF characters are not required, but for backward compatibility are to be interpreted as zero width no-break spaces

So I think Swift is doing the right thing here by just treating the "BOM" as a Unicode scalar with the semantics of a "zero width no-break space".

spevans · 2017-08-01T14:22:41Z

I just did a test on latest snapshow, linux v macOS

$ ~/swift-DEVELOPMENT-SNAPSHOT-2017-07-31-a-ubuntu16.04/usr/bin/swift
Welcome to Swift version 4.0-dev (LLVM ee9f1a5743, Clang 9a7d2d2f21, Swift 106f4bec0a). Type :help for assistance.
  1> print("\u{FEFF}NSString" == "NSString")
true
  2>  


$ Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2017-07-31-a.xctoolchain/usr/bin/swift
Welcome to Apple Swift version 4.0-dev (LLVM ee9f1a5743, Clang 9a7d2d2f21, Swift 106f4bec0a). Type :help for assistance.
  1> print("\u{FEFF}NSString" == "NSString")
false
  2>

I think this could cause problems later on down the road, Im just not sure which one is wrong or if Linux is just being more generous in allowing it.

ianpartridge · 2017-08-01T14:30:09Z

This might be caused by different underlying versions of ICU being used?

spevans · 2017-08-01T14:35:06Z

Ah yes, you could be right about that

spevans · 2017-08-01T19:29:41Z

@milseman @parkera Is it ok that Linux and macOS treat strings slightly differently, possibly due to different underlying ICU versions?

ianpartridge · 2017-08-02T08:13:57Z

I don't think it is OK. I think it's a bug on Linux.

spevans · 2017-08-02T10:13:46Z

@ianpartridge I think you may be right and I suspect the bug is lower down in stdlib rather than Foundation. I think this is ok to merge (to fix macOS) and I will follow up with a question about BOMs to swift-dev.

ianpartridge · 2017-08-02T11:09:24Z

I'm not sure this PR is a fix... Shouldn't NSString(contentsOf:usedEncoding:) detect the BOM in the file and not include it in the resulting string?

- Add the NSString-UTF32-{LE,BE}-data.txt tests to the copy resources. - Skip BOM header when passing data to CFStringCreateWithBytes().

spevans · 2017-08-02T14:12:06Z

@ianpartridge You are correct, if passing encoding as utf{16,32} the BOM is examined and skipped. If passing utf{16,32}{LE, BE} then its assumed there is no BOM and it needs to be skipped manually. Ive added the skip and it now works on Xcode9b4

ianpartridge · 2017-08-02T14:36:51Z

This makes sense to me - let's try Linux 🙂

ianpartridge · 2017-08-02T14:37:13Z

@swift-ci please test

ianpartridge · 2017-08-02T15:21:11Z

I tested this PR on Xcode myself, and confirmed the tests all now pass 🍾

ianpartridge · 2017-08-02T15:22:57Z

P.S. I think this means that the Linux tests were only passing because of the other bug you found: print("\u{FEFF}NSString" == "NSString") // true

Pure luck!

spevans · 2017-08-02T15:26:46Z

Yes indeed, I think there is still a unicode issue/difference between macOS and Linux but that can be look at seperately.

Make NSString(contentsOf:usedEncoding:) tests work on macOS

abe7693

- Add the NSString-UTF32-{LE,BE}-data.txt tests to the copy resources. - Skip BOM header when passing data to CFStringCreateWithBytes().

spevans force-pushed the pr_nsstring_bom_fix branch from 0507518 to abe7693 Compare August 2, 2017 14:09

ianpartridge merged commit 55d7e65 into swiftlang:master Aug 2, 2017

spevans deleted the pr_nsstring_bom_fix branch August 9, 2017 06:36

Make NSString(contentsOf:usedEncoding:) tests work on macOS #1149

Make NSString(contentsOf:usedEncoding:) tests work on macOS #1149

Uh oh!

Conversation

spevans commented Aug 1, 2017

Uh oh!

ianpartridge commented Aug 1, 2017

Uh oh!

ianpartridge commented Aug 1, 2017

Uh oh!

ianpartridge commented Aug 1, 2017

Uh oh!

spevans commented Aug 1, 2017

Uh oh!

ianpartridge commented Aug 1, 2017

Uh oh!

ianpartridge commented Aug 1, 2017

Uh oh!

spevans commented Aug 1, 2017

Uh oh!

ianpartridge commented Aug 1, 2017

Uh oh!

spevans commented Aug 1, 2017

Uh oh!

spevans commented Aug 1, 2017

Uh oh!

ianpartridge commented Aug 2, 2017

Uh oh!

spevans commented Aug 2, 2017

Uh oh!

ianpartridge commented Aug 2, 2017

Uh oh!

spevans commented Aug 2, 2017

Uh oh!

ianpartridge commented Aug 2, 2017

Uh oh!

ianpartridge commented Aug 2, 2017

Uh oh!

ianpartridge commented Aug 2, 2017

Uh oh!

ianpartridge commented Aug 2, 2017

Uh oh!

spevans commented Aug 2, 2017

Uh oh!

Uh oh!