-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Unicode × Linux #2368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The head ref may contain hidden characters: "unicode\u2010linux"
Unicode × Linux #2368
Conversation
// The macOS file system enforces NFD. | ||
// It can find everything without any help. | ||
return path | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this will tank the performance on non-macOS platforms. Can you check how corelibs and llvm::path handle these? I wonder if there's a API in corelibs that we can leverage here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think corelibs
is just as broken in this regard, since most uses here are immediately followed by calls to FileManager
, whose failures clued me into this in the first place.
I don’t find anything relevant in corelibs when searching for “Unicode”, or when perusing the FileManager and URL sources. Maybe someone from that team would know better? @millenomi?
I find nothing related in llvm::sys::path
or llvm::sys::unicode
either.
Those lower levels may simply not care. The package manager is in an unusual situation, in that it looks up paths that it neither received from the file system nor placed there itself (and also in that it cannot defer this responsibility to a higher‐level client). This is the gap in which lookup becomes necessary. SwiftPM’s unique case is compounded by the fact that the same string literal in a manifest may need to match differing byte representations when checked out on different devices, so even telling users to be explicit with their literals ("\u{...}\u{...}\u{...}"
) cannot work around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent for corelibs is to mimic Darwin. On Darwin, normalization occurs at the filesystem access level.
These should likely be contributions to FileManager to supplement its .fileSystemRepresentation… method. FileManager should accept any path as input and normalize it internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. this is the wrong place; we should make FileManager better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that this should be solved at the FileManager
level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should make FileManager better.
This is good to hear.
Unfortunately, my Linux machine is not large enough to develop core-libs-foundation
. I can just barely manage to test SwiftPM because it works in isolation, without any of the other swift-...
repositories. So that probably means someone else will have to be the one to sink it from here.
fileSystemRepresentation(withPath:)
looks like it intends to do the same thing, but at the moment it appears to basically just forward to CFStringGetFileSystemRepresentation
, which in turn currently does nothing outside of macOS. I don’t know if that is the right place for it to live. On macOS it basically only needs to decompose the string, but on other platforms it must actually look things up in the file system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked but maybe it's possible to develop corelibs-foundation against as a standalone project if you install a toolchain on Linux?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just doing a drive-by review (@SDGGiesbrecht's suggestion). I don't have a lot of context here, so these are just general questions.
Is canonical equivalence what you need for all platforms, or do you need compatibility equivalence? If you're trying to define a new model for SPM, then I agree with canonical equivalence. If you're trying to match an existing system's (legacy) model, then you might need compatibility equivalence.
Tangent: what is the story for unpaired surrogates in file names, which can happen on Windows?
/// It is sufficient if whatever scalars are returned | ||
/// *will be considered equal to the real ones by the file system* | ||
/// when future file system calls are made.* | ||
func resolveUnicode(_ path: AbsolutePath) -> AbsolutePath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, a "unicode resolved" path would be reflected in the type system as such. Have you thought about ways to present this distinction?
Otherwise, be sure to liberally assert anywhere assuming a resolved path is in fact resolved (if it comes up). I like as many checks as possible to guard against the omission of an API call, especially as code evolves over multiple contributors.
Example:
extension FileSystem {
internal func _assertUnicodeResolved(_ path: AboslutePath) {
#if os(macOS)
return
#else
assert(path == self.resolveUnicode(path))
}
}
which can be sprinkled about as needed. If the idea is that all AbsolutePath
-receiving APIs on this protocol have to be resolved, then a separate (or phantom) type becomes more appealing. Another option is to have the protocol requirements be something like uncheckedGetFileInfo(_:)
and have getFileInfo(_:)
be in an extension that runs this assert before calling uncheckedGetFileInfo(_:)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if this is about putting the path into the conformer's preferred normal form, then this might be better called normalizeUnicode
or something, unless normalization is the only kind of resolution imaginable for file systems (it might be).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look, @milseman. All very good questions.
Is canonical equivalence what you need for all platforms, or do you need compatibility equivalence? If you're trying to define a new model for SPM, then I agree with canonical equivalence. If you're trying to match an existing system's (legacy) model, then you might need compatibility equivalence.
I’m not aiming at supporting old code pages. (If someone else wants to extend it for that, I leave it up to them.) The aim here is about resolving the discrepancy between path components entered in the manifest, which use Swift String
’s Unicode equivalence, and the file paths stored by the platform, which outside of macOS do not attempt to handle Unicode equivalence. A manifest should be equally loadable on every platform. Right now you get “file not found” errors when you go back and forth between platforms.
Tangent: what is the story for unpaired surrogates in file names, which can happen on Windows?
I have not thought about them and I have done nothing special to handle them. Whatever String
does ought to be fine, at least for now. In contrast to what this aims to fix, unpaired surrogates would indicate that the string itself is already “broken” to begin with.
Ideally, a "unicode resolved" path would be reflected in the type system as such. Have you thought about ways to present this distinction?
Initially I wanted to do it that way. Unfortunately, I realized that is itself unsafe, because the “resolvedness” of a path can be invalidated over time as file operations occur, some of which are hard for SwiftPM to track (such as Git). Since resolution must be redone before each operation anyway, the type‐system doesn’t actually help us much.
Otherwise, be sure to liberally assert anywhere assuming a resolved path is in fact resolved (if it comes up).
The assertion would have to do all the same work to check whether the resolution is valid. (And we cannot just record a flag in AbsolutePath
for the same reason a separate type would not work.) Calling resolveUnicode(_:)
at the start of each LocalFileSystem
method, effectively accomplishes the same safety otherwise afforded by assert
.
Another option is to have the protocol requirements be something like uncheckedGetFileInfo(:) and have getFileInfo(:) be in an extension that runs this assert before calling uncheckedGetFileInfo(_:).
The two private underscored methods are basically what you mean by unchecked
. They only exist because they need a variant callable from within resolveUnicode(_:)
that doesn’t wind up circular. Nothing else should be using them.
Also, if this is about putting the path into the conformer's preferred normal form, then this might be better called normalizeUnicode or something, unless normalization is the only kind of resolution imaginable for file systems (it might be).
Normalization‐related stuff is the only resolution I am imagining here. However, it is actually not normalization, but the reverse. On platforms besides macOS, the file on disk may have a name that is neither NFD nor NFC, and we still need to be able to find it.
#if os(macOS) | ||
// The macOS file system enforces NFD. | ||
// It can find everything without any help. | ||
return path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this contingent on APFS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea. File systems largely aren’t my area of expertise. Based on this note on the Uncyclopedia article for APFS, I suspect you are at least generally right that my #if os(macOS)
shortcut here uses the wrong check and not always valid. I have absolutely no idea if there is a more accurate check, or if the shortcut needs to be dropped.
@@ -224,41 +234,78 @@ public extension FileSystem { | |||
func getFileInfo(_ path: AbsolutePath) throws -> FileInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default implementations above assert that the path is resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the other comment about types and assertions.
// Our semantics doesn't consider directories. | ||
return (self.isFile(path) || self.isSymlink(path)) && FileManager.default.isExecutableFile(atPath: path.pathString) | ||
} | ||
|
||
func exists(_ path: AbsolutePath, followSymlink: Bool) -> Bool { | ||
/// This method is Unicode‐naïve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good indication that an assert would be nice; at the very least it is documentation about what you mean by "naive", as in it assumes the caller handled such details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the other comment about types and assertions.
It is private
. The documentation was a brief note to other contributors about why it is distinct from the neighbouring method. I could elaborate it with more of what I said in the other comment if any of you think it would be helpful.
let directoryName = directory.basename.unicodeScalars.map({ "U+\($0.value)" }) | ||
XCTFail("Unable to handle directory named “\(directoryName)”: \(error)") | ||
} | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're looking for a more complete programmatic data set to drive this, the stdlib uses Unicode's published normalization tests at https://raw.githubusercontent.com/apple/swift/master/test/stdlib/Inputs/NormalizationTest.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lightweight test is really only there to check that Unicode equivalence is being considered at all. String
does all the heavy lifting when it comes to the definition of equivalence. (The text fixture we are trying to get working on Linux is already much meaner too.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also try U+0F73
which is FCD, has a CCC of 0, and yet expands into two different scalars, neither of which have a CCC of 0.
I have looked into porting this over to Foundation, but my (ancient and tiny) Linux machine just isn’t big enough. I am closing the PR, but the diffs and explanations will remain, so if someone else decides to give it a shot at some point, they will still have something to work from. |
This PR builds on the work done in #2356, activating that test on Linux and enabling it to pass. (Disclaimer: It passes on my Linux machine (Ubuntu 14). CI’s flavour of Linux could theoretically differ.)
The core of this PR is a single method added to
LocalFileSystem
, calledresolveUnicode(_:)
, which accepts anyAbsolutePath
and adjusts its scalar representation to match the scalar representation of whatever the file system actually has. Essentially, this allows a directory called “x” + “◌̄” + “◌̱” to be found by asking for “x” + “◌̱” + “◌̄” and vice versa, since the two are canonically equivalent. It is written in a platform‐agnostic way, which assumes nothing about the file system’s own preferred normalization form, and ought to work on any platform (though it is fast‐pathed on macOS, whose NFD enforcement makes it irrelevant).resolveUnicode(_:)
is then put to use in two situations:LocalFileSystem
’s methods use it internally to acquire proper paths before trying to do any file system operations. This enables all client code to never have to think about it, and it covers any and all file operations done directly by the package manager.In the course of that work, looking closer at the internals suggested a few gaps in testing. So three new tests were added:
LocalFileSystem
.TSCBasic
will eventually move to another repository, so it may as well have a test that moves with it.corelibs-foundation
team, since the twoXCTAssert
s disabled on Linux seem to demonstrate bugs inFoundation
’s normalization methods. @millenomi (The bugs do not affect SwiftPM; they just forced me to set up the test manually instead.)