-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Reference implementation of Unicode Security (UTS39) #11569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reference implementation of Unicode Security (UTS39) #11569
Conversation
* uncommon codepoints, confusables, mixed-script * trying to follow UTS39
Thank you @mrluc, this is amazing! I think the next step is to break this apart into smaller PRs. Here are my general thoughts so far:
|
👍 cool! That breakdown + order of PRs, to integrate into Elixir tokenization, all sounds doable.
And yes, will just rip out that placeholder 'mathy whitelist' for now -- in the future, it'll be easier to allow 'Technical' categories or somesuch, once we have these uts39 protections in place; probably people doing very-mathy Nx may want that eventually, if they're not already clamoring for it. |
Hi @mrluc, I was re-reading this PR after reading UTS39 and I realized that your mixed-script detection considers if the script was used anywhere else in the file. I think we should not change that so I think the next step is for you to track if any non-ASCII identifier is used and, if so, invoke the Unicode validation stuff. During the Unicode validation, we can form both confusable and mixed script checks. :) |
Awesome, thanks for giving it a look.
Oh, yeah, I guess I overlooked that when responding to your feedback! 🤦 Yeah, I guess both are stateful. For anyone reading and wondering why it's stateful -- the mixed-script detection itself isn't, that's done by the 'resolved script set', and 'mixed-script confusability' isn't inherently stateful either, it's just consulting a lookup of characters that are confusable cross-script. However, stopping here would mean that, for instance, every use of a Latin vowel would be flagged as they're all potential mixed-script confusables. Clearly a filter is needed, and the standard implies this:
So our filter for 'potentially problematic mixed script confusability' uses that same heuristic as Rust -- where we consider if there have been any non-confusable uses of a script, ie, characters from Script X that are clearly from script X and not some other script; in our case, within the same file. 👍 re the next PR including both checks, and probably being closer to the reference as a result. |
@mrluc I have been thinking more about this and I wonder if we should rather require characters sets to be explicitly opted-in? Something like this in your mix.exs:
This makes everything safe by default but gives user control too. Compiled files will be restricted to latin by default but test files/eval files can use all scriptsets. It is most likely an addition to all other work that needs to be done. |
@josevalim from a safety point of view, this sounds promising as an extension mechanism, as for instance it could provide a way of ensuring that 'only these scripts are used in our codebase' (and dependencies? if my intuition of what elixirc_options does is right, but maybe not) -- which I could see removing the need for the whole-file check like 'the only use of Script X is via confusable characters...'; if you don't want Script X, don't configure it! From an implementation point of view, I think I understand when you say 'most likely an addition to all other work that needs to be done' -- I could be wrong though. So this is what it'd most naturally mean to me: Currently, if it's unicode, in validation we do UTS 39 5.2's 'Highly Restrictive' only, which is (pseudocode):
And with the proposed additional security, in validation we'd do an initial check first (pseudocode):
That would give a configurable mechanism to let programmers of all languages get secure identifiers, maybe with a 1-liner in their config, but is also, as you say, secure by default and lets teams completely rule out scripts that shouldn't appear in their codebase. |
This is all merged now, thank you @mrluc! I am quite happy with the implementation and footprint of C3, so I think we can postpone for now the idea of declaring scriptsets upfront. I am also interested in supporting some mathematical symbols but I would like to do so in a structured way: i.e. we support all symbols that belong to a specific category that has been vetoed against our Unicode Security practices. If you are aware or you want to investigate that this category might be, it would be very appreciated! Thank you! ❤️ |
This is a reference implementation of the three main protections from Unicode Technical Standard on Security (UTS39) for Elixir.
The standard is pretty involved -- we include docs and tests that should help readers understand the 3 main pieces of those protections, and then there are lots of comments citing UTS39 directly in the implementation itself.
PR desc is organized as follows
Existing behavior in Elixir and other languages
Without the protections from UTS39, the following potentially confusing tokens generate no warnings in Elixir:
Languages differ in how they handle this:
The 'Trojan Source' folks were granted a 2nd CVE for this in Oct 2021, CVE-2021-42694, and their recommendation is:
The first part of that recommendation is covered by @josevalim's recent PR that prevents 'bidi' in source code; this PR is intended to address the second recommendation.
What this PR changes
Additional warnings
We emit warnings based on implementing the protections from the UTS39 standard, like so:
The number and kind of warnings correspond pretty closely to the same examples in Rust.
力=1; カ=1
, and only emit a 'confusable' warning -- but instead it emits both 'confusable' and 'mixed-script confusable' warnings, even though both characters resolve to {Japanese} and are thus not mixed-script per UTS39.unicode-security
crate can't/doesn't benefit from theunicode-scripts
crate's script resolution logic, which is in Rust, when computing the mixed-script-confusables table, and thus (likely by accident) uses a different definition of what mixed-script is -- one based only on the Scripts.txt file.Whitelist of uncommon codepoints
This PR also demonstrates adding a whitelist of math-like symbols; we wanted to add uncommon codepoint protection in a way that wouldn't make it hard to support mathy symbols in the future, so this branch also allows Elixir identifiers and functions to use mathy symbols from that whitelist, like:
Reference: Rust implementation
The Rust implementation was a very useful reference to read the spec with.
unicode-rs/unicode-scripts
,unicode-rs/unicode-security
(don't overlookscripts/unicode.py
in each of those, especially the security one; they're part of the implementation) and in the rust repo, the on-by-defaultnon_ascii_idents.rs
lint, which uses the capabilities from those crates to implement those 3 protections for Rust.Potential next steps
I titled this "reference implementation" because, minimally, it's valuable as an example of how we can add full UTS39 protections.
It's also valuable to me ... for freeing up my 'shop time', since this turned out to be a high-quality rabbit hole! 😆 So maybe I can stop tinkerin' with this soon.
However, if we're interested in moving forward with this PR, the next steps I'd have are as follows: