Reference implementation of Unicode Security (UTS39) #11569

mrluc · 2022-01-14T22:30:30Z

This is a reference implementation of the three main protections from Unicode Technical Standard on Security (UTS39) for Elixir.

The standard is pretty involved -- we include docs and tests that should help readers understand the 3 main pieces of those protections, and then there are lots of comments citing UTS39 directly in the implementation itself.

PR desc is organized as follows

Existing behavior (in Elixir and other languages)
What this PR changes
Note on following along with the Rust implementation (also we may have found a bug)
Next steps, maybe

Existing behavior in Elixir and other languages

Without the protections from UTS39, the following potentially confusing tokens generate no warnings in Elixir:

# one of these has a non-ascii char that looks like an exclamation point
assert! = 1; assertǃ = 1;

# one of these uses a Cyrillic vowel, which could be confusing:
аdmin = 1; admin = 1

# this character renders visually invisible in most clients, but is allowed in idents
# (visibility checked in: github, gists, iTerm2, emacs GUI, slack, replit; 4 'try lang X' playgrounds)
ㅤ = 1
# and could be confusing if present in source code:
fooㅤ= 1  # <-- invisible character hiding at end of identifier
[
  foo,ㅤ  # <-- invisible character hiding after comma
] = values

Languages differ in how they handle this:

🛶 Languages that allow unicode in code are largely in the same boat as Elixir! For instance, most or all of the above doesn't elicit a warning in: Ruby, Go (1, 2), Julia, Python3 (mixedscript only), and Javascript/Typescript.
Erlang, PHP, and others are ASCII-only for identifiers.
Perl requires a declaration up at the top of the file if you use non-resolving scripts.
🏅 Rust, however, allows Unicode and also does great at warning on this sort of thing.

The 'Trojan Source' folks were granted a 2nd CVE for this in Oct 2021, CVE-2021-42694, and their recommendation is:

"Compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters." (emphasis added)

The first part of that recommendation is covered by @josevalim's recent PR that prevents 'bidi' in source code; this PR is intended to address the second recommendation.

What this PR changes

Additional warnings

We emit warnings based on implementing the protections from the UTS39 standard, like so:

iex(1)> assert! = 1; assertǃ = 1;
warning: confusable identifier: 'assertǃ' looks like 'assert!' on line 1
  iex:5:14

warning: identifier 'assertǃ' has uncommon codepoint \u1C3
  iex:5:14

1
iex(2)> 力=1; カ=1
warning: confusable identifier: 'カ' looks like '力' on line 2
  iex:9:6

1
iex(3)> аdmin=1; admin=1
warning: The only uses of Cyrillic in this file are mixed-script confusables, like 'аdmin' on line 3:
 \u0430 'а' {Cyrillic} <- mixed-script confusable
 \u0064 'd' {Latin}
 \u006D 'm' {Latin}
 \u0069 'i' {Latin}
 \u006E 'n' {Latin}
 Resolved script set (intersection): {∅}
  iex:1:1

warning: confusable identifier: 'admin' looks like 'аdmin' on line 3
  iex:1:10

The number and kind of warnings correspond pretty closely to the same examples in Rust.

Actually, we believe we have found one minor/ticky-tacky issue we'll want to contribute to Rust for completeness -- we think Rust should behave like our implementation on the Japanese example above, 力=1; カ=1, and only emit a 'confusable' warning -- but instead it emits both 'confusable' and 'mixed-script confusable' warnings, even though both characters resolve to {Japanese} and are thus not mixed-script per UTS39.
I think I see where this happens, and I suspect it's because of the split between Python code to generate Rust code, and the compile/runtime Rust code. The Python code in the unicode-security crate can't/doesn't benefit from the unicode-scripts crate's script resolution logic, which is in Rust, when computing the mixed-script-confusables table, and thus (likely by accident) uses a different definition of what mixed-script is -- one based only on the Scripts.txt file.
It's NBD -- 'Rust may be emitting a redundant warning in rare circumstances', heh, who cares -- but was fun to trace out.

Whitelist of uncommon codepoints

This PR also demonstrates adding a whitelist of math-like symbols; we wanted to add uncommon codepoint protection in a way that wouldn't make it hard to support mathy symbols in the future, so this branch also allows Elixir identifiers and functions to use mathy symbols from that whitelist, like:

defmodule Mathish do
  def ∆ᵥ(isp, g, wet, dry), do: isp * g * ln(wet/dry)
end

iex(6)> φ = (1 + :math.sqrt(5)) / 2
1.618033988749895

Reference: Rust implementation

The Rust implementation was a very useful reference to read the spec with.

If you'd like to do that too, note that it requires looking in 3 repos: unicode-rs/unicode-scripts, unicode-rs/unicode-security (don't overlook scripts/unicode.py in each of those, especially the security one; they're part of the implementation) and in the rust repo, the on-by-default non_ascii_idents.rs lint, which uses the capabilities from those crates to implement those 3 protections for Rust.

Potential next steps

I titled this "reference implementation" because, minimally, it's valuable as an example of how we can add full UTS39 protections.

It's also valuable to me ... for freeing up my 'shop time', since this turned out to be a high-quality rabbit hole! 😆 So maybe I can stop tinkerin' with this soon.

However, if we're interested in moving forward with this PR, the next steps I'd have are as follows:

Language design question for Jose/core folks -- rip out the whitelist for now? Or go ahead and expand it, to a v1?
- If we wanted to expand it, we could add a handful of ranges, using a reference like this Julia listing, without supporting all of those as there's a wacky amount of stuff there -- like ©, or like the {bold, italic, bold+italic} versions of many math letters, etc. 😄 I'm a bit hamstrung there as I know some people feel very strongly about the utility of mathlike symbols, but that ain't me! (If we wanted to get a nice initial list, I was thinking we could grap the mathiest Julia repos out there, and do some unicode regex to find what kinds of chars are most used).
Integration -- is implementing in Tokenizer okay, or do we want to move it around? It made sense to me, but for completeness, the other major way I can imagine us integrating these checks would be as a compilation tracer, which would be a way to let us expand checks to be multi-file. I don't actually think that's necessary (for instance, moving to multi-file would simultaneously tighten the confusability checks, while loosening the mixed-script-confusability checks, as a single non-confusable char in a script in any file in a crate whitelists every usage in Rust).
Re-check performance. My intuition is it should be fine; a previous PR that ran confusability detection on every token tested compiling 5 big repos, 5x each, before/after, and compilation impact was 1%; now most checks are avoided, as we only run Confusable and Mixed-Script-Confusables checks if the more lightweight UncommonCodepoints check tells us that non-ascii characters are present.
incorporate PR feedback; the main impl. is like 400-500 lines so there's probably lots of opportunities to tweak things...
- the tests are against the actual warning output, so I put them in 'warnings_test.exs', but we could move them to the tokenization-related tests instead
Strings generally -- ScriptSetResolution could, in the future, be something we could expose for strings, example: "I want to allow unicode usernames, not just ASCII, but I also don't want those crazy cthulu-lookin' unicode hacks!" If we wanted to do that, we'd look at implementing restriction-level detection, ie detecting what level, 1-5, an arbitrary string is at (UTS39).

* uncommon codepoints, confusables, mixed-script * trying to follow UTS39

josevalim · 2022-01-15T09:20:45Z

Thank you @mrluc, this is amazing! I think the next step is to break this apart into smaller PRs. Here are my general thoughts so far:

Why do we need to allow those specific Math items? Is it because they are not allowed today? My suggestion is that we always perform changes based on Unicode properties, instead of individual characters.
What do you think about simply not allowing any of the restricted identifiers by default? We already exclude all patterns on tokenization, I would as well remove all characters with Elixir \p{Identifier_Status=Restricted}. You can do so by returning 2 here for characters that match said description:

elixir/lib/elixir/unicode/tokenizer.ex

Line 63 in 34dafb7

[" Pattern_Syntax" <> _] -> 2

If you agree with this, I would start with this in my PR. (EDIT: In a nutshell, get every character in IdentifierType.txt that is not Recommended or Included and remove it altogether from to the tokenizer)
For mixed scripts, we can also warn/check for them in the tokenizer and upfront. Everytime we tokenize an identifier, we can check if Ascii is false, and if so, check for mixed:

elixir/lib/elixir/src/elixir_tokenizer.erl

Line 1231 in 34dafb7

{Kind, Acc, Rest, Length, Ascii, Special} ->

- this would avoid forcing a pass later. This could be a second PR once the above is done.
Finally, in the same function above, we could store in the State if ASCII was ever false. If it was, then only then we run the confusable checks. This could be the third PR.

mrluc · 2022-01-15T18:02:11Z

👍 cool! That breakdown + order of PRs, to integrate into Elixir tokenization, all sounds doable.

early DENY of UncommonCodepoints
~~early WARN on~~ do MixedScriptConfusables in step 3, d'oh, overlooked when writing this
track Ascii over all tokens, and only if needed, do the full pass required to WARN on confusability detection

And yes, will just rip out that placeholder 'mathy whitelist' for now -- in the future, it'll be easier to allow 'Technical' categories or somesuch, once we have these uts39 protections in place; probably people doing very-mathy Nx may want that eventually, if they're not already clamoring for it.

josevalim · 2022-01-17T18:30:09Z

Hi @mrluc, I was re-reading this PR after reading UTS39 and I realized that your mixed-script detection considers if the script was used anywhere else in the file. I think we should not change that so I think the next step is for you to track if any non-ASCII identifier is used and, if so, invoke the Unicode validation stuff. During the Unicode validation, we can form both confusable and mixed script checks. :)

mrluc · 2022-01-17T21:29:22Z

Awesome, thanks for giving it a look.

your mixed-script detection considers if the script was used anywhere else in the file

Oh, yeah, I guess I overlooked that when responding to your feedback! 🤦 Yeah, I guess both are stateful.

For anyone reading and wondering why it's stateful -- the mixed-script detection itself isn't, that's done by the 'resolved script set', and 'mixed-script confusability' isn't inherently stateful either, it's just consulting a lookup of characters that are confusable cross-script. However, stopping here would mean that, for instance, every use of a Latin vowel would be flagged as they're all potential mixed-script confusables. Clearly a filter is needed, and the standard implies this:

Note that due to the number of mappings provided by the confusables data, the above algorithm is likely to flag a large number of legitimate strings as potential mixed-script confusables.

So our filter for 'potentially problematic mixed script confusability' uses that same heuristic as Rust -- where we consider if there have been any non-confusable uses of a script, ie, characters from Script X that are clearly from script X and not some other script; in our case, within the same file.

👍 re the next PR including both checks, and probably being closer to the reference as a result.

josevalim · 2022-02-09T15:11:07Z

@mrluc I have been thinking more about this and I wonder if we should rather require characters sets to be explicitly opted-in? Something like this in your mix.exs:

elixirc_options: [identifier_scriptsets: [:cyrilic, :han, :hiragana]]

This makes everything safe by default but gives user control too. Compiled files will be restricted to latin by default but test files/eval files can use all scriptsets. It is most likely an addition to all other work that needs to be done.

mrluc · 2022-02-10T22:25:35Z

@mrluc I have been thinking more about this and I wonder if we should rather require characters sets to be explicitly opted-in? Something like this in your mix.exs:
elixirc_options: [identifier_scriptsets: [:cyrilic, :han, :hiragana]]
This makes everything safe by default but gives user control too. Compiled files will be restricted to latin by default but test files/eval files can use all scriptsets. It is most likely an addition to all other work that needs to be done.

@josevalim from a safety point of view, this sounds promising as an extension mechanism, as for instance it could provide a way of ensuring that 'only these scripts are used in our codebase' (and dependencies? if my intuition of what elixirc_options does is right, but maybe not) -- which I could see removing the need for the whole-file check like 'the only use of Script X is via confusable characters...'; if you don't want Script X, don't configure it!

From an implementation point of view, I think I understand when you say 'most likely an addition to all other work that needs to be done' -- I could be wrong though. So this is what it'd most naturally mean to me:

Currently, if it's unicode, in validation we do UTS 39 5.2's 'Highly Restrictive' only, which is (pseudocode):

  if resolve(ident) do
    :ok 
  else 
    if any?(highly_restrictive_exceptions, &cover?(ident, &1), do: :ok, else: :mixed_script_err
  end

And with the proposed additional security, in validation we'd do an initial check first (pseudocode):

if cover?(ident, configured_scriptsets) do
  if resolve?(ident) do
    :ok
  else 
    if any?(highly_restrictive_exceptions, &cover?(ident, &1), do: :ok, else: :mixed_script_err
  end
else
  :script_not_allowed_by_project_config
end

That would give a configurable mechanism to let programmers of all languages get secure identifiers, maybe with a 1-liner in their config, but is also, as you say, secure by default and lets teams completely rule out scripts that shouldn't appear in their codebase.

josevalim · 2022-02-11T13:53:21Z

This is all merged now, thank you @mrluc! I am quite happy with the implementation and footprint of C3, so I think we can postpone for now the idea of declaring scriptsets upfront.

I am also interested in supporting some mathematical symbols but I would like to do so in a structured way: i.e. we support all symbols that belong to a specific category that has been vetoed against our Unicode Security practices. If you are aware or you want to investigate that this category might be, it would be very appreciated!

Thank you! ❤️

mrluc added 3 commits January 8, 2022 19:51

add unicode security in tokenizer

5b88d6d

* uncommon codepoints, confusables, mixed-script * trying to follow UTS39

pull it all together, mixed script confusables

6eb8ecc

fix format, remove unused var warning

34dafb7

mrluc changed the title ~~Draft of implementing Unicode Security (UTS39)~~ Reference implementation of Unicode Security (UTS39) Jan 15, 2022

mrluc closed this Jan 15, 2022

mrluc mentioned this pull request Jan 17, 2022

Don't allow restricted characters in identifiers (UTS 39, C1) #11580

Merged

mrluc mentioned this pull request Jan 18, 2022

Warn on confusable non-ascii identifiers (UTS 39, C2) #11582

Merged

mrluc mentioned this pull request Feb 10, 2022

Don't allow unsafe script mixing in identifiers (UTS 39, C3) #11620

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reference implementation of Unicode Security (UTS39) #11569

Reference implementation of Unicode Security (UTS39) #11569

Uh oh!

mrluc commented Jan 14, 2022 •

edited

Loading

Uh oh!

josevalim commented Jan 15, 2022 •

edited

Loading

Uh oh!

mrluc commented Jan 15, 2022 •

edited

Loading

Uh oh!

josevalim commented Jan 17, 2022

Uh oh!

mrluc commented Jan 17, 2022 •

edited

Loading

Uh oh!

josevalim commented Feb 9, 2022

Uh oh!

mrluc commented Feb 10, 2022

Uh oh!

josevalim commented Feb 11, 2022

Uh oh!

Uh oh!

Reference implementation of Unicode Security (UTS39) #11569

Reference implementation of Unicode Security (UTS39) #11569

Uh oh!

Conversation

mrluc commented Jan 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Existing behavior in Elixir and other languages

What this PR changes

Additional warnings

Whitelist of uncommon codepoints

Reference: Rust implementation

Potential next steps

Uh oh!

josevalim commented Jan 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrluc commented Jan 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josevalim commented Jan 17, 2022

Uh oh!

mrluc commented Jan 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josevalim commented Feb 9, 2022

Uh oh!

mrluc commented Feb 10, 2022

Uh oh!

josevalim commented Feb 11, 2022

Uh oh!

Uh oh!

mrluc commented Jan 14, 2022 •

edited

Loading

josevalim commented Jan 15, 2022 •

edited

Loading

mrluc commented Jan 15, 2022 •

edited

Loading

mrluc commented Jan 17, 2022 •

edited

Loading