[SR-331] Generate fix-its for confusable characters #732

amackworth · 2015-12-22T19:53:25Z

Addresses issue SR-331 by adding warning fix-its for possible Unicode confusions. This check only runs after either the lexer encounters an invalid character or an identifier fails to resolve. (All tests pass with utils/build-script -RT on OS X 10.11.)

A few possible issues that I'd love feedback on:

This patch includes changes to both the lexer and type checker. Should I break it into two different commits?
- Relatedly, I had to expose two UTF-8 helper functions from Lexer.cpp such that TypeCheckConstraints.ccp could use them. Should these be copied into the latter instead or moved to a shared location?
- Additionally, I had to duplicate the warning message across both DiagnosticsParse.def and DiagnosticsSema.def. Should these also be extracted to another shared location?
I created a new test file as I wasn't sure that any of the existing ones were close enough in topic.
These checks rely on some slightly hacky macros. Would there be a better way to refactor?

Finally, given this is my first contribution, I welcome any guidance to match the style of the rest of the project!

gribozavr · 2015-12-22T19:56:34Z

include/swift/Parse/Confusables.def

+#define CONFUSABLE(confused, expected)
+#endif
+
+CONFUSABLE(0x2010, 0x2d)


It would help to describe the procedure you used to generate this table, so that it can be updated for a new version of Unicode when it comes out.

Please also mention in comments the Unicode version you used this time.

amackworth · 2015-12-22T21:54:44Z

@gribozavr: Just added the version/source to the table.

Generating this table was a bit of a messy process, mainly involving a quick Python script that was given the list of punctuation from Tokens.def plus a few other basic operators and went through the list from http://www.unicode.org/Public/security/8.0.0/confusables.txt, outputting that list of pairs. I also had to manually trim the list to avoid collisions.

I could also post the script here if you'd like, but it would take quite a bit of cleanup. :)

gribozavr · 2015-12-23T00:01:53Z

@amackworth We usually have all those tools checked in as a part of the repository, and, where feasible, have them run as a part of the build. Check out ./lib/Basic/UnicodeExtendedGraphemeClusters.cpp.gyb for an example.

In this case, I think it is feasible to run it as a part of the build if you turn the header into a function in a .cpp file that takes a unicode scalar and returns you a replacement or 0 if it is not confusable. This will also improve code size.

jrose-apple · 2015-12-23T01:31:31Z

include/swift/AST/DiagnosticsSema.def

@@ -436,6 +436,9 @@ ERROR(unspaced_unary_operator,sema_nb,none,

 ERROR(use_unresolved_identifier,sema_nb,none,
      "use of unresolved %select{identifier|operator}1 %0", (Identifier, bool))
+WARNING(confusable_character,sema_nb,none,


Rather than include this in two different Diagnostics files, please move it to DiagnosticsCommon.def.

Oh, and it should also be a NOTE rather than a WARNING because it should be attached to the previous error.

jrose-apple · 2015-12-23T01:41:26Z

For bonus points, it would be awesome™ to recover as if the user had typed the operator in question.

amackworth · 2015-12-24T21:25:34Z

Following @gribozavr's suggestion, the table of confusable characters is now generated by gyb as part of the build process. (I wasn't sure where to put the confusables.txt file, so it's living in lib/Parse for now.)

Additionally, as @jrose-apple proposed, the fix-it now "deconfuses" the entire identifier at once, rather than generating errors for each individual possibly-confused character.

Thank you for the feedback!

gribozavr · 2015-12-24T21:44:30Z

@amackworth Thanks! utils/UnicodeData is the place for Unicode data files.

gribozavr · 2015-12-24T21:45:40Z

lib/Parse/Confusables.cpp.gyb

+    noPrefix = hexString[2:]
+    modifiedHex.append((((4 - len(noPrefix)) * "0") + noPrefix).upper())
+
+f = open('confusables.txt', 'r')


with open('confusables.txt', 'r') as f: ?

amackworth · 2015-12-24T23:08:11Z

@gribozavr: Done! 😄

gribozavr · 2015-12-24T23:13:47Z

include/swift/Parse/Confusables.h

+namespace confusable {
+  /// Given a UTF-8 codepoint, determines whether it appears on the Unicode
+  /// specification table of confusable characters and maps to punctuation,
+  /// and either returns either the expected ASCII character or ~0U.


Why not use 0 as the "not confusable" marker? It would make call site simpler:

if (uint32_t replacement = tryConvertConfusableCharacterToASCII(c)) { ...

Ah, right! Makes sense.

amackworth · 2015-12-26T04:49:04Z

So, I just fixed the StringRef issue, in addition to realizing that I didn't actually need to export EncodeToUTF8! I also removed the assert at the beginning of said helper to avoid branching, but I'm more than a little worried about that, given that it changes the contractual semantics of the function. Finally, I added a doc string for validateUTF8CharacterAndAdvance.

I'm still working on possibly recovering as if the user had typed in the expected expression as @jrose-apple suggested, but I'm not sure how to restart the name lookup process since the Identifier is immutable from the perspective of swiftSema.

(cc @gribozavr)

amackworth · 2016-01-04T17:34:13Z

Just checking in post-holiday on the status of this PR, and if there's anything else you'd like me to change! In particular, I'd really appreciate any suggestions for how to recover from the warning. (cc @gribozavr and @jrose-apple)

lattner · 2016-01-10T04:40:20Z

hi @amackworth, @gribozavr is out on vacation this week, but will be back next week.

One comment from me: please do not use gyb for C++ source files. It would be better to use the C preprocessor directly, probably with a ".def" style approach like we do for other things in swift/include.

gribozavr · 2016-01-11T03:12:26Z

@lattner The reason to use gyb in this case is that it allows us to construct the C++ source directly from tables published with the Unicode spec. The .def file approach requires us to check in a separate file derived from the vanilla tables.

jrose-apple · 2016-01-11T17:45:55Z

Sorry, haven't had time to look at this properly, but still planning to give you an answer about recovery.

lattner · 2016-01-11T22:38:36Z

@gribozavr Ok, that's a nice win, but seriously, please do not use gyb for c++ files.

jrose-apple · 2016-01-15T22:56:20Z

@lattner, would an intermediate step of .gyb to .def be good enough? Our build system already supports .gyb, and having to manually update a .def file, even with a script checked into utils/, seems like an unnecessary extra step.

jrose-apple · 2016-01-15T22:57:36Z

lib/Parse/Lexer.cpp

-static bool EncodeToUTF8(unsigned CharValue,
-                         SmallVectorImpl<char> &Result) {
-  assert(CharValue >= 0x80 && "Single-byte encoding should be already handled");
+bool EncodeToUTF8(unsigned CharValue, SmallVectorImpl<char> &Result) {


This isn't used outside the file anymore, so you can leave it alone.

jrose-apple · 2016-01-15T23:09:31Z

lib/Parse/Lexer.cpp

+            .fixItReplaceChars(getSourceLoc(CurPtr-1),
+                               getSourceLoc(tmp),
+                               expectedChar);
+        }


Recovering here would mean jumping to the top of the switch with the expected character instead, but I'm not sure we're set up to handle that very well. We probably assume all over the place that all the ASCII characters are only one byte wide. I guess it's okay not to do anything here.

lattner · 2016-01-17T19:16:47Z

@lattner, would an intermediate step of .gyb to .def be good enough?

Yes, I think the best thing in this case is to have a python (or whatever) script in swift/utils that is manually run to produce the .def file. The .def file would be checked into the tree.

dabrahams · 2016-01-18T09:04:42Z

on Sun Jan 17 2016, Chris Lattner <notifications-AT-i.8713187.xyz> wrote:

@lattner, would an intermediate step of .gyb to .def be good enough?

Yes, I think the best thing in this case is to have a python (or
whatever) script in swift/utils that is manually run to produce the
.def file. The .def file would be checked into the tree.

Why is this better than just using gyb?

lattner · 2016-01-18T17:17:33Z

Because gyb is a necessary evil used in the stdlib. We want to eliminate its use over time, not spread it to other parts of the code base.

More broadly, we have a solution to this sort of problem established in the LLVM community, and we should use that solution.

jrose-apple · 2016-01-19T17:59:28Z

We do have a solution, but that solution is TableGen. Not everything fits in a .def file. I'd argue that gyb is better than TableGen.

lattner · 2016-01-19T18:08:51Z

Tablegen isn't the only approach. LLVM uses perfect shuffle and other utils that generate a .cpp or .h file that is checked into the tree. Take that approach (but write the script in python if that is what you want) I agree that writing a tblgen backend is the wrong way to go.

jrose-apple · 2016-01-19T18:14:41Z

Why is "run a script manually whenever something changes" better than "run a script as part of the build whenever something changes"?

lattner · 2016-01-19T18:18:43Z

There is virtue in keeping the build machinery (e.g. cmake goop) as simple as possible, and making builds run fast. I'm not concerned about this in terms of build time, but it is a slippery slope that we should definitely not slide down.

gribozavr · 2016-01-19T19:29:05Z

We already have all the CMake code that supports C++ and gyb, it is required anyway for Swift code that uses gyb. There's nothing to simplify in CMake if C++ code didn't use gyb.

dabrahams · 2016-01-20T00:21:56Z

on Tue Jan 19 2016, Dmitri Gribenko <notifications-AT-i.8713187.xyz> wrote:

We already have all the CMake code that supports C++ and gyb, it is
required anyway for Swift code that uses gyb. There's nothing to
simplify in CMake if C++ code didn't use gyb.

Sometimes gyb is the right tool for the job. If switching away from it
is going to create maintenance headaches, require infrastructure work,
or make development error-prone, I don't understand why we would switch.

lattner · 2016-01-20T01:03:34Z

I completely understand that the cmake goop already exists. That doesn't address the slippery slope, nor does it address the desire to make gyb go away entirely.

Further, these tables do not change frequently. I do not see a reason to regenerate the output as part of the build process.

gribozavr reviewed Dec 22, 2015
View reviewed changes

jrose-apple reviewed Dec 23, 2015
View reviewed changes

gribozavr reviewed Dec 24, 2015
View reviewed changes

[SR-331] Generate fix-its for confusable characters.

0b0f512

amackworth changed the title ~~[SR-331] Generate fix-its for confusable characters.~~ [SR-331] Generate fix-its for confusable characters Dec 26, 2015

jrose-apple reviewed Jan 15, 2016
View reviewed changes

robinkunde mentioned this pull request Apr 27, 2017

SR-331: Diagnostic notes and fixits for unicode confusables #9070

Merged

CodaFi closed this May 6, 2017

swift-ci mentioned this pull request Dec 22, 2015

[SR-331] Swift should have fixits for similar looking characters #42953

Closed

[SR-331] Generate fix-its for confusable characters #732

[SR-331] Generate fix-its for confusable characters #732

Uh oh!

Conversation

amackworth commented Dec 22, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amackworth commented Dec 22, 2015

Uh oh!

gribozavr commented Dec 23, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrose-apple commented Dec 23, 2015

Uh oh!

amackworth commented Dec 24, 2015

Uh oh!

gribozavr commented Dec 24, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amackworth commented Dec 24, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amackworth commented Dec 26, 2015

Uh oh!

amackworth commented Jan 4, 2016

Uh oh!

lattner commented Jan 10, 2016

Uh oh!

gribozavr commented Jan 11, 2016

Uh oh!

jrose-apple commented Jan 11, 2016

Uh oh!

lattner commented Jan 11, 2016

Uh oh!

jrose-apple commented Jan 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lattner commented Jan 17, 2016

Uh oh!

dabrahams commented Jan 18, 2016

Uh oh!

lattner commented Jan 18, 2016

Uh oh!

jrose-apple commented Jan 19, 2016

Uh oh!

lattner commented Jan 19, 2016

Uh oh!

jrose-apple commented Jan 19, 2016

Uh oh!

lattner commented Jan 19, 2016

Uh oh!

gribozavr commented Jan 19, 2016

Uh oh!

dabrahams commented Jan 20, 2016

Uh oh!

lattner commented Jan 20, 2016

Uh oh!

Uh oh!