Skip to content

[SR-331] Generate fix-its for confusable characters #732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

[SR-331] Generate fix-its for confusable characters #732

wants to merge 1 commit into from

Conversation

amackworth
Copy link

Addresses issue SR-331 by adding warning fix-its for possible Unicode confusions. This check only runs after either the lexer encounters an invalid character or an identifier fails to resolve. (All tests pass with utils/build-script -RT on OS X 10.11.)

A few possible issues that I'd love feedback on:

  • This patch includes changes to both the lexer and type checker. Should I break it into two different commits?
    • Relatedly, I had to expose two UTF-8 helper functions from Lexer.cpp such that TypeCheckConstraints.ccp could use them. Should these be copied into the latter instead or moved to a shared location?
    • Additionally, I had to duplicate the warning message across both DiagnosticsParse.def and DiagnosticsSema.def. Should these also be extracted to another shared location?
  • I created a new test file as I wasn't sure that any of the existing ones were close enough in topic.
  • These checks rely on some slightly hacky macros. Would there be a better way to refactor?

Finally, given this is my first contribution, I welcome any guidance to match the style of the rest of the project!

#define CONFUSABLE(confused, expected)
#endif

CONFUSABLE(0x2010, 0x2d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help to describe the procedure you used to generate this table, so that it can be updated for a new version of Unicode when it comes out.

Please also mention in comments the Unicode version you used this time.

@amackworth
Copy link
Author

@gribozavr: Just added the version/source to the table.

Generating this table was a bit of a messy process, mainly involving a quick Python script that was given the list of punctuation from Tokens.def plus a few other basic operators and went through the list from http://www.unicode.org/Public/security/8.0.0/confusables.txt, outputting that list of pairs. I also had to manually trim the list to avoid collisions.

I could also post the script here if you'd like, but it would take quite a bit of cleanup. :)

@gribozavr
Copy link
Contributor

@amackworth We usually have all those tools checked in as a part of the repository, and, where feasible, have them run as a part of the build. Check out ./lib/Basic/UnicodeExtendedGraphemeClusters.cpp.gyb for an example.

In this case, I think it is feasible to run it as a part of the build if you turn the header into a function in a .cpp file that takes a unicode scalar and returns you a replacement or 0 if it is not confusable. This will also improve code size.

@@ -436,6 +436,9 @@ ERROR(unspaced_unary_operator,sema_nb,none,

ERROR(use_unresolved_identifier,sema_nb,none,
"use of unresolved %select{identifier|operator}1 %0", (Identifier, bool))
WARNING(confusable_character,sema_nb,none,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than include this in two different Diagnostics files, please move it to DiagnosticsCommon.def.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and it should also be a NOTE rather than a WARNING because it should be attached to the previous error.

@jrose-apple
Copy link
Contributor

For bonus points, it would be awesome™ to recover as if the user had typed the operator in question.

@amackworth
Copy link
Author

Following @gribozavr's suggestion, the table of confusable characters is now generated by gyb as part of the build process. (I wasn't sure where to put the confusables.txt file, so it's living in lib/Parse for now.)

Additionally, as @jrose-apple proposed, the fix-it now "deconfuses" the entire identifier at once, rather than generating errors for each individual possibly-confused character.

Thank you for the feedback!

@gribozavr
Copy link
Contributor

@amackworth Thanks! utils/UnicodeData is the place for Unicode data files.

noPrefix = hexString[2:]
modifiedHex.append((((4 - len(noPrefix)) * "0") + noPrefix).upper())

f = open('confusables.txt', 'r')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with open('confusables.txt', 'r') as f: ?

@amackworth
Copy link
Author

@gribozavr: Done! 😄

namespace confusable {
/// Given a UTF-8 codepoint, determines whether it appears on the Unicode
/// specification table of confusable characters and maps to punctuation,
/// and either returns either the expected ASCII character or ~0U.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use 0 as the "not confusable" marker? It would make call site simpler:

if (uint32_t replacement = tryConvertConfusableCharacterToASCII(c)) { ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right! Makes sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@amackworth
Copy link
Author

So, I just fixed the StringRef issue, in addition to realizing that I didn't actually need to export EncodeToUTF8! I also removed the assert at the beginning of said helper to avoid branching, but I'm more than a little worried about that, given that it changes the contractual semantics of the function. Finally, I added a doc string for validateUTF8CharacterAndAdvance.

I'm still working on possibly recovering as if the user had typed in the expected expression as @jrose-apple suggested, but I'm not sure how to restart the name lookup process since the Identifier is immutable from the perspective of swiftSema.

(cc @gribozavr)

@amackworth amackworth changed the title [SR-331] Generate fix-its for confusable characters. [SR-331] Generate fix-its for confusable characters Dec 26, 2015
@amackworth
Copy link
Author

Just checking in post-holiday on the status of this PR, and if there's anything else you'd like me to change! In particular, I'd really appreciate any suggestions for how to recover from the warning. (cc @gribozavr and @jrose-apple)

@lattner
Copy link
Contributor

lattner commented Jan 10, 2016

hi @amackworth, @gribozavr is out on vacation this week, but will be back next week.

One comment from me: please do not use gyb for C++ source files. It would be better to use the C preprocessor directly, probably with a ".def" style approach like we do for other things in swift/include.

@gribozavr
Copy link
Contributor

@lattner The reason to use gyb in this case is that it allows us to construct the C++ source directly from tables published with the Unicode spec. The .def file approach requires us to check in a separate file derived from the vanilla tables.

@jrose-apple
Copy link
Contributor

Sorry, haven't had time to look at this properly, but still planning to give you an answer about recovery.

@lattner
Copy link
Contributor

lattner commented Jan 11, 2016

@gribozavr Ok, that's a nice win, but seriously, please do not use gyb for c++ files.

@jrose-apple
Copy link
Contributor

@lattner, would an intermediate step of .gyb to .def be good enough? Our build system already supports .gyb, and having to manually update a .def file, even with a script checked into utils/, seems like an unnecessary extra step.

static bool EncodeToUTF8(unsigned CharValue,
SmallVectorImpl<char> &Result) {
assert(CharValue >= 0x80 && "Single-byte encoding should be already handled");
bool EncodeToUTF8(unsigned CharValue, SmallVectorImpl<char> &Result) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used outside the file anymore, so you can leave it alone.

.fixItReplaceChars(getSourceLoc(CurPtr-1),
getSourceLoc(tmp),
expectedChar);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recovering here would mean jumping to the top of the switch with the expected character instead, but I'm not sure we're set up to handle that very well. We probably assume all over the place that all the ASCII characters are only one byte wide. I guess it's okay not to do anything here.

@lattner
Copy link
Contributor

lattner commented Jan 17, 2016

@lattner, would an intermediate step of .gyb to .def be good enough?

Yes, I think the best thing in this case is to have a python (or whatever) script in swift/utils that is manually run to produce the .def file. The .def file would be checked into the tree.

@dabrahams
Copy link
Contributor

on Sun Jan 17 2016, Chris Lattner <notifications-AT-i.8713187.xyz> wrote:

@lattner, would an intermediate step of .gyb to .def be good enough?

Yes, I think the best thing in this case is to have a python (or
whatever) script in swift/utils that is manually run to produce the
.def file. The .def file would be checked into the tree.

Why is this better than just using gyb?

@lattner
Copy link
Contributor

lattner commented Jan 18, 2016

Because gyb is a necessary evil used in the stdlib. We want to eliminate its use over time, not spread it to other parts of the code base.

More broadly, we have a solution to this sort of problem established in the LLVM community, and we should use that solution.

@jrose-apple
Copy link
Contributor

We do have a solution, but that solution is TableGen. Not everything fits in a .def file. I'd argue that gyb is better than TableGen.

@lattner
Copy link
Contributor

lattner commented Jan 19, 2016

Tablegen isn't the only approach. LLVM uses perfect shuffle and other utils that generate a .cpp or .h file that is checked into the tree. Take that approach (but write the script in python if that is what you want) I agree that writing a tblgen backend is the wrong way to go.

@jrose-apple
Copy link
Contributor

Why is "run a script manually whenever something changes" better than "run a script as part of the build whenever something changes"?

@lattner
Copy link
Contributor

lattner commented Jan 19, 2016

There is virtue in keeping the build machinery (e.g. cmake goop) as simple as possible, and making builds run fast. I'm not concerned about this in terms of build time, but it is a slippery slope that we should definitely not slide down.

@gribozavr
Copy link
Contributor

We already have all the CMake code that supports C++ and gyb, it is required anyway for Swift code that uses gyb. There's nothing to simplify in CMake if C++ code didn't use gyb.

@dabrahams
Copy link
Contributor

on Tue Jan 19 2016, Dmitri Gribenko <notifications-AT-i.8713187.xyz> wrote:

We already have all the CMake code that supports C++ and gyb, it is
required anyway for Swift code that uses gyb. There's nothing to
simplify in CMake if C++ code didn't use gyb.

Sometimes gyb is the right tool for the job. If switching away from it
is going to create maintenance headaches, require infrastructure work,
or make development error-prone, I don't understand why we would switch.

@lattner
Copy link
Contributor

lattner commented Jan 20, 2016

I completely understand that the cmake goop already exists. That doesn't address the slippery slope, nor does it address the desire to make gyb go away entirely.

Further, these tables do not change frequently. I do not see a reason to regenerate the output as part of the build process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants