-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[SR-331] Generate fix-its for confusable characters #732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
#define CONFUSABLE(confused, expected) | ||
#endif | ||
|
||
CONFUSABLE(0x2010, 0x2d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would help to describe the procedure you used to generate this table, so that it can be updated for a new version of Unicode when it comes out.
Please also mention in comments the Unicode version you used this time.
@gribozavr: Just added the version/source to the table. Generating this table was a bit of a messy process, mainly involving a quick Python script that was given the list of punctuation from I could also post the script here if you'd like, but it would take quite a bit of cleanup. :) |
@amackworth We usually have all those tools checked in as a part of the repository, and, where feasible, have them run as a part of the build. Check out In this case, I think it is feasible to run it as a part of the build if you turn the header into a function in a .cpp file that takes a unicode scalar and returns you a replacement or 0 if it is not confusable. This will also improve code size. |
@@ -436,6 +436,9 @@ ERROR(unspaced_unary_operator,sema_nb,none, | |||
|
|||
ERROR(use_unresolved_identifier,sema_nb,none, | |||
"use of unresolved %select{identifier|operator}1 %0", (Identifier, bool)) | |||
WARNING(confusable_character,sema_nb,none, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than include this in two different Diagnostics files, please move it to DiagnosticsCommon.def.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and it should also be a NOTE rather than a WARNING because it should be attached to the previous error.
For bonus points, it would be awesome™ to recover as if the user had typed the operator in question. |
Following @gribozavr's suggestion, the table of confusable characters is now generated by Additionally, as @jrose-apple proposed, the fix-it now "deconfuses" the entire identifier at once, rather than generating errors for each individual possibly-confused character. Thank you for the feedback! |
@amackworth Thanks! |
noPrefix = hexString[2:] | ||
modifiedHex.append((((4 - len(noPrefix)) * "0") + noPrefix).upper()) | ||
|
||
f = open('confusables.txt', 'r') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with open('confusables.txt', 'r') as f:
?
@gribozavr: Done! 😄 |
namespace confusable { | ||
/// Given a UTF-8 codepoint, determines whether it appears on the Unicode | ||
/// specification table of confusable characters and maps to punctuation, | ||
/// and either returns either the expected ASCII character or ~0U. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use 0
as the "not confusable" marker? It would make call site simpler:
if (uint32_t replacement = tryConvertConfusableCharacterToASCII(c)) { ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right! Makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
So, I just fixed the I'm still working on possibly recovering as if the user had typed in the expected expression as @jrose-apple suggested, but I'm not sure how to restart the name lookup process since the (cc @gribozavr) |
Just checking in post-holiday on the status of this PR, and if there's anything else you'd like me to change! In particular, I'd really appreciate any suggestions for how to recover from the warning. (cc @gribozavr and @jrose-apple) |
hi @amackworth, @gribozavr is out on vacation this week, but will be back next week. One comment from me: please do not use gyb for C++ source files. It would be better to use the C preprocessor directly, probably with a ".def" style approach like we do for other things in swift/include. |
@lattner The reason to use gyb in this case is that it allows us to construct the C++ source directly from tables published with the Unicode spec. The .def file approach requires us to check in a separate file derived from the vanilla tables. |
Sorry, haven't had time to look at this properly, but still planning to give you an answer about recovery. |
@gribozavr Ok, that's a nice win, but seriously, please do not use gyb for c++ files. |
@lattner, would an intermediate step of .gyb to .def be good enough? Our build system already supports .gyb, and having to manually update a .def file, even with a script checked into utils/, seems like an unnecessary extra step. |
static bool EncodeToUTF8(unsigned CharValue, | ||
SmallVectorImpl<char> &Result) { | ||
assert(CharValue >= 0x80 && "Single-byte encoding should be already handled"); | ||
bool EncodeToUTF8(unsigned CharValue, SmallVectorImpl<char> &Result) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't used outside the file anymore, so you can leave it alone.
.fixItReplaceChars(getSourceLoc(CurPtr-1), | ||
getSourceLoc(tmp), | ||
expectedChar); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recovering here would mean jumping to the top of the switch with the expected character instead, but I'm not sure we're set up to handle that very well. We probably assume all over the place that all the ASCII characters are only one byte wide. I guess it's okay not to do anything here.
Yes, I think the best thing in this case is to have a python (or whatever) script in swift/utils that is manually run to produce the .def file. The .def file would be checked into the tree. |
on Sun Jan 17 2016, Chris Lattner <notifications-AT-i.8713187.xyz> wrote:
Why is this better than just using gyb? |
Because gyb is a necessary evil used in the stdlib. We want to eliminate its use over time, not spread it to other parts of the code base. More broadly, we have a solution to this sort of problem established in the LLVM community, and we should use that solution. |
We do have a solution, but that solution is TableGen. Not everything fits in a .def file. I'd argue that gyb is better than TableGen. |
Tablegen isn't the only approach. LLVM uses perfect shuffle and other utils that generate a .cpp or .h file that is checked into the tree. Take that approach (but write the script in python if that is what you want) I agree that writing a tblgen backend is the wrong way to go. |
Why is "run a script manually whenever something changes" better than "run a script as part of the build whenever something changes"? |
There is virtue in keeping the build machinery (e.g. cmake goop) as simple as possible, and making builds run fast. I'm not concerned about this in terms of build time, but it is a slippery slope that we should definitely not slide down. |
We already have all the CMake code that supports C++ and gyb, it is required anyway for Swift code that uses gyb. There's nothing to simplify in CMake if C++ code didn't use gyb. |
on Tue Jan 19 2016, Dmitri Gribenko <notifications-AT-i.8713187.xyz> wrote:
Sometimes gyb is the right tool for the job. If switching away from it |
I completely understand that the cmake goop already exists. That doesn't address the slippery slope, nor does it address the desire to make gyb go away entirely. Further, these tables do not change frequently. I do not see a reason to regenerate the output as part of the build process. |
Addresses issue SR-331 by adding warning fix-its for possible Unicode confusions. This check only runs after either the lexer encounters an invalid character or an identifier fails to resolve. (All tests pass with
utils/build-script -RT
on OS X 10.11.)A few possible issues that I'd love feedback on:
Lexer.cpp
such thatTypeCheckConstraints.ccp
could use them. Should these be copied into the latter instead or moved to a shared location?DiagnosticsParse.def
andDiagnosticsSema.def
. Should these also be extracted to another shared location?Finally, given this is my first contribution, I welcome any guidance to match the style of the rest of the project!