-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[clang] Introduce "binary" StringLiteral for #embed data #127629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
StringLiteral is used as internal data of EmbedExpr and we directly use it as an initializer if a single EmbedExpr appears in the initializer list of a char array. It is fast and convenient, but it is causing problems when string literal character values are checked because #embed data values are within a range [0-2^(char width)] but ordinary StringLiteral is of maybe signed char type. This PR introduces new kind of StringLiteral to hold binary data coming from an embedded resource to mitigate these problems. The new kind of StringLiteral is not assumed to have signed char type. The new kind of StringLiteral also helps to prevent crashes when trying to find StringLiteral token locations since these simply do not exist for binary data. Fixes llvm#119256
@llvm/pr-subscribers-clang Author: Mariya Podchishchaeva (Fznamznon) ChangesStringLiteral is used as internal data of EmbedExpr and we directly use it as an initializer if a single EmbedExpr appears in the initializer list of a char array. It is fast and convenient, but it is causing problems when string literal character values are checked because #embed data values are within a range [0-2^(char width)] but ordinary StringLiteral is of maybe signed char type. Fixes #119256 Full diff: https://github.com/llvm/llvm-project/pull/127629.diff 5 Files Affected:
diff --git a/clang/include/clang/AST/Expr.h b/clang/include/clang/AST/Expr.h
index cd584d9621a22..cf6f63b8711b8 100644
--- a/clang/include/clang/AST/Expr.h
+++ b/clang/include/clang/AST/Expr.h
@@ -1752,7 +1752,8 @@ enum class StringLiteralKind {
UTF8,
UTF16,
UTF32,
- Unevaluated
+ Unevaluated,
+ Binary
};
/// StringLiteral - This represents a string literal expression, e.g. "foo"
@@ -4965,9 +4966,9 @@ class EmbedExpr final : public Expr {
assert(EExpr && CurOffset != ULLONG_MAX &&
"trying to dereference an invalid iterator");
IntegerLiteral *N = EExpr->FakeChildNode;
- StringRef DataRef = EExpr->Data->BinaryData->getBytes();
N->setValue(*EExpr->Ctx,
- llvm::APInt(N->getValue().getBitWidth(), DataRef[CurOffset],
+ llvm::APInt(N->getValue().getBitWidth(),
+ EExpr->Data->BinaryData->getCodeUnit(CurOffset),
N->getType()->isSignedIntegerType()));
// We want to return a reference to the fake child node in the
// EmbedExpr, not the local variable N.
diff --git a/clang/lib/AST/Expr.cpp b/clang/lib/AST/Expr.cpp
index 6f570139630d8..2747480f00d68 100644
--- a/clang/lib/AST/Expr.cpp
+++ b/clang/lib/AST/Expr.cpp
@@ -1104,6 +1104,7 @@ unsigned StringLiteral::mapCharByteWidth(TargetInfo const &Target,
switch (SK) {
case StringLiteralKind::Ordinary:
case StringLiteralKind::UTF8:
+ case StringLiteralKind::Binary:
CharByteWidth = Target.getCharWidth();
break;
case StringLiteralKind::Wide:
@@ -1216,6 +1217,7 @@ void StringLiteral::outputString(raw_ostream &OS) const {
switch (getKind()) {
case StringLiteralKind::Unevaluated:
case StringLiteralKind::Ordinary:
+ case StringLiteralKind::Binary:
break; // no prefix.
case StringLiteralKind::Wide:
OS << 'L';
@@ -1332,11 +1334,17 @@ StringLiteral::getLocationOfByte(unsigned ByteNo, const SourceManager &SM,
const LangOptions &Features,
const TargetInfo &Target, unsigned *StartToken,
unsigned *StartTokenByteOffset) const {
+ // No source location of bytes for binary literals since they don't come from
+ // source.
+ if (getKind() == StringLiteralKind::Binary)
+ return getStrTokenLoc(0);
+
assert((getKind() == StringLiteralKind::Ordinary ||
getKind() == StringLiteralKind::UTF8 ||
getKind() == StringLiteralKind::Unevaluated) &&
"Only narrow string literals are currently supported");
+
// Loop over all of the tokens in this string until we find the one that
// contains the byte we're looking for.
unsigned TokNo = 0;
diff --git a/clang/lib/Parse/ParseInit.cpp b/clang/lib/Parse/ParseInit.cpp
index 63b1d7bd9db53..471b3eaf28287 100644
--- a/clang/lib/Parse/ParseInit.cpp
+++ b/clang/lib/Parse/ParseInit.cpp
@@ -445,7 +445,7 @@ ExprResult Parser::createEmbedExpr() {
Context.MakeIntValue(Str.size(), Context.getSizeType());
QualType ArrayTy = Context.getConstantArrayType(
Ty, ArraySize, nullptr, ArraySizeModifier::Normal, 0);
- return StringLiteral::Create(Context, Str, StringLiteralKind::Ordinary,
+ return StringLiteral::Create(Context, Str, StringLiteralKind::Binary,
false, ArrayTy, StartLoc);
};
diff --git a/clang/lib/Sema/SemaInit.cpp b/clang/lib/Sema/SemaInit.cpp
index 6a76e6d74a4b0..013e57df6615c 100644
--- a/clang/lib/Sema/SemaInit.cpp
+++ b/clang/lib/Sema/SemaInit.cpp
@@ -106,6 +106,7 @@ static StringInitFailureKind IsStringInit(Expr *Init, const ArrayType *AT,
return SIF_None;
[[fallthrough]];
case StringLiteralKind::Ordinary:
+ case StringLiteralKind::Binary:
// char array can be initialized with a narrow string.
// Only allow char x[] = "foo"; not char x[] = L"foo";
if (ElemTy->isCharType())
diff --git a/clang/test/Preprocessor/embed_constexpr.c b/clang/test/Preprocessor/embed_constexpr.c
new file mode 100644
index 0000000000000..e444dfec158b5
--- /dev/null
+++ b/clang/test/Preprocessor/embed_constexpr.c
@@ -0,0 +1,21 @@
+// RUN: %clang_cc1 %s -fsyntax-only --embed-dir=%S/Inputs -verify -std=c23
+
+static constexpr unsigned char data[] = {
+#embed "big_char.txt"
+};
+
+static constexpr char data1[] = {
+#embed "big_char.txt" // expected-error {{constexpr initializer evaluates to 255 which is not exactly representable in type 'const char'}}
+};
+
+static constexpr int data2[] = {
+#embed "big_char.txt"
+};
+
+static constexpr unsigned data3[] = {
+#embed "big_char.txt" suffix(, -1) // expected-error {{constexpr initializer evaluates to -1 which is not exactly representable in type 'const unsigned int'}}
+};
+
+static constexpr int data4[] = {
+#embed "big_char.txt" suffix(, -1)
+};
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
llvm::APInt(N->getValue().getBitWidth(), | ||
EExpr->Data->BinaryData->getCodeUnit(CurOffset), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change that line (and nothing else) does it still works?
I have a feeling this is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish it did. This particular line helps for generic case of #embed
inside of an initializer list. The rest that this patch is adding is for #embed "fast path" where we simply put StringLiteral to the initializer list instead of EmbedExpr when a single #embed is used to initialize char array.
When we do that, the iterators of EmbedExpr won't be in use and the fail from #119256 is still in place.
ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! It also needs a release note for the fix. :-)
In general, I think this seems reasonable, but I'd like confirmation from @cor3ntin given his somewhat recent thinking about unevaluated string literals and where those end up touching the rest of the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay
No changelog because we want to backport?
LGTM, but i think this could use some comments
@@ -1752,7 +1752,8 @@ enum class StringLiteralKind { | |||
UTF8, | |||
UTF16, | |||
UTF32, | |||
Unevaluated | |||
Unevaluated, | |||
Binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining this is for embed?
I'm sorry it took me a while to understand how this patch works.
(The reason is that this allows us to not "cast" to char in getCodeUnitS()
- which is only used in C23 mode)
Maybe also add a comment in getCodeUnitS
and/or CheckC23ConstexprInitStringLiteral
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments in 8ff7d03 . Do you think it turned out helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, Thanks!
No changelog because I forgot. Should we backport though? |
This seems simple enough to warrant backporting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -1752,7 +1752,8 @@ enum class StringLiteralKind { | |||
UTF8, | |||
UTF16, | |||
UTF32, | |||
Unevaluated | |||
Unevaluated, | |||
Binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, Thanks!
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/51/builds/12901 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/72/builds/9355 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/168/builds/9949 Here is the relevant piece of the build log for the reference
|
Ooops, I'll fix the bots quickly |
The bots are upset by llvm#127629 . Fix that.
The bots are upset by #127629 . Fix that.
…(#132201) The bots are upset by llvm/llvm-project#127629 . Fix that.
StringLiteral is used as internal data of EmbedExpr and we directly use it as an initializer if a single EmbedExpr appears in the initializer list of a char array. It is fast and convenient, but it is causing problems when string literal character values are checked because #embed data values are within a range [0-2^(char width)] but ordinary StringLiteral is of maybe signed char type. This PR introduces new kind of StringLiteral to hold binary data coming from an embedded resource to mitigate these problems. The new kind of StringLiteral is not assumed to have signed char type. The new kind of StringLiteral also helps to prevent crashes when trying to find StringLiteral token locations since these simply do not exist for binary data. Fixes llvm#119256
The bots are upset by llvm#127629 . Fix that.
StringLiteral is used as internal data of EmbedExpr and we directly use it as an initializer if a single EmbedExpr appears in the initializer list of a char array. It is fast and convenient, but it is causing problems when string literal character values are checked because #embed data values are within a range [0-2^(char width)] but ordinary StringLiteral is of maybe signed char type. This PR introduces new kind of StringLiteral to hold binary data coming from an embedded resource to mitigate these problems. The new kind of StringLiteral is not assumed to have signed char type. The new kind of StringLiteral also helps to prevent crashes when trying to find StringLiteral token locations since these simply do not exist for binary data. Fixes llvm#119256
The bots are upset by llvm#127629 . Fix that.
…(#132201) The bots are upset by llvm/llvm-project#127629 . Fix that.
StringLiteral is used as internal data of EmbedExpr and we directly use it as an initializer if a single EmbedExpr appears in the initializer list of a char array. It is fast and convenient, but it is causing problems when string literal character values are checked because #embed data values are within a range [0-2^(char width)] but ordinary StringLiteral is of maybe signed char type.
This PR introduces new kind of StringLiteral to hold binary data coming from an embedded resource to mitigate these problems. The new kind of StringLiteral is not assumed to have signed char type. The new kind of StringLiteral also helps to prevent crashes when trying to find StringLiteral token locations since these simply do not exist for binary data.
Fixes #119256