-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[llvm-rc] Concatenate consecutive string tokens in windres mode #68685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-platform-windows Author: Martin Storsjö (mstorsjo) ChangesA number of constructs allow more than one string literal for what represents one single string - version info values, string tables, user data resources, etc. This mostly works like how a C compiler merges consecutive string literals, like "foo" "bar" as producing the same as "foobar". This is beneficial for producing strings with some elements provided by the preprocessor. MS rc.exe only supports this in a few fixed locations (most of which are already supported), while GNU windres supports this essentially anywhere in any string. See A reasonable use case for multiple concatenated string literals that GNU windres accepts is To support this, extend the tokenizer to concatenate consecutive quoted string tokens. The tokenizer only stores StringRefs that point into the input file - including the surrounding quotes and potential leading L to denote wide strings. In order to concatenate consecutive string tokens into one, we just fuse the StringRef ranges (including whatever whitespace there is between the tokens) and store that within the single string token. At the end, when we are going to output something based on the string contents, we can split the single StringRef into a vector of StringRefs and then loop over them. The string tokens are currently used within ResourceFileWriter either via the processString function, outputting UTF16 data to write into the output, or the stripQuotes function, giving a StringRef to the raw contents of the string, e.g. for use for loading a file. Intercept all these cases and pass them via functions that split the multi-string token into one or more strings for each string literal, and then process them one at a time with the existing functions. This contains tests for the following cases:
This change also allows simplifying the earlier added explicit support of multiple string literals in constructs such as version info and string tables, as later cleanup commits. Full diff: https://github.com/llvm/llvm-project/pull/68685.diff 8 Files Affected:
diff --git a/llvm/test/tools/llvm-rc/Inputs/accelerators-split-token.rc b/llvm/test/tools/llvm-rc/Inputs/accelerators-split-token.rc
new file mode 100644
index 000000000000000..24377e014caf5e3
--- /dev/null
+++ b/llvm/test/tools/llvm-rc/Inputs/accelerators-split-token.rc
@@ -0,0 +1,5 @@
+1 ACCELERATORS
+LANGUAGE 5, 1
+{
+ "^" "a", 18
+}
diff --git a/llvm/test/tools/llvm-rc/Inputs/split-path.rc b/llvm/test/tools/llvm-rc/Inputs/split-path.rc
new file mode 100644
index 000000000000000..4f60b6f78ae15c5
--- /dev/null
+++ b/llvm/test/tools/llvm-rc/Inputs/split-path.rc
@@ -0,0 +1 @@
+100 ICON "subdir" "/icon-new.ico"
diff --git a/llvm/test/tools/llvm-rc/Inputs/tokens.rc b/llvm/test/tools/llvm-rc/Inputs/tokens.rc
index 6a781202a7e3721..7e1c6e923d25e14 100644
--- a/llvm/test/tools/llvm-rc/Inputs/tokens.rc
+++ b/llvm/test/tools/llvm-rc/Inputs/tokens.rc
@@ -4,6 +4,9 @@ identifier-with-dashes
"RC string test.",L"Another RC string test.'&{",42,100
+"concatenated" L"string"
+"token"
+
Block Comment Ident /*block /* // comment */ ifier
Line Comment // Identifier /*
diff --git a/llvm/test/tools/llvm-rc/accelerators-split-token.test b/llvm/test/tools/llvm-rc/accelerators-split-token.test
new file mode 100644
index 000000000000000..9414c787e0f24a2
--- /dev/null
+++ b/llvm/test/tools/llvm-rc/accelerators-split-token.test
@@ -0,0 +1,13 @@
+; RUN: llvm-rc -no-preprocess /FO %t -- %p/Inputs/accelerators-split-token.rc
+; RUN: llvm-readobj %t | FileCheck %s
+
+; CHECK: Resource type (int): ACCELERATOR (ID 9)
+; CHECK-NEXT: Resource name (int): 1
+; CHECK-NEXT: Data version: 0
+; CHECK-NEXT: Memory flags: 0x30
+; CHECK-NEXT: Language ID: 1029
+; CHECK-NEXT: Version (major): 0
+; CHECK-NEXT: Version (minor): 0
+; CHECK-NEXT: Characteristics: 0
+; CHECK-NEXT: Data size: 8
+; CHECK-NEXT: Data:: (80 00 01 00 12 00 00 00)
diff --git a/llvm/test/tools/llvm-rc/split-path.test b/llvm/test/tools/llvm-rc/split-path.test
new file mode 100644
index 000000000000000..adfcfa2af0add57
--- /dev/null
+++ b/llvm/test/tools/llvm-rc/split-path.test
@@ -0,0 +1,6 @@
+; RUN: rm -rf %t
+; RUN: mkdir %t
+; RUN: cd %t
+; RUN: mkdir subdir
+; RUN: cp %p/Inputs/icon-new.ico subdir
+; RUN: llvm-rc -no-preprocess /FO %t/split-path.res -- %p/Inputs/split-path.rc
diff --git a/llvm/test/tools/llvm-rc/tokenizer.test b/llvm/test/tools/llvm-rc/tokenizer.test
index 8486f8bd786909e..4d327e4084b27e8 100644
--- a/llvm/test/tools/llvm-rc/tokenizer.test
+++ b/llvm/test/tools/llvm-rc/tokenizer.test
@@ -35,6 +35,8 @@
; CHECK-NEXT: Int: 42; int value = 42
; CHECK-NEXT: Comma: ,
; CHECK-NEXT: Int: 100; int value = 100
+; CHECK-NEXT: String: "concatenated" L"string"
+; CHECK-NEXT: "token"
; CHECK-NEXT: Identifier: Block
; CHECK-NEXT: Identifier: Comment
; CHECK-NEXT: Identifier: Ident
diff --git a/llvm/tools/llvm-rc/ResourceFileWriter.cpp b/llvm/tools/llvm-rc/ResourceFileWriter.cpp
index dd9db338ece255b..31d42489841f694 100644
--- a/llvm/tools/llvm-rc/ResourceFileWriter.cpp
+++ b/llvm/tools/llvm-rc/ResourceFileWriter.cpp
@@ -109,6 +109,73 @@ static bool stripQuotes(StringRef &Str, bool &IsLongString) {
return true;
}
+static void splitStrings(StringRef Str, SmallVectorImpl<StringRef> &Result) {
+ if (!Str.starts_with_insensitive("L\"") &&
+ !Str.starts_with_insensitive("\"")) {
+ Result.push_back(Str);
+ return;
+ }
+ while (!Str.empty()) {
+ size_t Pos = 0;
+ if (Str.starts_with_insensitive("L"))
+ ++Pos;
+ assert(Pos < Str.size() && Str[Pos] == '"' &&
+ "Strings should be enclosed in quotes.");
+ ++Pos;
+ while (Pos < Str.size()) {
+ if (Str[Pos] == '"') {
+ // Found a quote
+ if (Pos + 1 == Str.size()) {
+ // At the end of the string; enqueue the rest
+ Result.push_back(Str);
+ return;
+ }
+
+ if (Str[Pos + 1] == '"') {
+ // Double quote, skip past it
+ Pos += 2;
+ continue;
+ }
+
+ // Single quote in the middle of a string, terminating the current one.
+ Result.push_back(Str.take_front(Pos + 1));
+ Str = Str.drop_front(Pos + 1);
+ Str = Str.ltrim(); // Skip past whitespace
+ Pos = 0;
+ // Restart the outer loop, starting a new string
+ break;
+ }
+
+ // Not a quote, advance
+ ++Pos;
+ }
+ assert(Pos == 0 && "Should have consumed a string");
+ }
+ llvm_unreachable("Misquoted strings");
+}
+
+static bool extractQuotedStrings(StringRef &Str, SmallString<0> &Buf) {
+ // Extract a StringRef from a parser token that contains one or more
+ // quoted strings.
+ SmallVector<StringRef, 3> SubStrs;
+ splitStrings(Str, SubStrs);
+ bool IsLongString;
+ if (SubStrs.size() == 1) {
+ assert(Str == SubStrs[0]);
+ return stripQuotes(Str, IsLongString);
+ } else {
+ // If the string token consisted of multiple quoted strings, run stripQuotes
+ // on each of them and concatenate the result in the user provided buffer.
+ for (StringRef S : SubStrs) {
+ if (!stripQuotes(S, IsLongString))
+ return false;
+ Buf.append(S);
+ }
+ Str = Buf.str();
+ return true;
+ }
+}
+
static UTF16 cp1252ToUnicode(unsigned char C) {
static const UTF16 Map80[] = {
0x20ac, 0x0081, 0x201a, 0x0192, 0x201e, 0x2026, 0x2020, 0x2021,
@@ -360,6 +427,18 @@ static Error processString(StringRef Str, NullHandlingMethod NullHandler,
return Error::success();
}
+static Error processStrings(StringRef Str, NullHandlingMethod NullHandler,
+ SmallVectorImpl<UTF16> &Result, int CodePage) {
+ SmallVector<StringRef, 3> SubStrs;
+ splitStrings(Str, SubStrs);
+ for (const auto &S : SubStrs) {
+ bool IsLongString;
+ RETURN_IF_ERROR(
+ processString(S, NullHandler, IsLongString, Result, CodePage));
+ }
+ return Error::success();
+}
+
uint64_t ResourceFileWriter::writeObject(const ArrayRef<uint8_t> Data) {
uint64_t Result = tell();
FS->write((const char *)Data.begin(), Data.size());
@@ -368,10 +447,8 @@ uint64_t ResourceFileWriter::writeObject(const ArrayRef<uint8_t> Data) {
Error ResourceFileWriter::writeCString(StringRef Str, bool WriteTerminator) {
SmallVector<UTF16, 128> ProcessedString;
- bool IsLongString;
- RETURN_IF_ERROR(processString(Str, NullHandlingMethod::CutAtNull,
- IsLongString, ProcessedString,
- Params.CodePage));
+ RETURN_IF_ERROR(processStrings(Str, NullHandlingMethod::CutAtNull,
+ ProcessedString, Params.CodePage));
for (auto Ch : ProcessedString)
writeInt<uint16_t>(Ch);
if (WriteTerminator)
@@ -400,8 +477,8 @@ void ResourceFileWriter::writeRCInt(RCInt Value) {
}
Error ResourceFileWriter::appendFile(StringRef Filename) {
- bool IsLong;
- stripQuotes(Filename, IsLong);
+ SmallString<0> Buf;
+ extractQuotedStrings(Filename, Buf);
auto File = loadFile(Filename);
if (!File)
@@ -645,8 +722,8 @@ Error ResourceFileWriter::writeSingleAccelerator(
}
StringRef Str = Obj.Event.getString();
- bool IsWide;
- stripQuotes(Str, IsWide);
+ SmallString<0> Buf;
+ extractQuotedStrings(Str, Buf);
if (Str.size() == 0 || Str.size() > 2)
return createAccError(
@@ -706,8 +783,8 @@ Error ResourceFileWriter::writeAcceleratorsBody(const RCResource *Base) {
Error ResourceFileWriter::writeBitmapBody(const RCResource *Base) {
StringRef Filename = cast<BitmapResource>(Base)->BitmapLoc;
- bool IsLong;
- stripQuotes(Filename, IsLong);
+ SmallString<0> Buf;
+ extractQuotedStrings(Filename, Buf);
auto File = loadFile(Filename);
if (!File)
@@ -884,8 +961,8 @@ Error ResourceFileWriter::visitIconOrCursorResource(const RCResource *Base) {
Type = IconCursorGroupType::Cursor;
}
- bool IsLong;
- stripQuotes(FileStr, IsLong);
+ SmallString<0> Buf;
+ extractQuotedStrings(FileStr, Buf);
auto File = loadFile(FileStr);
if (!File)
@@ -1323,10 +1400,9 @@ Error ResourceFileWriter::writeStringTableBundleBody(const RCResource *Base) {
// (which is not null-terminated).
SmallVector<UTF16, 128> Data;
if (Res->Bundle.Data[ID]) {
- bool IsLongString;
for (StringRef S : *Res->Bundle.Data[ID])
- RETURN_IF_ERROR(processString(S, NullHandlingMethod::CutAtDoubleNull,
- IsLongString, Data, Params.CodePage));
+ RETURN_IF_ERROR(processStrings(S, NullHandlingMethod::CutAtDoubleNull,
+ Data, Params.CodePage));
if (AppendNull)
Data.push_back('\0');
}
@@ -1372,21 +1448,25 @@ Error ResourceFileWriter::writeUserDefinedBody(const RCResource *Base) {
continue;
}
- SmallVector<UTF16, 128> ProcessedString;
- bool IsLongString;
- RETURN_IF_ERROR(
- processString(Elem.getString(), NullHandlingMethod::UserResource,
- IsLongString, ProcessedString, Params.CodePage));
+ SmallVector<StringRef, 3> SubStrs;
+ splitStrings(Elem.getString(), SubStrs);
+ for (const auto &S : SubStrs) {
+ SmallVector<UTF16, 128> ProcessedString;
+ bool IsLongString;
+ RETURN_IF_ERROR(processString(S, NullHandlingMethod::UserResource,
+ IsLongString, ProcessedString,
+ Params.CodePage));
+
+ for (auto Ch : ProcessedString) {
+ if (IsLongString) {
+ writeInt(Ch);
+ continue;
+ }
- for (auto Ch : ProcessedString) {
- if (IsLongString) {
- writeInt(Ch);
- continue;
+ RETURN_IF_ERROR(checkNumberFits<uint8_t>(
+ Ch, "Character in narrow string in user-defined resource"));
+ writeInt<uint8_t>(Ch);
}
-
- RETURN_IF_ERROR(checkNumberFits<uint8_t>(
- Ch, "Character in narrow string in user-defined resource"));
- writeInt<uint8_t>(Ch);
}
}
diff --git a/llvm/tools/llvm-rc/ResourceScriptToken.cpp b/llvm/tools/llvm-rc/ResourceScriptToken.cpp
index a8f40abdf8a680c..4422dbc75f8d378 100644
--- a/llvm/tools/llvm-rc/ResourceScriptToken.cpp
+++ b/llvm/tools/llvm-rc/ResourceScriptToken.cpp
@@ -188,6 +188,16 @@ Expected<std::vector<RCToken>> Tokenizer::run() {
return getStringError("Integer invalid or too large: " +
Token.value().str());
}
+ } else if (TokenKind == Kind::String && !Result.empty()) {
+ RCToken &Prev = Result.back();
+ if (Prev.kind() == Kind::String) {
+ StringRef PrevStr = Prev.value();
+ StringRef Concat(PrevStr.data(), Token.value().data() +
+ Token.value().size() -
+ PrevStr.data());
+ Result.pop_back();
+ Token = RCToken(TokenKind, Concat);
+ }
}
Result.push_back(Token);
|
This way of storing strings and re-parsing it later seems a bit fragile to me. For example, if I change your accelerators-split-token.rc test to use something like: |
Apparently yes - I had thought that it would be mostly robust as what goes into it should be proper tokens only, but apparently you managed to poke a hole in it :-)
Ooops; let me have a look at that ...
Practically speaking - it would most probably be fine; the amounts of data processed by llvm-rc are miniscule anyway. But that would probably be a much bigger refactoring, something i didn't really want to get into for this somewhat small feature. |
After thinking more about ways to make this better; you’re right, if we’d want this solution to be robust, we would probably need to re-invoke the original tokenizer when splitting the string in the end; that’s not very elegant. I thought about if we could extend I considered if one actually would concatenate the payload of the string literals when we want to append them to the existing token. However that’s problematic; we currently need to keep the One compromise would be to keep the current split logic at the end; but not create a WDYT about the alternatives above? |
The usual behavior (like in C) is that such concatenation promotes the whole string to be a wide string and testing with windres confirms that it follows that pattern. I used following test:
windres will interpret BTW, |
Or maybe parser could take care of that when it reads a string token. |
Oh, right, that explains things. Thanks!
So I guess that leaves us with two choices:
Do we have an opinion on this from users of llvm-rc in MSVC-like environments - @zmodem @nico @aganea, or others? Cautiously, I would prefer the former - faithfully emulating |
A number of constructs allow more than one string literal for what represents one single string - version info values, string tables, user data resources, etc. This mostly works like how a C compiler merges consecutive string literals, like "foo" "bar" as producing the same as "foobar". This is beneficial for producing strings with some elements provided by the preprocessor. MS rc.exe only supports this in a few fixed locations (most of which are already supported), while GNU windres supports this essentially anywhere in any string. See b989fcb for one recent change that extended support for this in one specific resource. A reasonable use case for multiple concatenated string literals that GNU windres accepts is `1 ICON DIR "/name.ico"`, where the directory is provided via the preprocessor, expanding to another string literal; this is llvm#51286. To support this, extend the tokenizer to concatenate consecutive quoted string tokens, when invoked in windres mode. For cases where the difference between narrow and wide strings is significant (e.g. in userdata resources), GNU windres treats `L"aaa" "bbb"` the same as `L"aaabbb"`, contrary to rc.exe which treats it as the wide string `"aaa"` followed by the narrow string `"bbb"`. Similarly, windres rejects the sequence `"aaa" L"bbb"`. However, in contexts where the end result actually is UTF16, GNU windres does accept such mismatched string representations. Therefore, it seems clear that GNU windres doesn't do the merging on the string token level. Many of the existing windres tests happen to use tag-stringtable-basic.rc as test input; this file contains a case of `"wor" L"ld"` which now is rejected by the tokenizer in windres mode - therefore switch those tests, where the actual input file doesn't matter, to a different file.
36df28d
to
aca0072
Compare
Actually, it turns out that GNU windres doesn't seem to be doing this on the tokenization level - and
And it turns out that we do have such an example among our tests, in I did try implementing this - have a look at this commit. I had to update a bunch of those tests, that use that input file, to use a different one as dummy input. While this works, I'm not entirely satisfied - as this makes the windres mode error out on a bunch of stuff that GNU windres actually does handle just fine. So with that, I'm considering a totally different approach; instead of trying to match GNU windres in general by merging string literals everywhere, just handle it in the specific cases we've run into. The cases in #51286 are all about filenames to |
For future cross-referencability - the alternative implementation is #68881. |
Abandoning this one, as we went with #68881 instead. |
Btw, I just recently saw http://www.os2museum.com/wp/from-the-annals-of-preprocessor-hackery/ which touches upon this exact issue, and the lengths some people have gone to, to work around the fact that MS rc.exe doesn't concatenate consecutive string tokens in general. |
A number of constructs allow more than one string literal for what represents one single string - version info values, string tables, user data resources, etc. This mostly works like how a C compiler merges consecutive string literals, like "foo" "bar" as producing the same as "foobar". This is beneficial for producing strings with some elements provided by the preprocessor.
MS rc.exe only supports this in a few fixed locations (most of which are already supported), while GNU windres supports this essentially anywhere in any string. See b989fcb for one recent change that extended support for this in one specific resource.
A reasonable use case for multiple concatenated string literals that GNU windres accepts is
1 ICON DIR "/name.ico"
, where the directory is provided via the preprocessor, expanding to another string literal; this is #51286.To support this, extend the tokenizer to concatenate consecutive quoted string tokens, when invoked in windres mode.
For cases where the difference between narrow and wide strings is significant (e.g. in userdata resources), GNU windres treats
L"aaa" "bbb"
the same asL"aaabbb"
, contrary to rc.exe which treats it as the wide string"aaa"
followed by the narrow string"bbb"
. Similarly, windres rejects the sequence"aaa" L"bbb"
.However, in contexts where the end result actually is UTF16, GNU windres does accept such mismatched string representations. Therefore, it seems clear that GNU windres doesn't do the merging on the string token level.
Many of the existing windres tests happen to use tag-stringtable-basic.rc as test input; this file contains a case of
"wor" L"ld"
which now is rejected by the tokenizer in windres mode - therefore switch those tests, where the actual input file doesn't matter, to a different file.