Skip to content

[llvm-rc] Concatenate consecutive string tokens in windres mode #68685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mstorsjo
Copy link
Member

@mstorsjo mstorsjo commented Oct 10, 2023

A number of constructs allow more than one string literal for what represents one single string - version info values, string tables, user data resources, etc. This mostly works like how a C compiler merges consecutive string literals, like "foo" "bar" as producing the same as "foobar". This is beneficial for producing strings with some elements provided by the preprocessor.

MS rc.exe only supports this in a few fixed locations (most of which are already supported), while GNU windres supports this essentially anywhere in any string. See b989fcb for one recent change that extended support for this in one specific resource.

A reasonable use case for multiple concatenated string literals that GNU windres accepts is 1 ICON DIR "/name.ico", where the directory is provided via the preprocessor, expanding to another string literal; this is #51286.

To support this, extend the tokenizer to concatenate consecutive quoted string tokens, when invoked in windres mode.

For cases where the difference between narrow and wide strings is significant (e.g. in userdata resources), GNU windres treats L"aaa" "bbb" the same as L"aaabbb", contrary to rc.exe which treats it as the wide string "aaa" followed by the narrow string "bbb". Similarly, windres rejects the sequence "aaa" L"bbb".

However, in contexts where the end result actually is UTF16, GNU windres does accept such mismatched string representations. Therefore, it seems clear that GNU windres doesn't do the merging on the string token level.

Many of the existing windres tests happen to use tag-stringtable-basic.rc as test input; this file contains a case of "wor" L"ld" which now is rejected by the tokenizer in windres mode - therefore switch those tests, where the actual input file doesn't matter, to a different file.

@mstorsjo mstorsjo added llvm-tools All llvm tools that do not have corresponding tag platform:windows labels Oct 10, 2023
@mstorsjo
Copy link
Member Author

CC @jeremyd2019 @mati865

@llvmbot
Copy link
Member

llvmbot commented Oct 10, 2023

@llvm/pr-subscribers-platform-windows

Author: Martin Storsjö (mstorsjo)

Changes

A number of constructs allow more than one string literal for what represents one single string - version info values, string tables, user data resources, etc. This mostly works like how a C compiler merges consecutive string literals, like "foo" "bar" as producing the same as "foobar". This is beneficial for producing strings with some elements provided by the preprocessor.

MS rc.exe only supports this in a few fixed locations (most of which are already supported), while GNU windres supports this essentially anywhere in any string. See
b989fcb for one recent change that extended support for this in one specific resource.

A reasonable use case for multiple concatenated string literals that GNU windres accepts is 1 ICON DIR "/name.ico", where the directory is provided via the preprocessor, expanding to another string literal; this is #51286.

To support this, extend the tokenizer to concatenate consecutive quoted string tokens.

The tokenizer only stores StringRefs that point into the input file - including the surrounding quotes and potential leading L to denote wide strings. In order to concatenate consecutive string tokens into one, we just fuse the StringRef ranges (including whatever whitespace there is between the tokens) and store that within the single string token.

At the end, when we are going to output something based on the string contents, we can split the single StringRef into a vector of StringRefs and then loop over them. The string tokens are currently used within ResourceFileWriter either via the processString function, outputting UTF16 data to write into the output, or the stripQuotes function, giving a StringRef to the raw contents of the string, e.g. for use for loading a file.

Intercept all these cases and pass them via functions that split the multi-string token into one or more strings for each string literal, and then process them one at a time with the existing functions.

This contains tests for the following cases:

  • Tokenizer test
  • Test of including an icon from a path built from multiple tokens, which is the actual intended use case.
  • Test an accelerator string provided as multiple strings. (This isn't a case that users reasonbly would use, but due to the toknizer change, the ResourceFileWriter::writeSingleAccelerator could now encounter such input data.)

This change also allows simplifying the earlier added explicit support of multiple string literals in constructs such as version info and string tables, as later cleanup commits.


Full diff: https://github.com/llvm/llvm-project/pull/68685.diff

8 Files Affected:

  • (added) llvm/test/tools/llvm-rc/Inputs/accelerators-split-token.rc (+5)
  • (added) llvm/test/tools/llvm-rc/Inputs/split-path.rc (+1)
  • (modified) llvm/test/tools/llvm-rc/Inputs/tokens.rc (+3)
  • (added) llvm/test/tools/llvm-rc/accelerators-split-token.test (+13)
  • (added) llvm/test/tools/llvm-rc/split-path.test (+6)
  • (modified) llvm/test/tools/llvm-rc/tokenizer.test (+2)
  • (modified) llvm/tools/llvm-rc/ResourceFileWriter.cpp (+108-28)
  • (modified) llvm/tools/llvm-rc/ResourceScriptToken.cpp (+10)
diff --git a/llvm/test/tools/llvm-rc/Inputs/accelerators-split-token.rc b/llvm/test/tools/llvm-rc/Inputs/accelerators-split-token.rc
new file mode 100644
index 000000000000000..24377e014caf5e3
--- /dev/null
+++ b/llvm/test/tools/llvm-rc/Inputs/accelerators-split-token.rc
@@ -0,0 +1,5 @@
+1 ACCELERATORS
+LANGUAGE 5, 1
+{
+  "^" "a", 18
+}
diff --git a/llvm/test/tools/llvm-rc/Inputs/split-path.rc b/llvm/test/tools/llvm-rc/Inputs/split-path.rc
new file mode 100644
index 000000000000000..4f60b6f78ae15c5
--- /dev/null
+++ b/llvm/test/tools/llvm-rc/Inputs/split-path.rc
@@ -0,0 +1 @@
+100 ICON "subdir" "/icon-new.ico"
diff --git a/llvm/test/tools/llvm-rc/Inputs/tokens.rc b/llvm/test/tools/llvm-rc/Inputs/tokens.rc
index 6a781202a7e3721..7e1c6e923d25e14 100644
--- a/llvm/test/tools/llvm-rc/Inputs/tokens.rc
+++ b/llvm/test/tools/llvm-rc/Inputs/tokens.rc
@@ -4,6 +4,9 @@ identifier-with-dashes
 
 "RC string test.",L"Another RC string test.'&{",42,100
 
+"concatenated"  L"string"
+"token"
+
 Block Comment Ident /*block /* // comment */ ifier
 
 Line Comment // Identifier /*
diff --git a/llvm/test/tools/llvm-rc/accelerators-split-token.test b/llvm/test/tools/llvm-rc/accelerators-split-token.test
new file mode 100644
index 000000000000000..9414c787e0f24a2
--- /dev/null
+++ b/llvm/test/tools/llvm-rc/accelerators-split-token.test
@@ -0,0 +1,13 @@
+; RUN: llvm-rc -no-preprocess /FO %t -- %p/Inputs/accelerators-split-token.rc
+; RUN: llvm-readobj %t | FileCheck %s
+
+; CHECK: Resource type (int): ACCELERATOR (ID 9)
+; CHECK-NEXT: Resource name (int): 1
+; CHECK-NEXT: Data version: 0
+; CHECK-NEXT: Memory flags: 0x30
+; CHECK-NEXT: Language ID: 1029
+; CHECK-NEXT: Version (major): 0
+; CHECK-NEXT: Version (minor): 0
+; CHECK-NEXT: Characteristics: 0
+; CHECK-NEXT: Data size: 8
+; CHECK-NEXT: Data:: (80 00 01 00 12 00 00 00)
diff --git a/llvm/test/tools/llvm-rc/split-path.test b/llvm/test/tools/llvm-rc/split-path.test
new file mode 100644
index 000000000000000..adfcfa2af0add57
--- /dev/null
+++ b/llvm/test/tools/llvm-rc/split-path.test
@@ -0,0 +1,6 @@
+; RUN: rm -rf %t
+; RUN: mkdir %t
+; RUN: cd %t
+; RUN: mkdir subdir
+; RUN: cp %p/Inputs/icon-new.ico subdir
+; RUN: llvm-rc -no-preprocess /FO %t/split-path.res -- %p/Inputs/split-path.rc
diff --git a/llvm/test/tools/llvm-rc/tokenizer.test b/llvm/test/tools/llvm-rc/tokenizer.test
index 8486f8bd786909e..4d327e4084b27e8 100644
--- a/llvm/test/tools/llvm-rc/tokenizer.test
+++ b/llvm/test/tools/llvm-rc/tokenizer.test
@@ -35,6 +35,8 @@
 ; CHECK-NEXT:  Int: 42; int value = 42
 ; CHECK-NEXT:  Comma: ,
 ; CHECK-NEXT:  Int: 100; int value = 100
+; CHECK-NEXT:  String: "concatenated"  L"string"
+; CHECK-NEXT:  "token"
 ; CHECK-NEXT:  Identifier: Block
 ; CHECK-NEXT:  Identifier: Comment
 ; CHECK-NEXT:  Identifier: Ident
diff --git a/llvm/tools/llvm-rc/ResourceFileWriter.cpp b/llvm/tools/llvm-rc/ResourceFileWriter.cpp
index dd9db338ece255b..31d42489841f694 100644
--- a/llvm/tools/llvm-rc/ResourceFileWriter.cpp
+++ b/llvm/tools/llvm-rc/ResourceFileWriter.cpp
@@ -109,6 +109,73 @@ static bool stripQuotes(StringRef &Str, bool &IsLongString) {
   return true;
 }
 
+static void splitStrings(StringRef Str, SmallVectorImpl<StringRef> &Result) {
+  if (!Str.starts_with_insensitive("L\"") &&
+      !Str.starts_with_insensitive("\"")) {
+    Result.push_back(Str);
+    return;
+  }
+  while (!Str.empty()) {
+    size_t Pos = 0;
+    if (Str.starts_with_insensitive("L"))
+      ++Pos;
+    assert(Pos < Str.size() && Str[Pos] == '"' &&
+           "Strings should be enclosed in quotes.");
+    ++Pos;
+    while (Pos < Str.size()) {
+      if (Str[Pos] == '"') {
+        // Found a quote
+        if (Pos + 1 == Str.size()) {
+          // At the end of the string; enqueue the rest
+          Result.push_back(Str);
+          return;
+        }
+
+        if (Str[Pos + 1] == '"') {
+          // Double quote, skip past it
+          Pos += 2;
+          continue;
+        }
+
+        // Single quote in the middle of a string, terminating the current one.
+        Result.push_back(Str.take_front(Pos + 1));
+        Str = Str.drop_front(Pos + 1);
+        Str = Str.ltrim(); // Skip past whitespace
+        Pos = 0;
+        // Restart the outer loop, starting a new string
+        break;
+      }
+
+      // Not a quote, advance
+      ++Pos;
+    }
+    assert(Pos == 0 && "Should have consumed a string");
+  }
+  llvm_unreachable("Misquoted strings");
+}
+
+static bool extractQuotedStrings(StringRef &Str, SmallString<0> &Buf) {
+  // Extract a StringRef from a parser token that contains one or more
+  // quoted strings.
+  SmallVector<StringRef, 3> SubStrs;
+  splitStrings(Str, SubStrs);
+  bool IsLongString;
+  if (SubStrs.size() == 1) {
+    assert(Str == SubStrs[0]);
+    return stripQuotes(Str, IsLongString);
+  } else {
+    // If the string token consisted of multiple quoted strings, run stripQuotes
+    // on each of them and concatenate the result in the user provided buffer.
+    for (StringRef S : SubStrs) {
+      if (!stripQuotes(S, IsLongString))
+        return false;
+      Buf.append(S);
+    }
+    Str = Buf.str();
+    return true;
+  }
+}
+
 static UTF16 cp1252ToUnicode(unsigned char C) {
   static const UTF16 Map80[] = {
       0x20ac, 0x0081, 0x201a, 0x0192, 0x201e, 0x2026, 0x2020, 0x2021,
@@ -360,6 +427,18 @@ static Error processString(StringRef Str, NullHandlingMethod NullHandler,
   return Error::success();
 }
 
+static Error processStrings(StringRef Str, NullHandlingMethod NullHandler,
+                            SmallVectorImpl<UTF16> &Result, int CodePage) {
+  SmallVector<StringRef, 3> SubStrs;
+  splitStrings(Str, SubStrs);
+  for (const auto &S : SubStrs) {
+    bool IsLongString;
+    RETURN_IF_ERROR(
+        processString(S, NullHandler, IsLongString, Result, CodePage));
+  }
+  return Error::success();
+}
+
 uint64_t ResourceFileWriter::writeObject(const ArrayRef<uint8_t> Data) {
   uint64_t Result = tell();
   FS->write((const char *)Data.begin(), Data.size());
@@ -368,10 +447,8 @@ uint64_t ResourceFileWriter::writeObject(const ArrayRef<uint8_t> Data) {
 
 Error ResourceFileWriter::writeCString(StringRef Str, bool WriteTerminator) {
   SmallVector<UTF16, 128> ProcessedString;
-  bool IsLongString;
-  RETURN_IF_ERROR(processString(Str, NullHandlingMethod::CutAtNull,
-                                IsLongString, ProcessedString,
-                                Params.CodePage));
+  RETURN_IF_ERROR(processStrings(Str, NullHandlingMethod::CutAtNull,
+                                 ProcessedString, Params.CodePage));
   for (auto Ch : ProcessedString)
     writeInt<uint16_t>(Ch);
   if (WriteTerminator)
@@ -400,8 +477,8 @@ void ResourceFileWriter::writeRCInt(RCInt Value) {
 }
 
 Error ResourceFileWriter::appendFile(StringRef Filename) {
-  bool IsLong;
-  stripQuotes(Filename, IsLong);
+  SmallString<0> Buf;
+  extractQuotedStrings(Filename, Buf);
 
   auto File = loadFile(Filename);
   if (!File)
@@ -645,8 +722,8 @@ Error ResourceFileWriter::writeSingleAccelerator(
   }
 
   StringRef Str = Obj.Event.getString();
-  bool IsWide;
-  stripQuotes(Str, IsWide);
+  SmallString<0> Buf;
+  extractQuotedStrings(Str, Buf);
 
   if (Str.size() == 0 || Str.size() > 2)
     return createAccError(
@@ -706,8 +783,8 @@ Error ResourceFileWriter::writeAcceleratorsBody(const RCResource *Base) {
 
 Error ResourceFileWriter::writeBitmapBody(const RCResource *Base) {
   StringRef Filename = cast<BitmapResource>(Base)->BitmapLoc;
-  bool IsLong;
-  stripQuotes(Filename, IsLong);
+  SmallString<0> Buf;
+  extractQuotedStrings(Filename, Buf);
 
   auto File = loadFile(Filename);
   if (!File)
@@ -884,8 +961,8 @@ Error ResourceFileWriter::visitIconOrCursorResource(const RCResource *Base) {
     Type = IconCursorGroupType::Cursor;
   }
 
-  bool IsLong;
-  stripQuotes(FileStr, IsLong);
+  SmallString<0> Buf;
+  extractQuotedStrings(FileStr, Buf);
   auto File = loadFile(FileStr);
 
   if (!File)
@@ -1323,10 +1400,9 @@ Error ResourceFileWriter::writeStringTableBundleBody(const RCResource *Base) {
     // (which is not null-terminated).
     SmallVector<UTF16, 128> Data;
     if (Res->Bundle.Data[ID]) {
-      bool IsLongString;
       for (StringRef S : *Res->Bundle.Data[ID])
-        RETURN_IF_ERROR(processString(S, NullHandlingMethod::CutAtDoubleNull,
-                                      IsLongString, Data, Params.CodePage));
+        RETURN_IF_ERROR(processStrings(S, NullHandlingMethod::CutAtDoubleNull,
+                                       Data, Params.CodePage));
       if (AppendNull)
         Data.push_back('\0');
     }
@@ -1372,21 +1448,25 @@ Error ResourceFileWriter::writeUserDefinedBody(const RCResource *Base) {
       continue;
     }
 
-    SmallVector<UTF16, 128> ProcessedString;
-    bool IsLongString;
-    RETURN_IF_ERROR(
-        processString(Elem.getString(), NullHandlingMethod::UserResource,
-                      IsLongString, ProcessedString, Params.CodePage));
+    SmallVector<StringRef, 3> SubStrs;
+    splitStrings(Elem.getString(), SubStrs);
+    for (const auto &S : SubStrs) {
+      SmallVector<UTF16, 128> ProcessedString;
+      bool IsLongString;
+      RETURN_IF_ERROR(processString(S, NullHandlingMethod::UserResource,
+                                    IsLongString, ProcessedString,
+                                    Params.CodePage));
+
+      for (auto Ch : ProcessedString) {
+        if (IsLongString) {
+          writeInt(Ch);
+          continue;
+        }
 
-    for (auto Ch : ProcessedString) {
-      if (IsLongString) {
-        writeInt(Ch);
-        continue;
+        RETURN_IF_ERROR(checkNumberFits<uint8_t>(
+            Ch, "Character in narrow string in user-defined resource"));
+        writeInt<uint8_t>(Ch);
       }
-
-      RETURN_IF_ERROR(checkNumberFits<uint8_t>(
-          Ch, "Character in narrow string in user-defined resource"));
-      writeInt<uint8_t>(Ch);
     }
   }
 
diff --git a/llvm/tools/llvm-rc/ResourceScriptToken.cpp b/llvm/tools/llvm-rc/ResourceScriptToken.cpp
index a8f40abdf8a680c..4422dbc75f8d378 100644
--- a/llvm/tools/llvm-rc/ResourceScriptToken.cpp
+++ b/llvm/tools/llvm-rc/ResourceScriptToken.cpp
@@ -188,6 +188,16 @@ Expected<std::vector<RCToken>> Tokenizer::run() {
         return getStringError("Integer invalid or too large: " +
                               Token.value().str());
       }
+    } else if (TokenKind == Kind::String && !Result.empty()) {
+      RCToken &Prev = Result.back();
+      if (Prev.kind() == Kind::String) {
+        StringRef PrevStr = Prev.value();
+        StringRef Concat(PrevStr.data(), Token.value().data() +
+                                             Token.value().size() -
+                                             PrevStr.data());
+        Result.pop_back();
+        Token = RCToken(TokenKind, Concat);
+      }
     }
 
     Result.push_back(Token);

@cjacek
Copy link
Contributor

cjacek commented Oct 10, 2023

This way of storing strings and re-parsing it later seems a bit fragile to me. For example, if I change your accelerators-split-token.rc test to use something like: "^" /* "b" */ "a", 18, it will crash on assert. Is memory allocation in the tokenizer a no-go?

@mstorsjo
Copy link
Member Author

This way of storing strings and re-parsing it later seems a bit fragile to me.

Apparently yes - I had thought that it would be mostly robust as what goes into it should be proper tokens only, but apparently you managed to poke a hole in it :-)

For example, if I change your accelerators-split-token.rc test to use something like: "^" /* "b" */ "a", 18, it will crash on assert.

Ooops; let me have a look at that ...

Is memory allocation in the tokenizer a no-go?

Practically speaking - it would most probably be fine; the amounts of data processed by llvm-rc are miniscule anyway. But that would probably be a much bigger refactoring, something i didn't really want to get into for this somewhat small feature.

@mstorsjo
Copy link
Member Author

Is memory allocation in the tokenizer a no-go?

Practically speaking - it would most probably be fine; the amounts of data processed by llvm-rc are miniscule anyway. But that would probably be a much bigger refactoring, something i didn't really want to get into for this somewhat small feature.

After thinking more about ways to make this better; you’re right, if we’d want this solution to be robust, we would probably need to re-invoke the original tokenizer when splitting the string in the end; that’s not very elegant.

I thought about if we could extend RCToken to contain a vector of StringRefs - but I think that would become quite messy for everything that isn’t a string token. Or what do you think?

I considered if one actually would concatenate the payload of the string literals when we want to append them to the existing token. However that’s problematic; we currently need to keep the L”” or ”” around the strings until the end, since we need to know whether the data was defined as a long string or not.

One compromise would be to keep the current split logic at the end; but not create a StringRef that spans the input file across all the sections, but instead allocate an extra string buffer, and append the fully quoted string literals there. It’s essentially the same solution as the current PR, but with only controlled whitespace between the string literals. It still requires doing some amount of parsing at the end to split them, but it should be at least marginally less brittle than what I have right now.

WDYT about the alternatives above?

@cjacek
Copy link
Contributor

cjacek commented Oct 11, 2023

However that’s problematic; we currently need to keep the L”” or ”” around the strings until the end, since we need to know whether the data was defined as a long string or not.

The usual behavior (like in C) is that such concatenation promotes the whole string to be a wide string and testing with windres confirms that it follows that pattern. I used following test:

1 test
{
   "aaaa",
   L"bbbb" "cccc",
}

windres will interpret "aaaa" as a narrow string, but L"bbbb" "cccc" as a wide string, the same as L"bbbbcccc". Implementing that behavior seems right to me and it means that the concatenation in tokenizer would not need to preserve 'L' part of each component separately and should be able to just concatenate the payload.

BTW, "bbbb" L"cccc" is considered a syntax error by windres.

@cjacek
Copy link
Contributor

cjacek commented Oct 11, 2023

Or maybe parser could take care of that when it reads a string token.

@mstorsjo
Copy link
Member Author

The usual behavior (like in C) is that such concatenation promotes the whole string to be a wide string and testing with windres confirms that it follows that pattern. I used following test:

1 test
{
   "aaaa",
   L"bbbb" "cccc",
}

windres will interpret "aaaa" as a narrow string, but L"bbbb" "cccc" as a wide string, the same as L"bbbbcccc". Implementing that behavior seems right to me and it means that the concatenation in tokenizer would not need to preserve 'L' part of each component separately and should be able to just concatenate the payload.

BTW, "bbbb" L"cccc" is considered a syntax error by windres.

Oh, right, that explains things. Thanks!

rc.exe, which generally doesn't concatenate string tokens but only allows consecutive string literals in some contexts, interprets it differently; with your example, L"bbbb" "cccc" "cccc" ends up as a narrow string. So to achieve the GNU windres behaviour, we don't need to maintain the rc.exe behaviour - and then this becomes much much easier to implement just as you're suggesting. Thanks!

So I guess that leaves us with two choices:

  • We add a windres/rc distinction into the general parsing process in llvm-rc, and emulate both behaviours. (Currently we have the rc.exe style behaviour, so we'd just need to make the new token merging optional.) Slightly sad that we'd have to do it as we've managed without doing that so far, but it's probably not a big deal.
  • We conclude that L"bbbb" "cccc" actually meaning a narrow "cccc" in rc.exe, or conversely being able to do "bbbb" L"cccc" as is possible with rc.exe (and llvm-rc right now), is a theoretical corner case, and nobody probably is doing that in practice.

Do we have an opinion on this from users of llvm-rc in MSVC-like environments - @zmodem @nico @aganea, or others?

Cautiously, I would prefer the former - faithfully emulating rc.exe's quirks when invoked as llvm-rc, even if the corner cases are kinda theoretical.

A number of constructs allow more than one string literal
for what represents one single string - version info values,
string tables, user data resources, etc. This mostly works
like how a C compiler merges consecutive string literals,
like "foo" "bar" as producing the same as "foobar". This is
beneficial for producing strings with some elements provided
by the preprocessor.

MS rc.exe only supports this in a few fixed locations (most of
which are already supported), while GNU windres supports this
essentially anywhere in any string. See
b989fcb for one recent change
that extended support for this in one specific resource.

A reasonable use case for multiple concatenated string literals
that GNU windres accepts is `1 ICON DIR "/name.ico"`, where the
directory is provided via the preprocessor, expanding to another
string literal; this is llvm#51286.

To support this, extend the tokenizer to concatenate consecutive
quoted string tokens, when invoked in windres mode.

For cases where the difference between narrow and wide strings
is significant (e.g. in userdata resources), GNU windres treats
`L"aaa" "bbb"` the same as `L"aaabbb"`, contrary to rc.exe
which treats it as the wide string `"aaa"` followed by the
narrow string `"bbb"`. Similarly, windres rejects the sequence
`"aaa" L"bbb"`.

However, in contexts where the end result actually is UTF16,
GNU windres does accept such mismatched string representations.
Therefore, it seems clear that GNU windres doesn't do the
merging on the string token level.

Many of the existing windres tests happen to use
tag-stringtable-basic.rc as test input; this file contains a
case of `"wor" L"ld"` which now is rejected by the tokenizer in
windres mode - therefore switch those tests, where the actual input
file doesn't matter, to a different file.
@mstorsjo mstorsjo force-pushed the llvm-rc-concat-string-tokens branch from 36df28d to aca0072 Compare October 12, 2023 12:04
@mstorsjo mstorsjo changed the title [llvm-rc] Concatenate consecutive string tokens [llvm-rc] Concatenate consecutive string tokens in windres mode Oct 12, 2023
@mstorsjo
Copy link
Member Author

The usual behavior (like in C) is that such concatenation promotes the whole string to be a wide string and testing with windres confirms that it follows that pattern. I used following test:

1 test
{
   "aaaa",
   L"bbbb" "cccc",
}

windres will interpret "aaaa" as a narrow string, but L"bbbb" "cccc" as a wide string, the same as L"bbbbcccc". Implementing that behavior seems right to me and it means that the concatenation in tokenizer would not need to preserve 'L' part of each component separately and should be able to just concatenate the payload.

BTW, "bbbb" L"cccc" is considered a syntax error by windres.

Actually, it turns out that GNU windres doesn't seem to be doing this on the tokenization level - and "bbbb" L"cccc" isn't considered a syntax error in every context, only within the context in your example. E.g. this is totally ok:

STRINGTABLE
{
  1 "bbbb" L"cccc"
}

And it turns out that we do have such an example among our tests, in tag-stringtable-basic.rc, and this file happens to be the dummy test input file for lots of the windres specific tests right now.

I did try implementing this - have a look at this commit. I had to update a bunch of those tests, that use that input file, to use a different one as dummy input.

While this works, I'm not entirely satisfied - as this makes the windres mode error out on a bunch of stuff that GNU windres actually does handle just fine. So with that, I'm considering a totally different approach; instead of trying to match GNU windres in general by merging string literals everywhere, just handle it in the specific cases we've run into. The cases in #51286 are all about filenames to ICON or for manifests. So I'll post a separate PR to show what that solution looks like.

@mstorsjo
Copy link
Member Author

For future cross-referencability - the alternative implementation is #68881.

@mstorsjo
Copy link
Member Author

Abandoning this one, as we went with #68881 instead.

@mstorsjo mstorsjo closed this Oct 16, 2023
@mstorsjo
Copy link
Member Author

Btw, I just recently saw http://www.os2museum.com/wp/from-the-annals-of-preprocessor-hackery/ which touches upon this exact issue, and the lengths some people have gone to, to work around the fact that MS rc.exe doesn't concatenate consecutive string tokens in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm-tools All llvm tools that do not have corresponding tag platform:windows
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants