Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom preprocessor #5

Open
squeek502 opened this issue Sep 20, 2023 · 7 comments
Open

Custom preprocessor #5

squeek502 opened this issue Sep 20, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@squeek502
Copy link
Owner

squeek502 commented Sep 20, 2023

This would be a solution for the "Unavoidable divergences from the Windows RC compiler" which are all preprocessor related.

Most notably, it would be necessary for UTF-16 support.

This would likely be a large undertaking. A starting point might be to fork arocc.


EDIT: Experimenting with this here: https://github.com/squeek502/preresinator

@squeek502
Copy link
Owner Author

squeek502 commented Oct 21, 2023

Some notes on potential UTF-16 handling I had locally that are relevant:

  • Convert UTF-16 files to UTF-8
  • Surround converted files with
    #pragma code_page(65001)
    // file contents
    #pragma code_page(DEFAULT)
    

Need to ensure that the generated #pragma code_page lines aren't ignored due to them being part of an included file

  • Could do something like using #line 1 "<built-in>" and then documenting that such lines should not be ignored even if they are 'within' an included file
  • Can't really do something like putting the generated #pragma code_page's in the 'root' file since that wouldn't handle the case of an included file including a UTF-16 file
  • Could do something like (where root-file.rc is the root file and file.rc is any arbitrary included file)
    #line 1 "root-file.rc" // note that the 1 here is a totally fake line number since the following line is not part of the original file at all
    #pragma code_page(65001)
    #line 1 "file.rc"
    // file.rc contents
    

Also need to keep https://squeek502.github.io/resinator/windows/input-and-output-code-pages.html in mind. What rc.exe does is encode the output of the preprocessor as UTF-16, having already handled all the #pragma code_pages and done things like replaced invalid codepoints with <U+FFFD>, etc. This type of approach may end up being necessary to reach full compatibility without resinator specific behavior in the preprocessed output.

@mehrdadn
Copy link

Hi, just wanted to make a suggestion: would it be possible to simply handle the cases that are easy, and leave the more difficult ones for later? Particularly, being able to parse UTF-16 resource files that don't have any problematic #pragma code_page directives would be a great step forward and quite helpful, because it would allow the usage of build steps that output UTF-16 lacking such directives.

@squeek502
Copy link
Owner Author

squeek502 commented Feb 11, 2025

Unfortunately, the baseline would be a C preprocessor that can handle UTF-16 encoded files as input, which is not necessarily easy on its own (for context, the Microsoft compilers are the only existing ones I'm aware of that support UTF-16 [or, at least, I know clang and gcc don't]).

Beyond that, decisions need to be made about the encoding of the output of the C preprocessor, and how that interacts with the resource compiler (this is where #pragma code_page starts being relevant). The 'easiest' strategy here would be for the C preprocessor to output UTF-16 and for the resource compiler to also be able to ingest UTF-16 (this is how rc.exe works; the preprocessor always outputs UTF-16 and so the resource compiler only has to care about ingesting UTF-16), but I don't necessarily like that solution too much.

@mehrdadn
Copy link

Unfortunately, the baseline would be a C preprocessor that can handle UTF-16 encoded files as input, which is not necessarily easy on its own (for context, the Microsoft compilers are the only existing ones I'm aware of that support UTF-16 [or, at least, I know clang and gcc don't]).

I feel I'm missing something, but what about just taking converting UTF-16 files to UTF-8 and then feeding them to your C preprocessor?

@squeek502
Copy link
Owner Author

squeek502 commented Feb 12, 2025

That would be one way to go, but it would be the C preprocessor that would need to do the conversion, since that's what's handling #include, etc.

It would also need some way to let the resource compiler know how the output should be interpreted, either:

  • Consume UTF-16 and output UTF-8, and then mark that section as being UTF-8 in some way
  • Consume UTF-16 and output UTF-16 (internally converting to UTF-8 and then back to UTF-16), and either also mark this or let the resource compiler infer the encoding of UTF-16 sections
  • Handle all #pragma code_page directives and output as a single standardized encoding

Consider this example:

#pragma code_page(1252)
#include "windows1252.rc"

// no need to set code page, the UTF-16 encoding is inferred
#include "utf16.rc"

// code page 1252 is still active
#include "windows1252_again.rc"

Running rc.exe /p test.rc (to only run the preprocessor) results in:

#pragma code_page 1252

<contents of windows1252.rc interpreted as Windows-1252 and outputted as UTF-16>

<contents of utf16.rc interpreted as UTF-16 and outputted as UTF-16>

<contents of windows1252_again.rc interpreted as Windows-1252 and outputted as UTF-16>

This approach of rc.exe simplifies things from the perspective of the resource compiler (it can ignore the #pragma code_page since everything has already been converted to UTF-16), but means that the C preprocessor is the one dealing with the #pragma code_page directives, which complicates the C preprocessor side of things.

(as an aside, note also that there are rc.exe quirks that are caused by the preprocessor 'speaking' UTF-16, e.g. this and this)

@mehrdadn
Copy link

Ahh... I see. Thanks for the explanation, that's definitely annoying!

@squeek502
Copy link
Owner Author

squeek502 commented Feb 12, 2025

No worries, writing this out has helped clarify my thoughts about the problem. I'm thinking this might be the most viable path to getting initial UTF-16 support:

[Make the C preprocessor] consume UTF-16 and output UTF-16 (internally converting to UTF-8 and then back to UTF-16), and either also mark this or let the resource compiler infer the encoding of UTF-16 sections

That would allow the preprocessor to not have any rc.exe-specific stuff, and resinator is decently equipped to handle the input being partially UTF-16 encoded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants