Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RE2::GlobalReplace case insensitive fail with OR-ed pattern AB|AC #477

Closed
bigF93 opened this issue Feb 19, 2024 · 2 comments
Closed

RE2::GlobalReplace case insensitive fail with OR-ed pattern AB|AC #477

bigF93 opened this issue Feb 19, 2024 · 2 comments

Comments

@bigF93
Copy link

bigF93 commented Feb 19, 2024

Hi,
it seems that the commit b2af9b3 has changed the behaviour of the case insensitive call to GlobalReplace.
Before b2af9b3 , the replace call was successful, but now it seems to have stopped working.
I am using ubuntu22 with g++.

To reproduce the issue, you can build and run the code snippet below (include iostream and iomanip which are not rendered by github)

#include <re2/re2.h>

int main()
{

// example of an array of bytes from which we want to remove some couples of bytes, like 0xa5 0xd1 and 0xa5 0x64
unsigned char rawBytes[] = { 0xa5, 0xd1, 0xa5, 0xd1, 0x61, 0x63, 0xa5, 0x64 };

std::string constructedString( reinterpret_cast<char*>( rawBytes ), sizeof( rawBytes )/sizeof( rawBytes[0] ) );

for ( const auto& elem : constructedString ) {
                std::cout << "0x" << std::hex << std::setw(2) << std::setfill('0') << static_cast<int>(static_cast<unsigned char>(elem)) << " ";
}
std::cout << "\n";

RE2::Options opts{};
opts.set_encoding( RE2::Options::Encoding::EncodingLatin1 );
opts.set_case_sensitive( false );
opts.set_posix_syntax( false );
opts.set_perl_classes( false );
RE2 regex{ "\xa5\xd1|\xa5\x64", opts };

int result = RE2::GlobalReplace( &constructedString, regex , "" );

for ( const uint8_t elem : constructedString) {
                std::cout << "0x" << std::hex << std::setw(2) << std::setfill('0') << static_cast<int>(static_cast<unsigned char>(elem)) << " ";
}

std::cout << "\nReplacements made = " << result << "\n";
return 0;

}

thank you!

@junyer
Copy link
Contributor

junyer commented Feb 19, 2024

Many thanks for the report! It seems that you have uncovered an ancient bug: this line fails to preserve the Latin1 flag. The case sensitivity doesn't appear to matter and, as such, I'm simplifying the options in the test that I'm committing with the fix.

copybara-service bot pushed a commit that referenced this issue Feb 19, 2024
Regexp::LeadingString didn't output Latin1 into flags.
In the given pattern, 0xA5 should be factored out, but
shouldn't lose its Latin1-ness in the process. Because
that was happening, the prefix for accel was 0xC2 0xA5
instead of 0xA5. Note that the former doesn't occur in
the given input and so replacements weren't occurring.

Fixes #477.

Change-Id: Icd36ba0905684d93d6db58e8047c9787918d0cf4
copybara-service bot pushed a commit that referenced this issue Feb 19, 2024
Regexp::LeadingString didn't output Latin1 into flags.
In the given pattern, 0xA5 should be factored out, but
shouldn't lose its Latin1-ness in the process. Because
that was happening, the prefix for accel was 0xC2 0xA5
instead of 0xA5. Note that the former doesn't occur in
the given input and so replacements weren't occurring.

Fixes #477.

Change-Id: Icd36ba0905684d93d6db58e8047c9787918d0cf4
@junyer
Copy link
Contributor

junyer commented Feb 19, 2024

I noticed some additional, seemingly long-standing sadness around Latin-1 handling that does involve case sensitivity. T_T

@junyer junyer reopened this Feb 19, 2024
copybara-service bot pushed a commit that referenced this issue Feb 19, 2024
It turned out that case folding assumed UTF-8 mode, so
we would fold, say, 0xD1 to 0xF1 even in Latin-1 mode.

Fixes #477.

Change-Id: I73aa5c8e33ee0c6041c54e3a7268635915960f64
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants