Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hyperscan] Broken UTF-8 support #1308

Open
sanyarnd opened this issue Jan 9, 2023 · 19 comments
Open

[Hyperscan] Broken UTF-8 support #1308

sanyarnd opened this issue Jan 9, 2023 · 19 comments
Labels

Comments

@sanyarnd
Copy link

sanyarnd commented Jan 9, 2023

Something is broken with Hyperscan build. The following code doesn't work.

import static org.bytedeco.hyperscan.global.hyperscan.HS_FLAG_UCP;
import org.bytedeco.hyperscan.hs_compile_error_t;
import org.bytedeco.hyperscan.hs_database_t;
import static org.bytedeco.hyperscan.global.hyperscan.HS_FLAG_UTF8;
import static org.bytedeco.hyperscan.global.hyperscan.HS_MODE_BLOCK;
import static org.bytedeco.hyperscan.global.hyperscan.hs_compile;

public class Sample {
    public static void main(String[] args) {
        String[] patterns = {"ok.*", "not ok кириллица.*"};

        for (var pattern : patterns) {
            try (var db = new hs_database_t();
                 var err = new hs_compile_error_t()) {
                var code = hs_compile(pattern, HS_FLAG_UTF8 | HS_FLAG_UCP, HS_MODE_BLOCK, null, db, err);

                var errMsg = "";
                if (!err.isNull()) {
                    errMsg = err.message().getString();
                }
                System.out.printf("Word: '%s'; retcode '%d'; message '%s' %n", pattern, code, errMsg);
            }
        }
    }
}

Output:

Word: 'ok.*'; retcode '0'; message '' 
Word: 'not ok кириллица.*'; retcode '-4'; message 'Expression is not valid UTF-8.' 

Meanwhile, https://github.com/gliwka/hyperscan-java binaries work as expected

@saudet
Copy link
Member

saudet commented Jan 9, 2023

That just means your platform's default encoding isn't UTF-8. Do you need to use UTF-8? If your platform encoding does all you need, not forcing UTF-8 should work just fine.

@sanyarnd
Copy link
Author

Can you clarify a bit?
I am using Ubuntu and it's using UTF-8 by default for everything, source file is also UTF-8, same goes for JVM.

@saudet
Copy link
Member

saudet commented Jan 10, 2023

Your JVM is obviously not using UTF-8. Check the value of the file.encoding system property.

@saudet
Copy link
Member

saudet commented Jan 10, 2023

That's funny, I get the same error here, but if I remove HS_FLAG_UTF8 it works fine. Can you confirm? That sounds like a bug in Hyperscan...

@saudet
Copy link
Member

saudet commented Jan 10, 2023

Sounds like issue intel/hyperscan#362 maybe?

@sanyarnd
Copy link
Author

Yeah everything works ok without the flag.
Seems like the case indeed. Strange it's not reproduced with other bindings.

@sanyarnd
Copy link
Author

sanyarnd commented Jan 10, 2023

Looking through code it seems it's not the reason

    // Invalid one-byte caseS.
    // Valid one-byte caseS.
    {"\x7f", false},
    {"\x7f", true}, // \x7f is valid

I have Cyrillic symbols, so it's not case.

Also note that turning the flag off disables caseless UTF-8 matching, for example, the following test fails now:

text="Ошибка" pattern=".*ош.*"

Again, the other bindings work as expected, so I tend to think there's something off with bindings itself.

I have the following set of flags:

HS_FLAG_CASELESS | HS_FLAG_SINGLEMATCH | HS_FLAG_UCP | HS_FLAG_UTF8

@saudet
Copy link
Member

saudet commented Jan 10, 2023

We can get UTF-8 bytes with String.getBytes() and call functions directly with a pointer to that and it still fails. So unless there's something wrong with how the JVM encodes in UTF-8, it's a problem with Hyperscan itself. Other bindings may offer some sort of workaround for that. Someone should investigate that.

@saudet
Copy link
Member

saudet commented Jan 10, 2023

Oh look at that. When building from the develop branch, the error doesn't show up anymore. So the bug has been fixed since 5.4.0. I guess we could upgrade the presets to the develop branch since they don't appear to make releases anymore...

@saudet
Copy link
Member

saudet commented Feb 26, 2023

Looks like that's been fixed in 5.4.1. Please give it a try with the snapshots: http://bytedeco.org/builds/

@saudet
Copy link
Member

saudet commented May 1, 2023

Ah, no, this is still happening with 5.4.1, but only when it gets built on CentOS 7. When I build locally with Fedora 36, this isn't happening. So I guess this is a bug in the compiler... Well, CentOS 7 EOL is next year, and then I think I'll just switch all builds to Ubuntu, and that should take care of this bug. If you need something before that, either build from source with a compiler that works, or figure out how to make this work with compilers available for CentOS 7!

@mmimica
Copy link

mmimica commented Jun 20, 2023

The problem perists with 5.4.2-1.5.9 when using maven central artifact.

When I compiled the preset myself it worked fine, so it must have been something the way it was compiled.

import org.bytedeco.hyperscan.hs_database_t;

import static org.bytedeco.hyperscan.global.hyperscan.*;

public class Sample {
    public static void main(String[] args) {
        try (var db = new hs_database_t();
             var err = new hs_compile_error_t()) {
            var code = hs_compile("测", HS_FLAG_UTF8, HS_MODE_BLOCK, null, db, err);
            var errMsg = "";
            if (!err.isNull()) {
                errMsg = err.message().getString();
            }
            System.out.printf("Word: '%s'; retcode '%d'; message '%s' %n", "测", code, errMsg);
        }
    }
}

outputs Word: '测'; retcode '-4'; message 'Expression is not valid UTF-8.'

@mmimica
Copy link

mmimica commented Jun 20, 2023

More explicitly, hs_compile(new BytePointer("测".getBytes(StandardCharsets.UTF_8)), HS_FLAG_UTF8, HS_MODE_BLOCK, null, db, err) fails too. It is a valid 3-byte UTF-8 sequence btw.

@saudet
Copy link
Member

saudet commented Jun 26, 2023

This should be fixed now. Please give it a try with the snapshots: http://bytedeco.org/builds/

@mmimica
Copy link

mmimica commented Jun 27, 2023

Still fails.

@saudet
Copy link
Member

saudet commented Jun 27, 2023

I wonder which version of the compiler fixes that...

@mmimica
Copy link

mmimica commented Jun 29, 2023

So far I figured the bug occurs in Parser.cpp which is generated by Ragel. Centos 7 uses Development (!) version 7 of Ragel, while Ragel 6 seems to work fine. I haven't managed to install Ragel 6 on Centos 7.

@mmimica
Copy link

mmimica commented Jun 29, 2023

Ok, I found it. It is Ragel. Version 6 has to be used. I had it compiled manually because no package for Centos 7 exists. Also had to build cmake, bmake, libfsm and kelbt. What a fun day.

EDIT:
Here is ragel 6.10 binary. It's self-contained.
ragel.gz

@saudet
Copy link
Member

saudet commented Jun 29, 2023

Yes, I see, thanks. I forgot to update the builds to Ubuntu. Now the snapshots work. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants