[Hyperscan] Broken UTF-8 support #1308

sanyarnd · 2023-01-09T21:19:35Z

Something is broken with Hyperscan build. The following code doesn't work.

import static org.bytedeco.hyperscan.global.hyperscan.HS_FLAG_UCP;
import org.bytedeco.hyperscan.hs_compile_error_t;
import org.bytedeco.hyperscan.hs_database_t;
import static org.bytedeco.hyperscan.global.hyperscan.HS_FLAG_UTF8;
import static org.bytedeco.hyperscan.global.hyperscan.HS_MODE_BLOCK;
import static org.bytedeco.hyperscan.global.hyperscan.hs_compile;

public class Sample {
    public static void main(String[] args) {
        String[] patterns = {"ok.*", "not ok кириллица.*"};

        for (var pattern : patterns) {
            try (var db = new hs_database_t();
                 var err = new hs_compile_error_t()) {
                var code = hs_compile(pattern, HS_FLAG_UTF8 | HS_FLAG_UCP, HS_MODE_BLOCK, null, db, err);

                var errMsg = "";
                if (!err.isNull()) {
                    errMsg = err.message().getString();
                }
                System.out.printf("Word: '%s'; retcode '%d'; message '%s' %n", pattern, code, errMsg);
            }
        }
    }
}

Output:

Word: 'ok.*'; retcode '0'; message '' 
Word: 'not ok кириллица.*'; retcode '-4'; message 'Expression is not valid UTF-8.'

Meanwhile, https://github.com/gliwka/hyperscan-java binaries work as expected

The text was updated successfully, but these errors were encountered:

saudet · 2023-01-09T23:48:06Z

That just means your platform's default encoding isn't UTF-8. Do you need to use UTF-8? If your platform encoding does all you need, not forcing UTF-8 should work just fine.

sanyarnd · 2023-01-10T00:20:54Z

Can you clarify a bit?
I am using Ubuntu and it's using UTF-8 by default for everything, source file is also UTF-8, same goes for JVM.

saudet · 2023-01-10T00:32:30Z

Your JVM is obviously not using UTF-8. Check the value of the file.encoding system property.

saudet · 2023-01-10T01:48:58Z

That's funny, I get the same error here, but if I remove HS_FLAG_UTF8 it works fine. Can you confirm? That sounds like a bug in Hyperscan...

saudet · 2023-01-10T02:01:48Z

Sounds like issue intel/hyperscan#362 maybe?

sanyarnd · 2023-01-10T08:54:18Z

Yeah everything works ok without the flag.
Seems like the case indeed. Strange it's not reproduced with other bindings.

sanyarnd · 2023-01-10T09:16:04Z

Looking through code it seems it's not the reason

    // Invalid one-byte caseS.
    // Valid one-byte caseS.
    {"\x7f", false},
    {"\x7f", true}, // \x7f is valid

I have Cyrillic symbols, so it's not case.

Also note that turning the flag off disables caseless UTF-8 matching, for example, the following test fails now:

text="Ошибка" pattern=".*ош.*"

Again, the other bindings work as expected, so I tend to think there's something off with bindings itself.

I have the following set of flags:

HS_FLAG_CASELESS | HS_FLAG_SINGLEMATCH | HS_FLAG_UCP | HS_FLAG_UTF8

saudet · 2023-01-10T10:27:54Z

We can get UTF-8 bytes with String.getBytes() and call functions directly with a pointer to that and it still fails. So unless there's something wrong with how the JVM encodes in UTF-8, it's a problem with Hyperscan itself. Other bindings may offer some sort of workaround for that. Someone should investigate that.

saudet · 2023-01-10T10:58:11Z

Oh look at that. When building from the develop branch, the error doesn't show up anymore. So the bug has been fixed since 5.4.0. I guess we could upgrade the presets to the develop branch since they don't appear to make releases anymore...

…LVM 15.0.7

saudet · 2023-02-26T02:41:56Z

Looks like that's been fixed in 5.4.1. Please give it a try with the snapshots: http://bytedeco.org/builds/

saudet · 2023-05-01T03:30:41Z

Ah, no, this is still happening with 5.4.1, but only when it gets built on CentOS 7. When I build locally with Fedora 36, this isn't happening. So I guess this is a bug in the compiler... Well, CentOS 7 EOL is next year, and then I think I'll just switch all builds to Ubuntu, and that should take care of this bug. If you need something before that, either build from source with a compiler that works, or figure out how to make this work with compilers available for CentOS 7!

mmimica · 2023-06-20T13:54:56Z

The problem perists with 5.4.2-1.5.9 when using maven central artifact.

When I compiled the preset myself it worked fine, so it must have been something the way it was compiled.

import org.bytedeco.hyperscan.hs_database_t;

import static org.bytedeco.hyperscan.global.hyperscan.*;

public class Sample {
    public static void main(String[] args) {
        try (var db = new hs_database_t();
             var err = new hs_compile_error_t()) {
            var code = hs_compile("测", HS_FLAG_UTF8, HS_MODE_BLOCK, null, db, err);
            var errMsg = "";
            if (!err.isNull()) {
                errMsg = err.message().getString();
            }
            System.out.printf("Word: '%s'; retcode '%d'; message '%s' %n", "测", code, errMsg);
        }
    }
}

outputs Word: '测'; retcode '-4'; message 'Expression is not valid UTF-8.'

mmimica · 2023-06-20T14:01:12Z

More explicitly, hs_compile(new BytePointer("测".getBytes(StandardCharsets.UTF_8)), HS_FLAG_UTF8, HS_MODE_BLOCK, null, db, err) fails too. It is a valid 3-byte UTF-8 sequence btw.

saudet · 2023-06-26T13:44:05Z

This should be fixed now. Please give it a try with the snapshots: http://bytedeco.org/builds/

mmimica · 2023-06-27T07:41:06Z

Still fails.

saudet · 2023-06-27T07:55:56Z

I wonder which version of the compiler fixes that...

mmimica · 2023-06-29T09:02:14Z

So far I figured the bug occurs in Parser.cpp which is generated by Ragel. Centos 7 uses Development (!) version 7 of Ragel, while Ragel 6 seems to work fine. I haven't managed to install Ragel 6 on Centos 7.

mmimica · 2023-06-29T14:17:56Z

Ok, I found it. It is Ragel. Version 6 has to be used. I had it compiled manually because no package for Centos 7 exists. Also had to build cmake, bmake, libfsm and kelbt. What a fun day.

EDIT:
Here is ragel 6.10 binary. It's self-contained.
ragel.gz

saudet · 2023-06-29T14:35:31Z

Yes, I see, thanks. I forgot to update the builds to Ubuntu. Now the snapshots work. 👍

saudet added bug help wanted labels Jan 10, 2023

saudet added a commit that referenced this issue Feb 25, 2023

* Upgrade presets for Hyperscan 5.4.1 (issue #1308), SciPy 1.10.1, L…

ba3e979

…LVM 15.0.7

saudet removed the help wanted label Feb 26, 2023

saudet added the help wanted label Jun 27, 2023

saudet removed the help wanted label Jun 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hyperscan] Broken UTF-8 support #1308

[Hyperscan] Broken UTF-8 support #1308

sanyarnd commented Jan 9, 2023

saudet commented Jan 9, 2023

sanyarnd commented Jan 10, 2023

saudet commented Jan 10, 2023

saudet commented Jan 10, 2023 •

edited

Loading

saudet commented Jan 10, 2023

sanyarnd commented Jan 10, 2023

sanyarnd commented Jan 10, 2023 •

edited

Loading

saudet commented Jan 10, 2023

saudet commented Jan 10, 2023

saudet commented Feb 26, 2023

saudet commented May 1, 2023

mmimica commented Jun 20, 2023

mmimica commented Jun 20, 2023

saudet commented Jun 26, 2023

mmimica commented Jun 27, 2023

saudet commented Jun 27, 2023

mmimica commented Jun 29, 2023

mmimica commented Jun 29, 2023 •

edited

Loading

saudet commented Jun 29, 2023

[Hyperscan] Broken UTF-8 support #1308

[Hyperscan] Broken UTF-8 support #1308

Comments

sanyarnd commented Jan 9, 2023

saudet commented Jan 9, 2023

sanyarnd commented Jan 10, 2023

saudet commented Jan 10, 2023

saudet commented Jan 10, 2023 • edited Loading

saudet commented Jan 10, 2023

sanyarnd commented Jan 10, 2023

sanyarnd commented Jan 10, 2023 • edited Loading

saudet commented Jan 10, 2023

saudet commented Jan 10, 2023

saudet commented Feb 26, 2023

saudet commented May 1, 2023

mmimica commented Jun 20, 2023

mmimica commented Jun 20, 2023

saudet commented Jun 26, 2023

mmimica commented Jun 27, 2023

saudet commented Jun 27, 2023

mmimica commented Jun 29, 2023

mmimica commented Jun 29, 2023 • edited Loading

saudet commented Jun 29, 2023

saudet commented Jan 10, 2023 •

edited

Loading

sanyarnd commented Jan 10, 2023 •

edited

Loading

mmimica commented Jun 29, 2023 •

edited

Loading