Bug: sz_find incorrectly finds the substring with length=5 #154

vasily-v-ryabov · 2024-05-30T08:06:06Z

Describe the bug

StringZilla incorrectly finds this case "Hello, world!".find("world", 0, 11) which must return -1.

Function sz_equal_serial doesn't handle correctly the suffix of substring "world" (suffix == "d"). It goes exactly to last statement return (sz_bool)(a_end == a); where a_end == "!" and a == "!" which is very strange because at the first line of the function there is an assignment sz_ptr_t const a_end = a + length; where length == 1 in debugger.

It looks like a while loop before return statement is written incorrectly for this case. b++ (b was empty "\0" string before entering the loop) remains empty randomly because subsequent memory is also clean (zeroes).

I'm aware about #72 though it is just a wish for optimization.

Steps to reproduce

Simple to test it in Python:

pip install stringzilla

and

>>> print("Hello, world!".find("world", 0, 11)) # original CPython algorithm
-1
>>> from stringzilla import Str
>>> print(Str("Hello, world!").find("world", 0, 11)) # StringZilla algorithm
7

Expected behavior

>>> print("Hello, world!".find("world", 0, 11)) # original CPython algorithm
-1

StringZilla version

3.8.4

Operating System

Ubuntu 22.04 and Windows 10/11 64-bit

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Are you open to being tagged as a contributor?

I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

I have searched the existing issues

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

vasily-v-ryabov · 2024-05-30T08:39:36Z

OK, maybe the root cause is at higher level, not exactly in this function. I couldn't figure out the root cause yet.

ashvardanian · 2024-05-30T18:46:43Z

Interesting, which CPU model are you running on?

ashvardanian · 2024-05-30T19:41:32Z

Oh, it must be a binding issue. When I avoid passing the extra arguments I get:

>>> print("Hello, world!".find("world")) # original CPython algorithm
7
>>> print(Str("Hello, world!").find("world")) # StringZilla algorithm
7

So should be coming from python/lib.c, @vasily-v-ryabov.

vasily-v-ryabov · 2024-06-05T12:50:04Z

I run on different Intel CPUs: Core i7-8700 or Core i7-12700H.

vasily-v-ryabov · 2024-06-05T12:50:52Z

Also I see similar issue in C code.

ashvardanian · 2024-06-05T17:23:35Z

Wow, that's dangerous! Any chance there is an existing test you can extend in scripts/test.cpp to highlight that? PRs always welcomed!

ashvardanian · 2024-08-04T20:19:47Z

Thank you for spotting this! I've fixed the bug and will merge soon 🤗

hillelme · 2024-08-05T06:36:30Z

Hi,
Thanks for this great lib! seeing a similar(?) issue:
sz_find_neon_too_smart is disagreening with sz_find_neon/sz_find_serial
this happens in C, with n_length=2.

I noticed the first byte is equal where sz_find_neon_too_smart flagged, but the 2nd wasn't.

I used random strings of length 2048 to compare the 2 functions.

n[0]: 0xe0
n[1]: 0x87
n_length: 2
sz_find_neon: 1968.           h[found_offset0]: 0xe0, h[found_offset0 + 1]: 0x87 <-- first and 2nd bytes good
sz_find_neon_too_smart: 780.  h[found_offset1]: 0xe0, h[found_offset1 + 1]: 0x1c <-- first byte good, 2nd not

Thanks!

edit: tested with lengths 1-32, and only "2" has this issue

ashvardanian · 2024-08-05T14:53:55Z

Hi @hillelme! That seems to be a different bug. Any chance you have a minimal string example I can use to reproduce this?

ashvardanian · 2024-08-05T21:54:01Z

I have just released the v3.9.0, which patches the original issue. @hillelme can you please create a new one if your issue persists?

Thanks for reporting those 🤗

vasily-v-ryabov · 2024-10-20T15:26:21Z

The fix is confirmed. Thanks!

vasily-v-ryabov added the bug Something isn't working label May 30, 2024

ashvardanian self-assigned this Aug 4, 2024

ashvardanian closed this as completed in 96e2561 Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: sz_find incorrectly finds the substring with length=5 #154

Bug: sz_find incorrectly finds the substring with length=5 #154

vasily-v-ryabov commented May 30, 2024

vasily-v-ryabov commented May 30, 2024

ashvardanian commented May 30, 2024

ashvardanian commented May 30, 2024

vasily-v-ryabov commented Jun 5, 2024

vasily-v-ryabov commented Jun 5, 2024

ashvardanian commented Jun 5, 2024

ashvardanian commented Aug 4, 2024

hillelme commented Aug 5, 2024 •

edited

Loading

ashvardanian commented Aug 5, 2024

ashvardanian commented Aug 5, 2024

vasily-v-ryabov commented Oct 20, 2024

Bug: sz_find incorrectly finds the substring with length=5 #154

Bug: sz_find incorrectly finds the substring with length=5 #154

Comments

vasily-v-ryabov commented May 30, 2024

Describe the bug

Steps to reproduce

Expected behavior

StringZilla version

Operating System

Hardware architecture

Which interface are you using?

Contact Details

Are you open to being tagged as a contributor?

Is there an existing issue for this?

Code of Conduct

vasily-v-ryabov commented May 30, 2024

ashvardanian commented May 30, 2024

ashvardanian commented May 30, 2024

vasily-v-ryabov commented Jun 5, 2024

vasily-v-ryabov commented Jun 5, 2024

ashvardanian commented Jun 5, 2024

ashvardanian commented Aug 4, 2024

hillelme commented Aug 5, 2024 • edited Loading

ashvardanian commented Aug 5, 2024

ashvardanian commented Aug 5, 2024

vasily-v-ryabov commented Oct 20, 2024

hillelme commented Aug 5, 2024 •

edited

Loading