Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: sz_find incorrectly finds the substring with length=5 #154

Closed
3 tasks done
vasily-v-ryabov opened this issue May 30, 2024 · 11 comments
Closed
3 tasks done

Bug: sz_find incorrectly finds the substring with length=5 #154

vasily-v-ryabov opened this issue May 30, 2024 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@vasily-v-ryabov
Copy link
Contributor

Describe the bug

StringZilla incorrectly finds this case "Hello, world!".find("world", 0, 11) which must return -1.

Function sz_equal_serial doesn't handle correctly the suffix of substring "world" (suffix == "d"). It goes exactly to last statement return (sz_bool)(a_end == a); where a_end == "!" and a == "!" which is very strange because at the first line of the function there is an assignment sz_ptr_t const a_end = a + length; where length == 1 in debugger.

It looks like a while loop before return statement is written incorrectly for this case. b++ (b was empty "\0" string before entering the loop) remains empty randomly because subsequent memory is also clean (zeroes).

I'm aware about #72 though it is just a wish for optimization.

Steps to reproduce

Simple to test it in Python:

pip install stringzilla

and

>>> print("Hello, world!".find("world", 0, 11)) # original CPython algorithm
-1
>>> from stringzilla import Str
>>> print(Str("Hello, world!").find("world", 0, 11)) # StringZilla algorithm
7

Expected behavior

>>> print("Hello, world!".find("world", 0, 11)) # original CPython algorithm
-1

StringZilla version

3.8.4

Operating System

Ubuntu 22.04 and Windows 10/11 64-bit

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Are you open to being tagged as a contributor?

  • I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@vasily-v-ryabov vasily-v-ryabov added the bug Something isn't working label May 30, 2024
@vasily-v-ryabov
Copy link
Contributor Author

OK, maybe the root cause is at higher level, not exactly in this function. I couldn't figure out the root cause yet.

@ashvardanian
Copy link
Owner

Interesting, which CPU model are you running on?

@ashvardanian
Copy link
Owner

Oh, it must be a binding issue. When I avoid passing the extra arguments I get:

>>> print("Hello, world!".find("world")) # original CPython algorithm
7
>>> print(Str("Hello, world!").find("world")) # StringZilla algorithm
7

So should be coming from python/lib.c, @vasily-v-ryabov.

@vasily-v-ryabov
Copy link
Contributor Author

I run on different Intel CPUs: Core i7-8700 or Core i7-12700H.

@vasily-v-ryabov
Copy link
Contributor Author

Also I see similar issue in C code.

@ashvardanian
Copy link
Owner

Wow, that's dangerous! Any chance there is an existing test you can extend in scripts/test.cpp to highlight that? PRs always welcomed!

@ashvardanian ashvardanian self-assigned this Aug 4, 2024
@ashvardanian
Copy link
Owner

Thank you for spotting this! I've fixed the bug and will merge soon 🤗

@hillelme
Copy link

hillelme commented Aug 5, 2024

Hi,
Thanks for this great lib! seeing a similar(?) issue:
sz_find_neon_too_smart is disagreening with sz_find_neon/sz_find_serial
this happens in C, with n_length=2.

I noticed the first byte is equal where sz_find_neon_too_smart flagged, but the 2nd wasn't.

I used random strings of length 2048 to compare the 2 functions.

n[0]: 0xe0
n[1]: 0x87
n_length: 2
sz_find_neon: 1968.           h[found_offset0]: 0xe0, h[found_offset0 + 1]: 0x87 <-- first and 2nd bytes good
sz_find_neon_too_smart: 780.  h[found_offset1]: 0xe0, h[found_offset1 + 1]: 0x1c <-- first byte good, 2nd not

Thanks!

edit: tested with lengths 1-32, and only "2" has this issue

@ashvardanian
Copy link
Owner

Hi @hillelme! That seems to be a different bug. Any chance you have a minimal string example I can use to reproduce this?

@ashvardanian
Copy link
Owner

I have just released the v3.9.0, which patches the original issue. @hillelme can you please create a new one if your issue persists?

Thanks for reporting those 🤗

@vasily-v-ryabov
Copy link
Contributor Author

The fix is confirmed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants