Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text: string_part smushes repeated separator instances #19

Closed
Hurricos opened this issue Jun 29, 2021 · 4 comments
Closed

text: string_part smushes repeated separator instances #19

Hurricos opened this issue Jun 29, 2021 · 4 comments

Comments

@Hurricos
Copy link

# uname -a
Linux .... 3.10.0-1160.21.1.el7.x86_64 #1 SMP Tue Mar 16 18:28:22 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
# sqlite3 --version
3.35.5 2021-04-19 18:32:05 1b256d97b553a9611efca188a3d995a2fff712759044ba480f9a0c9e98fae886
# echo -e ".load text\nselect split_part('one;;;three', ';', 2);" | sqlite3
three
@Hurricos
Copy link
Author

After reading the documentation for text I understand this is ... possibly expected behavior:

If sep is composed of multiple characters, each character is treated as separator. E.g.:

sqlite> select split_part('one/two\three', '/\', 2);
two

By "possibly expected", I mean that I expect this:

sqlite> select split_part('one/\two/\three', '/\', 2);
two

Still, it's really surprising behavior.

If it is expected behavior, would it be possible to add a cut function which acts similarly to Unix 'cut' instead, only accepting a single character?

If it is not expected behavior, ... well, I'm trying to trace src/sqlite3-text.c#L162 to find out why.

@Hurricos
Copy link
Author

As best I can tell, str_sep operates more like strtok than strsep; strtok cannot deal with empty fields and responds by simply squishing together repeated instances of the separator.

The following patch works for me upon testing, but I haven't tested for any bugs:

diff --git a/src/sqlite3-text.c b/src/sqlite3-text.c
index 25bacfb..25f5538 100644
--- a/src/sqlite3-text.c
+++ b/src/sqlite3-text.c
@@ -163,9 +163,6 @@ static char *split_part(char *source, const char *sep, int64_t part) {
     char *token;
     int64_t index = 1;
     while ((token = str_sep(&source, sep)) != NULL) {
-        if (strcmp(token, "") == 0) {
-            continue;
-        }
         if (index == part) {
             break;
         }

@Hurricos
Copy link
Author

After some more use, still no bugs with the above patch.

I'm comfortable with the new behavior of split_part, but I don't want to assume everyone is. For the purpose of consistency I won't submit a PR for the above patch; instead I'd rather separate the new behavior into a new SQLite function, strsep_part, explicitly different in that it does not condense delimiters.

PR forthcoming eventually.

nalgeon added a commit that referenced this issue Sep 28, 2021
@nalgeon
Copy link
Owner

nalgeon commented Sep 28, 2021

Thank you very much for this investigation! I agree - squashing multiple separators into one is unexpected behavior. I've applied your patch, don't see a real need for the separate function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants