Quality degradation after migration from MS-SQL to PostgreSQL #565

andreped · 2024-07-25T08:07:02Z

andreped
Jul 25, 2024

We recently made the move from MS-SQL to PostgreSQL as it allows us to use PGVector and reduce technical overhead.

Naturally, when doing this change, we had to change the dialect of all SQL queries for both our training and test sets. This we have done.

Surprisingly, we observed a quality degradation in SQL completion quality (likely other tasks experiencing the same). We are wondering if this has to do with that all column names have ended up being lower case only. Thus, instead of TaskDescription it will be taskdescription.

Now we know that we can alter this in the database itself by either writing all column names as "TaskDescription" or updating the queries themselves, but it was surprising to us that this could have been the cause. Perhaps whether CamelCaps is used or not impacts the tokenization?

Have anyone observed something similar? Perhaps you have @zainhoda as I know you used to have a similar configuration to us?

zainhoda · 2024-07-30T13:38:50Z

zainhoda
Jul 30, 2024
Maintainer

@andreped you're the first person to report this specific issue. My initial hunch would be similar to you that it has to do with tokenization. I thought perhaps the lowercase names weren't being tokenized as 2 words.

However, for the specific instance of TaskDescription, it's 2 tokens regardless of whether it's camel case or lower case.

>>> import tiktoken
>>> enc = tiktoken.encoding_for_model("gpt-4o")
>>> enc.encode("TaskDescription")
[5927, 6135]
>>> enc.encode("taskdescription")
[15921, 9186]

Could there be a disconnect between the terminology in the questions vs the terminology in the SQL?

1 reply

andreped Jul 30, 2024
Author

Yes, it seems like you are correct. I tried doing the same and inspecting each token separately using env.decode() and it looks like I have the same string within each chunk. Perhaps the tokenizer used is camel case invariant - that being it casts all data to lowercase? Thats new to me at least.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality degradation after migration from MS-SQL to PostgreSQL #565

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Quality degradation after migration from MS-SQL to PostgreSQL #565

andreped Jul 25, 2024

Replies: 1 comment · 1 reply

zainhoda Jul 30, 2024 Maintainer

andreped Jul 30, 2024 Author

andreped
Jul 25, 2024

Replies: 1 comment 1 reply

zainhoda
Jul 30, 2024
Maintainer

andreped Jul 30, 2024
Author