Replies: 1 comment 1 reply
-
@andreped you're the first person to report this specific issue. My initial hunch would be similar to you that it has to do with tokenization. I thought perhaps the lowercase names weren't being tokenized as 2 words. However, for the specific instance of TaskDescription, it's 2 tokens regardless of whether it's camel case or lower case. >>> import tiktoken
>>> enc = tiktoken.encoding_for_model("gpt-4o")
>>> enc.encode("TaskDescription")
[5927, 6135]
>>> enc.encode("taskdescription")
[15921, 9186] Could there be a disconnect between the terminology in the questions vs the terminology in the SQL? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We recently made the move from MS-SQL to PostgreSQL as it allows us to use PGVector and reduce technical overhead.
Naturally, when doing this change, we had to change the dialect of all SQL queries for both our training and test sets. This we have done.
Surprisingly, we observed a quality degradation in SQL completion quality (likely other tasks experiencing the same). We are wondering if this has to do with that all column names have ended up being lower case only. Thus, instead of
TaskDescription
it will betaskdescription
.Now we know that we can alter this in the database itself by either writing all column names as
"TaskDescription"
or updating the queries themselves, but it was surprising to us that this could have been the cause. Perhaps whether CamelCaps is used or not impacts the tokenization?Have anyone observed something similar? Perhaps you have @zainhoda as I know you used to have a similar configuration to us?
Beta Was this translation helpful? Give feedback.
All reactions