-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
different number of queries in Academic, IMDB, etc. (compared to the original distribution) #1
Comments
Which original release are you referring to? We took the data from https://github.com/jkkummerfeld/text2sql-data . Note that this repo groups all the SQLs that share the same template, so there might be fewer SQLs than the original release. (we then created separate data points for each SQL, but retrospectively it might be a better decision to create separate data points for each natural language question, rather than SQL queries.) |
Thanks for your response! The original release I'm referring to is indeed https://github.com/jkkummerfeld/text2sql-data. Having read your response, my understanding now is that you only took 1 query per template when you created the distilled test-suite. Does this imply that I can't use the test-suite databases to validate the correctness of the other queries? This would significantly reduce the dataset size. For example, there are 131 examples in the original IMDB dataset for 89 templates. Losing 42 out of 131 examples is not great. Do you plan to create a test-suite for the complete classical test sets? It would be a very valuable resource for the community. I have also found other anomalies in your dataset release:
and
... that corresponds to "Find the movie which is classified in the most number of genres", and
... that corresponds to "Which producer has worked with the most number of directors ?" |
Thanks for carefully examining the dataset! I think these recommendations are very helpful. Here are my responses to each of your point:
another teammate rewrites the IMDB queries into SPIDER style; I will check with him. The unsubstituted variable name in the SQL query is not intended. Our code made the assumption that variable examples for all SQLs were already provided and hence produced this artifact. Below is the original data dictionary: ` 'sentences': [{'question-split': '8', 'text': 'When was " writer_name0 " born ?', 'variables': {'writer_name0': 'Kevin Spacey'}}, {'question-split': '9', 'text': 'In what year was " writer_name0 " born ?', 'variables': {'writer_name0': 'Kevin Spacey'}}], 'sql': ['SELECT WRITERalias0.BIRTH_YEAR FROM WRITER AS WRITERalias0 WHERE WRITERalias0.NAME = "writer_name0" ;', 'SELECT ACTORalias0.BIRTH_YEAR FROM ACTOR AS ACTORalias0 WHERE ACTORalias0.NAME = "actor_name0" ;', 'SELECT DIRECTORalias0.BIRTH_YEAR FROM DIRECTOR AS DIRECTORalias0 WHERE DIRECTORalias0.NAME = "director_name0" ;', 'SELECT PRODUCERalias0.BIRTH_YEAR FROM PRODUCER AS PRODUCERalias0 WHERE PRODUCERalias0.NAME = "producer_name0" ;'], 'variables': [{'example': 'Kevin Spacey', 'location': 'both', 'name': 'writer_name0', 'type': 'writer_name'}]} Here, the second SQL template has an "actor_name0" variable, but no example is provided.
Again, thanks for all the comments. I will add a note in the README indicating that the current version has a couple of issues and is useful only for preliminary exploration. Additionally, in our next update, we will release the complete provenance of the data transformation process such that it becomes transparent and maintainable. |
I really appreciate that you responded so quickly, @ruiqi-zhong. I am also happy hear that you want to release a better version of the dataset. My team and I are really looking forward to it! Meanwhile, please find more comments below: Regarding the missing questions.
Hmm, I'm afraid the pkl file does not contain all the information that you'd need to fix everything. Consider the query template from the IMDB dataset. Your pkl file contains the following:
Meanwhile, the original entry in the data release by Universtity of Michigan researchers is
It appears that your pkl file does not include the variable assignments that would be required to reconstruct the second query ("What year was the movie The Imitation Game produced"). I also imagine that test-suite databases that you constructed probably do not contain the content that would be required to properly evaluate the corresponding query against model predictions. Is my understanding correct? It would imply that the databases would need to be regenerated to fix the issue once for all. Regarding the queries that were not deanonymized Thank you for your clarification, I think I understand now what happened. Note that the documentation for text2sql-data says that they "only use the first query, but retain the variants for completeness (e.g. using joins vs. conditions)." Queries other than the first one often do not even have the variable assignments provided, as you correctly noted in your analysis. I think it would make sense for you to also use the first query only when you regenerate the data. This also explains why you have different queries associated with different question texts (i.e. the review count example in my previous post). Regarding the missing queries Thanks for making clear what happened here. |
I think I have found another issue. While I can see that a lot of effort was put into rewriting the original SQL queries from text2sql-data into Spider SQL, there are still some issues:
It appears that it should be |
thanks for all the feedbacks! I just pushed an update.
I created one data point for each natural language query. The data conversion process is now in classical_provenance.ipynb . Now for each datapoint in classical_test.pkl, we can trace back to the original datapoint in Jonathan's repo using the
I fixed it and the details can be seen in
In this version, I decided to back-off and directly use the SQLs from Jonathan's repo. I changed some details on how to create neighbor queries. 1) Instead of dropping every sub span to create neighbor queries, I now only drop every sub span that corresponds to a derivation of an AST intermediate symbol (i.e. subtree). This is because SQLs from Jonathan's repo are usually very long, and dropping every possible subspan will create a quadratic number of neighbor queries, which is too many. 2) I forbid dropping any predicate of join statement (e.g. I also added two features: evaluating on a subset of the classical datasets and caching results for SQLs that have been evaluated. This is because queries on ATIS sometimes can take a while to run. Again, I really appreciate your time examining the queries closely. I am closing this issue. Please feel free to open a new issue if you have any comments on the updated version. |
Hi Ruiqi, thanks a lot for your continued effort to make this test-suite work correctly for classical datasets. Unfortunately, I think I have found another issue, this time probably the last one. In Jonathan's repo examples are grouped by SQL queries. Each query can have an arbitrary number of associated sentences. Every sentence has its own set of values that should be substituted for variables. E.g. in geography.json in the first query the city is Arizona, in the second it is Texas, etc. Your code in
As a result, many queries in The consequence of this mismatch is that the generated test suite can not be used to evaluate correctness of the original queries from the classical dataset. I noticed that when the evaluation returned some really surprising false positives. I would be extremely grateful if you fix this issue and regenerate the test suite again. This resource makes a big difference in making meaningful evaluation reproducible, and it would be really great if you get it right. |
(I'm re-opening the issue) Thanks for your feedback! Would you mind clarifying which one of the following cases is happening? Suppose the dictionary associated with each text-sql pair is d, then
I ran the following code
and did not find "dead poet" in the output. Looks like it is problem 3 rather than 1? This will help me pin down the problem with the current evaluation, and I will definitely fix it if there is an issue. Thanks! |
It is the case 3 that happens, namely:
I would argue this is a significant change compared to the original data, because linking literals to the right column in the schema is in general a non-trivial issue. For example, in "How big is New Mexico?" New Mexico is clearly a state, because there is no city called New Mexico. Whereas in "How big is New York?" New York can refer to either a city or a state. You can also check out the preprocessing code for this data that was released by Google: https://github.com/google-research/language/blob/89e5b53a0e7f9f3e2da25a5da71ce5bd466acabd/language/xsp/data_preprocessing/michigan_preprocessing.py#L30 . It is very well documented, and does variable substitution in what I think is the right way.
I understand the computational effort considerations, but technically this simplification changes the original question texts in the datasets.
It would really greatly help me in my work, thanks! Here is the "deat poets" line in the original data: https://github.com/jkkummerfeld/text2sql-data/blob/master/data/imdb.json#L10 |
thanks for the clarification. I will probably make this update by the end of this month. Thanks! |
Great, thank you so much! I am looking forward to the new release. |
Thank you for releasing this, great work!
I have noticed that the "classical datasets" in your release always have less examples than in the original release. For example:
Academic: 185 vs 189
Yelp: 122 vs 128
IMDB: 97 vs 131
Can you comment on what specific examples were left out? And maybe add a few words to
README.md
?The text was updated successfully, but these errors were encountered: