-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to process German, Chinese and Hebrew in case inSensitive mode #6
Comments
Fixed issue with every column being dropped when doing case-insensitive queries in the German, Chinese, and Hebrew corpuses. This is my solution to [Issue 6](#6).
Thanks for the information about the problem. The exact query you posted was missing the value for startYear, but after adding one I was able to replicate the problem which caused all the data columns to be dropped. I first pushed a commit which made your supplied query work, but I quickly realized that it wasn't a complete solution. If I specified a Hebrew, Chinese, or German corpus using the --corpus argument (instead of appending it to the query string manually, as in your supplied query), all the columns were still being dropped. So I just pushed a second commit which now appears to have solved the problem. I tested a handful of individual queries and they all seem to work: python getngrams.py פמיניזם --startYear=1900 --endYear=2000 --corpus=heb_2012 -caseInsensitive
python getngrams.py 女性主义 --startYear=1900 --endYear=2000 --corpus=chi_sim_2012 -caseInsensitive
python getngrams.py Füße --startYear=1900 --endYear=2000 --corpus=ger_2012 -caseInsensitive However, when I combine all the queries into a single query, all cases for each term are returned. For example, here is a series where both Füße and füße are returned: python getngrams.py פמיניזם:heb_2012, 女性主义:chi_sim_2012,Füße:ger_2012 -caseInsensitive I supposed I could manually sum the cases since a pre-summed "(All)"-like column isn't provided. |
Oh, great thanks for the quick fix. On Thu, Jan 2, 2014 at 11:48 AM, econpy notifications@github.com wrote:
|
This can be reproduced with this query:
python getngrams.py פמיניזם:heb_2012, 女性主义:chi_sim_2012 --startYear= --endYear=2008 -caseInsensitive -smoothing=1
The problem is that these languages only return one case, so there is no (all) column so the data is thrown away in the -AllData routine
The text was updated successfully, but these errors were encountered: