Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to process German, Chinese and Hebrew in case inSensitive mode #6

Open
DanangCode opened this issue Jan 2, 2014 · 2 comments

Comments

@DanangCode
Copy link

This can be reproduced with this query:

python getngrams.py פמיניזם:heb_2012, 女性主义:chi_sim_2012 --startYear= --endYear=2008 -caseInsensitive -smoothing=1

The problem is that these languages only return one case, so there is no (all) column so the data is thrown away in the -AllData routine

econpy added a commit that referenced this issue Jan 2, 2014
Fixed issue with every column being dropped when doing case-insensitive queries in the German, Chinese, and Hebrew corpuses.

This is my solution to [Issue 6](#6).
@econpy
Copy link
Owner

econpy commented Jan 2, 2014

Thanks for the information about the problem. The exact query you posted was missing the value for startYear, but after adding one I was able to replicate the problem which caused all the data columns to be dropped.

I first pushed a commit which made your supplied query work, but I quickly realized that it wasn't a complete solution. If I specified a Hebrew, Chinese, or German corpus using the --corpus argument (instead of appending it to the query string manually, as in your supplied query), all the columns were still being dropped.

So I just pushed a second commit which now appears to have solved the problem. I tested a handful of individual queries and they all seem to work:

python getngrams.py פמיניזם --startYear=1900 --endYear=2000 --corpus=heb_2012 -caseInsensitive
python getngrams.py 女性主义 --startYear=1900 --endYear=2000 --corpus=chi_sim_2012 -caseInsensitive
python getngrams.py Füße --startYear=1900 --endYear=2000 --corpus=ger_2012 -caseInsensitive

However, when I combine all the queries into a single query, all cases for each term are returned. For example, here is a series where both Füße and füße are returned:

python getngrams.py פמיניזם:heb_2012, 女性主义:chi_sim_2012,Füße:ger_2012 -caseInsensitive

I supposed I could manually sum the cases since a pre-summed "(All)"-like column isn't provided.

@DanangCode
Copy link
Author

Oh, great thanks for the quick fix.

On Thu, Jan 2, 2014 at 11:48 AM, econpy notifications@github.com wrote:

Thanks for the information about the problem. The exact query you posted
was missing the value for startYear, but after adding one I was able to
replicate the problem which caused all the data columns to be dropped.

I first pushed a commithttps://github.com/econpy/google-ngrams/commit/571c784e79a28341ef02843dccc174d03bb7e3a6which made your supplied query work, but I quickly realized that it wasn't
a complete solution. If I specified a Hebrew, Chinese, or German corpus
using the --corpus argument (instead of appending it to the query string
manually, as in your supplied query), all the columns were still being
dropped.

So I just pushed a second fixhttps://github.com/econpy/google-ngrams/commit/42281106b2d0360dd4c5d4227fef450d921cdfeawhich now appears to have solved the problem. I tested a handful of
individual queries and they all seem to work:

python getngrams.py פמיניזם --startYear=1900 --endYear=2000 --corpus=heb_2012 -caseInsensitivepython getngrams.py 女性主义 --startYear=1900 --endYear=2000 --corpus=chi_sim_2012 -caseInsensitivepython getngrams.py Füße --startYear=1900 --endYear=2008 --corpus=ger_2012 -caseInsensitive

However, when I combine all the queries into a single query, all cases for
each term is returned. For example, here a series for both Füße and füße
are returned:

python getngrams.py פמיניזם:heb_2012, 女性主义:chi_sim_2012,Füße:ger_2012 -caseInsensitive --smoothing=4

I supposed I could manually sum the cases since a pre-summed "(All)"-like
column isn't provided.


Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-31478933
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants