Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some characters in stopwords_tr do not appear Turkish character #15

Closed
erkanozhan opened this issue Jul 17, 2019 · 5 comments · Fixed by #16
Closed

Some characters in stopwords_tr do not appear Turkish character #15

erkanozhan opened this issue Jul 17, 2019 · 5 comments · Fixed by #16

Comments

@erkanozhan
Copy link

erkanozhan commented Jul 17, 2019

stopwords_tr <- data.frame(word = stopwords::stopwords("tr",source="stopwords-iso"), stringsAsFactors = FALSE)
stopwords_tr

Some characters in stopwords_tr are not in Turkish. For example;

1   acaba
2   acep
3   adamakıllı
4   adeta
5   ait
6   altm**ýþ**   <-Here Must be: **altmış**
7   altmış
8   alt**ý**     <-Here Must be: **altı**

I'm looking for a way to fix them.

stopwords_tr$word<-gsub("ý","ı",stopwords_tr$word)

The result has not changed. I tried these, but it didn't.

Encoding (stopwords_tr $ word) <- "WINDOWS-1254"
Encoding (stopwords_tr $ word) <- "LATIN-5"
Encoding (stopwords_tr $ word) <- "UTF-8"

Another interesting thing.

When you double-click stopwords_tr in R Studio to display it, the character appears "ý". In Console, it looks like "y".
Is there a parameter to set encoding?

Thanks to everyone.

@kbenoit
Copy link
Collaborator

kbenoit commented Jul 17, 2019

Thanks for finding this! The “encoding bit” for a character (object) in R can only be one of “Unknown”, “UTF-8”, or Latin-1, but all of the stopwords values should be UTF-8, so we just need to correct the mis-encoded stop words. I will fix this if you send me the corrections for the words that need it.

Thanks!

@kbenoit kbenoit mentioned this issue Jul 17, 2019
@kbenoit
Copy link
Collaborator

kbenoit commented Jul 17, 2019

@erkanozhan pls see #16, does that solve it?

@erkanozhan
Copy link
Author

erkanozhan commented Jul 17, 2019

I'm very glad for your answer. I'm making corrections. I'il send it in a short time. I prepared an excel table. Where can I send the file to you?

@kbenoit
Copy link
Collaborator

kbenoit commented Jul 17, 2019

kbenoit@lse.ac.uk but much better to use this tool https://www.tablesgenerator.com/markdown_tables so you can paste the Markdown here. I just really need the wrong words + their corrections, e.g.

Wrong Corrected
altý altı
altmýþ altmış

@erkanozhan
Copy link
Author

I've added fixes to #16 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants