Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combination encoding = "latin1" + method = "CWB" #8

Closed
ChristophLeonhardt opened this issue Feb 28, 2019 · 5 comments
Closed

combination encoding = "latin1" + method = "CWB" #8

ChristophLeonhardt opened this issue Feb 28, 2019 · 5 comments

Comments

@ChristophLeonhardt
Copy link

ChristophLeonhardt commented Feb 28, 2019

This way of encoding breaks German Umlaute:

CD$encode( registry_dir = registry, data_dir = stateparl_data_dir, corpus = toupper(corpus), encoding = "latin1", method = "CWB", p_attributes = p_attrs, s_attributes = s_attrs, compress = TRUE )

While using method = "R" apparently works:

CD$encode( registry_dir = registry, data_dir = stateparl_data_dir, corpus = toupper(corpus), encoding = "latin1", method = "R", p_attributes = p_attrs, s_attributes = s_attrs, compress = TRUE )

@ChristophLeonhardt ChristophLeonhardt changed the title combination encoding = "latin1" combination encoding = "latin1" + method = "CWB" Feb 28, 2019
@ablaette
Copy link
Collaborator

ablaette commented Oct 4, 2019

I am not sure that it is good to rely on the cwb-encode command line tool used by the "CWB"-method in the long run. But the "CWB" method should do the same as the "R"-method, if only for testing purposes.

My first idea is that something goes wrong when the token stream is written to disk. I use the fwrite function from the data.table package to do so, because it is really fast. Maybe it generates UTF-8?

data.table::fwrite(
  list(token_stream = token_stream), file = vrt_tmp_file,
   col.names = FALSE, quote = FALSE, showProgress = interactive()
)

If you could turn your initial example into a minimal reproducible example, it will be easier for us to follow up on this.

@ablaette
Copy link
Collaborator

This is a reproducible example. So the issue is still valid.

library(cwbtools)
library(magrittr)
library(tokenizers)
library(dplyr)
library(polmineR)

regdir <- fs::path(tempdir(), "regdir")
datadir <- fs::path(tempdir(), "corpusdata", "tmpcorpus")

dir.create(regdir)
dir.create(datadir, recursive = TRUE)

CD <- CorpusData$new()

CD$tokenstream <- c("Das müssen Sie einfach hören!") %>% 
  iconv(from = "UTF-8", to = "latin1") %>%
  tokenize_words(lowercase = FALSE, strip_punct = FALSE) %>%
  as.data.frame() %>%
  set_colnames("word")


CD$encode(
  registry_dir = regdir,
  data_dir = datadir,
  corpus = "LATIN1",
  encoding = "latin1",
  method = "CWB",
  p_attributes = "word",
  s_attributes = character(),
  compress = FALSE
)

corpus("LATIN1", registry_dir = regdir) %>%
  get_token_stream(p_attribute = "word")

You see:
"Das" "müssen" "Sie" "einfach" "hören" "!"

@ablaette
Copy link
Collaborator

This is a minified example. So there is an issue with p_attribute_encode().

library(cwbtools)
library(magrittr)
library(tokenizers)
library(dplyr)
library(polmineR)

regdir <- fs::path(tempdir(), "regdir")
datadir <- fs::path(tempdir(), "corpusdata", "tmpcorpus")

dir.create(regdir)
dir.create(datadir, recursive = TRUE)

tokenstream <- c("Das müssen Sie einfach hören!") %>% 
  iconv(from = "UTF-8", to = "latin1") %>%
  tokenize_words(lowercase = FALSE, strip_punct = FALSE)

p_attribute_encode(
  token_stream = tokenstream[[1]],
  p_attribute = "word",
  registry_dir = regdir,
  data_dir = datadir,
  corpus = "LATIN1",
  encoding = "latin1",
  method = "CWB",
  compress = FALSE,
  quietly = TRUE
)

RcppCWB::cqp_reset_registry(registry = regdir)

corpus("LATIN1", registry_dir = regdir) %>%
  get_token_stream(p_attribute = "word")

@ablaette
Copy link
Collaborator

So here is the solution: The issue is a false alarm, as far as the handling of encodings by p_attribute_encode() is concerned: tokenize_words() implicitly converts the latin1 input into UTF-8, causing the errors we see later. If we run icon() on the result of tokenize_words() (again), everything is fine.

library(cwbtools)
library(magrittr)
library(tokenizers)
library(dplyr)
library(polmineR)

regdir <- fs::path(tempdir(), "regdir")
datadir <- fs::path(tempdir(), "corpusdata", "tmpcorpus")

dir.create(regdir)
dir.create(datadir, recursive = TRUE)

tokenstream <- c("Das müssen Sie einfach hören!") %>% 
  iconv(from = "UTF-8", to = "latin1") %>%
  tokenize_words(lowercase = FALSE, strip_punct = FALSE) %>%
  .[[1]] %>%
  iconv(from = "UTF-8", to = "latin1") # required again!

p_attribute_encode(
  token_stream = tokenstream,
  p_attribute = "word",
  registry_dir = regdir,
  data_dir = datadir,
  corpus = "LATIN1",
  encoding = "latin1",
  method = "CWB",
  compress = FALSE,
  quietly = TRUE
)

RcppCWB::cqp_reset_registry(registry = regdir)

corpus("LATIN1", registry_dir = regdir)@encoding # returns 'latin1'
RcppCWB::corpus_property(corpus = "LATIN1", registry = regdir, property = "charset") # is 'latin1'

lexfile <- fs::path(datadir, "word.lexicon")
lexicon <- readBin(con = lexfile, what = character(), n = file.info(lexfile)$size)
Encoding(lexicon) <- "latin1"
lexicon

y <- RcppCWB::cl_id2str(corpus = "LATIN1", registry = regdir, id = 0:5, p_attribute = "word")
Encoding(y) <- "latin1"
y

y <- corpus("LATIN1", registry_dir = regdir) %>%
  get_token_stream(p_attribute = "word")

@ablaette
Copy link
Collaborator

I explored the documentation of the tokenizers package, but as far as I can see, it is silent on encoding. See this issue I wrote: ropensci/tokenizers#87 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants