combination encoding = "latin1" + method = "CWB" #8

ChristophLeonhardt · 2019-02-28T12:08:13Z

This way of encoding breaks German Umlaute:

CD$encode( registry_dir = registry, data_dir = stateparl_data_dir, corpus = toupper(corpus), encoding = "latin1", method = "CWB", p_attributes = p_attrs, s_attributes = s_attrs, compress = TRUE )

While using method = "R" apparently works:

CD$encode( registry_dir = registry, data_dir = stateparl_data_dir, corpus = toupper(corpus), encoding = "latin1", method = "R", p_attributes = p_attrs, s_attributes = s_attrs, compress = TRUE )

The text was updated successfully, but these errors were encountered:

ablaette · 2019-10-04T09:51:26Z

I am not sure that it is good to rely on the cwb-encode command line tool used by the "CWB"-method in the long run. But the "CWB" method should do the same as the "R"-method, if only for testing purposes.

My first idea is that something goes wrong when the token stream is written to disk. I use the fwrite function from the data.table package to do so, because it is really fast. Maybe it generates UTF-8?

data.table::fwrite(
  list(token_stream = token_stream), file = vrt_tmp_file,
   col.names = FALSE, quote = FALSE, showProgress = interactive()
)

If you could turn your initial example into a minimal reproducible example, it will be easier for us to follow up on this.

ablaette · 2024-02-14T14:23:04Z

This is a reproducible example. So the issue is still valid.

library(cwbtools)
library(magrittr)
library(tokenizers)
library(dplyr)
library(polmineR)

regdir <- fs::path(tempdir(), "regdir")
datadir <- fs::path(tempdir(), "corpusdata", "tmpcorpus")

dir.create(regdir)
dir.create(datadir, recursive = TRUE)

CD <- CorpusData$new()

CD$tokenstream <- c("Das müssen Sie einfach hören!") %>% 
  iconv(from = "UTF-8", to = "latin1") %>%
  tokenize_words(lowercase = FALSE, strip_punct = FALSE) %>%
  as.data.frame() %>%
  set_colnames("word")


CD$encode(
  registry_dir = regdir,
  data_dir = datadir,
  corpus = "LATIN1",
  encoding = "latin1",
  method = "CWB",
  p_attributes = "word",
  s_attributes = character(),
  compress = FALSE
)

corpus("LATIN1", registry_dir = regdir) %>%
  get_token_stream(p_attribute = "word")

You see:
"Das" "mÃ¼ssen" "Sie" "einfach" "hÃ¶ren" "!"

ablaette · 2024-02-14T14:41:26Z

This is a minified example. So there is an issue with p_attribute_encode().

library(cwbtools)
library(magrittr)
library(tokenizers)
library(dplyr)
library(polmineR)

regdir <- fs::path(tempdir(), "regdir")
datadir <- fs::path(tempdir(), "corpusdata", "tmpcorpus")

dir.create(regdir)
dir.create(datadir, recursive = TRUE)

tokenstream <- c("Das müssen Sie einfach hören!") %>% 
  iconv(from = "UTF-8", to = "latin1") %>%
  tokenize_words(lowercase = FALSE, strip_punct = FALSE)

p_attribute_encode(
  token_stream = tokenstream[[1]],
  p_attribute = "word",
  registry_dir = regdir,
  data_dir = datadir,
  corpus = "LATIN1",
  encoding = "latin1",
  method = "CWB",
  compress = FALSE,
  quietly = TRUE
)

RcppCWB::cqp_reset_registry(registry = regdir)

corpus("LATIN1", registry_dir = regdir) %>%
  get_token_stream(p_attribute = "word")

ablaette · 2024-02-14T16:58:57Z

So here is the solution: The issue is a false alarm, as far as the handling of encodings by p_attribute_encode() is concerned: tokenize_words() implicitly converts the latin1 input into UTF-8, causing the errors we see later. If we run icon() on the result of tokenize_words() (again), everything is fine.

library(cwbtools)
library(magrittr)
library(tokenizers)
library(dplyr)
library(polmineR)

regdir <- fs::path(tempdir(), "regdir")
datadir <- fs::path(tempdir(), "corpusdata", "tmpcorpus")

dir.create(regdir)
dir.create(datadir, recursive = TRUE)

tokenstream <- c("Das müssen Sie einfach hören!") %>% 
  iconv(from = "UTF-8", to = "latin1") %>%
  tokenize_words(lowercase = FALSE, strip_punct = FALSE) %>%
  .[[1]] %>%
  iconv(from = "UTF-8", to = "latin1") # required again!

p_attribute_encode(
  token_stream = tokenstream,
  p_attribute = "word",
  registry_dir = regdir,
  data_dir = datadir,
  corpus = "LATIN1",
  encoding = "latin1",
  method = "CWB",
  compress = FALSE,
  quietly = TRUE
)

RcppCWB::cqp_reset_registry(registry = regdir)

corpus("LATIN1", registry_dir = regdir)@encoding # returns 'latin1'
RcppCWB::corpus_property(corpus = "LATIN1", registry = regdir, property = "charset") # is 'latin1'

lexfile <- fs::path(datadir, "word.lexicon")
lexicon <- readBin(con = lexfile, what = character(), n = file.info(lexfile)$size)
Encoding(lexicon) <- "latin1"
lexicon

y <- RcppCWB::cl_id2str(corpus = "LATIN1", registry = regdir, id = 0:5, p_attribute = "word")
Encoding(y) <- "latin1"
y

y <- corpus("LATIN1", registry_dir = regdir) %>%
  get_token_stream(p_attribute = "word")

ablaette · 2024-02-15T08:21:46Z

I explored the documentation of the tokenizers package, but as far as I can see, it is silent on encoding. See this issue I wrote: ropensci/tokenizers#87 (comment)

ChristophLeonhardt changed the title ~~combination encoding = "latin1"~~ combination encoding = "latin1" + method = "CWB" Feb 28, 2019

ablaette closed this as completed Feb 14, 2024

ablaette mentioned this issue Feb 15, 2024

implicit conversion of character input to UTF-8 ropensci/tokenizers#87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

combination encoding = "latin1" + method = "CWB" #8

combination encoding = "latin1" + method = "CWB" #8

ChristophLeonhardt commented Feb 28, 2019 •

edited

Loading

ablaette commented Oct 4, 2019

ablaette commented Feb 14, 2024

ablaette commented Feb 14, 2024

ablaette commented Feb 14, 2024

ablaette commented Feb 15, 2024

combination encoding = "latin1" + method = "CWB" #8

combination encoding = "latin1" + method = "CWB" #8

Comments

ChristophLeonhardt commented Feb 28, 2019 • edited Loading

ablaette commented Oct 4, 2019

ablaette commented Feb 14, 2024

ablaette commented Feb 14, 2024

ablaette commented Feb 14, 2024

ablaette commented Feb 15, 2024

ChristophLeonhardt commented Feb 28, 2019 •

edited

Loading