-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
combination encoding = "latin1" + method = "CWB" #8
Comments
I am not sure that it is good to rely on the cwb-encode command line tool used by the "CWB"-method in the long run. But the "CWB" method should do the same as the "R"-method, if only for testing purposes. My first idea is that something goes wrong when the token stream is written to disk. I use the data.table::fwrite(
list(token_stream = token_stream), file = vrt_tmp_file,
col.names = FALSE, quote = FALSE, showProgress = interactive()
) If you could turn your initial example into a minimal reproducible example, it will be easier for us to follow up on this. |
This is a reproducible example. So the issue is still valid. library(cwbtools)
library(magrittr)
library(tokenizers)
library(dplyr)
library(polmineR)
regdir <- fs::path(tempdir(), "regdir")
datadir <- fs::path(tempdir(), "corpusdata", "tmpcorpus")
dir.create(regdir)
dir.create(datadir, recursive = TRUE)
CD <- CorpusData$new()
CD$tokenstream <- c("Das müssen Sie einfach hören!") %>%
iconv(from = "UTF-8", to = "latin1") %>%
tokenize_words(lowercase = FALSE, strip_punct = FALSE) %>%
as.data.frame() %>%
set_colnames("word")
CD$encode(
registry_dir = regdir,
data_dir = datadir,
corpus = "LATIN1",
encoding = "latin1",
method = "CWB",
p_attributes = "word",
s_attributes = character(),
compress = FALSE
)
corpus("LATIN1", registry_dir = regdir) %>%
get_token_stream(p_attribute = "word") You see: |
This is a minified example. So there is an issue with library(cwbtools)
library(magrittr)
library(tokenizers)
library(dplyr)
library(polmineR)
regdir <- fs::path(tempdir(), "regdir")
datadir <- fs::path(tempdir(), "corpusdata", "tmpcorpus")
dir.create(regdir)
dir.create(datadir, recursive = TRUE)
tokenstream <- c("Das müssen Sie einfach hören!") %>%
iconv(from = "UTF-8", to = "latin1") %>%
tokenize_words(lowercase = FALSE, strip_punct = FALSE)
p_attribute_encode(
token_stream = tokenstream[[1]],
p_attribute = "word",
registry_dir = regdir,
data_dir = datadir,
corpus = "LATIN1",
encoding = "latin1",
method = "CWB",
compress = FALSE,
quietly = TRUE
)
RcppCWB::cqp_reset_registry(registry = regdir)
corpus("LATIN1", registry_dir = regdir) %>%
get_token_stream(p_attribute = "word") |
So here is the solution: The issue is a false alarm, as far as the handling of encodings by library(cwbtools)
library(magrittr)
library(tokenizers)
library(dplyr)
library(polmineR)
regdir <- fs::path(tempdir(), "regdir")
datadir <- fs::path(tempdir(), "corpusdata", "tmpcorpus")
dir.create(regdir)
dir.create(datadir, recursive = TRUE)
tokenstream <- c("Das müssen Sie einfach hören!") %>%
iconv(from = "UTF-8", to = "latin1") %>%
tokenize_words(lowercase = FALSE, strip_punct = FALSE) %>%
.[[1]] %>%
iconv(from = "UTF-8", to = "latin1") # required again!
p_attribute_encode(
token_stream = tokenstream,
p_attribute = "word",
registry_dir = regdir,
data_dir = datadir,
corpus = "LATIN1",
encoding = "latin1",
method = "CWB",
compress = FALSE,
quietly = TRUE
)
RcppCWB::cqp_reset_registry(registry = regdir)
corpus("LATIN1", registry_dir = regdir)@encoding # returns 'latin1'
RcppCWB::corpus_property(corpus = "LATIN1", registry = regdir, property = "charset") # is 'latin1'
lexfile <- fs::path(datadir, "word.lexicon")
lexicon <- readBin(con = lexfile, what = character(), n = file.info(lexfile)$size)
Encoding(lexicon) <- "latin1"
lexicon
y <- RcppCWB::cl_id2str(corpus = "LATIN1", registry = regdir, id = 0:5, p_attribute = "word")
Encoding(y) <- "latin1"
y
y <- corpus("LATIN1", registry_dir = regdir) %>%
get_token_stream(p_attribute = "word") |
I explored the documentation of the tokenizers package, but as far as I can see, it is silent on encoding. See this issue I wrote: ropensci/tokenizers#87 (comment) |
This way of encoding breaks German Umlaute:
CD$encode( registry_dir = registry, data_dir = stateparl_data_dir, corpus = toupper(corpus), encoding = "latin1", method = "CWB", p_attributes = p_attrs, s_attributes = s_attrs, compress = TRUE )
While using method = "R" apparently works:
CD$encode( registry_dir = registry, data_dir = stateparl_data_dir, corpus = toupper(corpus), encoding = "latin1", method = "R", p_attributes = p_attrs, s_attributes = s_attrs, compress = TRUE )
The text was updated successfully, but these errors were encountered: