implicit conversion of character input to UTF-8 #87

ablaette · 2024-02-15T08:20:18Z

tokenize_words() implicitly converts non-UTF-8-input to UTF-8. See the following example (latin1 in, UTF-8 out). As I was not aware of this behavior, this had caused me some headaches (see PolMine/cwbtools#8 (comment)).

library(tokenizers)

c("Smørrebrød tastes great!") %>% 
  iconv(from = "UTF-8", to = "latin1") %>%
  tokenize_words(lowercase = FALSE, strip_punct = FALSE) %>%
  .[[1]] %>%
  Encoding()

[1] "UTF-8" "unknown" "unknown" "unknown"

Obviously, the times of 'latin1' are almost entirely over. But the documentation of the package is silent on this, the only reference to matters of encoding is in the 'Description' part of the DESCRIPTION file: "The tokenizers have a consistent interface, and
the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'."

Maybe include a sentence like this in the 'basic-tokenizers' documentation object? "Non-UTF-8 input is converted to UTF-8."

The text was updated successfully, but these errors were encountered:

ablaette mentioned this issue Feb 15, 2024

combination encoding = "latin1" + method = "CWB" PolMine/cwbtools#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implicit conversion of character input to UTF-8 #87

implicit conversion of character input to UTF-8 #87

ablaette commented Feb 15, 2024

implicit conversion of character input to UTF-8 #87

implicit conversion of character input to UTF-8 #87

Comments

ablaette commented Feb 15, 2024