You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tokenize_words() implicitly converts non-UTF-8-input to UTF-8. See the following example (latin1 in, UTF-8 out). As I was not aware of this behavior, this had caused me some headaches (see PolMine/cwbtools#8 (comment)).
Obviously, the times of 'latin1' are almost entirely over. But the documentation of the package is silent on this, the only reference to matters of encoding is in the 'Description' part of the DESCRIPTION file: "The tokenizers have a consistent interface, and
the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'."
Maybe include a sentence like this in the 'basic-tokenizers' documentation object? "Non-UTF-8 input is converted to UTF-8."
The text was updated successfully, but these errors were encountered:
tokenize_words()
implicitly converts non-UTF-8-input to UTF-8. See the following example (latin1 in, UTF-8 out). As I was not aware of this behavior, this had caused me some headaches (see PolMine/cwbtools#8 (comment)).[1] "UTF-8" "unknown" "unknown" "unknown"
Obviously, the times of 'latin1' are almost entirely over. But the documentation of the package is silent on this, the only reference to matters of encoding is in the 'Description' part of the DESCRIPTION file: "The tokenizers have a consistent interface, and
the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'."
Maybe include a sentence like this in the 'basic-tokenizers' documentation object? "Non-UTF-8 input is converted to UTF-8."
The text was updated successfully, but these errors were encountered: