Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread() throws correct non-ASCII messages #4751

Merged
merged 11 commits into from
Jan 4, 2021

Conversation

shrektan
Copy link
Member

@shrektan shrektan commented Oct 11, 2020

Fixes #4747

Before the PR the last line throw warnings with garbage letters. This PR fixes it.

Example code

x <- "fa\xE7ile"
Encoding(x) <- "latin1"
txt <- sprintf("A,B\n%s,%s\n%s", x, x, x)
Encoding(txt) <- "UTF-8"
data.table::fread(text = txt, encoding = 'UTF-8')
txt2 <- iconv(txt, "UTF-8", "latin1")
data.table::fread(text = txt2, encoding = 'Latin-1')

Before

Warning message:
In data.table::fread(text = txt2, encoding = "Latin-1") : Discarded single-line footer: <<fa�ile>>

After

Warning message:
In data.table::fread(text = txt, encoding = "UTF-8") :
  Discarded single-line footer: <<façile>>

@codecov
Copy link

codecov bot commented Oct 11, 2020

Codecov Report

Merging #4751 (7f3ae4f) into master (789dfe4) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4751   +/-   ##
=======================================
  Coverage   99.47%   99.47%           
=======================================
  Files          73       73           
  Lines       14560    14560           
=======================================
  Hits        14483    14483           
  Misses         77       77           
Impacted Files Coverage Δ
src/freadR.c 100.00% <ø> (ø)
src/fread.c 99.47% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 789dfe4...7f3ae4f. Read the comment docs.

@@ -7,8 +7,13 @@
#include "myomp.h"
#ifdef DTPY
#include "py_fread.h"
#define ENC2NATIVE(s) (s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @st-pasha , not sure if there's something to do here on the Python side

Merge remote-tracking branch 'origin/master' into shrektan/fread-nonascii-msg

# Conflicts:
#	inst/tests/tests.Rraw
@shrektan shrektan added this to the 1.13.5 milestone Nov 18, 2020
@@ -2,6 +2,7 @@
#define dt_FREAD_R_H
#define STRICT_R_HEADERS // https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Error-handling
#include <R.h>
#include <Rinternals.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we need this for translateChar(mkCharCE(s, ienc)), right?
It may be good to ask R core for public API for this functionality.

Copy link
Member

@mattdowle mattdowle Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe but I have the impression from WRE that Rinternals.h is already a public API, at least partially. For example, the string "#include <Rinternals.h>" is exhibited 13 times in WRE. We need it even for just SEXP (which is stated too in WRE). I did find one use of the word 'public' in the context of Rinternals.h:
image
Note 144 in full is "see The R API: note that these are not all part of the API."
I remember seeing some debate on R-devel mailing list on this over the years. Some R-core members believe that the R API is anything mentioned in WRE. But that would exclude many of the exported (i.e. non-hidden) functions that many packages use. I believe that #define USE_RINTERNALS before #include <Rinternals.h> (as data.table used to do in the past) does access truly internal non-API functions/structures. But including Rinternals.h is not necessarily accessing non-public API, in fact it's necessary to include it to access all the public API. Note that data.table.h includes Rinternals.h but that doesn't necessarily mean we are using non-public API.

With that said, translateChar and mkCharCE are both highlighted by name in WRE (section 5.15) and for that reason there would be no disagreement by any R-core member that those are already public API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems... needlessly confusing!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. All those that have thought so have engaged. Then what you discover is that R-core members disagree between themselves. Tangent issues (such as data.table using SET_TRUELENGTH) are quickly raised. I assume what you would like is for R.h to be public and Rinternals.h to be not-public? Underlying that seemingly simple task is i) change to R over time, and ii) in our case of data.table, achievements by using SET_TRUELENGTH that some core members wish was never accessible.
Careful what you wish for.

Copy link
Member

@jangorecki jangorecki Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry for fuss. When I read Rinternals I had in mind USE_RINTERNALS switch. So definitely confusing.

This PR addresses the footer warning, not all messages. So making that clear in the news item. Maybe future PRs will address other messages which also include exhibits from the file.
to avoid any interpretation that the non-ASCII characters might cause the warning message
@mattdowle mattdowle merged commit 354cf84 into master Jan 4, 2021
@mattdowle mattdowle deleted the shrektan/fread-nonascii-msg branch January 4, 2021 20:47
@mattdowle mattdowle modified the milestones: 1.13.7, 1.14.0 Feb 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The msg from fread should mark txt as declared encoding
4 participants