Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread() throws correct non-ASCII messages #4751

Merged
merged 11 commits into from
Jan 4, 2021
13 changes: 13 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -17150,3 +17150,16 @@ test(2154.1, fread("0.0\n", colClasses="integer"), data.table(V1=0.0),
test(2154.2, fread("A\n0.0\n", colClasses="integer"), data.table(A=0.0),
warning="Attempt to override column 1 <<A>> of inherent type 'float64' down to 'int32' ignored.*please")

# fread message display non-ASCII messages correctly, #4747
x = "fa\xE7ile"; Encoding(x) = "latin1"
# should only run this test if the native encoding can represent latin1 correctly
if (identical(x, enc2native(x))) {
txt = sprintf("A,B\n%s,%s\n%s", x, x, x)
Encoding(txt) = "UTF-8"
txt2 = iconv(txt, "UTF-8", "latin1")
out = data.table(A = x, B = x)
test(2155.1, fread(text = txt, encoding = 'UTF-8'), out,
warning="Discarded single-line footer: <<fa\u00e7ile>>")
test(2155.2, fread(text = txt2, encoding = 'Latin-1'), out,
warning="Discarded single-line footer: <<fa\u00e7ile>>")
}
2 changes: 1 addition & 1 deletion src/fread.c
Original file line number Diff line number Diff line change
Expand Up @@ -2575,7 +2575,7 @@ int freadMain(freadMainArgs _args) {
if (ch==eof) {
// whitespace at the end of the file is always skipped ok
} else {
const char *skippedFooter = ch;
const char *skippedFooter = ENC2NATIVE(ch);
// detect if it's a single line footer. Commonly the row count from SQL queries.
while (ch<eof && *ch!='\n' && *ch!='\r') ch++;
while (ch<eof && isspace(*ch)) ch++;
Expand Down
5 changes: 5 additions & 0 deletions src/fread.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,13 @@
#include "myomp.h"
#ifdef DTPY
#include "py_fread.h"
#define ENC2NATIVE(s) (s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @st-pasha , not sure if there's something to do here on the Python side

#else
#include "freadR.h"
extern cetype_t ienc;
// R's message functions only take C's char pointer not SEXP, where encoding info can't be stored
// so must convert the error message char to native encoding first in order to correctly display in R
#define ENC2NATIVE(s) translateChar(mkCharCE(s, ienc))
#endif

// Ordered hierarchy of types
Expand Down
2 changes: 1 addition & 1 deletion src/freadR.c
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ static SEXP selectSxp;
static SEXP dropSxp;
static SEXP colClassesSxp;
static bool selectColClasses = false;
static cetype_t ienc = CE_NATIVE;
cetype_t ienc = CE_NATIVE;
static SEXP RCHK;
static SEXP DT;
static SEXP colNamesSxp;
Expand Down
1 change: 1 addition & 0 deletions src/freadR.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
#define dt_FREAD_R_H
#define STRICT_R_HEADERS // https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Error-handling
#include <R.h>
#include <Rinternals.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we need this for translateChar(mkCharCE(s, ienc)), right?
It may be good to ask R core for public API for this functionality.

Copy link
Member

@mattdowle mattdowle Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe but I have the impression from WRE that Rinternals.h is already a public API, at least partially. For example, the string "#include <Rinternals.h>" is exhibited 13 times in WRE. We need it even for just SEXP (which is stated too in WRE). I did find one use of the word 'public' in the context of Rinternals.h:
image
Note 144 in full is "see The R API: note that these are not all part of the API."
I remember seeing some debate on R-devel mailing list on this over the years. Some R-core members believe that the R API is anything mentioned in WRE. But that would exclude many of the exported (i.e. non-hidden) functions that many packages use. I believe that #define USE_RINTERNALS before #include <Rinternals.h> (as data.table used to do in the past) does access truly internal non-API functions/structures. But including Rinternals.h is not necessarily accessing non-public API, in fact it's necessary to include it to access all the public API. Note that data.table.h includes Rinternals.h but that doesn't necessarily mean we are using non-public API.

With that said, translateChar and mkCharCE are both highlighted by name in WRE (section 5.15) and for that reason there would be no disagreement by any R-core member that those are already public API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems... needlessly confusing!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. All those that have thought so have engaged. Then what you discover is that R-core members disagree between themselves. Tangent issues (such as data.table using SET_TRUELENGTH) are quickly raised. I assume what you would like is for R.h to be public and Rinternals.h to be not-public? Underlying that seemingly simple task is i) change to R over time, and ii) in our case of data.table, achievements by using SET_TRUELENGTH that some core members wish was never accessible.
Careful what you wish for.

Copy link
Member

@jangorecki jangorecki Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry for fuss. When I read Rinternals I had in mind USE_RINTERNALS switch. So definitely confusing.

#include "po.h"

#define FREAD_MAIN_ARGS_EXTRA_FIELDS \
Expand Down