-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] open_dataset() behavior with incorrectly quoted input data #37908
Comments
Hi @angela-li, thanks for reporting this! Yeah, that looks pretty gnarly in terms of the paths you had to go down to find that information, we should definitely improve our docs around this. Would you mind telling me a bit more about where you looked and how you figured it out? In terms of question 1, I'm not sure there is a better strategy here. In terms of question 2, it sounds like a really helpful feature request, though it'd be a pretty substantial, and most likely in the C++ code, and I know that a lot of those devs are focussed on other areas of the codebase. Arrow’s CSV reader is optimized for very fast parsing of valid CSVs (rather than other parsers like readr and data.table that offer more flexible options for handling invalid data, occasionally at the expense of speed), so it might end up being more of a problem that is better solved by multiple libraries. |
…$create() docs (#37909) ### Rationale for this change Add more function documentation for folks who want to manually change `quote_char` or `escape_char` options in `CsvParseOptions$create()`, as I did in #37908. I had to go through the source code to arrow/r/R/csv.R - [line 587](https://github.com/apache/arrow/blob/7dc9f69a8a77345d0ec7920af9224ef96d7f5f78/r/R/csv.R#L587) to find the default argument, which was a pain. ### What changes are included in this PR? Documentation changes ### Are these changes tested? No - not changing underlying code. Maintainers might need to rebuild package to include these changes in man/ folder ### Are there any user-facing changes? Yes, on documentation of methods. Lead-authored-by: Angela Li <angela-li@users.noreply.github.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
Sure! Here's what I did to try to debug:
Phew!
That makes sense that Arrow is optimized for fast parsing of valid CSVs - that's what I started to suspect after seeing other examples that Arrow is used for (oftentimes machine-output data, not messy human-collected data). I'll think about what to do in this case. |
Thanks, that's really helpful! I wondering if a nice solution here might be to create wrapper functions, e.g. The error message:
I think the problem here is that that error could be caused by multiple problems, so it's not definitive. We technically could search for the quote character in the problematic line and if there is an uneven number, mention that, but I'm a bit unsure if it's a bit of a fringe case. Perhaps it's not though given someone else had the same issue! |
Hi @angela-li, when I've run into situations like yours in the past, I've resorted to adding a cleanup step in between the raw data and the less flexible system (in this case, arrow) in order to get the raw data in a form that can be read without issues. I can imagine this might not be practical for your use case. This comment got me thinking,
One other thing you might try that arrow can do right now would be to make use of arrow's UnionDataset functionality. As described above, you essentially need to parse some files with one set of rules and other files with another. my_ds <- open_dataset(
list(
open_dataset("good_file.txt", type = "text")
open_dataset("bad_file.txt", type = "text", parse_options = CsvParseOptions$create(...))
)
) # <- this is a UnionDataset From here you can work with This problem also reminds me of lubridate and its flexible_open_dataset.Rlibrary(arrow)
# First create a set of CsvParseOptions to try. Order matters.
default_parse_options <- CsvParseOptions$create(delimiter = "|")
quirk_parse_options <- CsvParseOptions$create(delimiter = "|", quote_char = '')
my_parse_options <- c(default_parse_options, quirk_parse_options)
# Then we define two helper functions that attempt to call open_dataset until one succeeds
flexible_open_dataset_single <- function(file, parse_options) {
for (parse_option in parse_options) {
ds <- tryCatch({
open_dataset(file, format = "text", parse_options = parse_option)
},
error = function(e) {
warning(
"Failed to parse ", file,
" with provided ParseOption. Trying any remaining options...")
NULL
})
if (!is.null(ds)) {
break;
}
}
ds
}
flexible_open_dataset <- function(files, parse_options) {
open_dataset(lapply(files, function(f) { flexible_open_dataset_single(f, parse_options) }))
}
# Then, finally, we use our new helper and this should print a warning but otherwise work
my_ds <- flexible_open_dataset(c("test_data.txt", "test_data_good.txt"), my_parse_options) If we wanted to provide something like this in arrow, one way would be to allow |
…$create() docs (apache#37909) ### Rationale for this change Add more function documentation for folks who want to manually change `quote_char` or `escape_char` options in `CsvParseOptions$create()`, as I did in apache#37908. I had to go through the source code to arrow/r/R/csv.R - [line 587](https://github.com/apache/arrow/blob/7dc9f69a8a77345d0ec7920af9224ef96d7f5f78/r/R/csv.R#L587) to find the default argument, which was a pain. ### What changes are included in this PR? Documentation changes ### Are these changes tested? No - not changing underlying code. Maintainers might need to rebuild package to include these changes in man/ folder ### Are there any user-facing changes? Yes, on documentation of methods. Lead-authored-by: Angela Li <angela-li@users.noreply.github.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
…$create() docs (apache#37909) ### Rationale for this change Add more function documentation for folks who want to manually change `quote_char` or `escape_char` options in `CsvParseOptions$create()`, as I did in apache#37908. I had to go through the source code to arrow/r/R/csv.R - [line 587](https://github.com/apache/arrow/blob/7dc9f69a8a77345d0ec7920af9224ef96d7f5f78/r/R/csv.R#L587) to find the default argument, which was a pain. ### What changes are included in this PR? Documentation changes ### Are these changes tested? No - not changing underlying code. Maintainers might need to rebuild package to include these changes in man/ folder ### Are there any user-facing changes? Yes, on documentation of methods. Lead-authored-by: Angela Li <angela-li@users.noreply.github.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
…$create() docs (apache#37909) ### Rationale for this change Add more function documentation for folks who want to manually change `quote_char` or `escape_char` options in `CsvParseOptions$create()`, as I did in apache#37908. I had to go through the source code to arrow/r/R/csv.R - [line 587](https://github.com/apache/arrow/blob/7dc9f69a8a77345d0ec7920af9224ef96d7f5f78/r/R/csv.R#L587) to find the default argument, which was a pain. ### What changes are included in this PR? Documentation changes ### Are these changes tested? No - not changing underlying code. Maintainers might need to rebuild package to include these changes in man/ folder ### Are there any user-facing changes? Yes, on documentation of methods. Lead-authored-by: Angela Li <angela-li@users.noreply.github.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
…$create() docs (apache#37909) ### Rationale for this change Add more function documentation for folks who want to manually change `quote_char` or `escape_char` options in `CsvParseOptions$create()`, as I did in apache#37908. I had to go through the source code to arrow/r/R/csv.R - [line 587](https://github.com/apache/arrow/blob/7dc9f69a8a77345d0ec7920af9224ef96d7f5f78/r/R/csv.R#L587) to find the default argument, which was a pain. ### What changes are included in this PR? Documentation changes ### Are these changes tested? No - not changing underlying code. Maintainers might need to rebuild package to include these changes in man/ folder ### Are there any user-facing changes? Yes, on documentation of methods. Lead-authored-by: Angela Li <angela-li@users.noreply.github.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
Describe the usage question you have. Please include as many useful details as possible.
I've started using {arrow} to read in data in R, and I noticed that it handles messy (aka human-collected!) tabular data slightly worse than {data.table}'s
fread
function. (But I want to work with arrow as vs. data.table as the partitioning element is going to be useful for me in the future!)One issue I came across was how
open_dataset
handles incorrectly quoted data, or data where the defaultquote_char
of"
is included accidentally in a column.Here's how the behavior is different between data.table and arrow. (Here's the test_data.txt file for the below code. It's a .txt file because the original humongous data file is delivered as a .txt.)
The data.table() documentation describes how they handle this data situation reasonably well, NEWS file here.
For now, I think I can change
parse_options
in the open_dataset() function to handle this, but it was quite fiddly to do this - hard to track down in the docs how to do this. Changing this option is also not good for the rest of the data, where I do want thequote_char
to be"
.I don't know if improper quoting happens elsewhere in the data, so ideally there would be some way to detect and fix this type of improper quoting systematically (as versus skipping rows manually, or changing the
quote_char
to blank, which could cause issues for other columns.)Two qs:
open_dataset()
?Thanks for your help!
Component(s)
R
The text was updated successfully, but these errors were encountered: