Show error if selected CSV file contains null character #982
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In getodk/central#589, we briefly discussed what would happen if the user uses the entity upload modal to select a binary file. This PR tries to detect that case and show a better error message.
The "choose one" button to select a file limits the selection to .csv and .tsv files. However, a user could rename an Excel file to .csv to try to get it to pass. Also, the drop zone doesn't have a similar restriction, so a user could drag-and-drop an Excel file. This is what it looks like if an Excel file is dropped in:
You can see that Papa Parse is so confused that it concludes that the file is tab-delimited. 😅 But you can see that Papa Parse isn't throwing an error here: it's returning binary data parsed as text. Here's an example of how Papa Parse is perfectly happy to accept a null character:
I also tried what happened if the header was valid, but the data after it contained a null character. It looked like the null character was inserted into the DOM, but wasn't displayed. The null character was sent to Backend: it looks like JSON is perfectly capable of encoding a null character. However, Backend returned a 500, and I saw a Postgres error in the Backend logs. The Postgres error was different depending on whether the null character was in the label or in a data property.
I found this article helpful in understanding how Postgres handles null characters: https://www.commandprompt.com/blog/null-characters-workarounds-arent-good-enough/. It sounds like the "unsupported Unicode escape sequence" error is associated with JSON data types.
All that said, I think it'd be nice to avoid showing binary data in Frontend as in the screenshot above. It'd be nice to show a better error instead. It doesn't seem completely trivial to determine whether a file is binary. However, I figure that most binary files will contain a null character. In this PR, I check for a null character when parsing the header, then check for a null character when parsing the rest of the data. Some corners of the Internet say that a null character is valid in a text file, but I can't think of a real use case.
If the header contains a null character, I show a simple error alert. I'm not rendering
EntityUploadHeaderErrors
because I don't want it to show the binary data from the header. Here's an example of selecting an Excel file:Note that it's possible for a binary file not to contain a null character in its first line. That was true of a .pdf file I tried. In that case,
EntityUploadHeaderErrors
will be rendered (assuming the first line doesn't match the expected header).If the file matches the expected header, but there's a null character in the data, then we will show the error in the "Data error" panel. It would've been more consistent to show the error at the top of the modal in an alert, but that's not easy to do with the way things are correctly plumbed. I also think this case is very unlikely.
What has been done to verify that this works as intended?
I wrote tests and tried it locally.
Why is this the best possible solution? Were any other approaches considered?
One thing I'm wondering is whether we should check for control characters other than a null character. It sounds like Postgres is uniquely displeased by a null character, which is maybe a reason to check for that specifically. All the binary files I tried contained a null character, and my understanding is that that's typical.
Instead of checking for a null character, we could take a totally different approach and check the MIME type of the file. However, apparently browsers determine the MIME type just by looking at the file extension (MDN). It wouldn't help in the case where a user renames an Excel file to .csv.
In a similar vein, we could use an npm package like
file-type
that actually reads a portion of the file to determine whether the file is a common binary format. However, I'd prefer to avoid an additional package if we don't really need it.How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?
The change to src/ is pretty small. I implemented the check as just an additional validation step. It shouldn't break existing code.
Before submitting this PR, please make sure you have:
npm run test
andnpm run lint
and confirmed all checks still pass OR confirm CircleCI build passes