Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show error if selected CSV file contains null character #982

Merged
merged 2 commits into from
Apr 27, 2024

Conversation

matthew-white
Copy link
Member

In getodk/central#589, we briefly discussed what would happen if the user uses the entity upload modal to select a binary file. This PR tries to detect that case and show a better error message.

The "choose one" button to select a file limits the selection to .csv and .tsv files. However, a user could rename an Excel file to .csv to try to get it to pass. Also, the drop zone doesn't have a similar restriction, so a user could drag-and-drop an Excel file. This is what it looks like if an Excel file is dropped in:

You can see that Papa Parse is so confused that it concludes that the file is tab-delimited. 😅 But you can see that Papa Parse isn't throwing an error here: it's returning binary data parsed as text. Here's an example of how Papa Parse is perfectly happy to accept a null character:

> Papa.parse('f\0o,bar')
{
  data: [ [ 'f\x00o', 'bar' ] ],
  errors: [],
  meta: {
    ...
  }
}

I also tried what happened if the header was valid, but the data after it contained a null character. It looked like the null character was inserted into the DOM, but wasn't displayed. The null character was sent to Backend: it looks like JSON is perfectly capable of encoding a null character. However, Backend returned a 500, and I saw a Postgres error in the Backend logs. The Postgres error was different depending on whether the null character was in the label or in a data property.

  • For the label, the error was:
    • error: invalid byte sequence for encoding "UTF8": 0x00
  • For a data property, the error was:
    • error: unsupported Unicode escape sequence

I found this article helpful in understanding how Postgres handles null characters: https://www.commandprompt.com/blog/null-characters-workarounds-arent-good-enough/. It sounds like the "unsupported Unicode escape sequence" error is associated with JSON data types.

All that said, I think it'd be nice to avoid showing binary data in Frontend as in the screenshot above. It'd be nice to show a better error instead. It doesn't seem completely trivial to determine whether a file is binary. However, I figure that most binary files will contain a null character. In this PR, I check for a null character when parsing the header, then check for a null character when parsing the rest of the data. Some corners of the Internet say that a null character is valid in a text file, but I can't think of a real use case.

If the header contains a null character, I show a simple error alert. I'm not rendering EntityUploadHeaderErrors because I don't want it to show the binary data from the header. Here's an example of selecting an Excel file:

Note that it's possible for a binary file not to contain a null character in its first line. That was true of a .pdf file I tried. In that case, EntityUploadHeaderErrors will be rendered (assuming the first line doesn't match the expected header).

If the file matches the expected header, but there's a null character in the data, then we will show the error in the "Data error" panel. It would've been more consistent to show the error at the top of the modal in an alert, but that's not easy to do with the way things are correctly plumbed. I also think this case is very unlikely.

What has been done to verify that this works as intended?

I wrote tests and tried it locally.

Why is this the best possible solution? Were any other approaches considered?

One thing I'm wondering is whether we should check for control characters other than a null character. It sounds like Postgres is uniquely displeased by a null character, which is maybe a reason to check for that specifically. All the binary files I tried contained a null character, and my understanding is that that's typical.

Instead of checking for a null character, we could take a totally different approach and check the MIME type of the file. However, apparently browsers determine the MIME type just by looking at the file extension (MDN). It wouldn't help in the case where a user renames an Excel file to .csv.

In a similar vein, we could use an npm package like file-type that actually reads a portion of the file to determine whether the file is a common binary format. However, I'd prefer to avoid an additional package if we don't really need it.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

The change to src/ is pretty small. I implemented the check as just an additional validation step. It shouldn't break existing code.

Before submitting this PR, please make sure you have:

  • run npm run test and npm run lint and confirmed all checks still pass OR confirm CircleCI build passes
  • verified that any code or assets from external sources are properly credited in comments or that everything is internally sourced

Copy link
Member

@lognaturel lognaturel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing out all your thinking! I would not have been able to evaluate this otherwise. Given your reasoning, this looks like a good approach. If this approach mistakenly flags csv files that should be ok, I think it will be an easy/natural thing for users to report.

@matthew-white matthew-white merged commit 29cebcc into master Apr 27, 2024
1 check passed
@matthew-white matthew-white deleted the csv-null-char branch April 27, 2024 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants