Skip to content

fix #2211 #2240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 21, 2017
Merged

fix #2211 #2240

merged 6 commits into from
Aug 21, 2017

Conversation

antgonza
Copy link
Member

This "fix" basically removes all non UTF8 chars from the input file by going char by char and checking if printable. Note that this will make a little slower the loading of info files but shouldn't be that significant compared to everything else going around to add them. BTW remember that this is a non blocking process.

@antgonza
Copy link
Member Author

ha! This solution works so well (ignoring all non UTF8) that the failures are cause now the test we had to check for (limited) non UTF8 chars are failing (cause the errors are not being raised). The quickest solution is to simply remove those tests. @josenavas, could you add this topic to today's discussion and add a resolution? I'll modify accordingly.

@ackermag
Copy link

ackermag commented Aug 16, 2017 via email

@antgonza
Copy link
Member Author

Note that é is a valid UTF8 char but \x00 or \x41, AFAIK, are not, which is the source of the problem when Excel does it's magic. In other words, if the template has México, it will be fine but if it has M\x00ico, it will remove the \x00. The other option is raising an issue but this is also hard for users to find/remove.

@josenavas
Copy link
Contributor

Discussed as a group:

  • Do not modify the metadata of the user
  • Show the exact cell (row/col) (sample/column name) where the issues are.
  • Provide information on how to solve it. @tanaes found that uploading to google spreadsheets and downloading as tsv should work (that is the workaround that people in SO points to).
  • Provide a checkbox with a warning so the user has to actively enable the removal of the UTF-8 character. It should be clear to the user that the metadata would be modified, and hence it should double check it later.

@antgonza
Copy link
Member Author

Note that the error comes from the line where pandas tries to read the file so we can check in the previous block (where the changes are right now). Thus, adding the col/row numbers should be possible; however, sample/column names will be pretty difficult

@josenavas josenavas changed the base branch from master to dev August 16, 2017 17:16
@josenavas
Copy link
Contributor

@antgonza that sounds good - we wanted at least on of them.

@antgonza
Copy link
Member Author

Just for clarity, I decided to add spaces to the tests so we know that spaces are not the issue. Note that we had a tests checking that we corrected some but not all UTF8 chars so now raising an errors for all of them. Now the error raised will look like: ValueError: There are non valid UTF-8 characters in the following positions (row, col): (0, 0), (0, 4)

@josenavas
Copy link
Contributor

Is it possible to actually show the offending value? I think there is a way of encoding the string so it actually shows a ? sign with is not UTF-8.

@antgonza
Copy link
Member Author

Thought about it but also thought it will get super messy. I think, in general, when this happens is gonna happen lots of times ... but if others think this is a nice addition, I can do it.

@antgonza
Copy link
Member Author

OK, I think the best solution is to do something like:

There are non valid UTF-8 characters. The errors are shown as ♥: ♥collection_timestamp = (0, 13)

@ElDeveloper
Copy link
Contributor

Is it possible to actually show the offending value? I think there is a way of encoding the string so it actually shows a ? sign with is not UTF-8

I think this might not be possible at all, since the problem is that we don't know the encoding, trying to encode it as UTF-8 might result in malformed text.

There are non valid UTF-8 characters. The errors are shown as ♥: ♥collection_timestamp = (0, 13)

I don' think I understand this error message.

@antgonza
Copy link
Member Author

So I decided to show a ❤️ vs. a ? char so it showns that there is a char that is wrong in the first position of collection_timestamp ... suggestions to make it clearer?

@ElDeveloper
Copy link
Contributor

@antgonza haha, got it. I see what I wasn't able to understand. What about something more verbose:

There are invalid (non UTF-8) characters in your sample information file. The offending fields and their location (row, column) are listed below, invalid characters are represented using an X: "Xcollection_timestamp" = (0, 13)

And I don't really have strong preferences about the character. The heart is fun but confused me for a bit. Maybe a qiita footprint :P 🐾 ... though we may want something more neutral.

@antgonza
Copy link
Member Author

ok, going with:
There are invalid (non UTF-8) characters in your information file. The offending fields and their location (row, column) are listed below, invalid characters are represented using 🐾: "🐾collection_timestamp" = (0, 13)

@codecov-io
Copy link

codecov-io commented Aug 17, 2017

Codecov Report

Merging #2240 into dev will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev    #2240      +/-   ##
==========================================
- Coverage   92.84%   92.81%   -0.04%     
==========================================
  Files         171      171              
  Lines       18877    18904      +27     
==========================================
+ Hits        17527    17545      +18     
- Misses       1350     1359       +9
Impacted Files Coverage Δ
qiita_db/metadata_template/test/test_util.py 99.48% <100%> (ø) ⬆️
qiita_db/metadata_template/util.py 90.97% <100%> (-0.07%) ⬇️
qiita_db/processing_job.py 91.75% <0%> (-0.82%) ⬇️
qiita_db/private.py 24.32% <0%> (-0.68%) ⬇️
qiita_db/test/test_artifact.py 99.84% <0%> (ø) ⬆️
qiita_db/test/test_processing_job.py 99.8% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b59f37...19ac949. Read the comment docs.

Copy link
Contributor

@josenavas josenavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question otherwise looks good

if tblock not in errors:
errors[tblock] = []
errors[tblock].append('(%d, %d)' % (row, col))
if bool(errors):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just out of curiosity. why this specific call to bool. AFAIK, this is not pythonic, and an empty list/dict evaluates to false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your link? If you check the actual example it is not using the call to bool. In the first part of the post it uses it to demonstrate the behavior, but in the actual code the call to bool is not used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I guess it shows that with or without the bool it works as expected, want me to change it?

@adswafford
Copy link
Contributor

I think this will make sense, but just to be clear before I merge, if in row 2, column 8 "geo_loc", the user has "México:El Niño", and in row 3 they "México:La Niña:" then the error message will be:

"There are invalid (non UTF-8) characters in your information file. The offending fields and their location (row, column) are listed below, invalid characters are represented using 🐾":
"M🐾xico:El Ni🐾o"= (1, 7)"
"M🐾xico:La Ni🐾a"= (2, 7)"

Is that right?

@antgonza
Copy link
Member Author

Yup, except that the cases will be separated by ; vs new lines. Obviously, easy to change.

@josenavas
Copy link
Contributor

@adswafford Can you confirm that you are ok with having a semi-colon (;) separated list or do you prefer new lines?

Thanks!

@adswafford
Copy link
Contributor

On the one hand, for rare errors, I think new lines makes it easier to find, but on the other the error banner could get huge for a file with "México" repeated 1000 times, so semicolon is okay for now. How hard will it be to change it if user feedback suggested new lines is better in the coming months? Just a "," to "\n"?

@josenavas
Copy link
Contributor

Yeah, it will be super-easy to change. Thanks for your input!

@josenavas
Copy link
Contributor

@ElDeveloper can you approve the PR if your comments have been addressed?

@ElDeveloper ElDeveloper merged commit acd27f5 into qiita-spots:dev Aug 21, 2017
@josenavas
Copy link
Contributor

Thanks @ElDeveloper !!!

@ElDeveloper
Copy link
Contributor

ElDeveloper commented Aug 21, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants