fix #2211 #2240

antgonza · 2017-08-16T13:14:00Z

This "fix" basically removes all non UTF8 chars from the input file by going char by char and checking if printable. Note that this will make a little slower the loading of info files but shouldn't be that significant compared to everything else going around to add them. BTW remember that this is a non blocking process.

antgonza · 2017-08-16T13:39:42Z

ha! This solution works so well (ignoring all non UTF8) that the failures are cause now the test we had to check for (limited) non UTF8 chars are failing (cause the errors are not being raised). The quickest solution is to simply remove those tests. @josenavas, could you add this topic to today's discussion and add a resolution? I'll modify accordingly.

ackermag · 2017-08-16T14:25:41Z

I don't think we should ignore all non UTF8 characters, just raise an error - a lot of the problems are because of 'translation' of e.g. accents, by Excel when converting file types that produce the wrong character - knowing that the character is wrong on upload is good because we don't want "(square root)@" instead of "è". Not sure we can fix this programmatically because the Excel problem is not transparent - in Excel it will look like "è".

…

On Wed, Aug 16, 2017 at 6:39 AM, Antonio Gonzalez ***@***.***> wrote: ha! This solution works so well (ignoring all non UTF8) that the failures are cause now the test we had to check for (limited) non UTF8 chars are failing (cause the errors are not being raised). The quickest solution is to simply remove those tests. @josenavas <https://github.com/josenavas>, could you add this topic to today's discussion and add a resolution? I'll modify accordingly. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2240 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB69gEtyos1WkM3eT4IEbDBQlOC5smt6ks5sYvEfgaJpZM4O45Ly> .

-- Gail Ackermann Knight Lab UCSD glackermann@ucsd.edu <ackermag@ucsd.edu>

antgonza · 2017-08-16T14:28:39Z

Note that é is a valid UTF8 char but \x00 or \x41, AFAIK, are not, which is the source of the problem when Excel does it's magic. In other words, if the template has México, it will be fine but if it has M\x00ico, it will remove the \x00. The other option is raising an issue but this is also hard for users to find/remove.

josenavas · 2017-08-16T17:05:19Z

Discussed as a group:

Do not modify the metadata of the user
Show the exact cell (row/col) (sample/column name) where the issues are.
Provide information on how to solve it. @tanaes found that uploading to google spreadsheets and downloading as tsv should work (that is the workaround that people in SO points to).
Provide a checkbox with a warning so the user has to actively enable the removal of the UTF-8 character. It should be clear to the user that the metadata would be modified, and hence it should double check it later.

antgonza · 2017-08-16T17:09:21Z

Note that the error comes from the line where pandas tries to read the file so we can check in the previous block (where the changes are right now). Thus, adding the col/row numbers should be possible; however, sample/column names will be pretty difficult

josenavas · 2017-08-16T17:16:49Z

@antgonza that sounds good - we wanted at least on of them.

antgonza · 2017-08-16T18:17:15Z

Just for clarity, I decided to add spaces to the tests so we know that spaces are not the issue. Note that we had a tests checking that we corrected some but not all UTF8 chars so now raising an errors for all of them. Now the error raised will look like: ValueError: There are non valid UTF-8 characters in the following positions (row, col): (0, 0), (0, 4)

josenavas · 2017-08-16T18:26:37Z

Is it possible to actually show the offending value? I think there is a way of encoding the string so it actually shows a ? sign with is not UTF-8.

antgonza · 2017-08-16T18:34:13Z

Thought about it but also thought it will get super messy. I think, in general, when this happens is gonna happen lots of times ... but if others think this is a nice addition, I can do it.

antgonza · 2017-08-17T14:56:23Z

OK, I think the best solution is to do something like:

There are non valid UTF-8 characters. The errors are shown as ♥: ♥collection_timestamp = (0, 13)

ElDeveloper · 2017-08-17T16:04:51Z

Is it possible to actually show the offending value? I think there is a way of encoding the string so it actually shows a ? sign with is not UTF-8

I think this might not be possible at all, since the problem is that we don't know the encoding, trying to encode it as UTF-8 might result in malformed text.

There are non valid UTF-8 characters. The errors are shown as ♥: ♥collection_timestamp = (0, 13)

I don' think I understand this error message.

antgonza · 2017-08-17T16:07:47Z

So I decided to show a ❤️ vs. a ? char so it showns that there is a char that is wrong in the first position of collection_timestamp ... suggestions to make it clearer?

ElDeveloper · 2017-08-17T16:19:22Z

@antgonza haha, got it. I see what I wasn't able to understand. What about something more verbose:

There are invalid (non UTF-8) characters in your sample information file. The offending fields and their location (row, column) are listed below, invalid characters are represented using an X: "Xcollection_timestamp" = (0, 13)

And I don't really have strong preferences about the character. The heart is fun but confused me for a bit. Maybe a qiita footprint :P 🐾 ... though we may want something more neutral.

antgonza · 2017-08-17T17:57:22Z

ok, going with:
There are invalid (non UTF-8) characters in your information file. The offending fields and their location (row, column) are listed below, invalid characters are represented using 🐾: "🐾collection_timestamp" = (0, 13)

codecov-io · 2017-08-17T18:45:43Z

Codecov Report

Merging #2240 into dev will decrease coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##              dev    #2240      +/-   ##
==========================================
- Coverage   92.84%   92.81%   -0.04%     
==========================================
  Files         171      171              
  Lines       18877    18904      +27     
==========================================
+ Hits        17527    17545      +18     
- Misses       1350     1359       +9

Impacted Files	Coverage Δ
qiita_db/metadata_template/test/test_util.py	`99.48% <100%> (ø)`	⬆️
qiita_db/metadata_template/util.py	`90.97% <100%> (-0.07%)`	⬇️
qiita_db/processing_job.py	`91.75% <0%> (-0.82%)`	⬇️
qiita_db/private.py	`24.32% <0%> (-0.68%)`	⬇️
qiita_db/test/test_artifact.py	`99.84% <0%> (ø)`	⬆️
qiita_db/test/test_processing_job.py	`99.8% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b59f37...19ac949. Read the comment docs.

josenavas

Just one question otherwise looks good

josenavas · 2017-08-17T22:56:07Z

qiita_db/metadata_template/util.py

+                    if tblock not in errors:
+                        errors[tblock] = []
+                    errors[tblock].append('(%d, %d)' % (row, col))
+        if bool(errors):


just out of curiosity. why this specific call to bool. AFAIK, this is not pythonic, and an empty list/dict evaluates to false.

Cause of: https://stackoverflow.com/a/23177452

I don't understand your link? If you check the actual example it is not using the call to bool. In the first part of the post it uses it to demonstrate the behavior, but in the actual code the call to bool is not used.

Right, I guess it shows that with or without the bool it works as expected, want me to change it?

adswafford · 2017-08-18T00:47:24Z

I think this will make sense, but just to be clear before I merge, if in row 2, column 8 "geo_loc", the user has "México:El Niño", and in row 3 they "México:La Niña:" then the error message will be:

"There are invalid (non UTF-8) characters in your information file. The offending fields and their location (row, column) are listed below, invalid characters are represented using 🐾":
"M🐾xico:El Ni🐾o"= (1, 7)"
"M🐾xico:La Ni🐾a"= (2, 7)"

Is that right?

antgonza · 2017-08-18T08:03:06Z

Yup, except that the cases will be separated by ; vs new lines. Obviously, easy to change.

josenavas · 2017-08-18T18:47:39Z

@adswafford Can you confirm that you are ok with having a semi-colon (;) separated list or do you prefer new lines?

Thanks!

adswafford · 2017-08-18T18:57:26Z

On the one hand, for rare errors, I think new lines makes it easier to find, but on the other the error banner could get huge for a file with "México" repeated 1000 times, so semicolon is okay for now. How hard will it be to change it if user feedback suggested new lines is better in the coming months? Just a "," to "\n"?

josenavas · 2017-08-18T18:59:28Z

Yeah, it will be super-easy to change. Thanks for your input!

josenavas · 2017-08-18T19:01:02Z

@ElDeveloper can you approve the PR if your comments have been addressed?

josenavas · 2017-08-21T18:14:08Z

Thanks @ElDeveloper !!!

ElDeveloper · 2017-08-21T18:15:31Z

Thanks @antgonza for addressing the comments, and you for reminding me :P!

…

On (Aug-21-17|18:14), Jose Navas wrote: Thanks @ElDeveloper !!! -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #2240 (comment)

fix qiita-spots#2211

2f54dfc

josenavas changed the base branch from master to dev August 16, 2017 17:16

following @qiita-dev suggestions

b2c934e

antgonza added 2 commits August 17, 2017 09:07

fixing errors and improving message

5396a34

removing Ó

1363d9b

🐾

c6c852c

æ

19ac949

josenavas reviewed Aug 17, 2017

View reviewed changes

ElDeveloper merged commit acd27f5 into qiita-spots:dev Aug 21, 2017

fix #2211 #2240

fix #2211 #2240

Uh oh!

Conversation

antgonza commented Aug 16, 2017

Uh oh!

antgonza commented Aug 16, 2017

Uh oh!

ackermag commented Aug 16, 2017 via email

Uh oh!

antgonza commented Aug 16, 2017

Uh oh!

josenavas commented Aug 16, 2017

Uh oh!

antgonza commented Aug 16, 2017

Uh oh!

josenavas commented Aug 16, 2017

Uh oh!

antgonza commented Aug 16, 2017

Uh oh!

josenavas commented Aug 16, 2017

Uh oh!

antgonza commented Aug 16, 2017

Uh oh!

antgonza commented Aug 17, 2017

Uh oh!

ElDeveloper commented Aug 17, 2017

Uh oh!

antgonza commented Aug 17, 2017

Uh oh!

ElDeveloper commented Aug 17, 2017

Uh oh!

antgonza commented Aug 17, 2017

Uh oh!

codecov-io commented Aug 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

josenavas left a comment

Choose a reason for hiding this comment

Uh oh!

josenavas Aug 17, 2017

Choose a reason for hiding this comment

Uh oh!

antgonza Aug 17, 2017

Choose a reason for hiding this comment

Uh oh!

josenavas Aug 18, 2017

Choose a reason for hiding this comment

Uh oh!

antgonza Aug 18, 2017

Choose a reason for hiding this comment

Uh oh!

adswafford commented Aug 18, 2017

Uh oh!

antgonza commented Aug 18, 2017

Uh oh!

josenavas commented Aug 18, 2017

Uh oh!

adswafford commented Aug 18, 2017

Uh oh!

josenavas commented Aug 18, 2017

Uh oh!

josenavas commented Aug 18, 2017

Uh oh!

josenavas commented Aug 21, 2017

Uh oh!

ElDeveloper commented Aug 21, 2017 via email

Uh oh!

Uh oh!

codecov-io commented Aug 17, 2017 •

edited

Loading