-
Notifications
You must be signed in to change notification settings - Fork 80
fix #2211 #2240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix #2211 #2240
Conversation
ha! This solution works so well (ignoring all non UTF8) that the failures are cause now the test we had to check for (limited) non UTF8 chars are failing (cause the errors are not being raised). The quickest solution is to simply remove those tests. @josenavas, could you add this topic to today's discussion and add a resolution? I'll modify accordingly. |
I don't think we should ignore all non UTF8 characters, just raise an error
- a lot of the problems are because of 'translation' of e.g. accents, by
Excel when converting file types that produce the wrong character - knowing
that the character is wrong on upload is good because we don't want
"(square root)@" instead of "è". Not sure we can fix this programmatically
because the Excel problem is not transparent - in Excel it will look like
"è".
…On Wed, Aug 16, 2017 at 6:39 AM, Antonio Gonzalez ***@***.***> wrote:
ha! This solution works so well (ignoring all non UTF8) that the failures
are cause now the test we had to check for (limited) non UTF8 chars are
failing (cause the errors are not being raised). The quickest solution is
to simply remove those tests. @josenavas <https://github.com/josenavas>,
could you add this topic to today's discussion and add a resolution? I'll
modify accordingly.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2240 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AB69gEtyos1WkM3eT4IEbDBQlOC5smt6ks5sYvEfgaJpZM4O45Ly>
.
--
Gail Ackermann
Knight Lab
UCSD
glackermann@ucsd.edu <ackermag@ucsd.edu>
|
Note that é is a valid UTF8 char but |
Discussed as a group:
|
Note that the error comes from the line where pandas tries to read the file so we can check in the previous block (where the changes are right now). Thus, adding the col/row numbers should be possible; however, sample/column names will be pretty difficult |
@antgonza that sounds good - we wanted at least on of them. |
Just for clarity, I decided to add spaces to the tests so we know that spaces are not the issue. Note that we had a tests checking that we corrected some but not all UTF8 chars so now raising an errors for all of them. Now the error raised will look like: |
Is it possible to actually show the offending value? I think there is a way of encoding the string so it actually shows a |
Thought about it but also thought it will get super messy. I think, in general, when this happens is gonna happen lots of times ... but if others think this is a nice addition, I can do it. |
OK, I think the best solution is to do something like: There are non valid UTF-8 characters. The errors are shown as ♥: ♥collection_timestamp = (0, 13) |
I think this might not be possible at all, since the problem is that we don't know the encoding, trying to encode it as UTF-8 might result in malformed text.
I don' think I understand this error message. |
So I decided to show a ❤️ vs. a ? char so it showns that there is a char that is wrong in the first position of collection_timestamp ... suggestions to make it clearer? |
@antgonza haha, got it. I see what I wasn't able to understand. What about something more verbose:
And I don't really have strong preferences about the character. The heart is fun but confused me for a bit. Maybe a qiita footprint :P 🐾 ... though we may want something more neutral. |
ok, going with: |
Codecov Report
@@ Coverage Diff @@
## dev #2240 +/- ##
==========================================
- Coverage 92.84% 92.81% -0.04%
==========================================
Files 171 171
Lines 18877 18904 +27
==========================================
+ Hits 17527 17545 +18
- Misses 1350 1359 +9
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question otherwise looks good
if tblock not in errors: | ||
errors[tblock] = [] | ||
errors[tblock].append('(%d, %d)' % (row, col)) | ||
if bool(errors): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just out of curiosity. why this specific call to bool
. AFAIK, this is not pythonic, and an empty list/dict evaluates to false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cause of: https://stackoverflow.com/a/23177452
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand your link? If you check the actual example it is not using the call to bool. In the first part of the post it uses it to demonstrate the behavior, but in the actual code the call to bool is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I guess it shows that with or without the bool it works as expected, want me to change it?
I think this will make sense, but just to be clear before I merge, if in row 2, column 8 "geo_loc", the user has "México:El Niño", and in row 3 they "México:La Niña:" then the error message will be: "There are invalid (non UTF-8) characters in your information file. The offending fields and their location (row, column) are listed below, invalid characters are represented using 🐾": Is that right? |
Yup, except that the cases will be separated by ; vs new lines. Obviously, easy to change. |
@adswafford Can you confirm that you are ok with having a semi-colon (;) separated list or do you prefer new lines? Thanks! |
On the one hand, for rare errors, I think new lines makes it easier to find, but on the other the error banner could get huge for a file with "México" repeated 1000 times, so semicolon is okay for now. How hard will it be to change it if user feedback suggested new lines is better in the coming months? Just a "," to "\n"? |
Yeah, it will be super-easy to change. Thanks for your input! |
@ElDeveloper can you approve the PR if your comments have been addressed? |
Thanks @ElDeveloper !!! |
Thanks @antgonza for addressing the comments, and you for reminding me
:P!
…On (Aug-21-17|18:14), Jose Navas wrote:
Thanks @ElDeveloper !!!
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#2240 (comment)
|
This "fix" basically removes all non UTF8 chars from the input file by going char by char and checking if printable. Note that this will make a little slower the loading of info files but shouldn't be that significant compared to everything else going around to add them. BTW remember that this is a non blocking process.