Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for encoding on resources from KN to allow for more dynamic snippets #65

Open
sarahrichmond opened this issue Oct 23, 2018 · 2 comments
Assignees

Comments

@sarahrichmond
Copy link

sarahrichmond commented Oct 23, 2018

Knowing the encoding will enable us to write more dynamic snippets where we can visualise the data better

@jyucsiro
Copy link

jyucsiro commented Oct 24, 2018

Found this python library which can detect the csv encoding
https://github.com/chardet/chardet

The CSV code in a jupyter environment then gets a bit messy though trying to figure out which encoding to use... there could be 30+ types

@hoylen
Copy link
Contributor

hoylen commented Oct 24, 2018

And it is not just the character encoding that might be different. I've seen many variations of CSV around (e.g. how they treat commas, new lines and escaping characters in values). There is no real standard... and even if there was, not everyone might implement it properly.

It feels like we need a general framework where different snippets can be assigned to different data sets, based on an expandable set of rules and metadata.

Currently, we (want to) have two snippets: download CSV and download anything. But the further we go, we'll have to deal with more variants (e.g. download UTF-8 CSV, download CSV that puts values with commas in double quotes, download CSV that uses backslashes to escape commas).

At one extreme, the rules need only find one snippet for a type of file. At the other extreme, there might need to be a custom snippet that is only used for one particular dataset. In between, a single snippet is used with all CSV from a particular publisher, but a different snippet used for other publishers. That is, the metadata for the rules might already be available, or at the worst case there needs to be a "use this particular snippet" metadata property.

Maintaining this will be a lot of work, so maybe we should let users contribute. Or at least let them tell us when a snippet no longer works for a particular dataset and/or to vote it down. Maybe they can be given a pop-up menu of possible snippets they can use, with a default already chosen, but with other options that might work -- with the "download anything" snippet as the option of last resort. Sounds like a code sharing project/feature in its own right!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants