Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamped data type sniffing for CSV/TSV files #260

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

hjhornbeck
Copy link

Based on: https://gitlab.com/hjhornbeck/datapusher/

This patch changes push_to_datastore() to use pandas for data sniffing CSV/TSV files instead of messytables. The last release of the latter library was about five years ago, and their git repository has been idle for three. In practice, I've found it has difficulties correctly detecting dates, hence the motivation to substitute another library.

In the process of adding that functionality, I've also done a number of small tweaks. Type sniffing is much more focused, taking more advantage of int4 and int8 formats to save space relative to numeric. CKAN and PostgreSQL reject a column name that contains a percent sign; a workaround to permit this has been implemented. If the dataset contains an columns with no name, pandas gives them the name "Unnamed: X." It is common for CSV files to have an unnamed index column as their first, which is redundant when CKAN adds its own index column; this code automatically deletes that column. The automatic data dictionary has been improved, the "description" field is now filled with a summary of the column's data. Text fields get special treatment, if a full list of the individual categories is less than a user-selectable limit then all of them are explicitly named in a format that's both human- and machine-readable.

hjhornbeck and others added 3 commits December 7, 2022 17:56
https://gitlab.com/hjhornbeck/datapusher/-/commits/master

Added pandas to the package requirements.
Allowed PostgreSQL types to pass through to the underlying database.
Wrote a routine to sanitize column names for CKAN/PostgreSQL.
Added support for TSV files.
Added code to drop a column pandas tends to add if there's a column with no name.
Added automatically-generated descriptions for all columns in pandas_sniff_algorithm().
Added a global variable to control what sort of description happens for text fields; this should allow machine-readable storage of category information while still being human-readable.
Added dummy descriptions to make up for the lack of them in old_sniff_algorithm().

Plus, all that has been synced up with some more recent commits.
Added a minor tweak; an unnamed initial column is almost certainly an index column, but in any other position may be an artifact of a poorly-made header and not worthy of automatic deletion. The code now differentiates between these two scenarios.
Updating datapusher to 0.0.20.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant