Revamped data type sniffing for CSV/TSV files #260
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Based on: https://gitlab.com/hjhornbeck/datapusher/
This patch changes push_to_datastore() to use pandas for data sniffing CSV/TSV files instead of messytables. The last release of the latter library was about five years ago, and their git repository has been idle for three. In practice, I've found it has difficulties correctly detecting dates, hence the motivation to substitute another library.
In the process of adding that functionality, I've also done a number of small tweaks. Type sniffing is much more focused, taking more advantage of
int4
andint8
formats to save space relative tonumeric
. CKAN and PostgreSQL reject a column name that contains a percent sign; a workaround to permit this has been implemented. If the dataset contains an columns with no name, pandas gives them the name "Unnamed: X." It is common for CSV files to have an unnamed index column as their first, which is redundant when CKAN adds its own index column; this code automatically deletes that column. The automatic data dictionary has been improved, the "description" field is now filled with a summary of the column's data. Text fields get special treatment, if a full list of the individual categories is less than a user-selectable limit then all of them are explicitly named in a format that's both human- and machine-readable.