Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be explicit about the datatypes of each column in csv files #68

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

ablack3
Copy link
Collaborator

@ablack3 ablack3 commented Sep 18, 2024

We have Eunomia CDM datasets stored in csv files. Currently the datatype of each column is not explicitly specified when reading in the data from csv which is causing #65.

In this PR I'm using the specification in the CommonDataModel package to be explicit about the datatypes when we read the csv files which should fix the issue. However this does mean that the column order matters.

I'm not sure if we consider column order (first, second, ect) part of the CDM specification but I noticed that in the GiBleed dataset the column order does not match the order in CommonDataModel specification csv. We can work around it and/or fix the file. It's a bit more tricky if we want to allow columns to be in any order but possible.

@ablack3 ablack3 changed the base branch from main to develop September 18, 2024 08:12
@ablack3 ablack3 marked this pull request as draft September 18, 2024 08:12
@ablack3 ablack3 marked this pull request as ready for review September 18, 2024 08:28
@ablack3
Copy link
Collaborator Author

ablack3 commented Sep 18, 2024

I need to investigate and fix the failing tests.

@fdefalco
Copy link
Collaborator

Thanks for looking into this, another reason the duckdb based data examples are a nice direction to go in.

@fdefalco
Copy link
Collaborator

For the column order, I would suggest that the data files should match the order of the columns defined by the CDM specification, so would we rather update the data files to follow that column order as a fix?

@ablack3
Copy link
Collaborator Author

ablack3 commented Sep 18, 2024

For the column order, I would suggest that the data files should match the order of the columns defined by the CDM specification, so would we rather update the data files to follow that column order as a fix?

That would be my preference as well. So we require csv files to have columns in the same order specified by the CommonDataModel specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants