Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies between ddf_utils.create_datapackage (Python) and validate-ddf -i (Node) #548

Open
lapidus opened this issue Feb 28, 2019 · 3 comments

Comments

@lapidus
Copy link

lapidus commented Feb 28, 2019

We are using both Python (https://github.com/semio/ddf_utils) and Javascript tooling to generate the datapackage.json with its ddfSchema property.

When running on the very same dataset, the Python-based generator results in a 50% larger datapackage.json file.

It would be interesting to hear your thoughts (@buchslava, @semio) about harmonising the two libraries. So far we have identified 4 differences in outcome:

1. Resource.name is encoded differently:

validate-ddf
"path": "ddf--entities--jurisdiction.csv",
"name": "jurisdiction"

ddf_utils
"path": "ddf--entities--jurisdiction.csv",
"name": "ddf--entities--jurisdiction"

2. The default datapackage.json properties differ

The JavaScript version typically adds more placeholders such as title, license, author, version) whereas ddf_utils generates a bare minimum (name).

3. Python ddf_utils does not seem to work with multiple measures in one file?

ddf--datapoints--measure--measure--by--country--year.csv

4. Different files are excluded

The Python tools seem to do a better job when it comes to excluding files from ddf creation.
With validate-ddf -i .DS_Store and .ipynb files were accidentally encoded into the datapackage.json file whereas ddf_utils skipped over these.

Thanks for any pointers and ideas!

@semio
Copy link

semio commented Mar 3, 2019

Hi @lapidus, thanks for reporting the issue to us! I agree that it's better to unify the behavior for both libraries. Here are my suggestions:

  1. The name for resources should be the reason of size difference between 2 libraries. According to the DDFcsv spec:

name MUST be a string which MAY be the file name or file path of the resource, minus the extension.

So both ddf_utils and ddf-validation are correct, but using filename minus extension is recommended way. I suggest that ddf-validation change the name of resources to follow this way.

  1. Again in the spec:

The fields title, description, author and license SHOULD be fields in datapackage.json.

So I will add more default fields to create_datapackage.

P.S @lapidus If you use ddf new to create a new dataset folder, you will be prompted to input those fields, so the generated datapackage.json will contain them. After you add your csv files, you can use ddf_ufils.package.get_datapackage(update=True) to update datapackage.json and keep the metadata in the old datapackage.json.

  1. it's a bug in ddf_utils, I will fix that.

  2. currently ddf_utils only consider all csv files (with .csv extension), I thought that the tool is for ddf csv datapackage so we only need to process csv files. I guess we would add include / exclude parameters to the functions so that we have better control over what files should be proceed

@jheeffer
Copy link
Member

jheeffer commented Mar 3, 2019

Thanks for answering Semio :)

  1. Indeed, as Semio says, both are valid. ddf-validation is a bit more concise and with that saves some diskspace while ddf-utils is following the option the spec gives (which by the way is not a recommendation; SHOULD would be a recommendation). We can align the two scripts to do the same. But, @lapidus, what is the actual problem you're having with the difference between the outputs?

  2. Yeah, Semio, good to add it to create_datapackage

  3. Thanks for the bug report @lapidus

  4. I think we should only include *.csv files, so if ddf-validation is reading other files that's wrong. Although I don't have experience with those specific files being taken along. I'll forward to @buchslava to check it.

@jheeffer
Copy link
Member

jheeffer commented Mar 4, 2019

@lapidus can you give steps for reproduction for issue 4 (version of ddf-validation, actual file, steps you take etc)? We can't really reproduce it locally (all three of us tried).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants