Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

May I contribute the complete Gapminder data set to your repo? #20

Closed
jabowery opened this issue Dec 4, 2017 · 11 comments
Closed

May I contribute the complete Gapminder data set to your repo? #20

jabowery opened this issue Dec 4, 2017 · 11 comments

Comments

@jabowery
Copy link

jabowery commented Dec 4, 2017

Jenny,

I've written an "R" script that joins the entire Gapminder database into a single csv file, given the Gapminder github repository of said data, which is organized one file per variable.

Since you already have the name "gapminder" in CRAN, would you mind terribly taking ownership of this, the complete, set of Gapminder data?

Thanks!

-- Jim

PS: Let me know how I can be of any further assistance.

@jennybc
Copy link
Owner

jennybc commented Dec 4, 2017

If it came in, I'd want to have clean R scripts that show how it is produced from its inputs, as I've done with the small excerpt that's already here. Look in the data-raw directory of this repo for that.

Based on the three variables I worked with, there was lots of fussing with inconsistent country names, missing and inconsistent continents, etc. I assume all this is much worse when trying to unify multiple datasets. What's the situation on that front?

What is the final file size of this csv?

Thanks!

@jabowery
Copy link
Author

jabowery commented Dec 5, 2017

Although the file's geography is ISO_3166-1 alpha 3-compliant, almost entirely, I'll need to do some "fussing" of my own to get into full compliance. From there it can be put through conversion to whatever standard you like. The scripts to do this can be included to build the compliant file directly from Gapminder's github repository. If necessary it can be an "R" script that calls out to do the git cloneof that repository.

The problem isn't as bad as it may, at first, appear. Click on the link I gave to the "Gapminder github repository" above.

The bzip2 -9 size of the data is just above 8M. The decompressed size is probably misleadingly large because it is very sparse -- rendered even more so by the lack of total compliance with a geo standard. But you can click on the link I gave for the CSV, download it and bunzip it to see that it takes nearly 160M.

@jennybc
Copy link
Owner

jennybc commented Dec 5, 2017

I will look into the feasibility of including such a compressed file. I haven't done so in the past ...

@jennybc
Copy link
Owner

jennybc commented Dec 5, 2017

It looks like the dataset might be an acceptable size, if included as a compressed R data file. Apparently it's borderline, with 5MB being the current ceiling these days.

So I can't guarantee anything. If you want to pursue it, though, the next step would be a PR with one or more scripts that enact dataset construction and that verify or enforce the consistency of countries/continents, as is done with the current excerpt. Then I'd need to think about how to make it most useful to others, since a data frame with >500 variables is a bit unwieldy.

@jabowery
Copy link
Author

jabowery commented Dec 6, 2017

I'll fork/clone your repo and make the mods for your requested Pull Request.

I have filtered out all but rows that are alpha-3 compliant with the R script included below.

merge_datapoints.R.gz

The 8M bz2 of the resulting tsv is at this link.

I note your iso file has 188 country codes. The source I used to filter for compliance had 248 in this json file. I can construct a country code tsv file congruent with yours from the aforelinked json file using the common.name and ccn3 fields to populate your country and iso_num columns.

@jabowery
Copy link
Author

I've submitted pull request 21. Let me know if there is anything I can do to help you evaluate it and/or improve the request.

@jennybc
Copy link
Owner

jennybc commented Dec 14, 2017

Thanks @jabowery. I'm about to enter a period of travel. If this falls off my radar, please don't hesitate to ping me here sometime next week.

@jabowery
Copy link
Author

While you're gone, I'll be putting it through some paces and expect to do some more cleaning. For instance, I discovered one of the indicators has a name starting with a numeral.

@dmi3kno
Copy link

dmi3kno commented Sep 23, 2018

One pretty urgent inconsistency is that North Korea (Korea, Dem. Rep.) has an iso code of KOR, when in fact it should be PRK.

@jennybc
Copy link
Owner

jennybc commented Mar 9, 2023

(The country code for North Korea has been fixed, in 003f98f. And I plan to make a release.)

I just touched this repo for the first time in many years, so realistically, I'm going to close remaining issues and PRs, just because it's quite clear that I need to treat this package as "finished". I think it's just going to be a little time capsule.

@dmi3kno
Copy link

dmi3kno commented Feb 1, 2024

I came here to say that including all_indicators data as optional dataset is a brilliant idea.

May I suggest that you @jabowery spin it off to another github-only package which could be appropriately named gapminderplus or something along those lines. We probably need searchable documentation for indicators, but this work is very needed and would be of enormous pedagogical value. I am teaching the whole basic stats class on gapminder and going beyond 3 variables is extremely important. Right now we are downloading and joining the indicators one by one, but if such gihub package would be available i would just direct students to remotes::install_github() and we would have a ball for the rest of the semester!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants