Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow field filters for data dumps #2902

Closed
BrittanyBunk opened this issue Jan 22, 2020 · 10 comments
Closed

Allow field filters for data dumps #2902

BrittanyBunk opened this issue Jan 22, 2020 · 10 comments
Labels
Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] Module: Data dumps Priority: 3 Issues that we can consider at our leisure. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@BrittanyBunk
Copy link
Contributor

BrittanyBunk commented Jan 22, 2020

https://openlibrary.org/developers/dumps
Right now - there's only 3 categories (editions, works, authors), which is really difficult to download, because they're too big and not all of the data there necessary. However, downloading is really important.
So I suggest if we could separate it by fields too (publication date, publisher, series name, 'about the __', contributors, etc.) by making a selection before it gets downloaded, it'd be much appreciated.

Then the download would only have what a person needs, so they could get started on it right away.

The reason is for accessibility - so it'll be easy for anyone to access and work with the data on the data dump - whether they have experience coding or not.

@BrittanyBunk BrittanyBunk added Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] State: Backlogged Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] labels Jan 22, 2020
@tfmorris
Copy link
Contributor

they're too big and not all of the data there necessary. However, downloading is really important.

Can you expand on what use cases you are trying to enable? I usually download full authors+works+editions version of the dump and it takes me less than an hour on my residential connection to both download it and load it into a database. On a well connected cloud computer, this would take much, much less time.

I know that CommonCrawl publishes some datasets in Apache Parquet format. Would that meet your needs? What does your desired toolchain look like?

@BrittanyBunk
Copy link
Contributor Author

@tfmorris it's not just for me, but for anyone. Extremely large files are difficult to open and take a long time to work with. I just can't open these in normal programs on my computer, and it'll only get more difficult as the OL's data gets bigger.

So downloading isn't a problem - as it's the same for me too the time. I don't know if it'll meet my needs. I just need only some data fields - like the OL ID, author ID, etc. If you could give me the link for that, it may help.

I just wish there's a way to pick what data to download (like checkboxes with which info I want downloaded) and just download that. I think it'll benefit others with that - or at least give people ideas on what software can be used to open such large files. I had to download software to open the gz file and another to open the extremely large text file.

@BrittanyBunk BrittanyBunk changed the title Have more bulk data dump categories Have more bulk data dump categories for downloading Jan 22, 2020
@xayhewalo xayhewalo added Module: Data dumps Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] Priority: 3 Issues that we can consider at our leisure. [managed] and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Jan 25, 2020
@xayhewalo xayhewalo changed the title Have more bulk data dump categories for downloading Allow field filters for data dumps Jan 25, 2020
@xayhewalo
Copy link
Collaborator

@BrittanyBunk I've changed the title to better reflect my interpretation of your issue. Let me know if I'm mistaken

@BrittanyBunk
Copy link
Contributor Author

Now that the title's changed a little, I would like to expand on this some more. @cdrini taught me about removing lines that don't have a word in it. Now that's cool for the rows, but there's also a need for column field filters too. So a way to have checkboxes for both rows and columns would be awesome.

@BrittanyBunk
Copy link
Contributor Author

I can't personally create the categories myself as files, because they are automatically updated, but I could make a list of the row/column filters if that would help.

@BrittanyBunk
Copy link
Contributor Author

Row would isolate desired words or phrases
Column would isolate the desired fields

@tfmorris
Copy link
Contributor

tfmorris commented Apr 3, 2020

jq can extract any column you desire. zgrep piped to jq can quickly create pretty much any subset that you need.

As an example, the following command will give you the key and redirect key (ie which record it was merged with) for the first 10 redirections in the complete dump in a couple of seconds:

time curl -L https://archive.org/download/ol_dump_2020-03-31/ol_dump_2020-03-31.txt.gz | zgrep ^/type/redirect | cut -f 5 | jq -r '"\(.key),\(.location)"' | head

The example can be extended/modified to get any data you want and can be done either streaming from the Internet, as above, or from a locally downloaded copy of the file, depending on what's most efficient for your use case.

@BrittanyBunk
Copy link
Contributor Author

@tfmorris I apologize for miscommunicating. This is supposed to be for pre-downloading. I edited my original comment that goes with the title.

@mekarpeles
Copy link
Member

I agree with @tfmorris's comment that there are a lot of strategies for processing the data dumps after download.

I haven't heard demand / requests from partners asking for more specific data dumps or complaints about the size.

Marking it as Will Not Fix, however the community is more than welcome to create derivative dumps from what we offer!

@BrittanyBunk
Copy link
Contributor Author

who are these 'partners'?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] Module: Data dumps Priority: 3 Issues that we can consider at our leisure. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
Development

No branches or pull requests

4 participants