Allow field filters for data dumps #2902

BrittanyBunk · 2020-01-22T04:00:36Z

https://openlibrary.org/developers/dumps
Right now - there's only 3 categories (editions, works, authors), which is really difficult to download, because they're too big and not all of the data there necessary. However, downloading is really important.
So I suggest if we could separate it by fields too (publication date, publisher, series name, 'about the __', contributors, etc.) by making a selection before it gets downloaded, it'd be much appreciated.

Then the download would only have what a person needs, so they could get started on it right away.

The reason is for accessibility - so it'll be easy for anyone to access and work with the data on the data dump - whether they have experience coding or not.

tfmorris · 2020-01-22T05:19:44Z

they're too big and not all of the data there necessary. However, downloading is really important.

Can you expand on what use cases you are trying to enable? I usually download full authors+works+editions version of the dump and it takes me less than an hour on my residential connection to both download it and load it into a database. On a well connected cloud computer, this would take much, much less time.

I know that CommonCrawl publishes some datasets in Apache Parquet format. Would that meet your needs? What does your desired toolchain look like?

BrittanyBunk · 2020-01-22T14:27:27Z

@tfmorris it's not just for me, but for anyone. Extremely large files are difficult to open and take a long time to work with. I just can't open these in normal programs on my computer, and it'll only get more difficult as the OL's data gets bigger.

So downloading isn't a problem - as it's the same for me too the time. I don't know if it'll meet my needs. I just need only some data fields - like the OL ID, author ID, etc. If you could give me the link for that, it may help.

I just wish there's a way to pick what data to download (like checkboxes with which info I want downloaded) and just download that. I think it'll benefit others with that - or at least give people ideas on what software can be used to open such large files. I had to download software to open the gz file and another to open the extremely large text file.

xayhewalo · 2020-01-25T23:36:21Z

@BrittanyBunk I've changed the title to better reflect my interpretation of your issue. Let me know if I'm mistaken

BrittanyBunk · 2020-02-10T01:27:07Z

Now that the title's changed a little, I would like to expand on this some more. @cdrini taught me about removing lines that don't have a word in it. Now that's cool for the rows, but there's also a need for column field filters too. So a way to have checkboxes for both rows and columns would be awesome.

BrittanyBunk · 2020-02-10T01:28:05Z

I can't personally create the categories myself as files, because they are automatically updated, but I could make a list of the row/column filters if that would help.

BrittanyBunk · 2020-04-03T17:20:14Z

Row would isolate desired words or phrases
Column would isolate the desired fields

tfmorris · 2020-04-03T18:47:10Z

jq can extract any column you desire. zgrep piped to jq can quickly create pretty much any subset that you need.

As an example, the following command will give you the key and redirect key (ie which record it was merged with) for the first 10 redirections in the complete dump in a couple of seconds:

time curl -L https://archive.org/download/ol_dump_2020-03-31/ol_dump_2020-03-31.txt.gz | zgrep ^/type/redirect | cut -f 5 | jq -r '"\(.key),\(.location)"' | head

The example can be extended/modified to get any data you want and can be done either streaming from the Internet, as above, or from a locally downloaded copy of the file, depending on what's most efficient for your use case.

BrittanyBunk · 2020-04-03T20:14:14Z

@tfmorris I apologize for miscommunicating. This is supposed to be for pre-downloading. I edited my original comment that goes with the title.

mekarpeles · 2021-01-23T12:57:26Z

I agree with @tfmorris's comment that there are a lot of strategies for processing the data dumps after download.

I haven't heard demand / requests from partners asking for more specific data dumps or complaints about the size.

Marking it as Will Not Fix, however the community is more than welcome to create derivative dumps from what we offer!

BrittanyBunk · 2021-01-23T15:31:13Z

who are these 'partners'?

BrittanyBunk added Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] State: Backlogged Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] labels Jan 22, 2020

BrittanyBunk changed the title ~~Have more bulk data dump categories~~ Have more bulk data dump categories for downloading Jan 22, 2020

xayhewalo changed the title ~~Have more bulk data dump categories for downloading~~ Allow field filters for data dumps Jan 25, 2020

xayhewalo removed the State: Backlogged label Mar 17, 2020

mekarpeles added the Close: Will Not Fix label Jan 23, 2021

mekarpeles closed this as completed Jan 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow field filters for data dumps #2902

Allow field filters for data dumps #2902

BrittanyBunk commented Jan 22, 2020 •

edited

Loading

tfmorris commented Jan 22, 2020

BrittanyBunk commented Jan 22, 2020

xayhewalo commented Jan 25, 2020

BrittanyBunk commented Feb 10, 2020

BrittanyBunk commented Feb 10, 2020

BrittanyBunk commented Apr 3, 2020

tfmorris commented Apr 3, 2020 •

edited

Loading

BrittanyBunk commented Apr 3, 2020

mekarpeles commented Jan 23, 2021

BrittanyBunk commented Jan 23, 2021

Allow field filters for data dumps #2902

Allow field filters for data dumps #2902

Comments

BrittanyBunk commented Jan 22, 2020 • edited Loading

tfmorris commented Jan 22, 2020

BrittanyBunk commented Jan 22, 2020

xayhewalo commented Jan 25, 2020

BrittanyBunk commented Feb 10, 2020

BrittanyBunk commented Feb 10, 2020

BrittanyBunk commented Apr 3, 2020

tfmorris commented Apr 3, 2020 • edited Loading

BrittanyBunk commented Apr 3, 2020

mekarpeles commented Jan 23, 2021

BrittanyBunk commented Jan 23, 2021

BrittanyBunk commented Jan 22, 2020 •

edited

Loading

tfmorris commented Apr 3, 2020 •

edited

Loading