-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow field filters for data dumps #2902
Comments
Can you expand on what use cases you are trying to enable? I usually download full authors+works+editions version of the dump and it takes me less than an hour on my residential connection to both download it and load it into a database. On a well connected cloud computer, this would take much, much less time. I know that CommonCrawl publishes some datasets in Apache Parquet format. Would that meet your needs? What does your desired toolchain look like? |
@tfmorris it's not just for me, but for anyone. Extremely large files are difficult to open and take a long time to work with. I just can't open these in normal programs on my computer, and it'll only get more difficult as the OL's data gets bigger. So downloading isn't a problem - as it's the same for me too the time. I don't know if it'll meet my needs. I just need only some data fields - like the OL ID, author ID, etc. If you could give me the link for that, it may help. I just wish there's a way to pick what data to download (like checkboxes with which info I want downloaded) and just download that. I think it'll benefit others with that - or at least give people ideas on what software can be used to open such large files. I had to download software to open the gz file and another to open the extremely large text file. |
@BrittanyBunk I've changed the title to better reflect my interpretation of your issue. Let me know if I'm mistaken |
Now that the title's changed a little, I would like to expand on this some more. @cdrini taught me about removing lines that don't have a word in it. Now that's cool for the rows, but there's also a need for column field filters too. So a way to have checkboxes for both rows and columns would be awesome. |
I can't personally create the categories myself as files, because they are automatically updated, but I could make a list of the row/column filters if that would help. |
Row would isolate desired words or phrases |
jq can extract any column you desire. As an example, the following command will give you the key and redirect key (ie which record it was merged with) for the first 10 redirections in the complete dump in a couple of seconds:
The example can be extended/modified to get any data you want and can be done either streaming from the Internet, as above, or from a locally downloaded copy of the file, depending on what's most efficient for your use case. |
@tfmorris I apologize for miscommunicating. This is supposed to be for pre-downloading. I edited my original comment that goes with the title. |
I agree with @tfmorris's comment that there are a lot of strategies for processing the data dumps after download. I haven't heard demand / requests from partners asking for more specific data dumps or complaints about the size. Marking it as |
who are these 'partners'? |
https://openlibrary.org/developers/dumps
Right now - there's only 3 categories (editions, works, authors), which is really difficult to download, because they're too big and not all of the data there necessary. However, downloading is really important.
So I suggest if we could separate it by fields too (publication date, publisher, series name, 'about the __', contributors, etc.) by making a selection before it gets downloaded, it'd be much appreciated.
Then the download would only have what a person needs, so they could get started on it right away.
The reason is for accessibility - so it'll be easy for anyone to access and work with the data on the data dump - whether they have experience coding or not.
The text was updated successfully, but these errors were encountered: