Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOC.gov pagination limits make it impossible to get all of big collections #22

Open
lmullen opened this issue Aug 13, 2021 · 0 comments
Labels
external-problem A problem outside this software that can't be worked around

Comments

@lmullen
Copy link
Owner

lmullen commented Aug 13, 2021

The pagination limits make it so that you can't go past 100,000 items. This means you can't get all of Chronicling America.

A sample log entry from the crawler

cchc-crawler  | time="2021-08-13T03:47:34Z" level=warning msg="HTTP error when fetching from API" http_code=400 http_error="400 Bad Request" url="https://www.loc.gov/collections/chronicling-america/?at%21=aka%2Cbreadcrumbs%2Cbrowse%2Ccategories%2Ccontent%2Ccontent_is_post%2Cexpert_resources%2Cfacet_trail%2Cfacet_views%2Cfacets%2Cfeatured_items%2Cform_facets%2Clegacy-url%2Cnext%2Cnext_sibling%2Coptions%2Coriginal_formats%2Cpages%2Cpartof%2Cprevious%2Cprevious_sibling%2Cresearch-centers%2Cshards%2Csite_type%2Csubjects%2Ctimeline_1852_1880%2Ctimeline_1881_1900%2Ctimeline_1901_1925%2Ctimestamp%2Ctopics%2Cviews&c=1000&fa=online-format%3Aonline+text&fo=json&sp=101&st=list"

Going to that URL in the pagination does in fact return a 400 error.

Probably need to ask if there is a way around this.

Cf. #18.

@lmullen lmullen added the external-problem A problem outside this software that can't be worked around label Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external-problem A problem outside this software that can't be worked around
Projects
None yet
Development

No branches or pull requests

1 participant