You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 5, 2022. It is now read-only.
Some places, such as us-dc, report their data in Excel spreadsheets. Add crawl and parse support for that.
Work-in-progress branch add-excel-parsing on master repo
There are some libraries that parse xlsx. It seemed simple to add, but at the moment it breaks on crawl -- the crawled file is not a valid xslx file. This branch contains a test (marked with only) that demonstrates the problem: all the test does is crawl a local "fake source" .xslx file, and then it checks that the file in the crawler-cache can be parsed in the same way that the "fake source" can be:
$ git fetch upstream
$ git checkout -b upstream/add-excel-parsing add-excel-parsing
$ npm run test
... etc
crawled Excel file has same parseable content as source
sanity check of src sheets
Sandbox Found Architect project manifest, starting up
Created test cache /Users/jeff/Documents/Projects/li/zz-testing-fake-cache
Created test report dir /Users/jeff/Documents/Projects/li/zz-reports-dir
Wrote to local cache: /Users/jeff/Documents/Projects/li/zz-testing-fake-cache/excel-source/2020-08-12/2020-08-12t20_27_54.266z-default-59988.xlsx.gz
...
x Error: End of data reached (data length = 10043, asked index = 347979759). Corrupted zip ? (fail at: undefined)
If we can get this crawl method to work, we can get Excel crawls and scrapes in general to work.
Things tried to get crawl to work
The "crawl" method (src/events/crawler/crawler) actually calls src/http/get-get-normal/index.js to get the file. I've tried:
setting the Content-Typein get-get-normal
setting content type in events/crawler/crawler/index.js (the got call)
a few other hacks!
Some other people ran into this trouble as well -- e.g. see SheetJS/sheetjs#337.
Description.
Some places, such as us-dc, report their data in Excel spreadsheets. Add crawl and parse support for that.
Work-in-progress branch
add-excel-parsing
on master repoThere are some libraries that parse xlsx. It seemed simple to add, but at the moment it breaks on crawl -- the crawled file is not a valid xslx file. This branch contains a test (marked with
only
) that demonstrates the problem: all the test does is crawl a local "fake source" .xslx file, and then it checks that the file in the crawler-cache can be parsed in the same way that the "fake source" can be:If we can get this crawl method to work, we can get Excel crawls and scrapes in general to work.
Things tried to get crawl to work
The "crawl" method (
src/events/crawler/crawler
) actually callssrc/http/get-get-normal/index.js
to get the file. I've tried:Content-Type
in get-get-normalSome other people ran into this trouble as well -- e.g. see SheetJS/sheetjs#337.
A minimal repo
... demonstrating this is at https://github.com/covidatlas/arc-excel-downloading-trouble.
The text was updated successfully, but these errors were encountered: