Skip to content
This repository has been archived by the owner on Dec 5, 2022. It is now read-only.

Add Excel parsing (have broken work-in-progress branch) #564

Closed
jzohrab opened this issue Aug 12, 2020 · 1 comment
Closed

Add Excel parsing (have broken work-in-progress branch) #564

jzohrab opened this issue Aug 12, 2020 · 1 comment
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@jzohrab
Copy link
Contributor

jzohrab commented Aug 12, 2020

Description.

Some places, such as us-dc, report their data in Excel spreadsheets. Add crawl and parse support for that.

Work-in-progress branch add-excel-parsing on master repo

There are some libraries that parse xlsx. It seemed simple to add, but at the moment it breaks on crawl -- the crawled file is not a valid xslx file. This branch contains a test (marked with only) that demonstrates the problem: all the test does is crawl a local "fake source" .xslx file, and then it checks that the file in the crawler-cache can be parsed in the same way that the "fake source" can be:

$ git fetch upstream
$ git checkout -b upstream/add-excel-parsing add-excel-parsing
$ npm run test

... etc
  crawled Excel file has same parseable content as source

    sanity check of src sheets
    Sandbox Found Architect project manifest, starting up
    Created test cache /Users/jeff/Documents/Projects/li/zz-testing-fake-cache
    Created test report dir /Users/jeff/Documents/Projects/li/zz-reports-dir
    Wrote to local cache: /Users/jeff/Documents/Projects/li/zz-testing-fake-cache/excel-source/2020-08-12/2020-08-12t20_27_54.266z-default-59988.xlsx.gz
...

    x Error: End of data reached (data length = 10043, asked index = 347979759). Corrupted zip ? (fail at: undefined)

If we can get this crawl method to work, we can get Excel crawls and scrapes in general to work.

Things tried to get crawl to work

The "crawl" method (src/events/crawler/crawler) actually calls src/http/get-get-normal/index.js to get the file. I've tried:

  • setting the Content-Typein get-get-normal
  • setting content type in events/crawler/crawler/index.js (the got call)
  • a few other hacks!

Some other people ran into this trouble as well -- e.g. see SheetJS/sheetjs#337.

A minimal repo

... demonstrating this is at https://github.com/covidatlas/arc-excel-downloading-trouble.

@jzohrab jzohrab added enhancement New feature or request help wanted Extra attention is needed labels Aug 12, 2020
@jzohrab
Copy link
Contributor Author

jzohrab commented Aug 13, 2020

Done and merged!

@jzohrab jzohrab closed this as completed Aug 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant