Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the S3Aggregator to an ParquetAggregator #618

Closed
vringar opened this issue Apr 20, 2020 · 5 comments
Closed

Make the S3Aggregator to an ParquetAggregator #618

vringar opened this issue Apr 20, 2020 · 5 comments

Comments

@vringar
Copy link
Contributor

vringar commented Apr 20, 2020

S3 is hard to mock and PyFilesystem in conjunction with s3fs allows us to set up everything in a way where we don't need our tests to write out to S3 but instead to tmpfs.
See the discussion in #613
@birdsarah is this more what you had in mind?
Or should we do this using dask as suggested in #427?

@birdsarah
Copy link
Contributor

birdsarah commented Apr 20, 2020

dask is backed by s3fs and provides a pandas like API.

There could be some advantages to using it, but they seem minor, I don't see any reason to use it with the way our current distributed architecture works.

I did just have an idea though. We already have the local aggregator. What if the parquet aggregator is just pandas.read_sql, pandas.to_parquet. And the content data is also suitably parsed.

Then we just s3.put the content pieces and the parquet pieces.

@birdsarah
Copy link
Contributor

I did just have an idea though. We already have the sqlite aggregator. What if the parquet aggregator is just pandas.read_sql, pandas.to_parquet.

Aside from simplicity, one of the advantages I can see to this is that we only need to maintain a sql scheme, the parquet schema would just follow from it.

The only time I can then imagine wanting to break from the sql aggregator is when we want to implement streaming collection and push to big query.

That said, the local aggregator is a record processor, I can imagine extending it to be record processor instead of local aggregator and then pushing records to external services as appropriate.

@birdsarah
Copy link
Contributor

One final thought, given that we're already heavily multi process, if we don't want to process all the data at the end of the crawl we can start the data read and push process in a separate process that is periodically checking for n records is the sqlite database, gathering them, pushing them, then removing them from the sqlite database so the sqlite database is just left with the net of what hasn't been sent.

n could be a configuration option allowing users to tune their batching size depending on their crawl setup.

@birdsarah
Copy link
Contributor

Given that, by my logic, we don't need to test pandas. All we'd need to do test the logic of grabbing data from the sql database and deleting records after push. The worst bit of this will be testing the unexpected shutdown conditions.

As I'm writing this I'm seeing the challenge of this idea is marking the "done" states. @vringar if you wanted to brainstorm this a little I'd be happy to chat.

@birdsarah
Copy link
Contributor

closing as dupe of #652

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants