Make the S3Aggregator to an ParquetAggregator #618

vringar · 2020-04-20T18:59:25Z

S3 is hard to mock and PyFilesystem in conjunction with s3fs allows us to set up everything in a way where we don't need our tests to write out to S3 but instead to tmpfs.
See the discussion in #613
@birdsarah is this more what you had in mind?
Or should we do this using dask as suggested in #427?

The text was updated successfully, but these errors were encountered:

birdsarah · 2020-04-20T22:41:34Z

dask is backed by s3fs and provides a pandas like API.

There could be some advantages to using it, but they seem minor, I don't see any reason to use it with the way our current distributed architecture works.

I did just have an idea though. We already have the local aggregator. What if the parquet aggregator is just pandas.read_sql, pandas.to_parquet. And the content data is also suitably parsed.

Then we just s3.put the content pieces and the parquet pieces.

birdsarah · 2020-04-20T22:50:54Z

I did just have an idea though. We already have the sqlite aggregator. What if the parquet aggregator is just pandas.read_sql, pandas.to_parquet.

Aside from simplicity, one of the advantages I can see to this is that we only need to maintain a sql scheme, the parquet schema would just follow from it.

The only time I can then imagine wanting to break from the sql aggregator is when we want to implement streaming collection and push to big query.

That said, the local aggregator is a record processor, I can imagine extending it to be record processor instead of local aggregator and then pushing records to external services as appropriate.

birdsarah · 2020-04-20T22:54:53Z

One final thought, given that we're already heavily multi process, if we don't want to process all the data at the end of the crawl we can start the data read and push process in a separate process that is periodically checking for n records is the sqlite database, gathering them, pushing them, then removing them from the sqlite database so the sqlite database is just left with the net of what hasn't been sent.

n could be a configuration option allowing users to tune their batching size depending on their crawl setup.

birdsarah · 2020-04-20T22:58:11Z

Given that, by my logic, we don't need to test pandas. All we'd need to do test the logic of grabbing data from the sql database and deleting records after push. The worst bit of this will be testing the unexpected shutdown conditions.

As I'm writing this I'm seeing the challenge of this idea is marking the "done" states. @vringar if you wanted to brainstorm this a little I'd be happy to chat.

birdsarah · 2020-05-15T22:27:56Z

closing as dupe of #652

vringar mentioned this issue Apr 20, 2020

Start testing with pytest-localstack instead of localstack[full] #613

Closed

englehardt mentioned this issue May 15, 2020

Remove dependency on localstack / refactor s3aggregator to parquet aggregator #652

Closed

birdsarah closed this as completed May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the S3Aggregator to an ParquetAggregator #618

Make the S3Aggregator to an ParquetAggregator #618

vringar commented Apr 20, 2020

birdsarah commented Apr 20, 2020 •

edited

Loading

birdsarah commented Apr 20, 2020

birdsarah commented Apr 20, 2020

birdsarah commented Apr 20, 2020

birdsarah commented May 15, 2020

Make the S3Aggregator to an ParquetAggregator #618

Make the S3Aggregator to an ParquetAggregator #618

Comments

vringar commented Apr 20, 2020

birdsarah commented Apr 20, 2020 • edited Loading

birdsarah commented Apr 20, 2020

birdsarah commented Apr 20, 2020

birdsarah commented Apr 20, 2020

birdsarah commented May 15, 2020

birdsarah commented Apr 20, 2020 •

edited

Loading