-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the S3Aggregator to an ParquetAggregator #618
Comments
dask is backed by s3fs and provides a pandas like API. There could be some advantages to using it, but they seem minor, I don't see any reason to use it with the way our current distributed architecture works. I did just have an idea though. We already have the local aggregator. What if the parquet aggregator is just Then we just s3.put the content pieces and the parquet pieces. |
Aside from simplicity, one of the advantages I can see to this is that we only need to maintain a sql scheme, the parquet schema would just follow from it. The only time I can then imagine wanting to break from the sql aggregator is when we want to implement streaming collection and push to big query. That said, the local aggregator is a record processor, I can imagine extending it to be record processor instead of local aggregator and then pushing records to external services as appropriate. |
One final thought, given that we're already heavily multi process, if we don't want to process all the data at the end of the crawl we can start the data read and push process in a separate process that is periodically checking for n records is the sqlite database, gathering them, pushing them, then removing them from the sqlite database so the sqlite database is just left with the net of what hasn't been sent. n could be a configuration option allowing users to tune their batching size depending on their crawl setup. |
Given that, by my logic, we don't need to test pandas. All we'd need to do test the logic of grabbing data from the sql database and deleting records after push. The worst bit of this will be testing the unexpected shutdown conditions. As I'm writing this I'm seeing the challenge of this idea is marking the "done" states. @vringar if you wanted to brainstorm this a little I'd be happy to chat. |
closing as dupe of #652 |
S3 is hard to mock and PyFilesystem in conjunction with s3fs allows us to set up everything in a way where we don't need our tests to write out to S3 but instead to
tmpfs
.See the discussion in #613
@birdsarah is this more what you had in mind?
Or should we do this using dask as suggested in #427?
The text was updated successfully, but these errors were encountered: