You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently all data is in one cases collection. This causes issues with operations such as prune, which have to get a write lock on the collection and update millions of cases by using flags. Operations such as export also become slower with the size of the collection.
Describe the solution you'd like
Split the cases collection by sourceId. This way, parallel ingestions should be faster as no simultaneous ingestions will be operating on the same collection. Also operations such as export become simpler, especially if we change the export unit to be by source rather than by country. Prune need not be a separate time-consuming operation, we can .renameCollection() the current collection to collection-old and replace it with staging collection for a single source. As renameCollection() does not involve a copy (it changes the metadata), this should be much faster. The benefit is that housekeeping operations relating to ingestion (export, prune) can be done as part of the ingestion process or at the database level via triggers on collections, as suggested by @jim-sheldon. Making collections smaller will make these database operations faster as well.
Describe alternatives you've considered
Keep the status quo. It mostly works, though we would expect scaling issues to get worse if we get 2-3x the current number of cases (~100m).
The text was updated successfully, but these errors were encountered:
Important to consider the trade-offs on searching, e.g. if I want all Brazil data or all cases with a particular symptom that's going to search across multiple collections: what's the performance impact in both time, and memory wherever we end up merging the results?
@iamleeg Most of the performance implications would come from sorting in the UI and/or in the API, as the rest is parallelizable (assuming MongoDB allows parallel reads across collections).
Is your feature request related to a problem? Please describe.
Currently all data is in one
cases
collection. This causes issues with operations such as prune, which have to get a write lock on the collection and update millions of cases by using flags. Operations such as export also become slower with the size of the collection.Describe the solution you'd like
Split the
cases
collection by sourceId. This way, parallel ingestions should be faster as no simultaneous ingestions will be operating on the same collection. Also operations such as export become simpler, especially if we change the export unit to be by source rather than by country. Prune need not be a separate time-consuming operation, we can.renameCollection()
the current collection to collection-old and replace it with staging collection for a single source. As renameCollection() does not involve a copy (it changes the metadata), this should be much faster. The benefit is that housekeeping operations relating to ingestion (export, prune) can be done as part of the ingestion process or at the database level via triggers on collections, as suggested by @jim-sheldon. Making collections smaller will make these database operations faster as well.Describe alternatives you've considered
Keep the status quo. It mostly works, though we would expect scaling issues to get worse if we get 2-3x the current number of cases (~100m).
The text was updated successfully, but these errors were encountered: