Split cases collection into collections by sourceId #2553

abhidg · 2022-02-25T14:37:41Z

Is your feature request related to a problem? Please describe.
Currently all data is in one cases collection. This causes issues with operations such as prune, which have to get a write lock on the collection and update millions of cases by using flags. Operations such as export also become slower with the size of the collection.

Describe the solution you'd like
Split the cases collection by sourceId. This way, parallel ingestions should be faster as no simultaneous ingestions will be operating on the same collection. Also operations such as export become simpler, especially if we change the export unit to be by source rather than by country. Prune need not be a separate time-consuming operation, we can .renameCollection() the current collection to collection-old and replace it with staging collection for a single source. As renameCollection() does not involve a copy (it changes the metadata), this should be much faster. The benefit is that housekeeping operations relating to ingestion (export, prune) can be done as part of the ingestion process or at the database level via triggers on collections, as suggested by @jim-sheldon. Making collections smaller will make these database operations faster as well.

Describe alternatives you've considered
Keep the status quo. It mostly works, though we would expect scaling issues to get worse if we get 2-3x the current number of cases (~100m).

The text was updated successfully, but these errors were encountered:

iamleeg · 2022-02-28T09:41:41Z

Important to consider the trade-offs on searching, e.g. if I want all Brazil data or all cases with a particular symptom that's going to search across multiple collections: what's the performance impact in both time, and memory wherever we end up merging the results?

abhidg · 2022-02-28T10:12:05Z

@iamleeg Most of the performance implications would come from sorting in the UI and/or in the API, as the rest is parallelizable (assuming MongoDB allows parallel reads across collections).

jim-sheldon · 2022-04-06T15:14:24Z

We could also move toward using a read database, a separate write database, and a replication script.

abhidg added the RFC Request for comments / enhancement proposal label Feb 25, 2022

abhidg mentioned this issue Feb 25, 2022

Poor performance of ADI #2551

Open

abhidg mentioned this issue Jun 28, 2022

2714 batch upsert #2739

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split cases collection into collections by sourceId #2553

Split cases collection into collections by sourceId #2553

abhidg commented Feb 25, 2022

iamleeg commented Feb 28, 2022

abhidg commented Feb 28, 2022

jim-sheldon commented Apr 6, 2022

Split cases collection into collections by sourceId #2553

Split cases collection into collections by sourceId #2553

Comments

abhidg commented Feb 25, 2022

iamleeg commented Feb 28, 2022

abhidg commented Feb 28, 2022

jim-sheldon commented Apr 6, 2022