You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue details problems that result from the Metadata Collector's fundamental design. In short: it's slow, it's expensive, and it's prone to catastrophic failure. These problems will become significant at LSST scale if left unaddressed (tagging #96).
The module's basic design
This is a python module running on a dedicated VM. Its basic workflow is to run a bulk collection process once each morning. It pulls every message produced by the broker the night before, extracts the metadata, and stores it in BigQuery.
The module "owns" the following resources: a VM, a BigQuery table, Pub/Sub subscriptions on every topic the broker produces.
Problems and Details
It's slow.
The VM takes about 1.5 hrs to process a 500,000-ZTF-alert night. (Recall that the broker produces N messages for every ZTF alert, where N is the number of modules in the broker).
This does seem to scale better than linearly, but with an alert volume 20 times larger (a la LSST) I think this would basically have to run 24 hrs a day to keep up. Even if it ends up being 16 hrs or something, this is certainly not efficient.
At a minimum, the collector needs to be parallelized. But I don't think that's "the" solution because it does nothing to address other problems that stem from the basic design. Doing the collection in real-time would be better, but would also necessitate a redesign .
It would be better to process the streams in real time.
It's expensive.
This is mostly because it requires dedicated subscriptions on every Pub/Sub topic we produce.
It's prone to catastrophic failure
Once it collects the metadata of a message, it has to hold it in memory while it collects every other message that the broker produced the night before.
After it has them all, it creates a separate table (pandas dataframe) for each subscription, and then it joins the tables on (objectId, sourceId) to create a new table containing one row for every ZTF alert. Then it loads that table to BigQuery.
If anything goes wrong at any point before the data gets to BigQuery, the program crashes and all the data it was holding is lost -- permanently. (Yes, we could write the data to disk occasionally during the collection, but I think that's not really a solution to the underlying problem.)
The text was updated successfully, but these errors were encountered:
Motivated by:
- #90
- #108
- #128
- #182
#172 proposes a good solution.
This issue details problems that result from the Metadata Collector's fundamental design. In short: it's slow, it's expensive, and it's prone to catastrophic failure. These problems will become significant at LSST scale if left unaddressed (tagging #96).
The module's basic design
This is a python module running on a dedicated VM. Its basic workflow is to run a bulk collection process once each morning. It pulls every message produced by the broker the night before, extracts the metadata, and stores it in BigQuery.
The module "owns" the following resources: a VM, a BigQuery table, Pub/Sub subscriptions on every topic the broker produces.
Problems and Details
It's slow.
It's expensive.
This is mostly because it requires dedicated subscriptions on every Pub/Sub topic we produce.
It's prone to catastrophic failure
objectId
,sourceId
) to create a new table containing one row for every ZTF alert. Then it loads that table to BigQuery.The text was updated successfully, but these errors were encountered: