Problems with the Metadata Collector #171

troyraen · 2022-09-26T04:45:48Z

Motivated by:
- #90
- #108
- #128
- #182

#172 proposes a good solution.

This issue details problems that result from the Metadata Collector's fundamental design. In short: it's slow, it's expensive, and it's prone to catastrophic failure. These problems will become significant at LSST scale if left unaddressed (tagging #96).

The module's basic design

This is a python module running on a dedicated VM. Its basic workflow is to run a bulk collection process once each morning. It pulls every message produced by the broker the night before, extracts the metadata, and stores it in BigQuery.

The module "owns" the following resources: a VM, a BigQuery table, Pub/Sub subscriptions on every topic the broker produces.

Problems and Details

It's slow.

The VM takes about 1.5 hrs to process a 500,000-ZTF-alert night. (Recall that the broker produces N messages for every ZTF alert, where N is the number of modules in the broker).
This does seem to scale better than linearly, but with an alert volume 20 times larger (a la LSST) I think this would basically have to run 24 hrs a day to keep up. Even if it ends up being 16 hrs or something, this is certainly not efficient.
At a minimum, the collector needs to be parallelized. But I don't think that's "the" solution because it does nothing to address other problems that stem from the basic design. Doing the collection in real-time would be better, but would also necessitate a redesign .
It would be better to process the streams in real time.

It's expensive.

This is mostly because it requires dedicated subscriptions on every Pub/Sub topic we produce.

It's prone to catastrophic failure

Once it collects the metadata of a message, it has to hold it in memory while it collects every other message that the broker produced the night before.
After it has them all, it creates a separate table (pandas dataframe) for each subscription, and then it joins the tables on (objectId, sourceId) to create a new table containing one row for every ZTF alert. Then it loads that table to BigQuery.
If anything goes wrong at any point before the data gets to BigQuery, the program crashes and all the data it was holding is lost -- permanently. (Yes, we could write the data to disk occasionally during the collection, but I think that's not really a solution to the underlying problem.)

The text was updated successfully, but these errors were encountered:

This was referenced Sep 26, 2022

Replace the Metadata Collector with BigQuery subscriptions #172

Open

Fix bugs in metadata collection #128

Merged

troyraen mentioned this issue Oct 22, 2022

Duplicate alerts #173

Closed

troyraen mentioned this issue Jan 31, 2024

Update to Python 3.12 #214

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with the Metadata Collector #171

Problems with the Metadata Collector #171

troyraen commented Sep 26, 2022 •

edited

Loading

Problems with the Metadata Collector #171

Problems with the Metadata Collector #171

Comments

troyraen commented Sep 26, 2022 • edited Loading

The module's basic design

Problems and Details

troyraen commented Sep 26, 2022 •

edited

Loading