Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with the Metadata Collector #171

Open
troyraen opened this issue Sep 26, 2022 · 0 comments
Open

Problems with the Metadata Collector #171

troyraen opened this issue Sep 26, 2022 · 0 comments

Comments

@troyraen
Copy link
Collaborator

troyraen commented Sep 26, 2022

Motivated by:
- #90
- #108
- #128
- #182

#172 proposes a good solution.

This issue details problems that result from the Metadata Collector's fundamental design. In short: it's slow, it's expensive, and it's prone to catastrophic failure. These problems will become significant at LSST scale if left unaddressed (tagging #96).

The module's basic design

This is a python module running on a dedicated VM. Its basic workflow is to run a bulk collection process once each morning. It pulls every message produced by the broker the night before, extracts the metadata, and stores it in BigQuery.

The module "owns" the following resources: a VM, a BigQuery table, Pub/Sub subscriptions on every topic the broker produces.

Problems and Details

It's slow.

  • The VM takes about 1.5 hrs to process a 500,000-ZTF-alert night. (Recall that the broker produces N messages for every ZTF alert, where N is the number of modules in the broker).
  • This does seem to scale better than linearly, but with an alert volume 20 times larger (a la LSST) I think this would basically have to run 24 hrs a day to keep up. Even if it ends up being 16 hrs or something, this is certainly not efficient.
  • At a minimum, the collector needs to be parallelized. But I don't think that's "the" solution because it does nothing to address other problems that stem from the basic design. Doing the collection in real-time would be better, but would also necessitate a redesign .
  • It would be better to process the streams in real time.

It's expensive.

This is mostly because it requires dedicated subscriptions on every Pub/Sub topic we produce.

It's prone to catastrophic failure

  • Once it collects the metadata of a message, it has to hold it in memory while it collects every other message that the broker produced the night before.
  • After it has them all, it creates a separate table (pandas dataframe) for each subscription, and then it joins the tables on (objectId, sourceId) to create a new table containing one row for every ZTF alert. Then it loads that table to BigQuery.
  • If anything goes wrong at any point before the data gets to BigQuery, the program crashes and all the data it was holding is lost -- permanently. (Yes, we could write the data to disk occasionally during the collection, but I think that's not really a solution to the underlying problem.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant