Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the Metadata Collector with BigQuery subscriptions #172

Open
8 tasks
troyraen opened this issue Sep 26, 2022 · 2 comments
Open
8 tasks

Replace the Metadata Collector with BigQuery subscriptions #172

troyraen opened this issue Sep 26, 2022 · 2 comments

Comments

@troyraen
Copy link
Collaborator

troyraen commented Sep 26, 2022

Overview

This issue proposes an entirely new workflow for the broker's metadata collection process. It also documents our reasons for collecting specific metadata to begin with. #171 details fundamental problems with the current system design.

The new workflow can potentially solve many problems at once, at least two of which will be significant barriers to processing at LSST scale if left unaddressed. It is based on a new Pub/Sub feature: BigQuery subscriptions.

Here is an architecture diagram for reference (especially in the section outlining the new workflow).

Motivation for a metadata collector

Enable the following project-level goals:

  • Track data provenance. Specifically, the broker and classifier versions that processed each alert.
  • Track pipeline performance. Specifically: module-level processing times, delay times, dropped alerts, duplicated alerts, etc.

Outline of new workflow

BigQuery subscriptions is a new feature in Pub/Sub that basically let you point a push subscription directly at a BigQuery table. It includes the following important options, which make the new workflow possible: 1) write both the message data and metadata to the table, and 2) automatically drop fields that are present in the message but not the table schema.

Basic envisioned workflow:

  1. Each module will publish one output stream (i.e., topic) that contains:
    1. the results of its processing,
    2. relevant metadata, and
    3. (usually) a copy of the incoming data/alert.
  2. Each module's topic will have a BigQuery subscription attached. The subscription will automatically send the message to a BigQuery table, dropping any data/metadata that is not in the table's schema.
    • Things like the original alert data can be included in every message, but only stored in a single table.
    • Similar modules (like classifiers) can all push results to the same table.
    • Modules will no longer load to BigQuery tables directly. (Those API calls can then be removed, which would completely eliminate this problem.)
  3. As a result of the above, a given module's metadata will be stored in the same table as its results. Thus, tracking the journey of a single alert through our pipeline will involve table joins. We may want to create a materialized view to make such queries easier.

Specific tasks that would be required

  • Design, create, and attach an Avro schema to every topic. Also, update every module to format outgoing messages using the schema attached to their topic.
  • Attach and configure a BigQuery subscription to each topic.
  • Remove the BigQuery API calls from every module that currently sends data there.
  • Review the tables we're currently keeping in BigQuery and decide if we want to restructure.
  • Drop the BigQuery Storage module. The full alert packet (minus cutouts) can go directly from the alerts stream to the alerts table. The DIASource table can either be a view on the alerts table, or the Cloud Storage module can produce the stream that populates it.
  • Drop the Metadata Collector module.
  • Consider creating a materialized view that joins the metadata from each table.
  • Redesign the broker's live-monitoring system (see below).

*Note: Some of the module's resources (its Pub/Sub "counter" subscriptions) also provide the data that is currently displayed by the broker's live-monitoring system. We should redesign that system to scrape data from logs instead, which is a more standard workflow there. Tagging #109.

@wmwv
Copy link
Collaborator

wmwv commented Sep 26, 2022

I'm sold.

@troyraen
Copy link
Collaborator Author

tagging @hernandezc1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants