Skip to content
This repository has been archived by the owner on Dec 3, 2021. It is now read-only.

Latest commit

 

History

History
327 lines (278 loc) · 11.1 KB

beginners-guide-to-catalog.md

File metadata and controls

327 lines (278 loc) · 11.1 KB

A Beginner's Guide to the AirbyteCatalog

Overview

The goal of this article is to make the AirbyteCatalog approachable to someone contributing to Airbyte for the first time. If you are looking to get deeper into the details of the catalog, you can read our technical specification on it here.

The goal of the AirbyteCatalog is to describe what data is available in a source. The goal of the ConfiguredAirbyteCatalog is to, based on an AirbyteCatalog, specify how data from the source is replicated.

Contents

This article will illustrate how to use AirbyteCatalog via a series of examples. We recommend reading the Database Example first. The other examples, will refer to knowledge described in that section. After that, jump around to whichever example is most pertinent to your inquiry.

In order to understand in depth how to configure incremental data replication, head over to the incremental replication docs.

Database Example

Let's jump into an example using a relational database. We will assume we have a database with the following schema:

CREATE TABLE "airlines" (
    "id"   INTEGER,
    "name" VARCHAR
);

CREATE TABLE "pilots" (
    "id"   INTEGER,
    "airline_id" INTEGER,
    "name" VARCHAR
);

AirbyteCatalog

We would represent this data in a catalog as follows:

{
  "streams": [
    {
      "name": "airlines",
      "supported_sync_modes": [
        "full_refresh",
        "incremental"
      ],
      "source_defined_cursor": false,
      "json_schema": {
        "type": "object",
        "properties": {
          "id": {
            "type": "number"
          },
          "name": {
            "type": "string"
          }
        }
      }
    },
    {
      "name": "pilots",
      "supported_sync_modes": [
        "full_refresh",
        "incremental"
      ],
      "source_defined_cursor": false,
      "json_schema": {
        "type": "object",
        "properties": {
          "id": {
            "type": "number"
          },
          "airline_id": {
            "type": "number"
          },
          "name": {
            "type": "string"
          }
        }
      }
    }
  ]
}

The catalog is structured as a list of AirbyteStream. In the case of a database a "stream" is analogous to a table. (For APIs the mapping can be a more creative; we will discuss it later in API Examples)

Let's walk through what each field in a stream means.

  • name - The name of the stream.
  • supported_sync_modes - This field lists the type of data replication that this source supports. The possible values in this array include FULL_REFRESH (docs) and INCREMENTAL (docs).
  • source_defined_cursor - If the stream supports INCREMENTAL replication, then this field signals whether the source can figure out how to detect new records on its own or not.
  • json_schema - This field is a JsonSchema object that describes the structure of the data. Notice that each key in the properties object corresponds to a column name in our database table.

Now we understand what data is available from this source. Next we will configure how we want to replicate that data.

ConfiguredAirbyteCatalog

Let's say that we do not care about replicating the pilot data at all. We do want to replicate the airlines data as a FULL_REFRESH. Here's what our ConfiguredAirbyteCatalog would look like.

{
  "streams": [
    {
      "sync_mode": "FULL_REFRESH",
      "stream": {
        "name": "airlines",
        "supported_sync_modes": [
          "full_refresh",
          "incremental"
        ],
        "source_defined_cursor": false,
        "json_schema": {
          "type": "object",
          "properties": {
            "id": {
              "type": "number"
            },
            "name": {
              "type": "string"
            }
          }
        }
      }
    }
  ]
}

Just as with the AirbyteCatalog the ConfiguredAirbyteCatalog contains a list. This time it is a list of ConfiguredAirbyteStream (instead of just AirbyteStream).

Let's walk through each field in the ConfiguredAirbyteStream:

  • sync_mode - This field must be one of the values that was in supported_sync_modes in the AirbyteStream - Configures which sync mode will be used when data is replicated.
  • stream - Hopefully this one looks familiar! This field contains an AirbyteStream. It should be identical to the one we saw in the AirbyteCatalog.
  • cursor_field - When sync_mode is INCREMENTAL and source_defined_cursor = false, this field configures which field in the stream will be used to determine if a record should be replicated or not. Read more about this concept in our documentation of incremental replication.

Summary of the Postgres Example

When thinking about AirbyteCatalog and ConfiguredAirbyteCatalog, remember that the AirbyteCatalog describes what data is present in the source (and metadata around what replication configuration it can support). It is output by the discover method of source. It should be treated as an immutable object; if you are ever manually editing a catalog outside of a source, you've gone off the rails. The ConfiguredAirbyteCatalog is a mutable configuration object that specifies, for each AirbyteStream, how (and if) it should be replicated. The ConfiguredAirbyteCatalog does this by wrapping each AirbyteStream in an AirbyteCatalog inside a ConfiguredAirbyteStream.

API Examples

The AirbyteCatalog offers the flexibility in how to model the data for an API. In the next two examples, we will model data from the same API--a stock ticker--in two different ways. In the first, the source will return a single stream called ticker, and in the second, the source with return a stream for each stock symbol it is configured to retrieve data for. Each stream's name will be a stock symbol.

Static Streams Example

Let's imagine we want to create a basic Stock Ticker source. The goal of this source is to take in a single stock symbol and return a single stream. We will call the stream ticker and will contain the closing price of the stock. We will assume that you already have a rough understanding of the AirbyteCatalog and the ConfiguredAirbyteCatalog from the previous database example.

AirbyteCatalog

Here is what the AirbyteCatalog might look like.

{
  "streams": [
    {
      "name": "ticker",
      "supported_sync_modes": [
        "full_refresh",
        "incremental"
      ],
      "source_defined_cursor": false,
      "json_schema": {
        "type": "object",
        "properties": {
          "symbol": {
            "type": "string"
          },
          "price": {
            "type": "number"
          },
          "date": {
            "type": "string"
          }
        }
      }
    }
  ]
}

This catalog looks pretty similar to the AirbyteCatalog that we created for the Database Example. For the data we've picked here, you can think about ticker as a table and then each field it returns in a record as a column, so it makes sense that these look pretty similar.

ConfiguredAirbyteCatalog

The ConfiguredAirbyteCatalog follows the same rules as we described in the Database Example. It just wraps the AirbyteCatalog described above.

Dynamic Streams Example

Now let's build a stock ticker source that handles returning ticker data for multiple stocks. The name of each stream will be the stock symbol that it represents.

AirbyteCatalog

{
  "streams": [
    {
      "name": "TSLA",
      "supported_sync_modes": [
        "full_refresh",
        "incremental"
      ],
      "source_defined_cursor": false,
      "json_schema": {
        "type": "object",
        "properties": {
          "symbol": {
            "type": "string"
          },
          "price": {
            "type": "number"
          },
          "date": {
            "type": "string"
          }
        }
      }
    },
    {
      "name": "FB",
      "supported_sync_modes": [
        "full_refresh",
        "incremental"
      ],
      "source_defined_cursor": false,
      "json_schema": {
        "type": "object",
        "properties": {
          "symbol": {
            "type": "string"
          },
          "price": {
            "type": "number"
          },
          "date": {
            "type": "string"
          }
        }
      }
    }
  ]
}

This example provides another way of thinking about exposing data in a source. As a developer building a source, you can model the AirbyteCatalog for a source however makes most sense to the use case you are trying to fulfill.

Nested Schema Example

Often, a data source contains "nested" data. In other words this is data where each record contains other objects nested inside it. Cases like this cannot be easily modeled just as tables / columns. This is why Airbyte uses JsonSchema to model the schema of its streams.

Let's imagine we are modeling a flight object. A flight object might look like this:

{
  "airline": "alaska",
  "origin": {
    "airport_code": "SFO",
    "terminal": "2",
    "gate": "G23"
  },
  "destination": {
    "airport_code": "JFK",
    "terminal": "7",
    "gate": "1"
  }
}

The AirbyteCatalog would look like this:

{
  "streams": [
    {
      "name": "flights",
      "supported_sync_modes": [
        "full_refresh"
      ],
      "source_defined_cursor": false,
      "json_schema": {
        "type": "object",
        "properties": {
          "airline": {
            "type": "string"
          },
          "origin": {
            "type": "object",
            "properties": {
              "airport_code": {
                "type": "string"
              },
              "terminal": {
                "type": "string"
              },
              "gate": {
                "type": "string"
              }
            }
          },
          "destination": {
            "type": "object",
            "properties": {
              "airport_code": {
                "type": "string"
              },
              "terminal": {
                "type": "string"
              },
              "gate": {
                "type": "string"
              }
            }
          }
        }
      }
    }
  ]
}

Because Airbyte uses JsonSchema to model the schema of streams, it is able to handle arbitrary nesting of data in a way that a table / column based model cannot.