Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Spec Lineage.NodeID #2949

Open
wslulciuc opened this issue Oct 24, 2024 · 0 comments
Open

Proposal: Spec Lineage.NodeID #2949

wslulciuc opened this issue Oct 24, 2024 · 0 comments
Labels
Milestone

Comments

@wslulciuc
Copy link
Member

wslulciuc commented Oct 24, 2024

Recently, we've seen various bugs reported for NodeID parsing issues:

A NodeID consists of multiple parts (i.e. metadata) delimited by a colon (:). A NodeID can be of type: dataset, jobs, etc with the following parts:

<type>:<namespace>:<name>
or, <type>:<namespace>:<name>#<version>

We defined a NodeIDin this way in order to encode metadata about the node type and ensure global unique IDs; the LineageAPI returns graph nodes with all metadata associated with that given node type. For example, below is the NodeID for dataset food_delivery:public.delivery_7_days:

dataset:food_delivery:public.delivery_7_days

where, food_delivery is the namespace, and public.delivery_7_days is the name of the dataset. A call to the LineageAPI will return the graph node:

{
  "id": "dataset:food_delivery:public.delivery_7_days",
  "type": "DATASET",
  "data": {
    "id": { "namespace": "food_delivery", "name": "public.delivery_7_days" },
    "type": "DB_TABLE",
    "name": "public.delivery_7_days",
    "physicalName": "public.delivery_7_days",
    "createdAt": "2024-10-24T19:27:05Z",
    "updatedAt": "2024-10-24T22:36:06Z",
    "namespace": "food_delivery",
    "sourceName": "food_delivery_db",
    "fields": [
      { "name": "order_id", "type": "INTEGER", "description": "The ID of the order." },
      { "name": "order_placed_on", "type": "TIMESTAMP", "description": "ISO-8601 timestamp for when the order was placed." },
      { "name": "order_dispatched_on", "type": "TIMESTAMP", "description": "ISO-8601 timestamp for dispatch." },
      { "name": "order_delivered_on", "type": "TIMESTAMP", "description": "ISO-8601 timestamp for delivery." },
      { "name": "customer_email", "type": "VARCHAR", "description": "Customer's email address." },
      { "name": "customer_address", "type": "VARCHAR", "description": "Customer's physical address." },
      { "name": "menu_id", "type": "INTEGER", "description": "ID of the related menu." },
      { "name": "restaurant_id", "type": "INTEGER", "description": "ID of the restaurant." },
      { "name": "restaurant_address", "type": "VARCHAR", "description": "Restaurant's address." },
      { "name": "menu_item_id", "type": "INTEGER", "description": "ID of the menu item." },
      { "name": "category_id", "type": "INTEGER", "description": "ID of the category." },
      { "name": "discount_id", "type": "INTEGER", "description": "ID of the discount." },
      { "name": "city_id", "type": "INTEGER", "description": "ID of the city." },
      { "name": "driver_id", "type": "INTEGER", "description": "ID of the driver." }
    ],
    "tags": [],
    "lastModifiedAt": null,
    "description": null,
    "lastLifecycleState": ""
  },
  "inEdges": [
    { "origin": "job:food_delivery:etl_delivery_7_days", "destination": "dataset:food_delivery:public.delivery_7_days" }
  ],
  "outEdges": [
    { "origin": "dataset:food_delivery:public.delivery_7_days", "destination": "job:food_delivery:delivery_times_7_days" }
  ]
}

Error on NodeId.parse()

But, what if the namespace contains a colon :? Our NodeId.parse() method errors (not fun!). For example, node parsing will error for the namespace:

trino://trino-integration-test:1337

We need to move away from NodeId with encoded metadata (no longer needed as we move towards a light-weight lineage graph response -- just nodes and edges).

Use UUIDs as NodeIDs

Let's move to using UUIDs for NodeIDs when the lineage graph returns just nodes and edges an supports the following lineage graphs:

  • dataset -> dataset
  • dataset -> column/field
  • columns/fields -> dataset
  • job -> job
  • job -> dataset
  • dataset -> job
@wslulciuc wslulciuc added this to the 0.52.0 milestone Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

1 participant