Skip to content

SPIKE: Make Serialization & De-serialization stricter with versioning with Task SDK #45428

@kaxil

Description

@kaxil

Currently the serialization and de-serialization logic lives in airflow/serialization in the Core. With Airflow 3 and the separation of Task SDK, we will need to make serialization and its versioning much stricter.

We should bump the current DAG serialization version to 2.

Approach

The serialization code should live closer to language-specific Task SDK as it knows best how to serialize objects in a language to a JSON-formatted string.

The Core/scheduler will contain the de-serialization code -- and it does need to be language specific as it contains only the info needed by the scheduler.

The contract between those two is the schema.json file that contains the serialization. Both the client and server could support multiple versions at a time.

Architecture

Task SDK (Serialization)     Schema Contract      Server (Deserialization)
┌─────────────────────┐     ┌─────────────────┐   ┌────────────────────────┐
│ Language-specific   │────▶│   schema.json   │◀──│ Language-agnostic      │
│ DAG → JSON          │     │   (versioned)   │   │ JSON → SerializedDAG   │
│                     │     │                 │   │                        │
│ - Python SDK        │     │ Version 2.0     │   │ - Scheduler            │
│ - Go SDK            │     │ Version 2.1     │   │ - API-Server           │
│ - Future SDKs       │     │ Version 2.2     │   │                        │
└─────────────────────┘     └─────────────────┘   └────────────────────────┘

Alternative Options to Compare

Option 2: Shared Serialization in airflow-protocols

  • Approach: Both serialization and deserialization live in airflow-protocols package
  • Pros: Single source of truth, shared implementation, easier maintenance
  • Cons: Both server and SDK depend on same package, potential coupling
  • Package location: airflow-protocols

Option 3: Symmetric Implementation

  • Approach: Both SDK and server can serialize/deserialize
  • Pros: Flexibility, testing capabilities, debugging support
  • Cons: Code duplication, potential drift between implementations

Key Questions to Answer

  • Where should serialization live? airflow-commons / airflow-protocols vs separate packages vs both?
  • Should s10n and des10n be separate? Or symmetric implementation?
  • How to handle versioning?
  • Backward compatibility strategy?

Success Criteria

  • SDK can serialize DAGs without importing server components and should be able to deserialize multiple versions
  • Server can deserialize without importing SDK components
  • Multiple schema versions supported simultaneously
  • Existing serialized DAGs can be migrated
  • Clear path for future language SDKs

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions