Migrate retry handler in task SDK API client to use tenacity instead of retryhttp #56762

amoghrajesh · 2025-10-17T11:03:39Z

Motivation

The task SDK uses httpx for HTTP operations but depends on retryhttp for retry logic. This brings in the entire requests library as a transitive dependency, even though it's never actually used.

Memray reports also show some stats which can be improved.

For import retryhttp:

This is because retryhttp unconditionally imports both httpx and requests, even though we only use httpx.

The retry handler works fine, but it's unnecessary bloat. We're already using tenacity in sdk client, and we only use httpx, not requests, so might as well use tenacity for better memory results and reduced footprint.

Alternatives Considered

1. Switch to stamina

Modern, opinionated wrapper around tenacity
Better ergonomics and async support

2. Use pure tenacity

Zero new dependencies (tenacity already used)
Maximum memory and size savings
Full control over retry behavior

Went ahead and selected tenacity because it is a battle tested library already present in task sdk and is used often and there is no need for a new library when the benefits of memory footprint (Net savings: ~444KB vs ~544KB with pure tenacity) aren't significantly high.

Changes of note

Migrated task SDK to use tenacity instead of retryhttp while maintaining total parity with what retryhttp offered. A lot of code thats written is inspired by code of retryhttp!

Tenacity is a generic retry library - it doesn't know anything about HTTP. It just retries when you tell it to, so to maintain parity, some wrappers and helpers had to be written.

_should_retry_api_request(exception)

This function determines which errors should trigger a retry. It replicates retryhttp's behavior of retrying on:

Server errors (5xx status codes)
Network failures (httpx.NetworkError)
Timeouts (httpx.TimeoutException)

But NOT retrying on client errors like 404 or 401, which would be pointless.

NOTE: Behaviour for 429 status code (rate limit) has been removed because the API server doesn't support rate limiting as of now, and we can add that support to the client as we need it or as ready.

How parity was maintained

The original retryhttp decorator looked like this:

@retry(
    max_attempt_number=API_RETRIES,
    wait_server_errors=_default_wait,
    wait_network_errors=_default_wait,
    wait_timeouts=_default_wait,
    wait_rate_limited=wait_retry_after(fallback=_default_wait),
    before_sleep=before_log(log, logging.WARNING),
)

Each of those parameters mapped to specific HTTP behaviors. Our new decorator with tenacity:

@retry(
    retry=retry_if_exception(_should_retry_api_request),
    stop=stop_after_attempt(API_RETRIES),
    wait=_get_retry_wait_time,
    before_sleep=before_log(log, logging.WARNING),
    reraise=True,
)

Removed from dependencies:

retryhttp
requests
types-requests

How this was tested

Unit tests

All existing retry tests pass without modification:

Server error recovery (500 errors with retry)
Non-retryable errors (422 validation errors)
Max retry attempts exhaustion

Integration testing

Ran the task SDK against a live API server and killed the server mid request. The client correctly retried with exponential backoff and recovered when the server came back up, confirming network error handling works as expected.

Benefits

Memray results

Importing import tenacity and showing difference:

Flamegraph (now):

Flamegraph (earlier):

Memory Graph (now):

Memory Graph (before):

Stats (now):

Stats (earlier):

Package size

INSTALLED DEPENDENCIES SIZE:
Removed packages:
• retryhttp ..................... 40K
• requests ..................... 228K
• urllib3 (dep of requests) .... 488K
• charset_normalizer (dep) ..... 808K
• idna (dep) ................... 352K
• certifi (dep) ................ 296K
─────────────────────────────────────
TOTAL DISK SAVINGS: ........... ~2.2 MB

Workers do not need to have these dependencies anymore.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

…ryhttp

task-sdk/src/airflow/sdk/api/client.py

task-sdk/tests/task_sdk/api/test_client.py

kaxil

Its a bug that retryhttp brings in requests!

They do have the codebase structured and documentation that you can either use httpx or requests. Let me create a PR on their repo.

kaxil · 2025-10-17T12:20:24Z

Created a PR on their repo austind/retryhttp#26 to fix that

cc @austind

amoghrajesh · 2025-10-17T12:56:17Z

Memray for stamina too: from stamina import retry

Turns out it is even better than tenacity

kaxil · 2025-10-17T13:12:53Z

Turns out it is even better than tenacity

That can't be true though. Something must be wrong in how you are testing it. Stamina uses tenacity under the hood.

amoghrajesh · 2025-10-17T13:24:30Z

@kaxil yes you are right, something was wrong with my setup.

Testing using simple CLI now and tenacity stands better:

cd /Users/amoghdesai/Documents/OSS/repos/airflow
source .venv/bin/activate

cat > /tmp/test_retry_tenacity.py << 'EOF'
from tenacity import retry
print("Imported retry from tenacity")
EOF

cat > /tmp/test_retry_stamina.py << 'EOF'
from stamina import retry
print("Imported retry from stamina")
EOF

memray run -o retry-tenacity.bin /tmp/test_retry_tenacity.py
memray run -o retry-stamina.bin /tmp/test_retry_stamina.py

echo "=== TENACITY ==="
memray stats retry-tenacity.bin | head -20
echo ""
echo "=== STAMINA ==="
memray stats retry-stamina.bin | head -20

Results:

TENACITY: from tenacity import retry
Peak memory: 1.523 MB
Total allocated: 3.989 MB
Allocations: 1,343

STAMINA: from stamina import retry
Peak memory: 2.655 MB
Total allocated: 6.805 MB
Allocations: 2,139

ashb · 2025-10-17T14:37:40Z

@amoghrajesh It might be worth adding import pydantic, httpx in to those benchmarks, since we really only care about the delta on top of those modules, not the standalone.

task-sdk/tests/task_sdk/api/test_client.py

kaxil

few comments but overall lgtm.

cc @jscheffl you might be interested since you added it

jscheffl · 2025-10-17T19:35:24Z

Yeah looking at it! Also have a gap in retryhttp as I would like to switch to aiohttp in edge and thought about how to continue. Might be I follow your path...

amoghrajesh · 2025-10-22T07:15:31Z

@amoghrajesh It might be worth adding import pydantic, httpx in to those benchmarks, since we really only care about the delta on top of those modules, not the standalone.

Good idea. Ran a test for that, posting results below.

amoghrajesh · 2025-10-22T07:28:51Z

Test Setup

Testing baseline

"""Test baseline: pydantic + httpx only"""

import pydantic
import httpx
from pydantic import BaseModel

class TestModel(BaseModel):
    name: str

model = TestModel(name="test")
client = httpx.Client()
client.close()

print("Baseline: pydantic + httpx loaded")

Testing with tenacity

"""Test tenacity with equivalent retry configuration"""

import pydantic
import httpx
import tenacity
from pydantic import BaseModel
from tenacity import retry, retry_if_exception, stop_after_attempt, wait_exponential

class TestModel(BaseModel):
    name: str

model = TestModel(name="test")
client = httpx.Client()
client.close()

# Equivalent retry configuration
@retry(
    retry=retry_if_exception(lambda e: isinstance(e, Exception)),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=0.1, max=1.0),
    reraise=True,
)
def dummy():
    pass

print("Tenacity with equivalent config loaded")

Testing with stamina

"""Test stamina with equivalent retry configuration"""

import pydantic
import httpx
import stamina
from pydantic import BaseModel

class TestModel(BaseModel):
    name: str

model = TestModel(name="test")
client = httpx.Client()
client.close()

# Equivalent retry configuration
@stamina.retry(
    on=Exception,
    attempts=3,
    wait_initial=0.1,
    wait_max=1.0,
    wait_jitter=1.0,
)
def dummy():
    pass

print("Stamina with equivalent config loaded")

Testing with retryhttp

"""Test retryhttp with equivalent retry configuration"""

import pydantic
import httpx
import retryhttp
from pydantic import BaseModel
from retryhttp import retry

class TestModel(BaseModel):
    name: str

model = TestModel(name="test")
client = httpx.Client()
client.close()

# Equivalent retry configuration (retryhttp uses different API)
@retry(
    max_attempt_number=3,
    # retryhttp uses different wait strategies - using default exponential
)
def dummy():
    pass

print("RetryHTTP with equivalent config loaded")

Run profiling

memray run --output baseline.bin test_baseline.py
memray run --output tenacity.bin test_tenacity.py  
memray run --output stamina.bin test_stamina.py
memray run --output retryhttp.bin test_retryhttp.py


memray summary baseline.bin
memray summary tenacity.bin
memray summary stamina.bin
memray summary retryhttp.bin

Benchmark Results

Library	Total Memory	Overhead	Savings vs RetryHTTP	% Reduction
Baseline (pydantic + httpx)	13.62 MB	-	-	-
Tenacity (equivalent config)	17.03 MB	+3.41 MB	4.22 MB	50%
Stamina (equivalent config)	17.65 MB	+4.03 MB	3.60 MB	42%
RetryHTTP (equivalent config)	21.25 MB	+7.63 MB	-	-

amoghrajesh · 2025-10-22T09:35:11Z

@kaxil I have added this to 3.1.2 (we could also consider 3.2), but do you think there is a reason to add it to 3.1.1? I think not.

@jscheffl I see similar patterns for edge executor too, you could adopt a similar pattern there too. LMK if you want me to take a stab at it.

sunank200

LGTM overall — clear improvement and simpler deps.

I did not understand completely why we removed 429 retry logic. Earlier (per the PR description and early commits), the Task SDK’s retry logic used to respect Retry-After headers from 429 responses - pausing before retrying.

In the final version, the _should_retry_api_request predicate only retries on:

httpx.RequestError (network / timeout)
httpx.HTTPStatusError with status ≥ 500

It no longer retries on 429, nor does it read Retry-After. Why?

task-sdk/src/airflow/sdk/api/client.py

amoghrajesh · 2025-10-22T13:28:51Z

LGTM overall — clear improvement and simpler deps.

I did not understand completely why we removed 429 retry logic. Earlier (per the PR description and early commits), the Task SDK’s retry logic used to respect Retry-After headers from 429 responses - pausing before retrying.

In the final version, the _should_retry_api_request predicate only retries on:

httpx.RequestError (network / timeout)

httpx.HTTPStatusError with status ≥ 500

It no longer retries on 429, nor does it read Retry-After. Why?

Thanks @sunank200, the reason for that is simple -- API server doesn't have feature support for rate limiting! So it doesnt make sense to have the client support it when server doesn't: #56762 (comment)

jscheffl · 2025-10-22T19:33:03Z

@kaxil I have added this to 3.1.2 (we could also consider 3.2), but do you think there is a reason to add it to 3.1.1? I think not.

@jscheffl I see similar patterns for edge executor too, you could adopt a similar pattern there too. LMK if you want me to take a stab at it.

Yes, I'd also apply this to edge but I am still hoping for the PR of Kaxil merged and then followed by austind/retryhttp#28 as I'd like to change edge to use asyncio, then I'd need aiohttp support... and post the fix from Kaxil no dependency to requests anymore so can still keep retryhttp...

task-sdk/src/airflow/sdk/api/client.py

…of retryhttp (#56762) (cherry picked from commit fa60a7a)

Switch retry handler in SDK API client to use tenacity instead of ret…

9b7d510

…ryhttp

amoghrajesh requested review from ashb and kaxil as code owners October 17, 2025 11:03

boring-cyborg bot added the area:task-sdk label Oct 17, 2025

amoghrajesh requested review from gopidesupavan, jscheffl and potiuk October 17, 2025 11:03

amoghrajesh self-assigned this Oct 17, 2025

ashb reviewed Oct 17, 2025

View reviewed changes

task-sdk/src/airflow/sdk/api/client.py Outdated Show resolved Hide resolved

ashb reviewed Oct 17, 2025

View reviewed changes

task-sdk/tests/task_sdk/api/test_client.py Outdated Show resolved Hide resolved

kaxil reviewed Oct 17, 2025

View reviewed changes

kaxil mentioned this pull request Oct 17, 2025

Avoid forcing installation of both httpx and requests austind/retryhttp#26

Merged

remove hardcoding

1c63157

kaxil reviewed Oct 17, 2025

View reviewed changes

task-sdk/tests/task_sdk/api/test_client.py Outdated Show resolved Hide resolved

kaxil approved these changes Oct 17, 2025

View reviewed changes

amoghrajesh added 2 commits October 22, 2025 13:07

dont mock time.sleep

f4d1349

remove 429 behavioyr

91eeaeb

amoghrajesh added this to the Airflow 3.1.2 milestone Oct 22, 2025

sunank200 reviewed Oct 22, 2025

View reviewed changes

task-sdk/src/airflow/sdk/api/client.py Show resolved Hide resolved

sunank200 approved these changes Oct 22, 2025

View reviewed changes

jscheffl reviewed Oct 22, 2025

View reviewed changes

task-sdk/src/airflow/sdk/api/client.py Show resolved Hide resolved

amoghrajesh merged commit fa60a7a into apache:main Oct 23, 2025
92 checks passed

amoghrajesh deleted the switch-to-tenacity-for-client-retries branch October 23, 2025 07:51

kaxil pushed a commit that referenced this pull request Oct 31, 2025

Migrate retry handler in task SDK API client to use tenacity instead …

96ab2f0

…of retryhttp (#56762) (cherry picked from commit fa60a7a)

This was referenced Oct 31, 2025

Sync v3-1-stable with 3.1.2rc1 changes #57640

Merged

Status of testing of Apache Airflow 3.1.2rc2 & Task SDK 1.1.2rc2 #57648

Closed

Migrate retry handler in task SDK API client to use tenacity instead of retryhttp #56762

Migrate retry handler in task SDK API client to use tenacity instead of retryhttp #56762

Uh oh!

Conversation

amoghrajesh commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Alternatives Considered

1. Switch to stamina

2. Use pure tenacity

Changes of note

How parity was maintained

How this was tested

Unit tests

Integration testing

Benefits

Memray results

Package size

Uh oh!

Uh oh!

Uh oh!

kaxil left a comment

Choose a reason for hiding this comment

Uh oh!

kaxil commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amoghrajesh commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaxil commented Oct 17, 2025

Uh oh!

amoghrajesh commented Oct 17, 2025

Uh oh!

ashb commented Oct 17, 2025

Uh oh!

Uh oh!

kaxil left a comment

Choose a reason for hiding this comment

Uh oh!

jscheffl commented Oct 17, 2025

Uh oh!

amoghrajesh commented Oct 22, 2025

Uh oh!

amoghrajesh commented Oct 22, 2025

Test Setup

Run profiling

Benchmark Results

Uh oh!

amoghrajesh commented Oct 22, 2025

Uh oh!

sunank200 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amoghrajesh commented Oct 22, 2025

Uh oh!

jscheffl commented Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amoghrajesh commented Oct 17, 2025 •

edited

Loading

kaxil commented Oct 17, 2025 •

edited

Loading

amoghrajesh commented Oct 17, 2025 •

edited

Loading

sunank200 left a comment •

edited

Loading