Skip to content

Conversation

@amoghrajesh
Copy link
Contributor

@amoghrajesh amoghrajesh commented Oct 17, 2025

Motivation

The task SDK uses httpx for HTTP operations but depends on retryhttp for retry logic. This brings in the entire requests library as a transitive dependency, even though it's never actually used.

Memray reports also show some stats which can be improved.

For import retryhttp:

image image image

This is because retryhttp unconditionally imports both httpx and requests, even though we only use httpx.

The retry handler works fine, but it's unnecessary bloat. We're already using tenacity in sdk client, and we only use httpx, not requests, so might as well use tenacity for better memory results and reduced footprint.

Alternatives Considered

1. Switch to stamina

  • Modern, opinionated wrapper around tenacity
  • Better ergonomics and async support

2. Use pure tenacity

  • Zero new dependencies (tenacity already used)
  • Maximum memory and size savings
  • Full control over retry behavior

Went ahead and selected tenacity because it is a battle tested library already present in task sdk and is used often and there is no need for a new library when the benefits of memory footprint (Net savings: ~444KB vs ~544KB with pure tenacity) aren't significantly high.

Changes of note

Migrated task SDK to use tenacity instead of retryhttp while maintaining total parity with what retryhttp offered. A lot of code thats written is inspired by code of retryhttp!

Tenacity is a generic retry library - it doesn't know anything about HTTP. It just retries when you tell it to, so to maintain parity, some wrappers and helpers had to be written.

_should_retry_api_request(exception)

This function determines which errors should trigger a retry. It replicates retryhttp's behavior of retrying on:

  • Server errors (5xx status codes)
  • Network failures (httpx.NetworkError)
  • Timeouts (httpx.TimeoutException)

But NOT retrying on client errors like 404 or 401, which would be pointless.

NOTE: Behaviour for 429 status code (rate limit) has been removed because the API server doesn't support rate limiting as of now, and we can add that support to the client as we need it or as ready.

How parity was maintained

The original retryhttp decorator looked like this:

@retry(
    max_attempt_number=API_RETRIES,
    wait_server_errors=_default_wait,
    wait_network_errors=_default_wait,
    wait_timeouts=_default_wait,
    wait_rate_limited=wait_retry_after(fallback=_default_wait),
    before_sleep=before_log(log, logging.WARNING),
)

Each of those parameters mapped to specific HTTP behaviors. Our new decorator with tenacity:

@retry(
    retry=retry_if_exception(_should_retry_api_request),
    stop=stop_after_attempt(API_RETRIES),
    wait=_get_retry_wait_time,
    before_sleep=before_log(log, logging.WARNING),
    reraise=True,
)

Removed from dependencies:

  • retryhttp
  • requests
  • types-requests

How this was tested

Unit tests

All existing retry tests pass without modification:

  • Server error recovery (500 errors with retry)
  • Non-retryable errors (422 validation errors)
  • Max retry attempts exhaustion

Integration testing

Ran the task SDK against a live API server and killed the server mid request. The client correctly retried with exponential backoff and recovered when the server came back up, confirming network error handling works as expected.

image

Benefits

Memray results

Importing import tenacity and showing difference:

Flamegraph (now):
image

Flamegraph (earlier):

image

Memory Graph (now):

image

Memory Graph (before):

image

Stats (now):

image

Stats (earlier):

image

Package size

INSTALLED DEPENDENCIES SIZE:
Removed packages:
• retryhttp ..................... 40K
• requests ..................... 228K
• urllib3 (dep of requests) .... 488K
• charset_normalizer (dep) ..... 808K
• idna (dep) ................... 352K
• certifi (dep) ................ 296K
─────────────────────────────────────
TOTAL DISK SAVINGS: ........... ~2.2 MB

Workers do not need to have these dependencies anymore.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a bug that retryhttp brings in requests!

They do have the codebase structured and documentation that you can either use httpx or requests. Let me create a PR on their repo.

@kaxil
Copy link
Member

kaxil commented Oct 17, 2025

Created a PR on their repo austind/retryhttp#26 to fix that

cc @austind

@amoghrajesh
Copy link
Contributor Author

amoghrajesh commented Oct 17, 2025

Memray for stamina too: from stamina import retry

image image image

Turns out it is even better than tenacity

@kaxil
Copy link
Member

kaxil commented Oct 17, 2025

Turns out it is even better than tenacity

That can't be true though. Something must be wrong in how you are testing it. Stamina uses tenacity under the hood.

@amoghrajesh
Copy link
Contributor Author

@kaxil yes you are right, something was wrong with my setup.

Testing using simple CLI now and tenacity stands better:

cd /Users/amoghdesai/Documents/OSS/repos/airflow
source .venv/bin/activate

cat > /tmp/test_retry_tenacity.py << 'EOF'
from tenacity import retry
print("Imported retry from tenacity")
EOF

cat > /tmp/test_retry_stamina.py << 'EOF'
from stamina import retry
print("Imported retry from stamina")
EOF

memray run -o retry-tenacity.bin /tmp/test_retry_tenacity.py
memray run -o retry-stamina.bin /tmp/test_retry_stamina.py

echo "=== TENACITY ==="
memray stats retry-tenacity.bin | head -20
echo ""
echo "=== STAMINA ==="
memray stats retry-stamina.bin | head -20

Results:

TENACITY: from tenacity import retry
Peak memory: 1.523 MB
Total allocated: 3.989 MB
Allocations: 1,343

STAMINA: from stamina import retry
Peak memory: 2.655 MB
Total allocated: 6.805 MB
Allocations: 2,139

@ashb
Copy link
Member

ashb commented Oct 17, 2025

@amoghrajesh It might be worth adding import pydantic, httpx in to those benchmarks, since we really only care about the delta on top of those modules, not the standalone.

Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments but overall lgtm.

cc @jscheffl you might be interested since you added it

@jscheffl
Copy link
Contributor

Yeah looking at it! Also have a gap in retryhttp as I would like to switch to aiohttp in edge and thought about how to continue. Might be I follow your path...

@amoghrajesh
Copy link
Contributor Author

@amoghrajesh It might be worth adding import pydantic, httpx in to those benchmarks, since we really only care about the delta on top of those modules, not the standalone.

Good idea. Ran a test for that, posting results below.

@amoghrajesh
Copy link
Contributor Author

Test Setup

  1. Testing baseline
"""Test baseline: pydantic + httpx only"""

import pydantic
import httpx
from pydantic import BaseModel

class TestModel(BaseModel):
    name: str

model = TestModel(name="test")
client = httpx.Client()
client.close()

print("Baseline: pydantic + httpx loaded")
  1. Testing with tenacity
"""Test tenacity with equivalent retry configuration"""

import pydantic
import httpx
import tenacity
from pydantic import BaseModel
from tenacity import retry, retry_if_exception, stop_after_attempt, wait_exponential

class TestModel(BaseModel):
    name: str

model = TestModel(name="test")
client = httpx.Client()
client.close()

# Equivalent retry configuration
@retry(
    retry=retry_if_exception(lambda e: isinstance(e, Exception)),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=0.1, max=1.0),
    reraise=True,
)
def dummy():
    pass

print("Tenacity with equivalent config loaded")
  1. Testing with stamina
"""Test stamina with equivalent retry configuration"""

import pydantic
import httpx
import stamina
from pydantic import BaseModel

class TestModel(BaseModel):
    name: str

model = TestModel(name="test")
client = httpx.Client()
client.close()

# Equivalent retry configuration
@stamina.retry(
    on=Exception,
    attempts=3,
    wait_initial=0.1,
    wait_max=1.0,
    wait_jitter=1.0,
)
def dummy():
    pass

print("Stamina with equivalent config loaded")
  1. Testing with retryhttp
"""Test retryhttp with equivalent retry configuration"""

import pydantic
import httpx
import retryhttp
from pydantic import BaseModel
from retryhttp import retry

class TestModel(BaseModel):
    name: str

model = TestModel(name="test")
client = httpx.Client()
client.close()

# Equivalent retry configuration (retryhttp uses different API)
@retry(
    max_attempt_number=3,
    # retryhttp uses different wait strategies - using default exponential
)
def dummy():
    pass

print("RetryHTTP with equivalent config loaded")

Run profiling

memray run --output baseline.bin test_baseline.py
memray run --output tenacity.bin test_tenacity.py  
memray run --output stamina.bin test_stamina.py
memray run --output retryhttp.bin test_retryhttp.py


memray summary baseline.bin
memray summary tenacity.bin
memray summary stamina.bin
memray summary retryhttp.bin

Benchmark Results

Library Total Memory Overhead Savings vs RetryHTTP % Reduction
Baseline (pydantic + httpx) 13.62 MB - - -
Tenacity (equivalent config) 17.03 MB +3.41 MB 4.22 MB 50%
Stamina (equivalent config) 17.65 MB +4.03 MB 3.60 MB 42%
RetryHTTP (equivalent config) 21.25 MB +7.63 MB - -

@amoghrajesh amoghrajesh added this to the Airflow 3.1.2 milestone Oct 22, 2025
@amoghrajesh
Copy link
Contributor Author

@kaxil I have added this to 3.1.2 (we could also consider 3.2), but do you think there is a reason to add it to 3.1.1? I think not.

@jscheffl I see similar patterns for edge executor too, you could adopt a similar pattern there too. LMK if you want me to take a stab at it.

Copy link
Collaborator

@sunank200 sunank200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall — clear improvement and simpler deps.

I did not understand completely why we removed 429 retry logic. Earlier (per the PR description and early commits), the Task SDK’s retry logic used to respect Retry-After headers from 429 responses - pausing before retrying.

In the final version, the _should_retry_api_request predicate only retries on:

  • httpx.RequestError (network / timeout)

  • httpx.HTTPStatusError with status ≥ 500

It no longer retries on 429, nor does it read Retry-After. Why?

@amoghrajesh
Copy link
Contributor Author

LGTM overall — clear improvement and simpler deps.

I did not understand completely why we removed 429 retry logic. Earlier (per the PR description and early commits), the Task SDK’s retry logic used to respect Retry-After headers from 429 responses - pausing before retrying.

In the final version, the _should_retry_api_request predicate only retries on:

  • httpx.RequestError (network / timeout)
  • httpx.HTTPStatusError with status ≥ 500

It no longer retries on 429, nor does it read Retry-After. Why?

Thanks @sunank200, the reason for that is simple -- API server doesn't have feature support for rate limiting! So it doesnt make sense to have the client support it when server doesn't: #56762 (comment)

@jscheffl
Copy link
Contributor

@kaxil I have added this to 3.1.2 (we could also consider 3.2), but do you think there is a reason to add it to 3.1.1? I think not.

@jscheffl I see similar patterns for edge executor too, you could adopt a similar pattern there too. LMK if you want me to take a stab at it.

Yes, I'd also apply this to edge but I am still hoping for the PR of Kaxil merged and then followed by austind/retryhttp#28 as I'd like to change edge to use asyncio, then I'd need aiohttp support... and post the fix from Kaxil no dependency to requests anymore so can still keep retryhttp...

@amoghrajesh amoghrajesh merged commit fa60a7a into apache:main Oct 23, 2025
92 checks passed
@amoghrajesh amoghrajesh deleted the switch-to-tenacity-for-client-retries branch October 23, 2025 07:51
kaxil pushed a commit that referenced this pull request Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants