Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: provide a way to use pagination concurrently to retrieve objects using the SDK #159

Closed
wvandeun opened this issue Nov 29, 2024 · 1 comment · Fixed by #219
Closed
Labels
type/feature New feature or request

Comments

@wvandeun
Copy link
Contributor

wvandeun commented Nov 29, 2024

Component

Python SDK

Describe the Feature Request

When you execute a query that retrieves a large amount of nodes from the db, using the filters or all method, then the SDK will leverage pagination to break the query into smaller pages.

The retrieval of this pages happens in a serial way, which is ideal. We can do this better and faster if we retrieve the pages concurrently.

Pseud code of what it could look like

from infrahub_sdk import InfrahubClient

client = InfrahubClient()
resp = await client.get("query { LocationSuite { count }}")

count = int(resp["LocationSuite"]["count"])

batch = await client.create_batch()

num = 1
page_size=50
while num < count:
    batch.add(task=client.all, kind="nodekind", offset=num, limit=page_size)
    num += page_size

Describe the Use Case

Retrieval of a large number of nodes using a GraphQL query is taking some time, since we are retrieving the pages one by one in a serial way. Being able to get the pages in a concurrent way should improve the speed.

Additional Information

No response

@wvandeun wvandeun added the type/feature New feature or request label Nov 29, 2024
@minitriga minitriga self-assigned this Jan 6, 2025
@minitriga
Copy link
Contributor

minitriga commented Jan 6, 2025

I have a working example of this for both async and sync clients using the batch functionality within the SDK.
Sync:

from rich import print as rprint
import os
from infrahub_sdk import InfrahubClientSync
from infrahub_sdk import Config

client = InfrahubClientSync(config=Config(pagination_size=2))

def main():
    branches = client.all(kind="OrganizationGeneric", batch=True)
    rprint(branches)

if __name__ == "__main__":
    main()

Async:

from asyncio import run as aiorun

from rich import print as rprint

from infrahub_sdk import Config, InfrahubClient

client = InfrahubClient(config=Config(pagination_size=2))

async def create_data(number: int):
    data = {
        "name": f"Vendor {number}",
    }
    obj = await client.create(kind="OrganizationGeneric", data=data)
    await obj.save()
    print(f"New OrganizationGeneric created with the Id {obj.id}")


async def main():
    # for i in range(1,1000):
    #     await create_data(i)
    branches = await client.all(kind="OrganizationGeneric", batch=True)
    rprint(len(branches))


if __name__ == "__main__":
    aiorun(main())

I have manually set the pagination size to 2 to slow things down but without batch=True the query for 1021 locations takes poetry run python test_sync.py 2.71s user 0.44s system 47% cpu 6.597 total and left to process the queries in serial it takes poetry run python test_sync.py 4.38s user 0.66s system 20% cpu 24.167 total.

@wvandeun mentioned that batch is not the best argument name so open to suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants