Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Pushing batches of data to online store: Should conn.commit() happen in the for loop or after? #4036

Closed
job-almekinders opened this issue Mar 23, 2024 · 4 comments · Fixed by #4303
Labels
kind/question Further information is requested

Comments

@job-almekinders
Copy link
Contributor

job-almekinders commented Mar 23, 2024

This piece of code in the online_write_batch function in the Postgres online store pushes data in batches to the online store.

I was wondering whether it makes more sense to put the conn.commit() inside the for loop, or after the for loop. I would love to hear the different trade-offs between the two options!

Copy of the code snippet here:

batch_size = 5000
for i in range(0, len(insert_values), batch_size):
    cur_batch = insert_values[i : i + batch_size]
    execute_values(
        cur,
        sql.SQL(
            """
            INSERT INTO {}
            (entity_key, feature_name, value, event_ts, created_ts)
            VALUES %s
            ON CONFLICT (entity_key, feature_name) DO
            UPDATE SET
                value = EXCLUDED.value,
                event_ts = EXCLUDED.event_ts,
                created_ts = EXCLUDED.created_ts;
            """,
        ).format(sql.Identifier(_table_id(project, table))),
        cur_batch,
        page_size=batch_size,
    )
    conn.commit() # << This is the point of interest. Should this be placed in the loop or after?  
@jeremyary jeremyary added the kind/question Further information is requested label Mar 23, 2024
@tokoko
Copy link
Collaborator

tokoko commented Mar 23, 2024

@job-almekinders thanks for raising this. Using multiple commits (while not ideal) is not the end of the world here. note that depending on what materiliasation engine you're using you might still end up with multiple commits even if we pull the commit out of the loop (if using spark or bytewax).

Having said that, I think it seems to be impossible to make that change now anyway. Batch size of 5000 is a really small number for even the most common use cases. I have never used this in practice, but I suspect for feature views numbering several million entities and tens of features, the whole process probably takes ages. Leaving the connection uncommited for so long is not ideal either.

We could make batch size configurable first, but increasing it is also too risky. If I'm not mistaken (?) execute_values generates a single sql string with all the values as tuples inside. With high enough batch size, the size of that sql query could quickly get out of hand... So maybe it's time to give up and try using adbc instead? bulk writes with adbc look a lot more promising. What do you think?

@job-almekinders
Copy link
Contributor Author

job-almekinders commented Mar 25, 2024

Thank you for your reply and additional information, this helps a lot :)

I think in the short term we could indeed make the batch_size configurable, to give a bit more control! And maybe also add some documentation in the docstring and/or a log statement about the risks of setting the batch size as a very large number.

The bulk writes from abdc look very promising indeed! It would be very interesting to try this out.

@lokeshrangineni
Copy link
Contributor

@job-almekinders - Would you like to create a PR for this change to make it configurable?

@job-almekinders
Copy link
Contributor Author

Hey @lokeshrangineni - I'm currently occupied with other work unfortunately. My focus might shift towards this topic again in some time, so I might open a PR then!

Feel free to open one in the mean time yourself if you need this change earlier :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Further information is requested
Projects
None yet
4 participants