Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python DeltaTable does not support writes in multiple threads #2958

Open
PeterKeDer opened this issue Oct 23, 2024 · 4 comments
Open

Python DeltaTable does not support writes in multiple threads #2958

PeterKeDer opened this issue Oct 23, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@PeterKeDer
Copy link
Contributor

PeterKeDer commented Oct 23, 2024

Environment

Delta-rs version: 0.20.1

Binding: python

Environment:

  • Cloud provider:
  • OS: macOS
  • Other:

Bug

What happened:

Using the same DeltaTable across multiple threads in write_deltalake causes the error

RuntimeError: Already borrowed

This is not an issue if we use the table URI directly.

What you expected to happen:

Writes should work in a multithreaded environment.

How to reproduce it:

from concurrent.futures import ThreadPoolExecutor
from deltalake import DeltaTable, write_deltalake
import polars as pl

path = 'some local path'
table = pl.DataFrame({'a': [1,2,3]}).to_arrow()
write_deltalake(path, table)

dt = DeltaTable(path)

with ThreadPoolExecutor() as exe:
    list(exe.map(lambda _: write_deltalake(dt, table, mode='append'), range(2)))

produces

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/peter.ke/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/Users/peter.ke/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/Users/peter.ke/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/Users/peter.ke/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/Users/peter.ke/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "<stdin>", line 2, in <lambda>
  File "/Users/peter.ke/.pyenv/versions/3.10.13/lib/python3.10/site-packages/deltalake/writer.py", line 302, in write_deltalake
    table.update_incremental()
  File "/Users/peter.ke/.pyenv/versions/3.10.13/lib/python3.10/site-packages/deltalake/table.py", line 1258, in update_incremental
    self._table.update_incremental()
RuntimeError: Already borrowed

More details:

Cause of error:

  • write_deltalake calls write_to_deltalake from rust to write, which borrows the RawDeltaTable immutably while releasing the GIL. The immutable borrow means theoretically writes in multiple threads should be fine
  • However, write_deltalake also calls table.update_incremental(), which borrows the RawDeltaTable mutably
  • Since GIL was released in write_to_deltalake, another thread running write_deltalake will try to mutably borrow the previously immutably borrowed RawDeltaTable and fails

From my brief investigation, I see two potential solutions:

  • The Rust API exposed to python could benefit from being refactored to be more immutable so it fits better in a multi-threaded environment. e.g. on the rust side, update_incremental can return a new DeltaTable rather than mutating the existing one- similarly for other internals like DeltaTableState or Snapshot.

    This should be similar performance/memory wise, since cloning a Snapshot seems relatively cheap given that the bulk files uses arrow. However, this looks like a large-ish refactor because a lot of the internals expose mutable APIs.

  • Remove update_incremental from write_deltalake (or add a flag to disable it). We'd just need to ensure write_to_deltalake returns an updated RawDeltaTable so table._table can kept updated.

@PeterKeDer PeterKeDer added the bug Something isn't working label Oct 23, 2024
@PeterKeDer
Copy link
Contributor Author

Actually the easiest solution might be to just clone the RawDeltaTable before we pass it into write_to_deltalake so it is not borrowed.

@brurucy
Copy link

brurucy commented Oct 23, 2024

I believe you could get around it with multiprocessing (using spawn).

@PeterKeDer
Copy link
Contributor Author

I believe you could get around it with multiprocessing (using spawn).

True, that does work. DeltaTable is not picklable though so it's a bit annoying. Since write_deltalake releases the GIL, I believe like the intention is to support multithreaded writes.

@houqp
Copy link
Member

houqp commented Nov 8, 2024

I think it's best to make the update_incremental call optional, seems unnecessary to always incur this overhead in the circumstances when the caller already know it has an up-to-date table instance.

@rtyler rtyler self-assigned this Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants