Skip to content

Conversation

@bveeramani
Copy link
Member

@bveeramani bveeramani commented Aug 17, 2023

Why are these changes needed?

Writing data back to databases is common for many applications like LLMs. For example, you might want to write vector indices back to a database like https://github.com/pgvector/pgvector. To support this use case, this PR adds an API to write Datasets to SQL databases.

Related issue number

Closes #38242

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
@bveeramani bveeramani marked this pull request as draft August 17, 2023 03:23
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid to me as a starting point. Can you publish and trigger the wheel build, so we can test it out in real workload?

@bveeramani bveeramani marked this pull request as ready for review August 17, 2023 06:34
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
…/ray into bveeramani/write-sql

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
GokuMohandas added a commit to ray-project/llm-applications that referenced this pull request Aug 18, 2023
Use ray-project/ray#38544 to write out the
embeddings. This has some advantages: The user doesn't need to specify
the parallelism, they don't need to trigger the computation with
`.count()`, since a write is an action -- so this is less error prone.
It is also less code.

We should wait merging the PR until Ray 2.7 is released with
ray-project/ray#38544 merged.

Signed-off-by: Goku Mohandas <gokumd@gmail.com>
Co-authored-by: Goku Mohandas <gokumd@gmail.com>
@pcmoritz
Copy link
Contributor

pcmoritz commented Aug 19, 2023

So this is now working for me, thanks a lot for fixing it!

One thing we should improve before merging it is the error handling. I noticed if the table does not exist, it will just silently fail and write nothing which is not good and can lead to data loss. To fix it I think we should do two things: Improve the error handling and catch exceptions better, and also potentially return some info about how many rows have been written (e.g. in the form of a status / info dict that we could add more stuff to in the future). The former we should do for sure, the latter only if it doesn't involve too many changes for now :)

Here is a repro of the silent failure in sqlite3 -- let's also add a test for it:

In [1]: import sqlite3

In [2]: import ray

In [3]: connection = sqlite3.connect("/tmp/db.sqlite")

In [4]: dataset = ray.data.from_items(
   ...:         [{"string": "spam", "number": 0}, {"string": "ham", "number": 1}]
   ...:     )

In [5]: dataset.write_sql(
   ...:         "INSERT INTO test VALUES(?, ?)", lambda: sqlite3.connect("/tmp/db.sqlite"))

Add chunking

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Remove extra file

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
@bveeramani
Copy link
Member Author

@pcmoritz fixed the issue with error messages. Here's what the error looks like now:

import sqlite3
import ray

connection = sqlite3.connect("/tmp/eggs.db")
dataset = ray.data.range(1)

dataset.write_sql(
    "INSERT INTO test VALUES(?)", lambda: sqlite3.connect("/tmp/eggs.db")
)
Traceback (most recent call last):
  ...
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/datasource/sql_datasource.py", line 55, in write
    cursor.executemany(sql, values)
sqlite3.OperationalError: no such table: test

Add docstring

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Address review comments

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Remove max_rows_per_write parameter

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
@bveeramani bveeramani merged commit ef0e80b into master Aug 23, 2023
@bveeramani bveeramani deleted the bveeramani/write-sql branch August 23, 2023 00:24
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Writing data back to databases is common for many applications like LLMs. For example, you might want to write vector indices back to a database like https://github.com/pgvector/pgvector. To support this use case, this PR adds an API to write Datasets to SQL databases.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Writing data back to databases is common for many applications like LLMs. For example, you might want to write vector indices back to a database like https://github.com/pgvector/pgvector. To support this use case, this PR adds an API to write Datasets to SQL databases.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
bveeramani added a commit that referenced this pull request Oct 18, 2023
Dataset.write_sql was added a while ago in #38544, but it wasn't documented in the API reference. This PR adds it.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
PatchouliTIS pushed a commit to PatchouliTIS/RAG-Bot that referenced this pull request Jul 9, 2024
Use ray-project/ray#38544 to write out the
embeddings. This has some advantages: The user doesn't need to specify
the parallelism, they don't need to trigger the computation with
`.count()`, since a write is an action -- so this is less error prone.
It is also less code.

We should wait merging the PR until Ray 2.7 is released with
ray-project/ray#38544 merged.

Signed-off-by: Goku Mohandas <gokumd@gmail.com>
Co-authored-by: Goku Mohandas <gokumd@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Support write_sql()

5 participants