Daft Equivalent Syntax for Pandas Dataframe.apply(axis=1) #1041

dendihandian · 2023-06-13T21:52:42Z

dendihandian
Jun 13, 2023

for straighforward example: df = df.apply(lambda row: processing_row(row, sqlite_conn), axis=1)

how to do it in daft?

Jun 13, 2023

Thanks for the question @dendihandian!

As an aside to your question (unrelated to your question about the API itself), it might be a lot easier if we could instead maybe read the entire SQLite table as a Daft Dataframe? Then you will not have to rely on a custom Python function to query sqlite, and it is likely that you probably want something more like a .join between your SQLite table and your dataframe. But I digress!

To answer your question: There is no API for running over the entire row at a time. This is because Daft is very deliberate about exactly which columns you need, and uses this information to optimize your work. Here are some ways of adapting your use-case to canonical Da…

View full answer

jaychia · 2023-06-13T22:40:41Z

jaychia
Jun 13, 2023
Maintainer

Thanks for the question @dendihandian!

As an aside to your question (unrelated to your question about the API itself), it might be a lot easier if we could instead maybe read the entire SQLite table as a Daft Dataframe? Then you will not have to rely on a custom Python function to query sqlite, and it is likely that you probably want something more like a .join between your SQLite table and your dataframe. But I digress!

To answer your question: There is no API for running over the entire row at a time. This is because Daft is very deliberate about exactly which columns you need, and uses this information to optimize your work. Here are some ways of adapting your use-case to canonical Daft code:

Applying a Python function on a single column

As an example, here is a processing_row function that takes a single id column, grabbing a single item out from a sqlite database:

def processing_row(id: int) -> str:
    sqlite_conn = sqlite3.connect("my.db")
    cur = sqlite_conn.cursor()
    data = cur.execute(f"SELECT name FROM table WHERE id={id}").fetchone()
    return data[0]

df = df.with_column(
    "sqlite_data",
    df["id"].apply(processing_row, return_dtype=daft.DataType.string())
)

Further extensions

I would note a couple of problems with the previous approach

The df["id"].apply(...) pattern only works with single columns. If instead you need to take as input multiple columns, you will need a UDF
sqlite_conn is initialized on every row. If you want to instead have it be serialized once and share that across multiple invocations, you will need a Stateful UDF

Here is an example UDF that performs the above optimizations, and takes as inputs two columns (id and last_name). It will grab names from SQLite and append with last_name before returning it as strings.

@daft.udf(return_dtype=daft.DataType.string())
class GetFullName:
    def __init__(self):
        # Initialize this once, shared across multiple invocations
        self.sqlite_conn = sqlite3.connect("my.db")

    def __call__(self, ids: daft.Series, last_names: daft.Series) -> list[str]:
        full_names = []
        for id, last_name in zip(ids.to_pylist(), last_names.to_pylist()):
            cur = self.sqlite_conn.cursor()
            name = cur.execute(f"SELECT name FROM table WHERE id={id}").fetchone()[0]
            full_names.append(f"{name} {last_name}")
        return full_names

df = df.with_column("full_name", GetFullName(df["id"], df["last_name"]))

Note that this is really nice because:

All your state is now defined inside of the UDF instead of being passed in as an argument (in your example code, sqlite_conn is likely some local variable in your notebook, and this is very error-prone when working in a distributed setting)
With a UDF, you can now pass in multiple columns as arguments. What's more - you actually have access to multiple rows, which lets you do really interesting optimizations if your code can be more optimized when working in batches.

2 replies

dendihandian Jun 14, 2023
Author

Thanks for the answer @jaychia

Since the daft df[col].apply() only works with single column, then I need to find another workaround. the sqlite_conn was the context manager object that I kept it connection-open until all the rows processed, but I think I will initialize it in every row since it won't possible for multi-thread.

jaychia Jun 14, 2023
Maintainer

Indeed! You wouldn't want to share your sqlite_conn across threads.

Your best option in this case would be to use the Stateful UDF example that I showed earlier! This will:

Ensure that you have one connection per-thread instead of accidentally sharing the connection across threads
Allow you to run on multiple columns

Apologies, I also forgot to show you how to actually use that Stateful UDF! After you define your UDF, calling it on columns in your dataframe is really simple:

df = df.with_column("full_name", GetFullName(df["id"], df["last_name"]))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Daft Equivalent Syntax for Pandas Dataframe.apply(axis=1) #1041

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Daft Equivalent Syntax for Pandas Dataframe.apply(axis=1) #1041

dendihandian Jun 13, 2023

Replies: 1 comment · 2 replies

jaychia Jun 13, 2023 Maintainer

Applying a Python function on a single column

Further extensions

dendihandian Jun 14, 2023 Author

jaychia Jun 14, 2023 Maintainer

dendihandian
Jun 13, 2023

Replies: 1 comment 2 replies

jaychia
Jun 13, 2023
Maintainer

dendihandian Jun 14, 2023
Author

jaychia Jun 14, 2023
Maintainer