Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a pydough.from_string API so that PyDough code can be run programmatically #236

Open
knassre-bodo opened this issue Jan 29, 2025 · 0 comments
Labels
documentation Improvements or additions to documentation effort - medium mid-sized issue with average implementation time/difficulty enhancement New feature or request user feature Adding a new user-facing feature/functionality

Comments

@knassre-bodo
Copy link
Contributor

Goal: add a new PyDough api from_string that takes in 1+ lines of Python PyDough code, which can be programmatically generated by an LLM, and runs an equivalent of the pydough cell magic (without to_sql or to_df) to obtain a PyDough unqualified node corresponding to the code. Once this API is called, other apis like to_sql or to_df can then be called on it.

Specifications:

  • The first argument python_code can be a multiline string of Python statements. The last line should store the PyDough object for the final answer being sought in a variable.
  • The second optional argument is answer_variable, a string indicating the name that was used by python_code to store the result. The default value should be "result".
  • The code in python_code has any indentation of lines fixed, and should be verified (e.g. checks that answer_variable gets defined in the last line). The strictness of these checks, and some of the formatting changes, can perhaps be controlled by additional keyword arugments to from_sring.
  • The code in python_code is transformed using the AST visitor, in a manners similar to the transform_cell API used by the %%pydough Jupyter extension. The appropriate global/local namespaces should be used so that PyDough can figure out what is/isn't an undefined variable in the code.
  • The transformed code should then be executed with exec, using the real globals(), and either the real locals() or a new dictionary to mimic it (which could start out empty, or start as a shallow copy of locals()). The behavior with the local namespace argument is a key design consideration:
    • Option 1: If using the real locals(), that means that all local variables are accessible, but that all variables defined in the code will be defined afterward.
    • Option 2: If using an empty locals(), that means that no persistent mutation to the namespace occurs, but also that no variables from the local namespace can be used in the PyDough code.
    • Option 3: If using a shallow copy of locals, that means that no persistent mutation to the namespace occurs, but all variables from the existing local namespace can still be accessed by the PyDough code.
    • Option 4: Same as option 3, but persists the variable created to store the final answer (with name answer_variable) into the real locals(). This means that the final answer can be accessed later in the namespace, but no intermediary variables mutate the namespace.
    • Option 5: Some/all of options 1-4 are allowed, and controlled by an additional namespace keyword argument to from_string (one of the options is chosen as the default). Option 1 = "update", Option 2 = "empty", Option 3 = "snapshot", Option 4 = "". or they can pass in a dictionary directly (which can be one of the ones from options 1-3, or their own which they can selectively re-use between calls -> allows separation between Python variables vs PyDough variables).
  • Whichever local namespace dictionary was used in the previous step should now contain a variable with the name from the answer_variable argument (e.g. result, by default). This variable can be accessed by indexing into the dictionary, and it should be an unqualified node.
  • The unqualified node should be returned by from_string. Afterwards, other operations (explain, to_sql, to_df) can be called on the unqualified node.

This API needs to be clearly & thoroughly documented in the user documentation.

Example of what this would look like:

import pydough

pydough.active_session.load_metadata_graph(...)
pydough.active_session.connect_database(...)

# Example question (using TPC-H graph): which 5 suppliers made the most revenue in 1996?
# (can be generated from an LLM API call)
pydough_code = """
# The revenue generated by a specific lineitem
line_revenue = extended_price * (1 - discount)

# The lineitem purchases for each supplier that were ordered in 1996
lines_96 = lines.WHERE(YEAR(order.order_date) == 1996)

# For each supplier, list their name & revenue from 1996
supplier_info = Suppliers(name, revenue96=SUM(lines_96(rev=line_revenue).rev))

# Pick the 5 suppliers with the highest from 1996
result = supplier_info.TOP_K(5, by=revenue.DESC())
"""

# This is an unqualified node
top_suppliers_96 = pydough.from_string(pydough_code)

print(pydough.explain(top_suppliers_96))
print(pydough.to_sql(top_suppliers_96))
print(pydough.to_df(top_suppliers_96))

And an example where a name other than result is used:

import pydough

pydough.active_session.load_metadata_graph(...)
pydough.active_session.connect_database(...)

# Example question (using TPC-H graph): how many parts contain "almond" in the name?
# (can be generated from an LLM API call)
pydough_code = """
# All parts that contain almond in the name
almond_parts = Parts.WHERE(name == "amond")

# Count how many `almond_parts` exist
my_answer = TPCH(n_almond_parts=COUNT(almond_parts))
"""

# Notice that the argument `"my_answer"` is also the variable name used to store the answer
# in the last line of `pydough_code`, as opposed to `"result"`.
# `almond_part_info` is an unqualified node
almond_part_info = pydough.from_string(pydough_code, answer_variable="my_answer")

print(pydough.explain(almond_part_info))
print(pydough.to_sql(almond_part_info))
print(pydough.to_df(almond_part_info))
@knassre-bodo knassre-bodo added documentation Improvements or additions to documentation effort - medium mid-sized issue with average implementation time/difficulty enhancement New feature or request user feature Adding a new user-facing feature/functionality labels Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation effort - medium mid-sized issue with average implementation time/difficulty enhancement New feature or request user feature Adding a new user-facing feature/functionality
Projects
None yet
Development

No branches or pull requests

1 participant