Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some import/module adjustments #992

Merged
merged 16 commits into from
Jul 19, 2021
Merged

Some import/module adjustments #992

merged 16 commits into from
Jul 19, 2021

Conversation

stinodego
Copy link
Member

@stinodego stinodego commented Jul 15, 2021

Ok, I promise this is my last unsollicited refactor 😬

I noticed an issue in the current code base with the star imports in the __init__.py files. They would not only bring up the objects (like DataFrame), but also the modules (like frame). This would cause a clash; because now polars.frame could point to either the eager or lazy version of the module (it would point to the lazy version, but no idea why specifically). This is untransparent. So I fixed it.

I also moved some other stuff around; possibly breaking backwards compatibility. Please read carefully. Maybe I need to put some stuff back / deprecate it instead. Let me know.

Changes:

  • Added __all__ dunders to the __init__.py files to solve the problem mentioned above.
  • This caused some issues with mypy, so I had to explicitly import Series, DataFrame, etc. in the top __init__.py (I filed a mypy issue for this... I believe there's a bug).
  • Removed the dtype_to_int function in the datatypes module. It was unused internally, and I see no use for this function.
  • Removed the from_rows function. Decided this is too soon/rigorous. Let's improve the DataFrame constructor first. I did adjust the tests to use the 'source' function (DataFrame.from_rows) instead.
  • I tried hiding the wrapping functions (wrap_s, wrap_df, etc.) from the main scope, as I believe users should never have to explicitly use these. But the Rust backend expects those functions to be there. I left them for now; something for the future maybe (possibly just rename them with a leading underscore or something).
  • Renamed lazy/expr_functions to lazy/functions (now possible thanks to the fix to the first mentioned issue). This conforms to the syntax people know from pyspark: use from polars.lazy import functions as F and then use F.col, F.sum, etc.
  • Split up functions.py into io.py (for all the read functions like read_csv, etc.) and eager/functions.py (for concat, get_dummies, etc.).
  • Moved StringCache and toggle_string_cache to their own file.

@ritchie46
Copy link
Member

I am bouldering in the woods, so I will take a close look after this weekend. Some initial thoughts:

All DataFrame constructors except the init are created from the pl namespace. Think pl.read_csv, pl.scan_csv, pl.from_arrow, etc. In that light pl.from_rows is preferred over creation from DataFrame.

For functions, I plan to make all work in eager and lazy alike. So maybe we should create a top level functions and add the hybrid functions there.

@stinodego
Copy link
Member Author

I am bouldering in the woods, so I will take a close look after this weekend. Some initial thoughts:

All DataFrame constructors except the init are created from the pl namespace. Think pl.read_csv, pl.scan_csv, pl.from_arrow, etc. In that light pl.from_rows is preferred over creation from DataFrame.

For functions, I plan to make all work in eager and lazy alike. So maybe we should create a top level functions and add the hybrid functions there.

Enjoy your trip!

Regarding from_rows: I do see a distinction betwen read_csv (filesystem I/O) and from_rows (takes a Python object). from_arrow would actually be in the same category as from_rows.

I do agree that the distiction between the lazy and the eager module feels quite arbitrary.

Maybe park this MR for now and take a moment next week to just think through what the ideal module setup would like like?

@ritchie46
Copy link
Member

Maybe park this MR for now and take a moment next week to just think through what the ideal module setup would like like?

I think we can still merge this work in if we keep a top level functions module. I also think the string_cache module should be top level, as it does not make sense to speak of it in an eager or lazy manner.

Great work, I do think its an improvement. :)

@stinodego
Copy link
Member Author

Maybe park this MR for now and take a moment next week to just think through what the ideal module setup would like like?

I think we can still merge this work in if we keep a top level functions module. I also think the string_cache module should be top level, as it does not make sense to speak of it in an eager or lazy manner.

Great work, I do think its an improvement. :)

All right, I will shuffle things around and make that happen. Probably tonight or tomorrow.

@stinodego
Copy link
Member Author

All right, updated the MR with the points we discussed. I still have all the file IO functions in a separate io.py; wasn't sure if you wanted those back in functions.py. In any case, they're available from top level as pl.read_csv etc.

@@ -700,12 +700,14 @@ def test_to_json(df):


def test_from_rows():
df = pl.from_rows([[1, 2, "foo"], [2, 3, "bar"]], column_name_mapping={1: "foo"})
df = pl.DataFrame.from_rows(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also work as pl.from_rows right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will! I figured the test_from_rows function here should test the actual from_rows function that's attached to the DataFrame, not the functions version (since it's test_frame.py). The other from_rows function is tested earlier in the file, in test_concat.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Then I merge it in. Thanks again. :)

@ritchie46 ritchie46 merged commit 6f80cf8 into pola-rs:master Jul 19, 2021
ritchie46 pushed a commit that referenced this pull request Jul 20, 2021
* Added __all__ dunders to the __init__.py files to solve the problem mentioned above.
This caused some issues with mypy, so I had to explicitly import Series, DataFrame, etc. in the top __init__.py (I filed a mypy issue for this... I believe there's a bug).
* Removed the dtype_to_int function in the datatypes module. It was unused internally, and I see no use for this function.
* I tried hiding the wrapping functions (wrap_s, wrap_df, etc.) from the main scope, as I believe users should never have to explicitly use these. But the Rust backend expects those functions to be there. I left them for now; something for the future maybe (possibly just rename them with a leading underscore or something).
* Renamed lazy/expr_functions to lazy/functions (now possible thanks to the fix to the first mentioned issue). This conforms to the syntax people know from pyspark: use from polars.lazy import functions as F and then use F.col, F.sum, etc.
Split up functions.py into io.py (for all the read functions like read_csv, etc.) and eager/functions.py (for concat, get_dummies, etc.).
* Moved StringCache and toggle_string_cache to their own file.
ritchie46 pushed a commit that referenced this pull request Jul 20, 2021
* Added __all__ dunders to the __init__.py files to solve the problem mentioned above.
This caused some issues with mypy, so I had to explicitly import Series, DataFrame, etc. in the top __init__.py (I filed a mypy issue for this... I believe there's a bug).
* Removed the dtype_to_int function in the datatypes module. It was unused internally, and I see no use for this function.
* I tried hiding the wrapping functions (wrap_s, wrap_df, etc.) from the main scope, as I believe users should never have to explicitly use these. But the Rust backend expects those functions to be there. I left them for now; something for the future maybe (possibly just rename them with a leading underscore or something).
* Renamed lazy/expr_functions to lazy/functions (now possible thanks to the fix to the first mentioned issue). This conforms to the syntax people know from pyspark: use from polars.lazy import functions as F and then use F.col, F.sum, etc.
Split up functions.py into io.py (for all the read functions like read_csv, etc.) and eager/functions.py (for concat, get_dummies, etc.).
* Moved StringCache and toggle_string_cache to their own file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants