Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table profile added #168

Merged
merged 35 commits into from
Mar 20, 2023
Merged

Table profile added #168

merged 35 commits into from
Mar 20, 2023

Conversation

yafimvo
Copy link

@yafimvo yafimvo commented Feb 27, 2023

Describe your changes

profile magic command added to %sqlcmd.

How to use it:
%sqlcmd profile -t table_name

Issue ticket number and link

Closes #66

Checklist before requesting a review

  • I have performed a self-review of my code
  • I have added thorough tests (when necessary).
  • I have added the right documentation in the docstring and changelog (when needed)

📚 Documentation preview 📚: https://jupysql--168.org.readthedocs.build/en/168/

@sync-by-unito sync-by-unito bot closed this Feb 27, 2023
@edublancas edublancas reopened this Feb 27, 2023
@edublancas
Copy link

is this ready for review?

@yafimvo
Copy link
Author

yafimvo commented Mar 2, 2023

Yes

doc/_toc.yml Outdated Show resolved Hide resolved
doc/user-guide/explore-tables.md Outdated Show resolved Hide resolved
doc/user-guide/explore-tables.md Outdated Show resolved Hide resolved
src/sql/inspect.py Outdated Show resolved Hide resolved
src/sql/magic_cmd.py Outdated Show resolved Hide resolved
src/sql/magic_cmd.py Outdated Show resolved Hide resolved
src/sql/magic_cmd.py Show resolved Hide resolved
@edublancas
Copy link

I also noticed that the numbers in the profiling table are too long. let's implement a custom format display that shortens it by displaying them in scientific notation. @neelasha23 implemented something like that for sklearn-evaluation's interactive confusion matrix, maybe we can re-use the code?

@yafimvo
Copy link
Author

yafimvo commented Mar 5, 2023

I also noticed that the numbers in the profiling table are too long. let's implement a custom format display that shortens it by displaying them in scientific notation. @neelasha23 implemented something like that for sklearn-evaluation's interactive confusion matrix, maybe we can re-use the code?

I found this code. It consists of 2 parts, one (convert_to_scientific) takes data in a specific format (key-value) and checks if it's a number. The second part (_is_long_number) checks its length and formats it using np.

We can use _is_long_number as is but convert_to_scientific is a bit different. In the meantime, I added both of them to util in jupysql (with some modifications to convert_to_scientific).

@edublancas
Do you think we should move _is_long_number to ploomber_core?

setup.py Outdated Show resolved Hide resolved
src/sql/inspect.py Outdated Show resolved Hide resolved
src/sql/inspect.py Outdated Show resolved Hide resolved
src/sql/inspect.py Outdated Show resolved Hide resolved
src/sql/inspect.py Outdated Show resolved Hide resolved
src/sql/inspect.py Outdated Show resolved Hide resolved
src/sql/magic_cmd.py Outdated Show resolved Hide resolved
src/sql/magic_cmd.py Outdated Show resolved Hide resolved
@edublancas
Copy link

I found this code. It consists of 2 parts, one (convert_to_scientific) takes data in a specific format (key-value) and checks if it's a number. The second part (_is_long_number) checks its length and formats it using np.

We can use _is_long_number as is but convert_to_scientific is a bit different. In the meantime, I added both of them to util in jupysql (with some modifications to convert_to_scientific).

@edublancas Do you think we should move _is_long_number to ploomber_core?

the dependence on numpy is a problem here. it sounds like too much to add numpy just to use such function. looks like with can do it without numpy as well: https://stackoverflow.com/a/69569277/709975

let's keep it here for now, we'll move it to core if we need it elsewhere

@yafimvo yafimvo requested a review from edublancas March 8, 2023 10:53
Copy link

@edublancas edublancas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how should I interpret an empty table cell vs a cell with nan?

in the tutorial, I see this:

image

src/sql/inspect.py Outdated Show resolved Hide resolved
src/sql/inspect.py Outdated Show resolved Hide resolved
@@ -106,13 +106,18 @@ def __init__(self, sqlaproxy, config):
self.keys = {}
if sqlaproxy.returns_rows:
self.keys = sqlaproxy.keys()
if config.autolimit:
if isinstance(config.autolimit, int) and config.autolimit > 0:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? this would break existing compatibility (setting None and setting autolimit to 0 should display all values) - it's a bit counterintuitive but we inherited this behavior from ipython-sql

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I removed the hardcoded configuration from SqlCmdMagic there are some missing default configurations (autolimit and style). In this case, config.autolimit and config.style return <LazyConfigValue>.

I added the config.autolimit > 0 since according to one of the tests if autolimit is 0 we should return everything.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see what you're saying! we have a design problem here.

the configuration is attached to the %sql magic, but ideally we want the config to be accessible to all magics. I'm unsure if this is possible so I think for now the best thing to do is to create another version of run that doesn't take the config argument (we won't be able to provide the autolimit feature when running the profiling but that's fine)

I remember suggesting creating a raw_run function already but I can't remember if it was in the ggplot PR or in a different one that @tonykploomber is working on.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also probably open an issue to research more if magics allow setting global variables.

Comment on lines +115 to +120
_style = None
if isinstance(config.style, str):
_style = prettytable.__dict__[config.style.upper()]

self.pretty = PrettyTable(
self.field_names, style=prettytable.__dict__[config.style.upper()]
self.field_names, style=_style

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this changed?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this is fixing a bug (since the _style variable wasn't used?)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -70,3 +72,117 @@ def test_columns_with_schema(ip, tmp_empty):
).result._repr_html_()

assert "some_number" in out


def test_table_profile(ip, tmp_empty):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some tests to the integration testing file? we should check if this works with other databases, we know it doesn't work with sqlite so we can ignore it.

but we should check for the other ones, we can mark tests as xfail fo the ones that don't pass the tests and we'll fix them later

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test for each database but with the relevant profile fields (DuckDB and PostgreSQL should work with all fields)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolving.

@yafimvo
Copy link
Author

yafimvo commented Mar 8, 2023

how should I interpret an empty table cell vs a cell with nan?

in the tutorial, I see this:

image

It happened since it tried to run stdev on non-numeric values (DateTime) and the exception was vague (sqlalchemy.exc.ProgrammingError), so it ignored these cells.

Changed it.

@idomic
Copy link

idomic commented Mar 13, 2023

What else is missing here?
Please resolve discussions/questions you answered on already

@yafimvo yafimvo requested a review from edublancas March 13, 2023 20:21
Copy link

@edublancas edublancas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments

@idomic
Copy link

idomic commented Mar 18, 2023

@yafimvo please resolve conflicts and pending issues so we can merge.

I think the only one left is this run function alias without the conf arg.

@yafimvo yafimvo requested a review from edublancas March 20, 2023 16:32
@edublancas edublancas merged commit 55ed866 into ploomber:master Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

proposal: %sqlcmd profile
3 participants