Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: dist_plot only displaying first numerical column #242

Conversation

akanz1
Copy link
Owner

@akanz1 akanz1 commented Jul 28, 2024

Summary by Sourcery

This pull request addresses a bug in the dist_plot function where only the first numerical column was displayed. It also includes several code refactorings for improved readability, updates to pre-commit hooks, enhanced test coverage, and the removal of an obsolete configuration file.

  • Bug Fixes:
    • Fixed issue in dist_plot where only the first numerical column was being displayed by adding a check for numeric columns and returning None if none are found.
  • Enhancements:
    • Refactored multiple functions to improve readability by consolidating multi-line statements into single lines where appropriate.
    • Added asterisks to function parameters to enforce keyword-only arguments in several functions for better clarity and usage.
  • Build:
    • Updated pre-commit hooks to newer versions: pre-commit-hooks to v4.6.0 and ruff-pre-commit to v0.5.5.
  • Tests:
    • Enhanced test coverage for _corr_selector by adding new assertions.
    • Refactored test cases to improve readability and maintainability by consolidating multi-line assertions into single lines.
  • Chores:
    • Removed the readthedocs.yml file as it is no longer needed.

@akanz1 akanz1 added the bug Something isn't working label Jul 28, 2024
@akanz1 akanz1 self-assigned this Jul 28, 2024
Copy link
Contributor

sourcery-ai bot commented Jul 28, 2024

Reviewer's Guide by Sourcery

This pull request addresses the issue of dist_plot only displaying the first numerical column by adding a check for empty numeric columns. Additionally, it includes several refactorings for better readability, updates to pre-commit hooks, and improvements to test cases.

File-Level Changes

Files Changes
src/klib/clean.py
src/klib/describe.py
src/klib/utils.py
Refactored code for better readability and added positional-only argument markers in several functions.
tests/test_util.py
tests/test_clean.py
tests/test_describe.py
Refactored test assertions for better readability and added new test cases.
.pre-commit-config.yaml Updated pre-commit hooks versions and added new arguments for ruff hook.

Tips
  • Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
  • Continue your discussion with Sourcery by replying directly to review comments.
  • You can change your review settings at any time by accessing your dashboard:
    • Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
    • Change the review language;
  • You can always contact us if you have any questions or feedback.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @akanz1 - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟡 Testing: 3 issues found
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.

clean_column_names(self.df_clean_column_names).columns[i]
== expected_results[i]
)
assert clean_column_names(self.df_clean_column_names).columns[i] == expected_results[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add tests for edge cases in clean_column_names

Consider adding tests for edge cases such as when the DataFrame has no columns, columns with special characters, or very long column names. This will help ensure that the clean_column_names function handles all possible scenarios.

assert (
convert_datatypes(self.df_data_convert).dtypes[i] == expected_results[i]
)
assert convert_datatypes(self.df_data_convert).dtypes[i] == expected_results[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add tests for edge cases in convert_datatypes

It would be useful to add tests for edge cases such as when the DataFrame has mixed data types, missing values, or very large numbers. This will ensure that the convert_datatypes function is robust and handles all possible scenarios.

@@ -93,8 +93,7 @@ def test_output_type(self):
def test_output_shape(self):
# Test for output dimensions
assert (
corr_mat(self.data_corr_df).data.shape[0]
== corr_mat(self.data_corr_df).data.shape[1]
corr_mat(self.data_corr_df).data.shape[0] == corr_mat(self.data_corr_df).data.shape[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add tests for edge cases in corr_mat

Consider adding tests for edge cases such as when the input data is empty, contains NaNs, or has only one column. This will help ensure that the corr_mat function handles all possible scenarios.

Suggested change
corr_mat(self.data_corr_df).data.shape[0] == corr_mat(self.data_corr_df).data.shape[1]
assert (
corr_mat(pd.DataFrame()).data.shape[0] == corr_mat(pd.DataFrame()).data.shape[1]
)
assert (
corr_mat(pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, None]})).data.shape[0] == corr_mat(pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, None]})).data.shape[1]
)
assert (
corr_mat(pd.DataFrame({"A": [1, 2, 3]})).data.shape[0] == corr_mat(pd.DataFrame({"A": [1, 2, 3]})).data.shape[1]
)

Comment on lines 56 to +57
for i, _ in enumerate(expected_results):
assert (
clean_column_names(self.df_clean_column_names).columns[i]
== expected_results[i]
)
assert clean_column_names(self.df_clean_column_names).columns[i] == expected_results[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

ExplanationAvoid complex code, like loops, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines 219 to +220
for i, _ in enumerate(expected_results):
assert (
convert_datatypes(self.df_data_convert).dtypes[i] == expected_results[i]
)
assert convert_datatypes(self.df_data_convert).dtypes[i] == expected_results[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

ExplanationAvoid complex code, like loops, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

@@ -41,7 +42,7 @@ def _optimize_floats(data: pd.Series | pd.DataFrame) -> pd.DataFrame:
return data


def clean_column_names(data: pd.DataFrame, hints: bool = True) -> pd.DataFrame:
def clean_column_names(data: pd.DataFrame, *, hints: bool = True) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Low code quality found in clean_column_names - 24% (low-code-quality)


ExplanationThe quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

Copy link

sonarcloud bot commented Jul 28, 2024

@akanz1 akanz1 merged commit 10c2af5 into main Jul 28, 2024
16 checks passed
@akanz1 akanz1 deleted the 241-bug-the-command-klibdist_plotdf-does-not-plot-the-distribution-for-all-the-numeric-features-of-a-dataframa branch July 28, 2024 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant