Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create expect_column_set_to_be_unique_case_insensitive.sql #138

Merged
merged 7 commits into from
Feb 24, 2022

Conversation

agusfigueroa-htg
Copy link
Contributor

Compares if the unique values of a given column remain unique from a case insensitive perspective. Useful when dealing with CamelCase and lowercase uniqueness within a column. Example: the test would fail if "MasterCard" and "mastercard" are in the same column.

I don't know the guidelines of the commits, but I can take care of adding the corresponding links and description to the README.

@clausherther
Copy link
Contributor

Hi @agusfigueroa-htg, thanks for your PR! The use case here makes sense to me. However, I wonder if we could instead modify expect_column_values_to_be_unique to take in an optional boolean case_sensitive parameter that would accomplish the same thing?
We don't have strict contribution guidelines (yet), but it'd be great if you could ultimately add to/update the README. Also, if you could, please take a look at the existing macros for formatting guidelines. Thanks!

@agusfigueroa-htg
Copy link
Contributor Author

Hey @clausherther, thanks for reviewing this! Question on the comment, as I gave that some thought but still I don’t fully see it.

In this test the column values are not unique, I just want the set of unique values to remain unique case insensitive.

Some examples below

Example 1:
“iPhone”
“iPhone”
expect_column_values_to_be_unique fails
expect_column_set_to_be_unique_case_insensitive passes

Example 2:
“iPhone”
“iphone”
expect_column_values_to_be_unique passes
expect_column_set_to_be_unique_case_insensitive fails

expect_column_set_to_be_unique_case_insensitive doesn’t make much sense in my eyes as a case-sensitive test - because I am already considering the column set of values, which is unique by definition.

Is there any existing test that comes to your mind to integrate this use case? Otherwise, we can leave it standalone. Let me know what you think.

Regarding the next steps, I can adjust whatever is needed, format it as the other macros and add the documentation to the readme.

Thanks again! :)

@clausherther
Copy link
Contributor

clausherther commented Jan 14, 2022

@agusfigueroa-htg what I meant was, we could add a new optional parameter to expect_column_values_to_be_unique, case_sensitive=True and based on that parameter value, implement the logic needed to make the uniqueness test case insensitive in the expect_column_values_to_be_unique test, without needing a standalone test.
We're trying to avoid too many similar standalone tests. Does that make sense?

@clausherther clausherther added the enhancement New feature or request label Jan 14, 2022
@agusfigueroa-htg
Copy link
Contributor Author

@clausherther
That is feasible but would change the idea of the test in itself as it is not analyzing the uniqueness of the column, but analyzing if the set (unique values of the column) remain unique case insensitive.

Our use case, and what I wonder whether it would come in handy for further users, is to apply this to large tables where the column is not necessarily unique but you wanna ensure that there are no duplicates case insensitive. For instance, when picking the OS column of a sessions table I would expect values to show up there repeatedly (e.g. Windows, Android, iOS, Android, iOS, etc). If, for some data problem, we have sessions with different ways of naming "windows" (e.g. Windows, Android, iOS, Android, windows, iOS, etc) I want the test to fail. Values being repeated is expected (which is why expect_column_values_to_be_unique would fail), what this adds on top is the expectation of that set of values remaining unique from a case insensitive perspective.

I pinged you on slack - maybe we can briefly sync.
Thanks,
Agus

@agusfigueroa-htg
Copy link
Contributor Author

agusfigueroa-htg commented Jan 21, 2022

After discussing with @clausherther, we aligned on the fact that what we were looking for was a test for consistent casing.
This means, that if a column has several repeated values it's fine, as long as the casing is consistent.

--Example of expected values for a given column: 

google
google
google
aws
aws
othervalue
--Example of unexpected values for a given column: 

google
Google
google
aws
aws
othervalue

As in this case the column value "google" is spelled using two different cases ("google" in the first place and "Google" in the second one), the casing is inconsistent. This is the intrinsic purpose of this test and the reason why we decided to make it a standalone test. A flag was added so the test can return the faulty column values if needed.

The final PR is here, including formatting and documentation, feel free to review it and get back to me if needed.
Thanks

@clausherther
Copy link
Contributor

@agusfigueroa-htg we'll need to add an integration test here please.

@clausherther
Copy link
Contributor

Hi @agusfigueroa-htg! Is this PR ready to be reviewed? Let me know if there's anything I can do to help get this over the finish line this week.

@agusfigueroa-htg
Copy link
Contributor Author

Ah yes sorry! When I submitted the last commit I forgot to tag you in a comment. Yes, ready to be reviewed @clausherther . Thanks!

@clausherther clausherther self-requested a review January 31, 2022 16:14
README.md Outdated Show resolved Hide resolved
@agusfigueroa-htg
Copy link
Contributor Author

@clausherther I introduced those changes, may you check? thx!

@clausherther clausherther self-requested a review January 31, 2022 18:50
Copy link
Contributor

@clausherther clausherther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@clausherther clausherther merged commit 854e653 into calogica:main Feb 24, 2022
clausherther added a commit that referenced this pull request Apr 4, 2022
…y_n_datepart` (#141)

* implement exclusion_condition (#1)

* move exclusion_condition to closing statement

* Update README.md

* Create expect_column_set_to_be_unique_case_insensitive.sql (#138)

* Create expect_column_set_to_be_unique_case_insensitive.sql

* adjusting consistent casing test and documentation

* adjusting formatting and adding integration test

* Fix formatting

* Formatting updates

* adjusting Readme and adding test instance

* Fix spelling

Co-authored-by: clausherther <claus@calogica.com>

* Change check for data date out of range to use timestamp (#145)

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update CHANGELOG.md

* Added partition by in lag tests (#146)

* Added partition by parameter

* Added documentation

* Modified the test to accept list instead of string

* created data and implemented tests

* corrected test in yml

* Update formatting, test data and test macro

Co-authored-by: clausherther <claus@calogica.com>

* Update CHANGELOG.md

* Update CHANGELOG.md

* Fix emails.sql (#153)

* Adds test for column presence (#149)

* add test for column not to exist

* Update expect_table_columns_not_to_contain_set.sql

* Update README.md

* name change and inner joined

* update readme

* update test

* Fix expect_row_values_to_have_recent_data issues on bigquery (#147)

* Add datetime and timestamp test columns

* Convert all evals to timestamps

* Add datatypes

* Update schema tests

* Update CHANGELOG.md

* Update README.md

* Add pre-commit config (#156)

* Add github actions workflow (#159)

* Add github workflow

* Remove pre-commit

* Add .circleci/config.yml (#161)

* Add .circleci/config.yml

* Remove github action

* Update postgres test

* Add dbt install

* Add env variables

* Install dbt dependencies

* Empty postgres password

* Upgrade postgres image

* Update connection settings

* Fix postgres profile

* Move profiles

* Add BigQuery run

* Fix job name

* Update key path

* Update config

* Update schema

* Add Snowflake tests

* Update profile

* Add back postgres

* Show group_by columns in validation errors for column increasing test (#158)

* Show group_by columns in validation errors for column increasing test

* Add github actions workflow (#159)

* Add github workflow

* Remove pre-commit

* Show group_by columns in validation errors for column increasing test

* Add .circleci/config.yml (#161)

* Add .circleci/config.yml

* Remove github action

* Update postgres test

* Add dbt install

* Add env variables

* Install dbt dependencies

* Empty postgres password

* Upgrade postgres image

* Update connection settings

* Fix postgres profile

* Move profiles

* Add BigQuery run

* Fix job name

* Update key path

* Update config

* Update schema

* Add Snowflake tests

* Update profile

* Add back postgres

* Show group_by columns in validation errors for column increasing test

* Jinja whitespace update (to force CI)

Co-authored-by: Claus Herther <claus@calogica.com>

* implement exclusion_condition (#1)

* move exclusion_condition to closing statement

* Update README.md

* Add integration test

Co-authored-by: agusfigueroa-htg <77272542+agusfigueroa-htg@users.noreply.github.com>
Co-authored-by: clausherther <claus@calogica.com>
Co-authored-by: Josh Jones <josh.jones@accenture.com>
Co-authored-by: Lucas Larbodiere <62683844+Lucasthenoob@users.noreply.github.com>
Co-authored-by: Randy Caddell <64412616+rcaddell@users.noreply.github.com>
Co-authored-by: dluftspring <daniel.luftspring@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants