Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable types #10 #12

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

j-grover
Copy link

@CLAassistant
Copy link

CLAassistant commented Oct 18, 2019

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@thehomebrewnerd
Copy link
Contributor

Generally I think this PR looks pretty good, but I have a question about how we want this to behave for the ID columns. If you look at the original entityset (entityset) generated in test_variable_types before normalizing, the customer_id is of type Numeric. However, if you look at this same column of the session_id entity in the normalized entityset, this same customer_id column now is now of type Id.

@rwedge Is this behavior acceptable? I know we will have some differences in variable types where a column in the original non-normalized entityset gets set as an index in a normalized entity, but wasn't sure how we wanted non-index columns to be treated throughout.

We probably should also update the test to test all of the columns we expect to have the same variable types throughout instead of just the single zip_code column.

@rwedge
Copy link
Contributor

rwedge commented Mar 13, 2020

@thehomebrewnerd changing the variable we normalized on to Id type is standard featuretools behavior. I think the logic is that we now know this variable can be treated as a foreign key, so we label it as a special categorical type, Id.

I agree the test should test all of the columns, and I think we should test that make_entityset and auto_entityset use variable_types as expected

@rwedge
Copy link
Contributor

rwedge commented Mar 13, 2020

@j-grover sorry for the delay on the review, are you interested in updating the tests?

@j-grover j-grover force-pushed the variable-types-#10 branch 2 times, most recently from 361a774 to 8cc177c Compare March 15, 2020 05:47
@j-grover
Copy link
Author

@rwedge Updated test to check all columns

@thehomebrewnerd
Copy link
Contributor

@j-grover Thanks for the quick response and updates to the test. Since we have now modified the parameters for the auto_entityset and make_entityset to add the optional variable_types parameter, it would also be good to add specific tests for these two functions to make sure they behave properly and return the expected result when using this optional parameter.

Is that something you would be able to do as well?

@j-grover
Copy link
Author

@j-grover Thanks for the quick response and updates to the test. Since we have now modified the parameters for the auto_entityset and make_entityset to add the optional variable_types parameter, it would also be good to add specific tests for these two functions to make sure they behave properly and return the expected result when using this optional parameter.

Is that something you would be able to do as well?

What sort of things are we looking to test for these two methods. For example one case with default values and one with custom args?

@thehomebrewnerd
Copy link
Contributor

thehomebrewnerd commented Mar 23, 2020

@j-grover Thanks for the quick response and updates to the test. Since we have now modified the parameters for the auto_entityset and make_entityset to add the optional variable_types parameter, it would also be good to add specific tests for these two functions to make sure they behave properly and return the expected result when using this optional parameter.
Is that something you would be able to do as well?

What sort of things are we looking to test for these two methods. For example one case with default values and one with custom args?

Yes, that is what I was thinking...tests very similar to the test you added for normalize_entity, just to make sure those methods return the expected results when passing a variable_types parameter. Since these methods are part of the public API, it would be good for test coverage to have specific tests that cover their possible use cases as well.

@j-grover
Copy link
Author

@thehomebrewnerd
I've added the initial tests for make_entityset and auto_entityset. There is overlap between the tests as auto_entityset makes use of make_entityset internally. When testing auto_entityset, I've found that for entity 0 (refer to line 330 in test_normalize.py) the name changes between 'jersey_num_team' and 'team_jersey_num' for different runs. This causes the following to fail occasionally:
assert normalized_entityset.entities[0].variable_types['jersey_num_team'] == Index
How would you go about testing this?

@thehomebrewnerd
Copy link
Contributor

@thehomebrewnerd
I've added the initial tests for make_entityset and auto_entityset. There is overlap between the tests as auto_entityset makes use of make_entityset internally. When testing auto_entityset, I've found that for entity 0 (refer to line 330 in test_normalize.py) the name changes between 'jersey_num_team' and 'team_jersey_num' for different runs. This causes the following to fail occasionally:
assert normalized_entityset.entities[0].variable_types['jersey_num_team'] == Index
How would you go about testing this?

@j-grover Thanks for creating these tests. If we expect this behavior - where the name is not deterministic - we could set a variable for the name that is actually returned and then use that variable in the tests. One way that comes to mind would be to check if team_jersey_num is in the variable_types dictionary keys and if it is not we would use the other option of jersey_num_team. Something like this:

# Index name is not always the same - checking what was returned
index_vname = "jersey_num_team"
if index_vname not in normalized_entityset.entities[0].variable_types.keys():
    index_vname = "team_jersey_num"
assert normalized_entityset.entities[0].variable_types[index_vname] == Index

I think you would also need to rename one of the dataframe columns for the df.equals test to pass reliably as well.

I don't know enough about the details of this code to know if expect to get the same name back every time, but I can look into that a bit more in the meantime to make sure this isn't highlighting some other issue.



def test_auto_entityset_custom_args():
dic = {'team': ['Red', 'Red', 'Red', 'Orange', 'Orange', 'Yellow',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these dictionaries get re-used in several tests, creating many duplicated lines of code. Could you turn these dictionaries into pytest fixtures so they're only defined once?

@thehomebrewnerd
Copy link
Contributor

@j-grover After reviewing the code with @rwedge , the non-deterministic nature of column names is to be expected as the new names are created by joining an unsorted list of column names. Issue #24 was created to fix this problem, so for this PR I suggest we go ahead and implement the tests as we have described above, and then we can update later after issue #24 is closed.

@j-grover
Copy link
Author

@j-grover After reviewing the code with @rwedge , the non-deterministic nature of column names is to be expected as the new names are created by joining an unsorted list of column names. Issue #24 was created to fix this problem, so for this PR I suggest we go ahead and implement the tests as we have described above, and then we can update later after issue #24 is closed.

Sorting the list sounds good. I have change the index names accordingly to jersey_num_team.
@rwedge I've added a pytest fixture for that particular example

Base automatically changed from master to main February 19, 2021 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Variable types not preserved after call to normalize_entity()
4 participants