-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TBOX-332 Address Validation #156
Conversation
I see that a couple of checks are failing already. I'm not sure why the enforce test coverage shows your test coverage as 0% at the moment... Will look into it in more detail. For the core-dependencies, pandas is identified as an optional dependency. I don't remember the exact reasoning for why this is still the case... Maybe worth revisiting. You can see this script as an example of how to import pandas. |
This is ready for review now -- test coverage for the new code is complete:
|
if joined_addr not in addr_mapping.keys(): | ||
addr_to_validate.append(joined_addr) | ||
count_new_addr += 1 | ||
elif addr_mapping[joined_addr].expiration < str(datetime.now() + expiration_date_buffer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused by this but I think I get it now... Please confirm if this is the case - basically if the validation expires then it will get deleted. This elif clause catches those that are soon to expire to also be validated again. Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you differentiate between those records that have been validated but have expired versus those that have not been validated before? Does it matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is all to make it possible to comply with the google terms of service, which prohibit caching most results for more than 30 days. Regarding the first comment, validation won't be deleted if it expires currently, but if it's a validation that's still in use (e.g. it's an address that is still in the list of addresses of interest), it will be validated again.
I don't differentiate between the two cases currently -- I'm not sure there's a reason to do so.
|
||
|
||
def get_maps_client() -> "googlemaps.Client": | ||
"""Get GoogleMaps client using the environment variable 'GOOGLEMAPS_API_KEY' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this show up in the docs as the name that is required to be set when running address validation?
@@ -0,0 +1,5 @@ | |||
{"input_address": "66 Church St Cambridge Mass", "validated_formatted_address": "66 Church Street, Cambridge, MA 02138-3733, USA", "expiration": "2023-08-09 11:46:40.408657", "region_code": "US", "postal_code": "02138-3733", "admin_area": "MA", "locality": "Cambridge", "address_lines": ["66 Church St"], "usps_first_address_line": "66 CHURCH ST", "usps_city_state_zip_line": "CAMBRIDGE MA 02138-3733", "usps_city": "CAMBRIDGE", "usps_state": "MA", "usps_zip_code": "02138-3733", "latitude": 42.3739503, "longitude": -71.1211445, "place_id": "ChIJNR2ZIGh344kRNQAj-dh6d00", "input_granularity": "PREMISE", "validation_granularity": "PREMISE", "geocode_granularity": "PREMISE", "has_inferred": true, "has_unconfirmed": false, "has_replaced": false, "address_complete": false} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need these files as part of the PR and testing? They will be added to the package if included in the present form. I know the addresses are not that critical but still, I am wondering if we can test without adding these files to the PR. I don't see equivalent files for translation either...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed it but where do you use these 2 json files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are used to test the from_json functions and from_json error handling for module. The translation doesn't have them because it has data in the test file, which it saves to json and then loads, and it doesn't test the same error handling. We could get rid of them by doing a save + load in the test file and mocking the bad data somehow, but the same data would still be stored somewhere.
path_to_csv_to_validate: str, | ||
path_to_validated_csv: str, | ||
) -> None: | ||
"""Validate data located on disk and save results to disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example is a good place to specify how the API key needs to be stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also important to specify since your example allows for a config file but your get_maps_client function assumes an environment variable. Worth converting to a config variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, perhaps that would be better. I can see about doing that.
from tamr_toolbox.enrichment.api_client.google_address_validate import get_maps_client | ||
from tamr_toolbox.utils.testing import mock_api | ||
|
||
ADDR_VAL_MAPPING_0 = tamr_toolbox.enrichment.address_mapping.AddressValidationMapping( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be much happier if we could find a way to test this without putting all these details into the package... But if that does not exist, the tests themselves look good...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your concern with storing the data? Depending on the issue, there may be a way around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Depending on your conversation with Ravi, if you need to make any changes, I can take another look. But otherwise, feel free to merge when ready.
↪️ Pull Request
This adds functionality to use Google Address Validation API with Tamr. Tests are incomplete, but putting this up for early feedback since the plan is to use this with a customer within a couple weeks.
Some notes:
tamr-toolbox/tamr_toolbox/enrichment/dictionary.py
Line 35 in 721fd5a
examples
folder to set up a SM project/ validation mapping on the toolbox test instance. You can see what it looks like there / how I used transformations to pull in the validation. It would be nice to automate set-up of the additionalvalidation
columns in the unified dataset, as well as the transformations to populate them. I would welcome suggestions on how to do that / where it belongs (e.g. just an example script, or more enrichment module functionality).✔️ PR Todo
main()
function in your scripts