Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TBOX-332 Address Validation #156

Merged
merged 22 commits into from
Jul 17, 2023

Conversation

ErinCompaan
Copy link
Contributor

@ErinCompaan ErinCompaan commented Jul 11, 2023

↪️ Pull Request

This adds functionality to use Google Address Validation API with Tamr. Tests are incomplete, but putting this up for early feedback since the plan is to use this with a customer within a couple weeks.

Some notes:

  1. The set-up is very similar to the existing translation enrichment -- the main difference is that the address validation doesn't store sets of similar addresses which can be formatted by rules/cleaning to the same result. The translation dictionary stores sets of phrases like this:
    original_phrases: Set[str] = field(default_factory=lambda: set())
    I didn't see much value to adding that for addresses, but I'm open to other thoughts.
  2. There is a class here (!), but it was okay for translation apparently, so I'm hoping it's okay for validation.
  3. I used the scripts I added in the examples folder to set up a SM project/ validation mapping on the toolbox test instance. You can see what it looks like there / how I used transformations to pull in the validation. It would be nice to automate set-up of the additional validation columns in the unified dataset, as well as the transformations to populate them. I would welcome suggestions on how to do that / where it belongs (e.g. just an example script, or more enrichment module functionality).

✔️ PR Todo

@zbpvarun-tamr
Copy link
Contributor

I see that a couple of checks are failing already. I'm not sure why the enforce test coverage shows your test coverage as 0% at the moment... Will look into it in more detail. For the core-dependencies, pandas is identified as an optional dependency. I don't remember the exact reasoning for why this is still the case... Maybe worth revisiting. You can see this script as an example of how to import pandas.

@ErinCompaan
Copy link
Contributor Author

This is ready for review now -- test coverage for the new code is complete:

pytest --cov=tamr_toolbox/enrichment --cov-report=term-missing tests/enrichment
---------- coverage: platform darwin, python 3.9.13-final-0 ----------
Name                                                            Stmts   Miss  Cover   Missing
---------------------------------------------------------------------------------------------
tamr_toolbox/enrichment/__init__.py                                 7      0   100%
tamr_toolbox/enrichment/address_mapping.py                        121      0   100%
tamr_toolbox/enrichment/address_validation.py                      54      0   100%
tamr_toolbox/enrichment/api_client/__init__.py                      3      0   100%
tamr_toolbox/enrichment/api_client/google_address_validate.py      43      0   100%
tamr_toolbox/enrichment/api_client/google_translate.py             63      0   100%
tamr_toolbox/enrichment/dictionary.py                             123      7    94%   281-285, 380-385
tamr_toolbox/enrichment/enrichment_utils.py                        26      0   100%
tamr_toolbox/enrichment/translate.py                               66      4    94%   154-156, 197
---------------------------------------------------------------------------------------------
TOTAL                                                             506     11    98%

@ErinCompaan ErinCompaan marked this pull request as ready for review July 13, 2023 15:00
@ErinCompaan ErinCompaan requested a review from a user July 13, 2023 15:00
@ErinCompaan ErinCompaan requested a review from Adam-Tamr as a code owner July 13, 2023 15:00
optional_requirements.txt Outdated Show resolved Hide resolved
if joined_addr not in addr_mapping.keys():
addr_to_validate.append(joined_addr)
count_new_addr += 1
elif addr_mapping[joined_addr].expiration < str(datetime.now() + expiration_date_buffer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused by this but I think I get it now... Please confirm if this is the case - basically if the validation expires then it will get deleted. This elif clause catches those that are soon to expire to also be validated again. Is this correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you differentiate between those records that have been validated but have expired versus those that have not been validated before? Does it matter?

Copy link
Contributor Author

@ErinCompaan ErinCompaan Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is all to make it possible to comply with the google terms of service, which prohibit caching most results for more than 30 days. Regarding the first comment, validation won't be deleted if it expires currently, but if it's a validation that's still in use (e.g. it's an address that is still in the list of addresses of interest), it will be validated again.

I don't differentiate between the two cases currently -- I'm not sure there's a reason to do so.



def get_maps_client() -> "googlemaps.Client":
"""Get GoogleMaps client using the environment variable 'GOOGLEMAPS_API_KEY'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this show up in the docs as the name that is required to be set when running address validation?

@@ -0,0 +1,5 @@
{"input_address": "66 Church St Cambridge Mass", "validated_formatted_address": "66 Church Street, Cambridge, MA 02138-3733, USA", "expiration": "2023-08-09 11:46:40.408657", "region_code": "US", "postal_code": "02138-3733", "admin_area": "MA", "locality": "Cambridge", "address_lines": ["66 Church St"], "usps_first_address_line": "66 CHURCH ST", "usps_city_state_zip_line": "CAMBRIDGE MA 02138-3733", "usps_city": "CAMBRIDGE", "usps_state": "MA", "usps_zip_code": "02138-3733", "latitude": 42.3739503, "longitude": -71.1211445, "place_id": "ChIJNR2ZIGh344kRNQAj-dh6d00", "input_granularity": "PREMISE", "validation_granularity": "PREMISE", "geocode_granularity": "PREMISE", "has_inferred": true, "has_unconfirmed": false, "has_replaced": false, "address_complete": false}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need these files as part of the PR and testing? They will be added to the package if included in the present form. I know the addresses are not that critical but still, I am wondering if we can test without adding these files to the PR. I don't see equivalent files for translation either...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed it but where do you use these 2 json files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are used to test the from_json functions and from_json error handling for module. The translation doesn't have them because it has data in the test file, which it saves to json and then loads, and it doesn't test the same error handling. We could get rid of them by doing a save + load in the test file and mocking the bad data somehow, but the same data would still be stored somewhere.

path_to_csv_to_validate: str,
path_to_validated_csv: str,
) -> None:
"""Validate data located on disk and save results to disk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is a good place to specify how the API key needs to be stored.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also important to specify since your example allows for a config file but your get_maps_client function assumes an environment variable. Worth converting to a config variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, perhaps that would be better. I can see about doing that.

from tamr_toolbox.enrichment.api_client.google_address_validate import get_maps_client
from tamr_toolbox.utils.testing import mock_api

ADDR_VAL_MAPPING_0 = tamr_toolbox.enrichment.address_mapping.AddressValidationMapping(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be much happier if we could find a way to test this without putting all these details into the package... But if that does not exist, the tests themselves look good...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your concern with storing the data? Depending on the issue, there may be a way around it.

Copy link
Contributor

@zbpvarun-tamr zbpvarun-tamr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Depending on your conversation with Ravi, if you need to make any changes, I can take another look. But otherwise, feel free to merge when ready.

@ErinCompaan ErinCompaan merged commit 15b19b7 into Datatamer:develop Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants