Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove token #528

Merged
merged 13 commits into from
Apr 29, 2024
Merged

Remove token #528

merged 13 commits into from
Apr 29, 2024

Conversation

doug-newman-nasa
Copy link
Contributor

@doug-newman-nasa doug-newman-nasa commented Apr 19, 2024

The new cassette for the test_data_links has some information in it pertaining to the user who executed the test. This PR redacts it.
BTW - at the hackathon I stated that I was going to add a VCR cassette for test_data_links which I attempted to PR today only to find the same work had been merged three days ago. We need to coordinate better.


📚 Documentation preview 📚: https://earthaccess--528.org.readthedocs.build/en/528/

@chuckwondo
Copy link
Collaborator

Thanks for catching and redacting the token!

@chuckwondo chuckwondo self-requested a review April 19, 2024 13:26
Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Doug! My bad on the token issue, and for failing to coordinate. It slipped my mind that you had mentioned this, but now I recall discussing with @mfisher87 the need to handle this very situation.

However, instead of simply hand-modifying the cassette, I'd like to see proper request and response filtering. Without using the vcrpy capabilities for scrubbing sensitive information, if someone were to re-record this cassette, they would run the risk of committing sensitive information again, if they didn't realize (like me) that sensitive information was being recorded.

See:

@doug-newman-nasa
Copy link
Contributor Author

Yup. I looked into using
with my_vcr.use_cassette('test.yml', filter_post_data_parameters=['api_key']): requests.post('http://api.com/postdata', data={'api_key': 'secretstring'})
but I couldn't get it working properly.

@chuckwondo
Copy link
Collaborator

Yup. I looked into using with my_vcr.use_cassette('test.yml', filter_post_data_parameters=['api_key']): requests.post('http://api.com/postdata', data={'api_key': 'secretstring'}) but I couldn't get it working properly.

Ok. I'll take a look. If I can't sort it out in short order, I'll let you know and we can approve this redaction in the meantime. If that's the case, then I'll open a new issue to later address this via vcrpy functionality.

@doug-newman-nasa
Copy link
Contributor Author

Sorry, I should have been more specific. I can keep looking at it. Just not immediately.

@mfisher87
Copy link
Collaborator

Great catch, Doug!

@jhkennedy
Copy link
Collaborator

BTW - at the hackathon I stated that I was going to add a VCR cassette for test_data_links which I attempted to PR today only to find the same work had been merged three days ago. We need to coordinate better.

One way that might help here is to put PRs into, or back into, draft status if there's planned future work but the PR appears to be in ready-for-review or ready-to-merge state.

@doug-newman-nasa
Copy link
Contributor Author

This is a 'preliminary pull request' because I wanted to get your eyes on the redact_key_values code. Not being a python coder I'm sure it requires some love. I'm going to work on redacting uri: https://urs.earthdata.nasa.gov/api/users/dschuck?client_id=

@chuckwondo
Copy link
Collaborator

This is a 'preliminary pull request' because I wanted to get your eyes on the redact_key_values code. Not being a python coder I'm sure it requires some love. I'm going to work on redacting uri: https://urs.earthdata.nasa.gov/api/users/dschuck?client_id=

Thanks Doug! I'll take an initial pass over your updates.

tests/unit/test_results.py Outdated Show resolved Hide resolved
("client_id", "foo"),
]

def redact_key_values(keys_to_redact):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of a higher-order function here! I'd suggest just making this a top level function. No need to nest this within the _get_vcr method, nor within the TestResults class at all.

tests/unit/test_results.py Outdated Show resolved Hide resolved
tests/unit/test_results.py Outdated Show resolved Hide resolved
Comment on lines 53 to 80
def before_record_response(response):
# Only do this if the response has not been recorded
string_body = response["body"]["string"].decode("utf8")
if REDACTED_STRING not in string_body:
# Only do this is if the body contains one or more
# of the keys to redact.
if any(key in string_body for key in keys_to_redact):
try:
is_list = False
# Marshall into json object, if it is a JSON object.
payload = json.loads(string_body)
if isinstance(payload, list):
payload = payload[0]
is_list = True
for key in keys_to_redact:
if key in payload:
# Redact the key value
payload[key] = REDACTED_STRING
# Write out the updated json object to the response
# body string.
if is_list:
payload = [payload]
response["body"]["string"] = json.dumps(payload).encode()
except ValueError:
# If it is not a json object, return
return response

return response
Copy link
Collaborator

@chuckwondo chuckwondo Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify this logic a bit, I suggest not bothering to avoid redacting an already redacted payload. This redaction is idempotent, so redacting an already redacted payload yields the same result, and in the context of running unit tests, the performance impact is likely unnoticeable.

Suggested change
def before_record_response(response):
# Only do this if the response has not been recorded
string_body = response["body"]["string"].decode("utf8")
if REDACTED_STRING not in string_body:
# Only do this is if the body contains one or more
# of the keys to redact.
if any(key in string_body for key in keys_to_redact):
try:
is_list = False
# Marshall into json object, if it is a JSON object.
payload = json.loads(string_body)
if isinstance(payload, list):
payload = payload[0]
is_list = True
for key in keys_to_redact:
if key in payload:
# Redact the key value
payload[key] = REDACTED_STRING
# Write out the updated json object to the response
# body string.
if is_list:
payload = [payload]
response["body"]["string"] = json.dumps(payload).encode()
except ValueError:
# If it is not a json object, return
return response
return response
def redact(payload):
for key in keys_to_redact:
if key in payload:
payload[key] = "REDACTED"
return payload
def before_record_response(response):
body = response["body"]["string"].decode("utf8")
with contextlib.suppress(json.JSONDecodeError):
payload = json.loads(body)
redacted_payload = (
list(map(redact, payload))
if isinstance(payload, list)
else redact(payload)
)
response["body"]["string"] = json.dumps(redacted_payload).encode()
return response

Also, please move the entirety of redact_key_values to the top level. It does not need to be nested with the _get_vcr method. The nesting just adds unnecessary clutter.

Same goes for the nested redact_login_request below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oddly, I move this method out of TestCase it no longer gets called. Investigating.

@doug-newman-nasa
Copy link
Contributor Author

I've made all the suggestions changes with the exception of moving redact_key_values out of TestCase which prevents it from being called. Looking into that next.

@chuckwondo
Copy link
Collaborator

@doug-newman-nasa, I put these at the top level in tests/unit/test_results.py (preceding class TestResults), and everything works fine for me:

def redact_login_request(request):
    if "/api/users/" in request.path and "/api/users/tokens" not in request.path:
        _, user_name = os.path.split(request.path)
        request.uri = request.uri.replace(user_name, REDACTED_STRING)

    return request


def redact_key_values(keys_to_redact):
    def redact(payload):
        for key in keys_to_redact:
            if key in payload:
                payload[key] = REDACTED_STRING
        return payload

    def before_record_response(response):
        body = response["body"]["string"].decode("utf8")

        with contextlib.suppress(json.JSONDecodeError):
            payload = json.loads(body)
            redacted_payload = (
                list(map(redact, payload))
                if isinstance(payload, list)
                else redact(payload)
            )
            response["body"]["string"] = json.dumps(redacted_payload).encode()

        return response

    return before_record_response

Give it a shot and let me know what you see.

Also, I'd like to request a minor tweak to an assertion in the function test_get_all_more_than_2k. Because the number of granules is changing, if someone decide to regenerate all of the cassettes, this test will fail due to a difference in the length of the granules list.

So instead of this assertion:

        self.assertEqual(len(granules), 2520)

let's do this, which will be less strict, but still validate the results correctly:

        self.assertIn(len(granules), range(2001, 3001))

@doug-newman-nasa
Copy link
Contributor Author

doug-newman-nasa commented Apr 26, 2024

Regarding test_get_all_more_than_2k what I am trying to test here is that the correct number of granules were marshaled from the result. Using a range comparison will limit the effectiveness of that test. However, I'm not actually testing against the value in the result rather the result I fished out manually. So I'm going to counter by asserting equality of
len(granules)
against the value of 'hits' in the response
string: '{"hits":2520,"took":138,"items":[]}'
Does that work?

@chuckwondo
Copy link
Collaborator

Regarding test_get_all_more_than_2k what I am trying to test here is that the correct number of granules were marshaled from the result. Using a range comparison will limit the effectiveness of that test. However, I'm not actually testing against the value in the result rather the result I fished out manually. So I'm going to counter by asserting equality of len(granules) against the value of 'hits' in the response string: '{"hits":2520,"took":138,"items":[]}' Does that work?

Perfect! Even better!

Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! Thanks Doug!

@chuckwondo
Copy link
Collaborator

I've approved this, but does anybody else want to review this?

BTW, I'd like to suggest that we squash merge PRs to have a tidier, more readable change history.

Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Doug. One last tweak, please.

tests/unit/test_results.py Show resolved Hide resolved
Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you sir! Nice work.

@chuckwondo
Copy link
Collaborator

Last call for someone else to review. If I don't get any takers today, I'll proceed with a squash merge near COB US Eastern.

@chuckwondo chuckwondo merged commit 73e7959 into nsidc:main Apr 29, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants