Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pacer): Link pacer sessions to proxies #4192

Merged
merged 20 commits into from
Jul 30, 2024

Conversation

ERosendo
Copy link
Contributor

@ERosendo ERosendo commented Jul 10, 2024

This PR introduces a new approach to caching PACER sessions and updates the code of the ProxyPacerSession class to pick one of the proxy connection strings from the settings.

Key Changes:

  • Introduces a new environment variable (EGRESS_PROXY_HOSTS) to manage a list of proxies.

  • Implements new logic in the ProxyPacerSession to choose a proxy connection string from the settings, enabling dynamic proxy selection.

  • Updates the log_into_pacer method to return a tuple containing both the user's cookie and the used proxy address used for login

  • Tweaks the get_or_cache_pacer_cookies to handle both the new return format and potentially existing cached data in the old format, ensuring backward compatibility.

  • Updates tasks from corpus_importer and recap modules to accommodate the new PACER user cookie format.

  • Adapts commands in the corpus_importer module that use cookies to handle the new format.

  • Introduces new tests for pacer session utils.

This PR closes #4087

@freelawproject freelawproject deleted a comment from semgrep-app bot Jul 10, 2024
@ERosendo ERosendo force-pushed the 4087-feat-link-sessions-to-proxy branch from 6cebd2b to 4579385 Compare July 10, 2024 21:07
@ERosendo ERosendo marked this pull request as ready for review July 10, 2024 22:43
This commit updates the `ProxyPacerSession` class to enable selection of a proxy connection string from the application settings.
This commit updates the `log_into_pacer` method to return a tuple containing the user's cookie and the proxy address used for login (if applicable). This improvement provides more context about the login session, facilitating further actions requiring both cookies and potential proxy information.
This commit tweaks the `get_or_cache_pacer_cookies` function to handle the new return format of `log_into_pacer`. It ensures compatibility with both the updated function and any existing cached data that might be in the old format.
… cookie handling

- Updates tasks logic to accommodate the new format for PACER user cookies.
- Updates function signature to reflect the change in cookie data format.
- Adapts update_docket_info_iquery task to handle the updated output format of get_or_cache_pacer_cookies.
- Enhances task functionality by passing the proxy parameter to the ProxyPacerSession class
@ERosendo ERosendo force-pushed the 4087-feat-link-sessions-to-proxy branch from a1c1746 to 693552d Compare July 10, 2024 22:43
@ERosendo ERosendo requested a review from mlissner July 10, 2024 22:43
Copy link

semgrep-app bot commented Jul 10, 2024

Semgrep found 4 avoid-pickle findings:

Avoid using pickle, which is known to lead to code execution vulnerabilities. When unpickling, the serialized data could be manipulated to run arbitrary code. Instead, consider serializing the relevant data as JSON or a similar text-based serialization format.

Ignore this finding from avoid-pickle.

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you've got an effective solution here, and that you've thought carefully about how to roll it out without queued tasks failing. Nice.

I think I would change something though. Having to work with tuples and pass them around is never much fun. They're kind of ugly, it's hard to remember what order the tuple is in ("Was it cookies then proxy or proxy then cookies?"), and so forth.

Since I think the values we need are getting added to the PACER session object anyway, did you consider just passing that around? Even log_into_pacer could just return s, I think?

cl/lib/pacer_session.py Outdated Show resolved Hide resolved
return cookies
if isinstance(cookies_data, tuple):
return cookies_data
return cookies_data, settings.EGRESS_PROXY_HOST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop this variable in the code now so we don't have to think about it again. I see you're doing backwards compat here, but it'd be better to just hardcode EGRESS_PROXY_HOSTS[0], say. Let's also make a comment about how this code can be removed the next time anybody is in this area and that it's only here for transitioning.

cl/recap/tasks.py Outdated Show resolved Hide resolved
cl/corpus_importer/tasks.py Outdated Show resolved Hide resolved
@ERosendo ERosendo force-pushed the 4087-feat-link-sessions-to-proxy branch from 2954d38 to b4cc6df Compare July 12, 2024 16:23
@ERosendo ERosendo force-pushed the 4087-feat-link-sessions-to-proxy branch from b4cc6df to 5a47599 Compare July 12, 2024 16:37
@@ -7,7 +7,7 @@
from cl.corpus_importer.tasks import add_tags, get_pacer_doc_by_rd
from cl.lib.celery_utils import CeleryThrottle
from cl.lib.command_utils import VerboseCommand, logger
from cl.lib.pacer_session import ProxyPacerSession
from cl.lib.pacer_session import ProxyPacerSession, SessionData
from cl.lib.scorched_utils import ExtraSolrInterface
from cl.lib.search_utils import build_main_query_from_query_string
from cl.scrapers.tasks import extract_recap_pdf
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are using environment variables inside django app. Use django-environ as it a better alternative for deployment.

Ignore this finding from use-django-environ.

@@ -14,7 +14,7 @@
)
from cl.lib.celery_utils import CeleryThrottle
from cl.lib.command_utils import VerboseCommand, logger
from cl.lib.pacer_session import ProxyPacerSession
from cl.lib.pacer_session import ProxyPacerSession, SessionData
from cl.lib.scorched_utils import ExtraSolrInterface
from cl.lib.search_utils import build_main_query_from_query_string
from cl.recap.tasks import process_recap_attachment
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are using environment variables inside django app. Use django-environ as it a better alternative for deployment.

Ignore this finding from use-django-environ.

Copy link

semgrep-app bot commented Jul 30, 2024

Semgrep found 5 avoid-pickle findings:

Avoid using pickle, which is known to lead to code execution vulnerabilities. When unpickling, the serialized data could be manipulated to run arbitrary code. Instead, consider serializing the relevant data as JSON or a similar text-based serialization format.

Ignore this finding from avoid-pickle.

@ERosendo ERosendo force-pushed the 4087-feat-link-sessions-to-proxy branch from 81cd87d to e34f3d5 Compare July 30, 2024 02:52
@ERosendo ERosendo requested a review from mlissner July 30, 2024 13:23
@ERosendo
Copy link
Contributor Author

@mlissner I've implemented your suggestions. I've removed the EGRESS_PROXY_HOST environment variable from settings and replaced tuples with a dedicated dataclass named SessionData for better structure and readability.

Let me know what you think.

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! This is a lot easier to reason about. Thanks for making the little helper class.

Can you make an issue with details about deployment and the follow up tasks? E.g.:

  • Need the variables set before deployment
  • Need to remove the old proxies after a few hours or maybe a day
  • Need to remove shim code after a few hours or maybe a day

Anything else?

@mlissner
Copy link
Member

This is approved, but can you work with Ramiro to make sure that we've got the variables and k8s deployments in place before we merge and deploy?

@ERosendo
Copy link
Contributor Author

Sure!

@ERosendo
Copy link
Contributor Author

@mlissner Ramiro confirmed the environment variable was set up correctly and that CL can successfully reach the newly deployed proxies.

Let's merge this PR. :shipit:

@mlissner mlissner enabled auto-merge July 30, 2024 16:10
@mlissner
Copy link
Member

Auto-merge is set. Thanks all!

@mlissner mlissner merged commit 723b7ec into main Jul 30, 2024
13 checks passed
@mlissner mlissner deleted the 4087-feat-link-sessions-to-proxy branch July 30, 2024 16:20
@mlissner
Copy link
Member

@ERosendo Now that this has been live for a couple weeks, is there code in here we want to refactor to simplify things?

@ERosendo
Copy link
Contributor Author

@mlissner Yes, We used several ternary if statements to manage legacy cached data. I believe we can safely remove this code. Here are examples:

session_data = (
cookies if isinstance(cookies, SessionData) else SessionData(cookies)
)

# Ensures session data is a `SessionData` instance for consistent handling.
# This approach prevents disruptions during processing of enqueued data
# after deployment.
session_data = (
cookies if isinstance(cookies, SessionData) else SessionData(cookies)
)

if cookies_data and ttl_seconds >= 300 and not refresh:
# cookies were found in cache and ttl >= 5 minutes, return them
if isinstance(cookies_data, SessionData):
return cookies_data
return SessionData(cookies_data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

Update catching strategy to link session data to proxy connections
2 participants