Skip to content

refactor: improve experimental source code pattern analysis of pypi packages #965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

art1f1c3R
Copy link
Member

@art1f1c3R art1f1c3R commented Jan 17, 2025

Summary

This pull request builds upon the work completed in #851, refactoring and refining the source code analysis of PyPI packages by identifying suspicious patterns and dataflows. This PR introduces the open source components of the tool Semgrep, using Semgrep rules written in .yaml files to specify suspicious code patterns, and suspicious dataflows using its taint tracking features. Included in the changes are default Semgrep rules, the option to include user-created Semgrep rules, disable rulesets (whole .yaml files), and unit tests for each default Semgrep ruleset. This new feature is encompassed in a refactored version of the PyPISourcecodeAnalyzer, which analyzes the source code downloaded from the package tarball.

Usage

This feature is only intended to be used when the user explicitly includes the argument --analyze-source along with the macaron analyze command. This analyzer has a dependency on SOURCE_CODE_REPO, but can be forced to run by also supplying the --force-analyze-source command. The user may provide their own Semgrep rules by including a directory with one or many semgrep .yaml rule files in the defaults.ini file under custom_semgrep_rules. By default, this is blank. If rules are provided, they are run in addition to the default rules provided by Macaron. The user may disable individual rule IDs from both the default and custom rulset using disabled_rules in defaults.ini, or disable whole ruleset files using disabled_default_rulesets and disabled_custom_rulesets. Currently, due to a high false positive rate, the exfiltration default ruleset is disabled by default.

For post-run analysis, provided in the information returned by this analyzer is the rule identifier, along with the file path, start, and end line number of the file where the suspicious pattern was detected. The analyzer information includes the field enabled_sourcecode_rule_findings for enabled ruleset detections, and the field disabled_sourcecode_rule_findings for disabled ruleset detections. Only enabled rules affect the heuristic result.

Function

If this feature is turned on, the PyPISourcecodeAnalyzer is run after all other current heuristics have been run. If this analyzer fails and determines there were suspicious patterns in the source code, the overall package analysis is failed. If it had previously passed, this is done with low confidence (Confidence.LOW). The reason this analyzer may override the previous decision by the metadata heuristics is that it is only designed to be run when certain metadata heuristics fail, so if there is enough confidence in the results of the metadata heuristics, the source code analysis will not be run. Currently, this is done by including a dependency on the SourceCodeRepo heuristic failing, meaning there is no source code repository available. In this instance, the package is trusted less as the source code is not published, so the tarball is downloaded and analyzed. As heuristics change, this dependency may be updated.

Testing

A new unit test was added for this analyzer. It tests error pathways through the use and instantiation of the analyzer, as well as its ability to detect the patterns outlined in the Semgrep rules. To do this, sample files with suspicious patterns were included for each suspicious pattern category under tests/malware_analyzer/pypi/resources/sourcecode_samples. Each directory has an expected_results.json file with the expected file names and line numbers relative to that directory that should be detected, categorized in rule IDs. None of the code within these sample files is malicious in any way, the samples are all variants of printing Hello world!. However, to maintain their code patterns and behaviors, they must be excluded from many of the pre-commit hooks. To take precautions to ensure excluding them is safe, the following measures are put in place:

  • Each file calls sys.exit() immediately at every possible entry point.
  • The exported symbols are set to empty (using __all__) to ensure nothing can be imported from these files.
  • All patterns are isolated to a single function, test_function.
  • A pre-commit hook has been added which uses the script samples_permissions_checker.sh to ensure no execute permissions are available on any of the sample files.

Further Development

The following is recommended to be implemented and/or considered for further development of this feature:

  • Updates to obfuscation rules (obfuscation.yaml): improvements to this rule may include more decoding and execution patterns, and coverage of more obfuscation tool behaviors.
  • Updates to exfiltration rules (exfiltration.yaml): further common sensitive information behaviors and accesses should be included (e.g. OS-specific accesses of things like the Windows registry or Linux /etc files, keystroke reading for keylogging, clipboard access), along with further methods of creating remote connections and sending data (e.g. FTP).
  • Network operations (new rule): detect the flow of network operations to sensitive calls and behaviors, such as remote connections downloading files to then write then to disk or running malicious scripts downloaded from a remote connection, or spawning processes and shells after a remote connection. Operations like keylogging may required some more special treatment due to their typical flow.
  • Suspicious constants (new rule): detect hard-coded constant values to suspicious URL links for exfiltrating data or downloading malicious payloads, or sensitive constants such as files or directories commonly containing sensitive information. Constants included in this should also be included in exfiltration and network operation rules. This rule would need to be carefully crafted to avoid false positives.
  • The current suspicious_setup.py heuristic may benefit from using a custom Semgrep rule as opposed to its current method.
  • Source code analysis should only be performed when necessary, given it is an expensive operation. This analyzer's dependency on metadata heuristics and other heuristics should be refined to ensure this is the case.
  • Function Arguments: changes to all heuristics pertaining to suspicious function calls may assist in reducing false positives by attempting to discern what is passed as those arguments (e.g. is a hard-coded string passed to a base64 decode call, or something pulled from elsewhere that may be benign)

Development Notes

  • A new top-level file, .semgrepignore, has been added but is blank except for a single comment; this is intentional. This file is used by Semgrep with the same syntax as a .gitignore file to exclude items from being scanned by Semgrep. As our intention is to scan only specific directories, but ignore nothing (given we are scanning for malicious behavior in any part of the package), this file has been created so nothing is ignored. If Semgrep does not detect a .semgrepignore, it uses its own default one, which is not empty, hence this file's presence. For a similar reason, the --disable-nosem argument has been added to the main scan, so packages may not bypass this analyzer's scan by adding a # nosemgrep comment in the source code.
  • Currently, exfiltration.yaml Semgrep rules are disabled by default. These rules need further development to reduce false positives.
  • In the final docker build in Dockerfile.final, an additional line exists to uninstall the version of Semgrep installed by pip. This is done before then replacing it with the trusted built-from-source wheel.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jan 17, 2025
@art1f1c3R art1f1c3R self-assigned this Jan 22, 2025
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch from f599234 to 38cc36b Compare January 28, 2025 02:23
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch from 8bc6532 to 6dfa198 Compare February 5, 2025 06:42
@art1f1c3R art1f1c3R marked this pull request as ready for review February 6, 2025 05:31
@art1f1c3R art1f1c3R requested a review from tromai as a code owner February 6, 2025 05:31
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch 3 times, most recently from f449d13 to 0d8e735 Compare February 14, 2025 07:49
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch from 0d8e735 to c162b40 Compare February 24, 2025 05:50
@art1f1c3R
Copy link
Member Author

Semgrep rules have been refined and updated based on the results of scanning the ICSE25 dataset as was provided to Hercule. A summary of what was observed and what was changed as a result:

  • The obfuscation_default-assigning rule triggered many false positives, as packages often perform these operations for custom typing purposes, as well as other unknown but benign behaviors. The rule did not, by itself, detect many malware within the dataset, and was often detected in conjunction with some other rule. For this reason, it has been removed.
  • The rules obfuscation_decode-and-execute and exfiltration_remote-exfiltration have been rewritten to make them more context aware. What is meant by this is, many false positives were being generated for patterns such as $BYTES.decode(...), as since $BYTES is not type checked to actually be a bytes object, anything with a .decode(...) method was triggered. This had a heavy impact on the socket patterns and file read/write patterns. These have been rewritten to ensure that any file read/write, or any socket action, is taken when a relevant object is available and detected. This has significantly reduced false positives.
  • The ast.literal_eval() function has been removed, as it does not, by itself, offer any real malicious functionality (only parsing types and other strings, not executing code).
  • The obfuscation_inline-imports rule has been rewritten to detect hexadecimal and octal import after observing malicious behaviors similar in some of the malware datasets, as well as only detecting certain, hardcoded suspicious imports. This is to avoid the many false positives it triggered for trusted packages which use __import__(...) for configuration purposes, and use this function as opposed to the suggested importlib.import_module(...).

@art1f1c3R art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch 3 times, most recently from 8bee8eb to f06c632 Compare March 12, 2025 06:04
Copy link
Member

@behnazh-w behnazh-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the source code analysis has become more mature, we can now run it by default.

@art1f1c3R art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch 3 times, most recently from ae2983b to 73d3fa8 Compare April 3, 2025 06:10
@art1f1c3R art1f1c3R changed the base branch from staging to main April 8, 2025 23:46
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch 3 times, most recently from 3ef07a0 to d19911b Compare April 14, 2025 03:51
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch 2 times, most recently from 1360312 to 55a9dcb Compare April 17, 2025 06:29
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch from 55a9dcb to 2afcd59 Compare May 15, 2025 05:57
behnazh-w
behnazh-w previously approved these changes Jun 1, 2025
@art1f1c3R art1f1c3R dismissed behnazh-w’s stale review June 1, 2025 22:51

The merge-base changed after approval.

…ackages

Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch from 987c2c2 to bfe204a Compare June 2, 2025 00:38
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
art1f1c3R added 2 commits June 2, 2025 14:12
…ture in defaults

Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
…y, and handle json error in analysis

Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
# temporary directory to unzip and read all source files
temp_dir = tempfile.mkdtemp(prefix=f"{package_name}_")
response = send_get_http_raw(url, stream=True)
if response is None:
Copy link
Member

@tromai tromai Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory if you are looking for a way to create a temporary directory and have it automatically cleaned up afterward. I see that this was used in the old implementation. Could you let me know the reason we are not using it anymore 🤔 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a while ago, but I believe the rationale was that the TemporaryDirectory object made it difficult to store the extracted path persistent across the entire run. I want this capability as I would like other heuristics and potentially other components of Macaron to use this feature once this is merged, to prevent us downloading the tarball repeatedly. Currently, since this PR is isolated to my feature, I implement and use a context manager to download, extract, and clean it up in DetectMaliciousMetadataCheck.analyze_source.

Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants