refactor: improve experimental source code pattern analysis of pypi packages #965

art1f1c3R · 2025-01-17T05:36:34Z

Summary

This pull request builds upon the work completed in #851, refactoring and refining the source code analysis of PyPI packages by identifying suspicious patterns and dataflows. This PR introduces the open source components of the tool Semgrep, using Semgrep rules written in .yaml files to specify suspicious code patterns, and suspicious dataflows using its taint tracking features. Included in the changes are default Semgrep rules, the option to include user-created Semgrep rules, disable rulesets (whole .yaml files), and unit tests for each default Semgrep ruleset. This new feature is encompassed in a refactored version of the PyPISourcecodeAnalyzer, which analyzes the source code downloaded from the package tarball.

Usage

This feature is only intended to be used when the user explicitly includes the argument --analyze-source along with the macaron analyze command. This analyzer has a dependency on SOURCE_CODE_REPO, but can be forced to run by also supplying the --force-analyze-source command. The user may provide their own Semgrep rules by including a directory with one or many semgrep .yaml rule files in the defaults.ini file under custom_semgrep_rules. By default, this is blank. If rules are provided, they are run in addition to the default rules provided by Macaron. The user may disable individual rule IDs from both the default and custom rulset using disabled_rules in defaults.ini, or disable whole ruleset files using disabled_default_rulesets and disabled_custom_rulesets. Currently, due to a high false positive rate, the exfiltration default ruleset is disabled by default.

For post-run analysis, provided in the information returned by this analyzer is the rule identifier, along with the file path, start, and end line number of the file where the suspicious pattern was detected. The analyzer information includes the field enabled_sourcecode_rule_findings for enabled ruleset detections, and the field disabled_sourcecode_rule_findings for disabled ruleset detections. Only enabled rules affect the heuristic result.

Function

If this feature is turned on, the PyPISourcecodeAnalyzer is run after all other current heuristics have been run. If this analyzer fails and determines there were suspicious patterns in the source code, the overall package analysis is failed. If it had previously passed, this is done with low confidence (Confidence.LOW). The reason this analyzer may override the previous decision by the metadata heuristics is that it is only designed to be run when certain metadata heuristics fail, so if there is enough confidence in the results of the metadata heuristics, the source code analysis will not be run. Currently, this is done by including a dependency on the SourceCodeRepo heuristic failing, meaning there is no source code repository available. In this instance, the package is trusted less as the source code is not published, so the tarball is downloaded and analyzed. As heuristics change, this dependency may be updated.

Testing

A new unit test was added for this analyzer. It tests error pathways through the use and instantiation of the analyzer, as well as its ability to detect the patterns outlined in the Semgrep rules. To do this, sample files with suspicious patterns were included for each suspicious pattern category under tests/malware_analyzer/pypi/resources/sourcecode_samples. Each directory has an expected_results.json file with the expected file names and line numbers relative to that directory that should be detected, categorized in rule IDs. None of the code within these sample files is malicious in any way, the samples are all variants of printing Hello world!. However, to maintain their code patterns and behaviors, they must be excluded from many of the pre-commit hooks. To take precautions to ensure excluding them is safe, the following measures are put in place:

Each file calls sys.exit() immediately at every possible entry point.
The exported symbols are set to empty (using __all__) to ensure nothing can be imported from these files.
All patterns are isolated to a single function, test_function.
A pre-commit hook has been added which uses the script samples_permissions_checker.sh to ensure no execute permissions are available on any of the sample files.

Further Development

The following is recommended to be implemented and/or considered for further development of this feature:

Updates to obfuscation rules (obfuscation.yaml): improvements to this rule may include more decoding and execution patterns, and coverage of more obfuscation tool behaviors.
Updates to exfiltration rules (exfiltration.yaml): further common sensitive information behaviors and accesses should be included (e.g. OS-specific accesses of things like the Windows registry or Linux /etc files, keystroke reading for keylogging, clipboard access), along with further methods of creating remote connections and sending data (e.g. FTP).
Network operations (new rule): detect the flow of network operations to sensitive calls and behaviors, such as remote connections downloading files to then write then to disk or running malicious scripts downloaded from a remote connection, or spawning processes and shells after a remote connection. Operations like keylogging may required some more special treatment due to their typical flow.
Suspicious constants (new rule): detect hard-coded constant values to suspicious URL links for exfiltrating data or downloading malicious payloads, or sensitive constants such as files or directories commonly containing sensitive information. Constants included in this should also be included in exfiltration and network operation rules. This rule would need to be carefully crafted to avoid false positives.
The current suspicious_setup.py heuristic may benefit from using a custom Semgrep rule as opposed to its current method.
Source code analysis should only be performed when necessary, given it is an expensive operation. This analyzer's dependency on metadata heuristics and other heuristics should be refined to ensure this is the case.
Function Arguments: changes to all heuristics pertaining to suspicious function calls may assist in reducing false positives by attempting to discern what is passed as those arguments (e.g. is a hard-coded string passed to a base64 decode call, or something pulled from elsewhere that may be benign)

Development Notes

A new top-level file, .semgrepignore, has been added but is blank except for a single comment; this is intentional. This file is used by Semgrep with the same syntax as a .gitignore file to exclude items from being scanned by Semgrep. As our intention is to scan only specific directories, but ignore nothing (given we are scanning for malicious behavior in any part of the package), this file has been created so nothing is ignored. If Semgrep does not detect a .semgrepignore, it uses its own default one, which is not empty, hence this file's presence. For a similar reason, the --disable-nosem argument has been added to the main scan, so packages may not bypass this analyzer's scan by adding a # nosemgrep comment in the source code.
Currently, exfiltration.yaml Semgrep rules are disabled by default. These rules need further development to reduce false positives.
In the final docker build in Dockerfile.final, an additional line exists to uninstall the version of Semgrep installed by pip. This is done before then replacing it with the trusted built-from-source wheel.

src/macaron/config/defaults.ini

.semgrepignore

scripts/dev_scripts/samples_permissions_checker.sh

.pre-commit-config.yaml

art1f1c3R · 2025-02-27T05:56:29Z

Semgrep rules have been refined and updated based on the results of scanning the ICSE25 dataset as was provided to Hercule. A summary of what was observed and what was changed as a result:

The obfuscation_default-assigning rule triggered many false positives, as packages often perform these operations for custom typing purposes, as well as other unknown but benign behaviors. The rule did not, by itself, detect many malware within the dataset, and was often detected in conjunction with some other rule. For this reason, it has been removed.
The rules obfuscation_decode-and-execute and exfiltration_remote-exfiltration have been rewritten to make them more context aware. What is meant by this is, many false positives were being generated for patterns such as $BYTES.decode(...), as since $BYTES is not type checked to actually be a bytes object, anything with a .decode(...) method was triggered. This had a heavy impact on the socket patterns and file read/write patterns. These have been rewritten to ensure that any file read/write, or any socket action, is taken when a relevant object is available and detected. This has significantly reduced false positives.
The ast.literal_eval() function has been removed, as it does not, by itself, offer any real malicious functionality (only parsing types and other strings, not executing code).
The obfuscation_inline-imports rule has been rewritten to detect hexadecimal and octal import after observing malicious behaviors similar in some of the malware datasets, as well as only detecting certain, hardcoded suspicious imports. This is to avoid the many false positives it triggered for trusted packages which use __import__(...) for configuration purposes, and use this function as opposed to the suggested importlib.import_module(...).

scripts/dev_scripts/samples_permissions_checker.sh

src/macaron/malware_analyzer/pypi_heuristics/sourcecode/pypi_sourcecode_analyzer.py

behnazh-w

Since the source code analysis has become more mature, we can now run it by default.

src/macaron/slsa_analyzer/checks/detect_malicious_metadata_check.py

src/macaron/malware_analyzer/pypi_heuristics/sourcecode/pypi_sourcecode_analyzer.py

tests/malware_analyzer/pypi/test_pypi_sourcecode_analyzer.py

The merge-base changed after approval.

src/macaron/config/defaults.ini

src/macaron/malware_analyzer/pypi_heuristics/sourcecode/pypi_sourcecode_analyzer.py

src/macaron/config/defaults.ini

src/macaron/malware_analyzer/pypi_heuristics/sourcecode/pypi_sourcecode_analyzer.py

src/macaron/slsa_analyzer/package_registry/pypi_registry.py

…ackages Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>

Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>

…ture in defaults Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>

…y, and handle json error in analysis Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>

Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>

tromai

LGTM. Thanks!

…ackages (oracle#965) Include support for using Semgrep for analysis of source code to detect malicious code patterns, specified using Semgrep's YAML files. Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>

oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jan 17, 2025

behnazh-w reviewed Jan 21, 2025

View reviewed changes

src/macaron/config/defaults.ini Outdated Show resolved Hide resolved

art1f1c3R self-assigned this Jan 22, 2025

art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch from f599234 to 38cc36b Compare January 28, 2025 02:23

art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch from 8bc6532 to 6dfa198 Compare February 5, 2025 06:42

art1f1c3R marked this pull request as ready for review February 6, 2025 05:31

art1f1c3R requested a review from tromai as a code owner February 6, 2025 05:31

art1f1c3R commented Feb 11, 2025

View reviewed changes

.semgrepignore Show resolved Hide resolved

art1f1c3R commented Feb 11, 2025

View reviewed changes

scripts/dev_scripts/samples_permissions_checker.sh Show resolved Hide resolved

tromai reviewed Feb 11, 2025

View reviewed changes

.pre-commit-config.yaml Show resolved Hide resolved

art1f1c3R mentioned this pull request Feb 11, 2025

docs: include new heuristics in malware analyzer readme #987

Merged

art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch 3 times, most recently from f449d13 to 0d8e735 Compare February 14, 2025 07:49

art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch from 0d8e735 to c162b40 Compare February 24, 2025 05:50

tromai reviewed Mar 7, 2025

View reviewed changes

scripts/dev_scripts/samples_permissions_checker.sh Show resolved Hide resolved

art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch 3 times, most recently from 8bee8eb to f06c632 Compare March 12, 2025 06:04

behnazh-w reviewed Mar 22, 2025

View reviewed changes

src/macaron/malware_analyzer/pypi_heuristics/sourcecode/pypi_sourcecode_analyzer.py Outdated Show resolved Hide resolved

behnazh-w reviewed Mar 26, 2025

View reviewed changes

src/macaron/slsa_analyzer/checks/detect_malicious_metadata_check.py Show resolved Hide resolved

behnazh-w reviewed Mar 28, 2025

View reviewed changes

src/macaron/malware_analyzer/pypi_heuristics/sourcecode/pypi_sourcecode_analyzer.py Outdated Show resolved Hide resolved

art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch 3 times, most recently from ae2983b to 73d3fa8 Compare April 3, 2025 06:10

art1f1c3R changed the base branch from staging to main April 8, 2025 23:46

art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch 3 times, most recently from 3ef07a0 to d19911b Compare April 14, 2025 03:51

art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch from 1360312 to 55a9dcb Compare April 17, 2025 06:29

art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch from 55a9dcb to 2afcd59 Compare May 15, 2025 05:57

tromai reviewed May 29, 2025

View reviewed changes

tests/malware_analyzer/pypi/test_pypi_sourcecode_analyzer.py Outdated Show resolved Hide resolved

behnazh-w previously approved these changes Jun 1, 2025

View reviewed changes