Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scancode-fingerprint plugin does not figerprint UTF-8 encoded files correctly. #1690

Closed
steven-esser opened this issue Aug 22, 2019 · 1 comment
Assignees

Comments

@steven-esser
Copy link
Contributor

Description

When using the -f command from plugins/scancode-fingerprint/ plugin, it fails to fingerprint any UTF-8 encoded files.

{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "3.1.0.post27.38d1017eb",
      "options": {
        "input": [
          "src/packagedcode/pyrpm.py"
        ],
        "--fingerprint": true,
        "--info": true,
        "--json-pp": "/home/sesser/out.json"
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2019-08-22T144239.760365",
      "end_timestamp": "2019-08-22T144239.868490",
      "message": null,
      "errors": [
        "Path: pyrpm.py"
      ],
      "extra_data": {
        "files_count": 1
      }
    }
  ],
  "files": [
    {
      "path": "pyrpm.py",
      "type": "file",
      "name": "pyrpm.py",
      "base_name": "pyrpm",
      "extension": ".py",
      "size": 15711,
      "date": "2019-07-08",
      "sha1": "a83f39b3c15f8382942a5b6a78bd5c3d99c018e0",
      "md5": "ef098ae1b056c8f91b41d3c3fc9a5d53",
      "mime_type": "text/x-python",
      "file_type": "Python script, UTF-8 Unicode text executable",
      "programming_language": "Python",
      "is_binary": false,
      "is_text": true,
      "is_archive": false,
      "is_media": false,
      "is_source": true,
      "is_script": true,
      "fingerprint": null,
      "files_count": 0,
      "dirs_count": 0,
      "size_count": 0,
      "scan_errors": [
        "ERROR: for scanner: fingerprint:\nERROR: Unknown error:\nTraceback (most recent call last):\n  File \"/home/sesser/Code/scancode-toolkit/src/scancode/interrupt.py\", line 91, in interruptible\n    return NO_ERROR, func(*(args or ()), **(kwargs or {}))\n  File \"/home/sesser/Code/scancode-toolkit/plugins/scancode-fingerprint/src/plugin_fingerprint/plugin_fingerprint.py\", line 71, in get_fingerprint\n    result = simhash.hex_digest()\n  File \"/home/sesser/Code/scancode-toolkit/plugins/scancode-fingerprint/src/plugin_fingerprint/fingerprint.py\", line 57, in hex_digest\n    fingerprint_binary = self.generate_fingerprint()\n  File \"/home/sesser/Code/scancode-toolkit/plugins/scancode-fingerprint/src/plugin_fingerprint/fingerprint.py\", line 45, in generate_fingerprint\n    weighted_hash = self.get_weighted_hash()\n  File \"/home/sesser/Code/scancode-toolkit/plugins/scancode-fingerprint/src/plugin_fingerprint/fingerprint.py\", line 72, in get_weighted_hash\n    self.process_shingles(shingle, result)\n  File \"/home/sesser/Code/scancode-toolkit/plugins/scancode-fingerprint/src/plugin_fingerprint/fingerprint.py\", line 106, in process_shingles\n    hash = hashlib.md5(shingle.encode()).digest()\nUnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128)\n"
      ]
    }
  ]
}

The full cli args (can be reproduced):

$ scancode -f -i src/packagedcode/pyrpm.py --json-pp ~/out.json

System configuration

For bug reports, it really helps us to know:

  • What OS are you running on? Linux
  • What version of scancode-toolkit was used to generate the scan file? Latest Develop
  • What installation method was used to install/run scancode? (pip/source download/other) git/source download
@steven-esser steven-esser self-assigned this Sep 4, 2019
steven-esser added a commit that referenced this issue Nov 9, 2019
Signed-off-by: Steven Esser <sesser@nexb.com>
steven-esser added a commit that referenced this issue Nov 9, 2019
Signed-off-by: Steven Esser <sesser@nexb.com>
pombredanne added a commit that referenced this issue Nov 11, 2019
Handle non-ascii characters properly in scancode-fingerprint #1690
@steven-esser
Copy link
Contributor Author

#1823 merged. Closing this.

viragumathe5 pushed a commit to viragumathe5/scancode-toolkit that referenced this issue Mar 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant