Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import_logs.py - 'utf-8' codec can't encode character '\udcbf' in position 0: surrogates not allowed #334

Open
mctunes opened this issue May 5, 2022 · 0 comments

Comments

@mctunes
Copy link

mctunes commented May 5, 2022

While importing standard IIS log files using import_logs.py, the following exception was thrown when processing one particular file:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/var/www/html/piwik/misc/log-analytics/import_logs.py", line 1864, in _run_bulk
    self._record_hits(hits)
  File "/var/www/html/piwik/misc/log-analytics/import_logs.py", line 2010, in _record_hits
    'requests': [self._get_hit_args(hit) for hit in hits]
  File "/var/www/html/piwik/misc/log-analytics/import_logs.py", line 2010, in <listcomp>
    'requests': [self._get_hit_args(hit) for hit in hits]
  File "/var/www/html/piwik/misc/log-analytics/import_logs.py", line 1971, in _get_hit_args
    urllib.parse.quote(args['urlref'], '')
  File "/usr/lib/python3.8/urllib/parse.py", line 853, in quote
    string = string.encode(encoding, errors)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcbf' in position 0: surrogates not allowed

Expected Behavior

Log should be imported successfully.

Current Behavior

Exception above is thrown.

Possible Solution

Steps to Reproduce (for Bugs)

  1. Execute import_logs.py for one particular file:
/var/www/html/piwik/misc/log-analytics/import_logs.py /logs/u_ex220127.log \
	--url=https://analytics.example.com \
	--idsite=1 \
	--recorders=4 \
	--accept-invalid-ssl-certificate \
	--enable-http-errors \
	--enable-bots \
	--exclude-path="/cf_scripts/*" \
	--exclude-path="/tz_json/*" \
	--exclude-path="/*/assets/*" \
	--exclude-path="/*/cache/*" \
	--exclude-path="/*/css/*" \
	--exclude-path="/*/images/*"

Context

This has only happened once, on one particular file. We worked around it by removing the file from the batch, and processing then continued as normal.

Please let me know if there is any other information you need.

Your Environment

  • Matomo Version: 4.9.1
  • PHP Version: 7.4.3
  • Server Operating System: Ubuntu 20.04.4 LTS
  • Additionally installed plugins:
    CustomVariables (v4.0.1)
    HidePasswordReset (v4.3.3)
    IntranetGeoIP (v4.0.0)
    LogViewer (v4.0.1)
    LoginLdap (v4.4.0)
    Provider (v4.0.3)
    SecurityInfo (v4.0.2)
  • Browser: N/A
  • Operating System: N/A
@sgiehl sgiehl transferred this issue from matomo-org/matomo May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant