Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 encoding problems in minimal Ubuntu for CI #11

Closed
palmskog opened this issue Dec 14, 2020 · 5 comments · Fixed by #13
Closed

UTF8 encoding problems in minimal Ubuntu for CI #11

palmskog opened this issue Dec 14, 2020 · 5 comments · Fixed by #13

Comments

@palmskog
Copy link
Collaborator

palmskog commented Dec 14, 2020

I set up a custom Docker container with Ubuntu (Dockerfile) to be able to run Alectryon with coqdoc on every master branch push for a Coq project. However, I quickly ran into UTF8 encoding issues like this:

'ascii' codec can't encode character '\u2191' in position 6443: ordinal not in range(128)

Note that \u2191 is the "uparrow" Unicode symbol, so the problem came from the use of HEADER in alectryon/html.py.

Even after reading up on Python3 encoding issues, I couldn't figure out exactly where there might be a .encode("utf-8") missing, so I opted to simply remove all UTF8 from all output by Alectryon and coqdoc. However, since the --utf8 option to coqdoc is hardcoded, I had to use a fork of Alectryon (commit). Also, I believe this means the build will break anytime anyone uses an UTF8 character in a Coq file.

Is there a better way to solve this issue? I theorize that one more complete workaround would be to set up a locale (e.g., en_US.UTF8) in the Docker container, but this seems like a cumbersome thing to do in every Docker image where one wants to run Alectryon.

@cpitclaudel
Copy link
Owner

Thanks a lot for the report. Do you have a complete backtrace? (you can get one by passing --traceback to Alectryon). The reason I'm asking is that Alectryon doesn't really print much to stdout, so this error seems to mean that programs in that docker container can't even write files that contain non-ascii characters.

I think the solution is here: https://stackoverflow.com/questions/52065842/python-docker-ascii-codec-cant-encode-character (ignore the incorrect duplicate banner)

I wonder if this is the same problem as the one that forced @jfehrle to catch encoding exceptions in https://github.com/coq/coq/pull/13564/files#diff-99858e5d76716d34bcaf9ad38b8d67f05a7a8849e7969faa8b2318805d94f223R219 .

Also, I believe this means the build will break anytime anyone uses an UTF8 character in a Coq file. […]
I theorize that one more complete workaround would be to set up a locale (e.g., en_US.UTF8) in the Docker container, but this seems like a cumbersome thing to do

I think that's the right solution, precisely because of your point on non-ascii characters in Coq files. Fortunately it looks easy (ENV LANG en_US.utf8); once we confirm that this works, I'll add a note in the readme.

@palmskog
Copy link
Collaborator Author

Complete command and backtrace from inside the container:

user@eaac613822d7:~/casper-cbc-proofs$ ~/alectryon/alectryon.py --frontend coqdoc --webpage-style windowed --traceback -Q . CasperCBC --output-directory tmp Lib/Classes.v
Traceback (most recent call last):
  File "/home/user/alectryon/alectryon.py", line 26, in <module>
    main()
  File "/home/user/alectryon/alectryon/cli.py", line 631, in main
    process_pipelines(args)
  File "/home/user/alectryon/alectryon/cli.py", line 623, in process_pipelines
    raise e
  File "/home/user/alectryon/alectryon/cli.py", line 620, in process_pipelines
    state = call_pipeline_step(step, state, ctx)
  File "/home/user/alectryon/alectryon/cli.py", line 589, in call_pipeline_step
    return step(state, **{p: ctx[p] for p in params})
  File "/home/user/alectryon/alectryon/cli.py", line 326, in <lambda>
    write_output(ext, contents, fname, output, output_directory)
  File "/home/user/alectryon/alectryon/cli.py", line 322, in write_output
    f.write(contents)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2191' in position 6441: ordinal not in range(128)

@palmskog
Copy link
Collaborator Author

@cpitclaudel it actually seems as though the following diff for cli.py solves the issue completely, even with LANG=C:

@@ -318,7 +318,7 @@ def write_output(ext, contents, fname, output, output_directory):
     else:
         if not output:
             output = os.path.join(output_directory, strip_extension(fname) + ext)
-        with open(output, mode="w") as f:
+        with open(output, mode="w", encoding="utf-8") as f:
             f.write(contents)
 
 def write_file(ext):

Since the whole project is supposed to be UTF8 anyway, would a PR with this change be welcome? To me, this would be a better fix than remembering to change LANG everywhere.

@jfehrle
Copy link

jfehrle commented Dec 14, 2020 via email

@jfehrle
Copy link

jfehrle commented Dec 14, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants