Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUILD file parsing is not symmetric w.r.t. actions.write when processing UTF-8 encoded BUILD files. #10174

Closed
aiuto opened this issue Nov 6, 2019 · 7 comments
Assignees
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) type: bug

Comments

@aiuto
Copy link
Contributor

aiuto commented Nov 6, 2019

Description of the problem / feature request:

Multi-byte UTF-8 sequences in BUILD files get parsed as multiple distinct characters instead of a single code point

Feature requests: what underlying problem are you trying to solve with this feature?

It is useful to have full Unicode support for attributes.

Bugs: what's the simplest, easiest way to reproduce this bug?

  • a rule with one string attribute
  • the implementation of the rule should ctx.actions.write(..., content={that attribute})
  • observe an instance where the value of the attribute contains a character like the copyright symbol

Repo here: https://github.com/tonyaiuto/bazel/tree/master/utf8_encode

What operating system are you running Bazel on?

Linux

What's the output of bazel info release?

bazel 1.1.0

Have you found anything relevant by searching the web?

#4551 is similar.

Analysis from the sample repo.

The iso8859-Latin-1 copyright symbol is 0xA9. The UTF-8 encoding of that
is the two bytes [c2, a9]. What we see in the output file is clearly
those two bytes again encoded as utf-8.

The BUILD file is in UTF-8 format:

grep copyright_notice BUILD | od -c -t x1
0000000                   c   o   p   y   r   i   g   h   t   _   n   o
         20  20  20  20  63  6f  70  79  72  69  67  68  74  5f  6e  6f
0000020   t   i   c   e       =       "   C   o   p   y   r   i   g   h
         74  69  63  65  20  3d  20  22  43  6f  70  79  72  69  67  68
0000040   t     302 251       2   0   1   9       T   o   n   y       A
         74  20  c2  a9  20  32  30  31  39  20  54  6f  6e  79  20  41
0000060   i   u   t   o   "   ,  \n
         69  75  74  6f  22  2c  0a
0000067

We can see the signature of the encoded copyright symbol as 302 251.

It would seem that

  • the BUILD file is parsed as a stream of octets, each one becoming a
    distinct character [0xc2, 0xa9].
  • write() presumes the file should be UTF-8 encoded and converts the 2
    characters into the 4 need for their UTF-8 representation.

Potential fixes

BUILD files are UTF-8

  • Pro: Easy to implement and understand.
  • Con: A breaking change, but probably little used. (Reasoning: encodings are
    broken anyway, so who could be using anything beyond ASCII successfully)

Allow BUILD files to specify an encoding

We could borrow from (PEP-263)[https://www.python.org/dev/peps/pep-0263/]
and use:

# -*- coding: <encoding name> -*-
  • Pro: Not a breaking change.
  • Con: Feature bloat.

Byte are bytes

ctx.actions.write() should not assume an encoding for the output. It
would emit exactly what it is given.

  • Pro: Easy to implement.
  • Con: Breaking change.
  • Con: Hard to defend as the most useful option.

More switches and bells.

ctx.actions.write() could have an encoding attribute. Use 'none' for this case.

  • Con: Feature bloat.
  • Con: Dubious value.
  • Con: Still does not fix the bug.
@laurentlb
Copy link
Contributor

cc @alandonovan

Without looking into details, I'd assume that ctx.actions.write should be fixed to emit exactly what was in input.

@laurentlb laurentlb added P3 We're not considering working on this, but happy to review a PR. (No assignee) type: bug and removed untriaged labels Dec 16, 2019
@aiuto
Copy link
Contributor Author

aiuto commented May 18, 2020

What also works is if bazel just reads octets, with no interpretation. File write actions should also not have an encoding. This make bazel an efficient passthrough.

@alandonovan
Copy link
Contributor

alandonovan commented May 18, 2020

This is essentially a duplicate of #374: Bazel treats BUILD and .bzl files, and file names returned by readdir, as if they are encoded as Latin 1, and then uses many weird hacks to try to correct for that mistake.

(When we made this decision, a long time ago, it was in fear that if we chose UTF-8, then it would be possible to create directories that couldn't be deleted because their file names (byte strings) were not valid UTF-8 strings. Of course, the solution to that is: make the DeleteTree operation more robust. I also blame my youthful ignorance of string encoding.)

I tried a fix a few months back, but it's a complex and potentially breaking change. For example, it changes the result of len(s) for some strings---though I suspect extremely few real-world Starlark functions would care.

@aiuto
Copy link
Contributor Author

aiuto commented May 26, 2020

Going back to the comment of 2019-11-29, yes. I think file write is broken, in that it converts to UTF-8, whereas it should just write what was parsed in.

aiuto added a commit to aiuto/bazel that referenced this issue May 26, 2020
Since Bazel reads BUILD and .bzl files as (essentially) raw bytes, any UTF-8
encoded input is essentially pre-UTF-8 encoded on the way out.

Closes bazelbuild#10174

DO NOT SUBMIT: This is a breaking change, so it needs a migration path.
Current idea:
- introduce a new attribute to write 'encode_as_utf_8'. Default is gated by incompatible_write_action_encodes_as_utf_8.
- for bazel we do the multi-release cycle flip
- for google we agressively try to change the default
  - add encode_as_utf_8 to the few broken rules
  - flip the default
  - fix the user.
@aiuto
Copy link
Contributor Author

aiuto commented May 26, 2020

And yes. Removing any mention of UTF8 in that file does fix this issue. The sequence of octets from my BUILD file are perfectly preserved in the file created by actions.write.
Just for yucks, I'm trying to see the impact of this within google.

@alandonovan
Copy link
Contributor

alandonovan commented May 26, 2020

Tony: it is correct that the file actually written by the action contain UTF-8 encoded text. The question is whether getFileContents() returns (a) a regular Java UTF-16 string, in which case getBytes(UTF_8) is appropriate, or (b) a Starlark string value, a monstrous thing that puts the bytes of a UTF-8 encoded text string in the chars of a java.lang.String. If the latter, then you need to use getBytes(ISO8859_1).

Either way, please document in FileWriteAction whatever you discover.

@aiuto
Copy link
Contributor Author

aiuto commented May 26, 2020

Good question. It is probably a starlark string value, since the code creating the content is taking a string attribute from a rule. You can see it in the repo repository. I expect a 27 character string of text, but len(x) gives me 28, hinting that the copyright symbol is taking 2 chars instead of 1. When we write that out, we get 4 bytes, since each of the 2 chars is getting the UTF-8 externalization.

After I do the global presubmit with no encodings, I'll try ISO8859_1.

@aiuto aiuto changed the title BUILD files are parsed as raw octets instead of UTF-8 BUILD file parsing is not symmetric w.r.t. actions.write when processing UTF-8 encoded BUILD files. May 26, 2020
@aiuto aiuto self-assigned this Jun 15, 2020
bazel-io pushed a commit that referenced this issue Jul 13, 2020
Baseline: 7404d17

Cherry picks:

   + a4334be:
     fixup! Gracefully handle the lack of subreaper support in Linux.

Incompatible changes:

  - This removes the short-lived --process_wrapper_extra_flags
    flag, which was introduced primarily to roll out a bug fix.
    Unfortunately,
    this made us inadvertently expose all of the process-wrapper's
    command line
    interface to the public, which should not have happened.  Given
    the corner
    case of the utility of this flag, the lack of documentation for
    it, and the
    fact that it only appeared in a single release, we are treating
    this as a
    bug instead of a backwards compatibility breakage.

New features:

  - bazel info: Allow to specify multiple keys.
  - Support code coverage with GCC 9.

Important changes:

  - Allow InstrumentedFilesInfo fields to be read from Starlark.
  - The --starlark_cpu_profile=<file> flag writes a profile in
    pprof format containing a statistical summary of CPU usage
    by all Starlark execution during the bazel command. Use it
    to identify slow Starlark functions in loading and analysis.
  - The --debug_depset_flag has been removed as it is in effect
    always on at no cost.
  - Rule authors should use the
    incompatible_use_toolchain_transition rule attribute to migrate
    to using
    the toolchain transition. jcater to udpate notes further.
  - `apple_binary` rules now accept the `stamp` attribute with the
    same
    semantics that it has in `cc_binary` rules.
  - --incompatible_objc_provider_remove_compile_info turns off
    the compile info/mege_zip Starlark APIs in ObjcProvider.  See
    #11359.
  - The --debug_depset_flag has been removed as it is in effect
    always on at no cost.
  - Fix behavior of ctx.actions.write so content is written without
    an incorrect encoding to UTF-8.
    See #10174 for details.
  - Collect more performance metrics for worker execution.
  - Add flag --incompatible_force_strict_header_check_from_starlark
  - Configure coverage and runfiles for sh_library.
  - Adds --incompatible_blacklisted_protos_requires_proto_info to
    indicate whether proto_lang_toolchain.blacklisted_protos requires
    ProtoInfo.

This release contains contributions from many people at Google, as well as Andrzej Guszak, Benjamin Peterson, Benjamin Romano, Carlos Eduardo Seo, Claudio Bley, dannysullivan, David Ostrovsky, George Gensure, Graham Jenson, Grzegorz Lukasik, Gunnar Wagenknecht, Henk van der Laan, Jin, John Millikin, Marin Baron, Nikhil Marathe, Robin Nabel, Ryan Beasley, Samuel Giddins, Sergey Balabanov, utsav-dbx, Vo Van Nghia, Yannic Bonenberger.
meteorcloudy pushed a commit that referenced this issue Jul 13, 2020
Baseline: 7404d17

Cherry picks:

   + a4334be:
     fixup! Gracefully handle the lack of subreaper support in Linux.

Incompatible changes:

  - This removes the short-lived --process_wrapper_extra_flags
    flag, which was introduced primarily to roll out a bug fix.
    Unfortunately,
    this made us inadvertently expose all of the process-wrapper's
    command line
    interface to the public, which should not have happened.  Given
    the corner
    case of the utility of this flag, the lack of documentation for
    it, and the
    fact that it only appeared in a single release, we are treating
    this as a
    bug instead of a backwards compatibility breakage.

New features:

  - bazel info: Allow to specify multiple keys.
  - Support code coverage with GCC 9.

Important changes:

  - Allow InstrumentedFilesInfo fields to be read from Starlark.
  - The --starlark_cpu_profile=<file> flag writes a profile in
    pprof format containing a statistical summary of CPU usage
    by all Starlark execution during the bazel command. Use it
    to identify slow Starlark functions in loading and analysis.
  - The --debug_depset_flag has been removed as it is in effect
    always on at no cost.
  - Rule authors should use the
    incompatible_use_toolchain_transition rule attribute to migrate
    to using
    the toolchain transition. jcater to udpate notes further.
  - `apple_binary` rules now accept the `stamp` attribute with the
    same
    semantics that it has in `cc_binary` rules.
  - --incompatible_objc_provider_remove_compile_info turns off
    the compile info/mege_zip Starlark APIs in ObjcProvider.  See
    #11359.
  - The --debug_depset_flag has been removed as it is in effect
    always on at no cost.
  - Fix behavior of ctx.actions.write so content is written without
    an incorrect encoding to UTF-8.
    See #10174 for details.
  - Collect more performance metrics for worker execution.
  - Add flag --incompatible_force_strict_header_check_from_starlark
  - Configure coverage and runfiles for sh_library.
  - Adds --incompatible_blacklisted_protos_requires_proto_info to
    indicate whether proto_lang_toolchain.blacklisted_protos requires
    ProtoInfo.

This release contains contributions from many people at Google, as well as Andrzej Guszak, Benjamin Peterson, Benjamin Romano, Carlos Eduardo Seo, Claudio Bley, dannysullivan, David Ostrovsky, George Gensure, Graham Jenson, Grzegorz Lukasik, Gunnar Wagenknecht, Henk van der Laan, Jin, John Millikin, Marin Baron, Nikhil Marathe, Robin Nabel, Ryan Beasley, Samuel Giddins, Sergey Balabanov, utsav-dbx, Vo Van Nghia, Yannic Bonenberger.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants