BUILD file parsing is not symmetric w.r.t. actions.write when processing UTF-8 encoded BUILD files. #10174

aiuto · 2019-11-06T04:21:41Z

Description of the problem / feature request:

Multi-byte UTF-8 sequences in BUILD files get parsed as multiple distinct characters instead of a single code point

Feature requests: what underlying problem are you trying to solve with this feature?

It is useful to have full Unicode support for attributes.

Bugs: what's the simplest, easiest way to reproduce this bug?

a rule with one string attribute
the implementation of the rule should ctx.actions.write(..., content={that attribute})
observe an instance where the value of the attribute contains a character like the copyright symbol

Repo here: https://github.com/tonyaiuto/bazel/tree/master/utf8_encode

What operating system are you running Bazel on?

Linux

What's the output of `bazel info release`?

bazel 1.1.0

Have you found anything relevant by searching the web?

#4551 is similar.

Analysis from the sample repo.

The iso8859-Latin-1 copyright symbol is 0xA9. The UTF-8 encoding of that
is the two bytes [c2, a9]. What we see in the output file is clearly
those two bytes again encoded as utf-8.

The BUILD file is in UTF-8 format:

grep copyright_notice BUILD | od -c -t x1
0000000                   c   o   p   y   r   i   g   h   t   _   n   o
         20  20  20  20  63  6f  70  79  72  69  67  68  74  5f  6e  6f
0000020   t   i   c   e       =       "   C   o   p   y   r   i   g   h
         74  69  63  65  20  3d  20  22  43  6f  70  79  72  69  67  68
0000040   t     302 251       2   0   1   9       T   o   n   y       A
         74  20  c2  a9  20  32  30  31  39  20  54  6f  6e  79  20  41
0000060   i   u   t   o   "   ,  \n
         69  75  74  6f  22  2c  0a
0000067

We can see the signature of the encoded copyright symbol as 302 251.

It would seem that

the BUILD file is parsed as a stream of octets, each one becoming a
distinct character [0xc2, 0xa9].
write() presumes the file should be UTF-8 encoded and converts the 2
characters into the 4 need for their UTF-8 representation.

Potential fixes

BUILD files are UTF-8

Pro: Easy to implement and understand.
Con: A breaking change, but probably little used. (Reasoning: encodings are
broken anyway, so who could be using anything beyond ASCII successfully)

Allow BUILD files to specify an encoding

We could borrow from (PEP-263)[https://www.python.org/dev/peps/pep-0263/]
and use:

# -*- coding: <encoding name> -*-

Pro: Not a breaking change.
Con: Feature bloat.

Byte are bytes

ctx.actions.write() should not assume an encoding for the output. It
would emit exactly what it is given.

Pro: Easy to implement.
Con: Breaking change.
Con: Hard to defend as the most useful option.

More switches and bells.

ctx.actions.write() could have an encoding attribute. Use 'none' for this case.

Con: Feature bloat.
Con: Dubious value.
Con: Still does not fix the bug.

The text was updated successfully, but these errors were encountered:

laurentlb · 2019-11-29T16:55:15Z

cc @alandonovan

Without looking into details, I'd assume that ctx.actions.write should be fixed to emit exactly what was in input.

aiuto · 2020-05-18T21:33:21Z

What also works is if bazel just reads octets, with no interpretation. File write actions should also not have an encoding. This make bazel an efficient passthrough.

alandonovan · 2020-05-18T23:22:31Z

This is essentially a duplicate of #374: Bazel treats BUILD and .bzl files, and file names returned by readdir, as if they are encoded as Latin 1, and then uses many weird hacks to try to correct for that mistake.

(When we made this decision, a long time ago, it was in fear that if we chose UTF-8, then it would be possible to create directories that couldn't be deleted because their file names (byte strings) were not valid UTF-8 strings. Of course, the solution to that is: make the DeleteTree operation more robust. I also blame my youthful ignorance of string encoding.)

I tried a fix a few months back, but it's a complex and potentially breaking change. For example, it changes the result of len(s) for some strings---though I suspect extremely few real-world Starlark functions would care.

aiuto · 2020-05-26T18:05:52Z

Going back to the comment of 2019-11-29, yes. I think file write is broken, in that it converts to UTF-8, whereas it should just write what was parsed in.

bazel/src/main/java/com/google/devtools/build/lib/analysis/actions/FileWriteAction.java

Line 238 in 1557ac8

out.write(getFileContents().getBytes(UTF_8));

Since Bazel reads BUILD and .bzl files as (essentially) raw bytes, any UTF-8 encoded input is essentially pre-UTF-8 encoded on the way out. Closes bazelbuild#10174 DO NOT SUBMIT: This is a breaking change, so it needs a migration path. Current idea: - introduce a new attribute to write 'encode_as_utf_8'. Default is gated by incompatible_write_action_encodes_as_utf_8. - for bazel we do the multi-release cycle flip - for google we agressively try to change the default - add encode_as_utf_8 to the few broken rules - flip the default - fix the user.

aiuto · 2020-05-26T19:01:43Z

And yes. Removing any mention of UTF8 in that file does fix this issue. The sequence of octets from my BUILD file are perfectly preserved in the file created by actions.write.
Just for yucks, I'm trying to see the impact of this within google.

alandonovan · 2020-05-26T19:09:52Z

Tony: it is correct that the file actually written by the action contain UTF-8 encoded text. The question is whether getFileContents() returns (a) a regular Java UTF-16 string, in which case getBytes(UTF_8) is appropriate, or (b) a Starlark string value, a monstrous thing that puts the bytes of a UTF-8 encoded text string in the chars of a java.lang.String. If the latter, then you need to use getBytes(ISO8859_1).

Either way, please document in FileWriteAction whatever you discover.

aiuto · 2020-05-26T20:05:41Z

Good question. It is probably a starlark string value, since the code creating the content is taking a string attribute from a rule. You can see it in the repo repository. I expect a 27 character string of text, but len(x) gives me 28, hinting that the copyright symbol is taking 2 chars instead of 1. When we write that out, we get 4 bytes, since each of the 2 chars is getting the UTF-8 externalization.

After I do the global presubmit with no encodings, I'll try ISO8859_1.

Baseline: 7404d17 Cherry picks: + a4334be: fixup! Gracefully handle the lack of subreaper support in Linux. Incompatible changes: - This removes the short-lived --process_wrapper_extra_flags flag, which was introduced primarily to roll out a bug fix. Unfortunately, this made us inadvertently expose all of the process-wrapper's command line interface to the public, which should not have happened. Given the corner case of the utility of this flag, the lack of documentation for it, and the fact that it only appeared in a single release, we are treating this as a bug instead of a backwards compatibility breakage. New features: - bazel info: Allow to specify multiple keys. - Support code coverage with GCC 9. Important changes: - Allow InstrumentedFilesInfo fields to be read from Starlark. - The --starlark_cpu_profile=<file> flag writes a profile in pprof format containing a statistical summary of CPU usage by all Starlark execution during the bazel command. Use it to identify slow Starlark functions in loading and analysis. - The --debug_depset_flag has been removed as it is in effect always on at no cost. - Rule authors should use the incompatible_use_toolchain_transition rule attribute to migrate to using the toolchain transition. jcater to udpate notes further. - `apple_binary` rules now accept the `stamp` attribute with the same semantics that it has in `cc_binary` rules. - --incompatible_objc_provider_remove_compile_info turns off the compile info/mege_zip Starlark APIs in ObjcProvider. See #11359. - The --debug_depset_flag has been removed as it is in effect always on at no cost. - Fix behavior of ctx.actions.write so content is written without an incorrect encoding to UTF-8. See #10174 for details. - Collect more performance metrics for worker execution. - Add flag --incompatible_force_strict_header_check_from_starlark - Configure coverage and runfiles for sh_library. - Adds --incompatible_blacklisted_protos_requires_proto_info to indicate whether proto_lang_toolchain.blacklisted_protos requires ProtoInfo. This release contains contributions from many people at Google, as well as Andrzej Guszak, Benjamin Peterson, Benjamin Romano, Carlos Eduardo Seo, Claudio Bley, dannysullivan, David Ostrovsky, George Gensure, Graham Jenson, Grzegorz Lukasik, Gunnar Wagenknecht, Henk van der Laan, Jin, John Millikin, Marin Baron, Nikhil Marathe, Robin Nabel, Ryan Beasley, Samuel Giddins, Sergey Balabanov, utsav-dbx, Vo Van Nghia, Yannic Bonenberger.

dslomov added team-Starlark untriaged labels Nov 25, 2019

laurentlb added P3 We're not considering working on this, but happy to review a PR. (No assignee) type: bug and removed untriaged labels Dec 16, 2019

aiuto mentioned this issue May 26, 2020

Stop doing actions.write UTF-8. This was causing a double encode. #11496

Closed

aiuto changed the title ~~BUILD files are parsed as raw octets instead of UTF-8~~ BUILD file parsing is not symmetric w.r.t. actions.write when processing UTF-8 encoded BUILD files. May 26, 2020

aiuto self-assigned this Jun 15, 2020

aiuto mentioned this issue Jun 16, 2020

target label paths with non-latin1 characters have inconsistent starlark semantics. #11602

Closed

bazel-io closed this as completed in 27e1103 Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUILD file parsing is not symmetric w.r.t. actions.write when processing UTF-8 encoded BUILD files. #10174

BUILD file parsing is not symmetric w.r.t. actions.write when processing UTF-8 encoded BUILD files. #10174

aiuto commented Nov 6, 2019

laurentlb commented Nov 29, 2019

aiuto commented May 18, 2020

alandonovan commented May 18, 2020 •

edited

Loading

aiuto commented May 26, 2020

aiuto commented May 26, 2020

alandonovan commented May 26, 2020 •

edited

Loading

aiuto commented May 26, 2020

BUILD file parsing is not symmetric w.r.t. actions.write when processing UTF-8 encoded BUILD files. #10174

BUILD file parsing is not symmetric w.r.t. actions.write when processing UTF-8 encoded BUILD files. #10174

Comments

aiuto commented Nov 6, 2019

Description of the problem / feature request:

Feature requests: what underlying problem are you trying to solve with this feature?

Bugs: what's the simplest, easiest way to reproduce this bug?

What operating system are you running Bazel on?

What's the output of bazel info release?

Have you found anything relevant by searching the web?

Analysis from the sample repo.

Potential fixes

BUILD files are UTF-8

Allow BUILD files to specify an encoding

Byte are bytes

More switches and bells.

laurentlb commented Nov 29, 2019

aiuto commented May 18, 2020

alandonovan commented May 18, 2020 • edited Loading

aiuto commented May 26, 2020

aiuto commented May 26, 2020

alandonovan commented May 26, 2020 • edited Loading

aiuto commented May 26, 2020

What's the output of `bazel info release`?

alandonovan commented May 18, 2020 •

edited

Loading

alandonovan commented May 26, 2020 •

edited

Loading