Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genrules with py_binary in exec_tools broken by TF python toolchain #4862

Open
nfelt opened this issue Apr 13, 2021 · 1 comment
Open

genrules with py_binary in exec_tools broken by TF python toolchain #4862

nfelt opened this issue Apr 13, 2021 · 1 comment

Comments

@nfelt
Copy link
Contributor

nfelt commented Apr 13, 2021

The problem:

  1. We have genrules that use py_binary targets listed in exec_tools.
  2. Because they are listed in exec_tools these py_binary targets get built in exec mode, using a special exec build configuration (rather than the host mode, which is what happens if you pass them in tools, and rather than the target mode, which is what happens if you just build the py_binary directly).
    • Quoting @wchargin: "host" = the machine on which Bazel runs, "execution" = the machine on which Bazel actions execute, "target" = end-user architecture
    • See background note at bottom.
  3. Due to our use of the TF Bazel workspace, we inherit a bunch of complex configs that affect the exec build configuration, and in particular, we inherit a special TF python toolchain that overrides the default Bazel autodetecting python toolchain.
  4. The special TF python toolchain finds the full path to the python binary on the build machine, and then "bakes" this into the Python toolchain configuration, which results in the exec-mode build of the py_binary rule's wrapper script hardcoding the path to whatever python was in use when it was first built.
  5. This build artifact has an environment dependency - most obviously on PATH - but unlike other build artifacts that rebuild whenever PATH changes (Bazel build recompiles protoc every time PATH changes #3213), this build artifact doesn't get marked as stale if PATH changes (possibly because it's assumed to be an exec mode artifact and thus "independent" of the host machine, even though it's certainly not), and consequently seems to persist until you do bazel clean --expunge or possibly some other scenarios (in practice it seemed to occasionally get cleared for no apparent reason at all).
  6. When PATH changes - for example, switching to a new virtualenv - everything else in our build will reflect the new python interpreter path (since it uses the Bazel auto-detecting python toolchain), except for these py_binary targets, which continue to be hardcoded to point to the old path.
  7. At this point, not only is the build not hermetic, it's invisibly dependent on the prior build environment, and as soon as something breaks there (for example: numpy gets uninstalled) you have a breakage that is not reliably reproducible and generally maddening to debug.
  8. (╯°□°)╯︵ ┻━┻

Background on exec_tools: Bazel genrules have historically always run in host mode, and as such, the tools attribute to genrules always builds its binaries in host mode. This was a problem for python 2 -> 3 since Bazel's host mode can only run either python 2 or python 3, but can't do both. So genrules had an exec_tools attribute introduced in bazelbuild/bazel#6443 to allow tool deps for genrules to be build in exec mode and vary whether they used python 2 or python 3 on a per-target basis. In preparation for switching over the Bazel host mode default from 2 to 3, we were asked to migrate all our genrules from tools to exec_tools, which we did in #4031. Now that they've cut over, nothing really prevents us from switching exec_tools back to tools to simplify our lives, except the note from Bazel team here.

How do we actually inherit the TF toolchain?

@nfelt nfelt changed the title build: genrules that use a py_binary in exec_tools are broken by TF python toolchain build: genrules with py_binary in exec_tools broken by TF python toolchain Apr 13, 2021
@nfelt nfelt changed the title build: genrules with py_binary in exec_tools broken by TF python toolchain genrules with py_binary in exec_tools broken by TF python toolchain Apr 13, 2021
@nfelt
Copy link
Contributor Author

nfelt commented Apr 13, 2021

I've written up a repro script that consistently demonstrates the issue here: repro.sh.txt

It creates 2 virtualenvs, builds a genrule once in the first venv, then switches venvs, builds the genrule again, and shows whether the genrule still uses the python binary from the first venv or not. Finally it switches the bazel output base (equivalent to bazel clean --expunge) without changing the venv, to show that a rebuild picks up the correct python binary from the second venv.

Sample executions:

# Run in an empty WORKSPACE - not affected
$ repro.sh

# Run in a WORKSPACE using `tf_workspace()` - reproduces the problem
$ repro.sh --tf-workspace

# Same as above but with patch to make Python toolchain reconfigure when `PATH` is changed - fixes the problem
$ repro.sh --tf-workspace-patched

Full contents is included below as well.

Full `repro.sh` script
#!/bin/bash

# Repro script for TensorFlow Python toolchain resolution problems.

set -e
set -o pipefail

# Whether to use the TF workspace rules. If not, uses an empty WORKSPACE.
tf_workspace=0
while [ "$#" -gt 0 ]; do
  case "$1" in
    --tf-workspace)
      tf_workspace=1
      shift
      ;;
    --tf-workspace-patched)
      tf_workspace_patched=1
      shift
      ;;
    *)
      printf >&2 'fatal: unknown argument "%s"' "$1"
      exit 1
      ;;
  esac
done

tmpdir="$(mktemp -d)"
workspace="${tmpdir}/src"
output_base_1="${tmpdir}/bazel_output_base_1"
output_base_2="${tmpdir}/bazel_output_base_2"
venv1="${tmpdir}/venv1"
venv2="${tmpdir}/venv2"
results="${tmpdir}/results"

# Helper to make a virtualenv and install numpy (needed for TF build).
function make-venv() {
  name="$1"
  (
    virtualenv -q --python python3 "${name}"
    source "${name}/bin/activate"
    pip -q install numpy
  )
}

# Helper to run bazel with a bunch of debugging flags.
function bazel-with-args() {
  output_base="$1"
  shift
  command="$1"
  shift
  (
  set -x
  bazel --output_base="${output_base}" "${command}" \
      --noshow_progress \
      --toolchain_resolution_debug=true \
      --experimental_repo_remote_exec \
      "$@"
  )
}

# Helper to run a single trial build.
function run_trial() {
  trial="$1"
  shift
  output_base="$1"
  shift
  venv="$1"
  shift
  (
    source "${venv}/bin/activate"
    bazel-with-args "${output_base}" build //:foo.txt
    bazel_bin="$(bazel-with-args "${output_base}" info bazel-bin)"
    cat "${bazel_bin}"/foo.txt | tee "${results}/${trial}_foo.txt"
  )
}

# Set up the example Bazel workspace.
mkdir "${workspace}"
cd "${workspace}"
touch WORKSPACE
touch tensorflow.patch

cat >optional_tensorflow.patch <<'EOF'
--- third_party/py/python_configure.bzl	1970-01-01 00:00:00.000000000 +0000
+++ third_party/py/python_configure.bzl	1970-01-01 00:00:00.000000000 +0000
@@ -268,6 +268,7 @@

 _ENVIRONS = [
     BAZEL_SH,
+    "PATH",
     PYTHON_BIN_PATH,
     PYTHON_LIB_PATH,
 ]
EOF

cat >TF_WORKSPACE <<'EOF'
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
http_archive(
    name = "io_bazel_rules_closure",
    sha256 = "6a900831c1eb8dbfc9d6879b5820fd614d4ea1db180eb5ff8aedcb75ee747c1f",
    strip_prefix = "rules_closure-db4683a2a1836ac8e265804ca5fa31852395185b",
    urls = [
        "http://mirror.tensorflow.org/github.com/bazelbuild/rules_closure/archive/db4683a2a1836ac8e265804ca5fa31852395185b.tar.gz",
        "https://github.com/bazelbuild/rules_closure/archive/db4683a2a1836ac8e265804ca5fa31852395185b.tar.gz",  # 2020-01-15
    ],
)
load("@io_bazel_rules_closure//closure:repositories.bzl", "rules_closure_dependencies")
rules_closure_dependencies(
    omit_bazel_skylib = True,
    omit_com_google_protobuf = True,
    omit_com_google_protobuf_js = True,
)
http_archive(
    name = "org_tensorflow",
    sha256 = "2595a5c401521f20a2734c4e5d54120996f8391f00bb62a57267d930bce95350",
    strip_prefix = "tensorflow-2.3.0",
    urls = [
        "http://mirror.tensorflow.org/github.com/tensorflow/tensorflow/archive/v2.3.0.tar.gz",  # 2020-07-23
        "https://github.com/tensorflow/tensorflow/archive/v2.3.0.tar.gz",
    ],
    patches = ["//:tensorflow.patch"],
)
load("@org_tensorflow//tensorflow:workspace.bzl", "tf_workspace")
tf_workspace()
EOF

cat >BUILD <<'EOF'
py_binary(
    name = "foo",
    srcs = ["foo.py"],
    python_version = "PY3",
    srcs_version = "PY3",
)

genrule(
    name = "gen_foo",
    outs = ["foo.txt"],
    cmd = "$(execpath :foo) $(execpath :foo) >$@",
    exec_tools = [":foo"],
)
EOF

cat >foo.py <<'EOF'
from __future__ import print_function
import sys
print(sys.version)
print(sys.executable)
print(sys.argv)
if len(sys.argv) >= 2:
  path_to_exec_py_binary = sys.argv[1]
  with open(path_to_exec_py_binary) as f:
    for line in f:
      if line.startswith("PYTHON_BINARY"):
        print(line.rstrip())
EOF

if [ "${tf_workspace}" -eq 1 ]; then
  # Run trials with TF workspace.
  mv TF_WORKSPACE WORKSPACE
fi
if [ "${tf_workspace_patched}" -eq 1 ]; then
  # Run trials with patched TF workspace.
  mv TF_WORKSPACE WORKSPACE
  mv optional_tensorflow.patch tensorflow.patch
fi

mkdir "${results}"

# Create the virtualenvs.
make-venv "${venv1}"
make-venv "${venv2}"

# Run trials. Switch output base without switching venv to simulate expunge;
# it's better than expunge since then we can still inspect sandbox contents.
run_trial "1-venv1" "${output_base_1}" "${venv1}"
run_trial "2-venv2" "${output_base_1}" "${venv2}"
run_trial "3-venv2-expunged" "${output_base_2}" "${venv2}"

# Print out results in consolidated format.
printf "\n\n\n"
printf "=================== RESULTS =====================\n"
for file in $(ls -1 "${results}"); do
  printf "\n===== %s =====\n" "${file}"
  cat "${results}/${file}"
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant