Skip to content

Conversation

@caisq
Copy link
Contributor

@caisq caisq commented Dec 23, 2019

  • Motivation for features / changes

    • Unblock TensorBoard Travis CI from using the latest (floating) tf-nightly
  • Technical description of changes

    • Update to .travis.yml to use Ubuntu 16.04 (xenial)
    • Update to .travis.yml to float the tf-nightly version

@wchargin
Copy link
Contributor

We usually keep this floating, but have pinned it (in #2947) due to
low-level errors in TensorFlow. I can still reproduce the SIGABRT error
on today’s nightly (20191222), and this is one of the cases where Travis
caching creates Heisenbugs, so it’s plausible that we’ll have to revert
this PR soon.

@caisq
Copy link
Contributor Author

caisq commented Dec 23, 2019

@wchargin Thanks for the background info. I can see the exact double-free and size-corruption errors in the test logs of this PR: https://api.travis-ci.org/v3/job/628578238/log.txt

I've pinged the owner of the tensorflow issue again. This needs to be fixed soon, or it'll block the debugger work and the next release.

@caisq
Copy link
Contributor Author

caisq commented Dec 23, 2019

@mihaimaruseac Trying out dist: xenial (Ubuntu 16.04). See 4aa3035

@caisq
Copy link
Contributor Author

caisq commented Dec 23, 2019

For motivation of the upgrade to xenial, see tensorflow/tensorflow#34427 (comment)

@caisq caisq changed the title Upgrade tf-nightly version to dev20191222 Upgrade docker image to xenial; tf-nightly version to dev20191222 Dec 23, 2019
@caisq
Copy link
Contributor Author

caisq commented Dec 23, 2019

@mihaimaruseac It worked! Thanks!!

@caisq caisq requested review from nfelt and wchargin December 23, 2019 19:27
@caisq caisq marked this pull request as ready for review December 23, 2019 19:28
@wchargin wchargin changed the title Upgrade docker image to xenial; tf-nightly version to dev20191222 Upgrade Travis to Xenial and floating tf-nightly Dec 23, 2019
Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. The last times that we tried to upgrade to Xenial (#2293,
#2654), we gave up because we were hitting Heisenbugs that we couldn’t
reproduce locally. The test experiencing those Heisenbugs is currently
disabled due to #2987, which isn’t obviously related to the Xenial-only
errors (SIGSEGVs vs. non-responsive HTTP routes). So I suppose that this
patch is fine with me because it’s not a regression; it merely changes
the root cause of the already-broken test.

@wchargin
Copy link
Contributor

(Also, the platform-specific segfaults aren’t exactly encouraging… it’s
good that this unblocks us in the short term, but it doesn’t sound like
the underlying problem has really been fixed.)

@caisq
Copy link
Contributor Author

caisq commented Dec 23, 2019

@wchargin I'm not sure I agree with #3070 (comment) , considering the fact that the official System Requirements doc states that Ubuntu 16.04+ is required:

https://www.tensorflow.org/install/pip

cc @mihaimaruseac for context and any opinions on whether that related tensorflow issue should be closed or not.

@wchargin
Copy link
Contributor

Hmm, fair. I guess it could be a legitimate fix for an ABI change, then.

@mihaimaruseac
Copy link

We switched to being maylinux2010 compliant and that required a modern toolchain which is not provided by Ubuntu14.

Furthermore, Ubuntu14 eol'ed this summer, so we can no longer support it as no patches would come to it and the toolchains.

@mihaimaruseac
Copy link

mihaimaruseac commented Dec 23, 2019

If Ubuntu16 fixes this, it could be that the change that exposed the heisenbug causing the segfaults to show up is the commit cherry-picked by tensorflow/tensorflow#34764 Unfortunately, we cannot roll that back as it would break all manylinux2010 builds.

So I think it's better to close the TF issue if it's caused by old toolchain

@caisq
Copy link
Contributor Author

caisq commented Dec 23, 2019

@mihaimaruseac @wchargin Thanks for the info. Given the documented system requirements of tensorflow, I think it's better to run tensorboard's CI on xenial. So I'm merging the PR now, there is another PR (#3051) of which the CI won't pass without this.

@nickfelt If you have further comments, please let me know and I'll be happy to follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants