Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make timeout configurable from e2e tests #1787

Merged
merged 1 commit into from
Mar 24, 2023

Conversation

nagar-ajay
Copy link
Contributor

What this PR does / why we need it:

  • Due to the resources (CPU) crunch in the GitHub actions VM, we have limited CPUs in e2e tests. As a result of this, the execution time of jobs has increased.
  • Currently PyTorch test job is taking around 10 mins to complete. Because of this PyTorch test job fails sometimes. We have a timeout period of 10 mins in the wait_for_job_conditions functions.
  • One approach is if we somehow reduce the running time of the PyTorch job. But for this, we will need to build a new image and push it on docker. This can create issues in CI because of the docker pull rate limit.
  • In this PR, I have made timeout configurable from the e2e test and increased the timeout to 15 mins for PyTorch job.

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Mar 24, 2023

Pull Request Test Coverage Report for Build 4511331735

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+27.2%) to 39.541%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 2 77.16%
Totals Coverage Status
Change from base Build 4480672508: 27.2%
Covered Lines: 2739
Relevant Lines: 6927

💛 - Coveralls

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach is if we somehow reduce the running time of the PyTorch job. But for this, we will need to build a new image and push it on docker. This can create issues in CI because of the docker pull rate limit.

We can avoid the docker pull rate limit.

For publicly accessible containers we are working with docker hub to make sure you will not be impacted by the new rate limits.

actions/runner-images#1445 (comment)

However, it is enough to increase the timeout since we already use a quite lightweight test, using mnist.

/lgtm
/assign @johnugeorge

@johnugeorge
Copy link
Member

Thanks @tenzen-y for the pointers. For now, we can work with higher timeouts in case tests run longer.

/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, nagar-ajay

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants