Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate/fix test_cancel_running_job #1019

Closed
kt474 opened this issue Aug 15, 2023 · 4 comments
Closed

Investigate/fix test_cancel_running_job #1019

kt474 opened this issue Aug 15, 2023 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@kt474
Copy link
Member

kt474 commented Aug 15, 2023

This test has been failing for a while now - https://github.com/Qiskit/qiskit-ibm-runtime/actions/runs/5863436233/job/15896900505

It appears that when attempting to cancel a running job, calling job.cancel() can return a 204 and raise no errors while the job is actually completed.

This snippet shows that the status of job is CANCELLED but retrieving the job and status again shows DONE, which is the correct status.

I'm guessing this is some sort of race condition where the job runs too quickly and if this is the case, it should be handled appropriately on the client side.

def test_cancel_job_running(self, service):
        """Test canceling a running job."""
        job = self._run_program(service, iterations=5)
        if not cancel_job_safe(job, self.log):
            return
        time.sleep(10)  # Wait a bit for DB to update.
        rjob = service.job(job.job_id())
        print("job==rjob is " + str(job == rjob))
        print("job.job_id() = " + str(job.job_id()))
        print("rjob.job_id() = " + str(rjob.job_id()))
        print("job.status() = " + str(job.status()))
        print("rjob.status()) = " + str(rjob.status()))
        self.assertEqual(job.status(), JobStatus.CANCELLED)
@kt474 kt474 added the bug Something isn't working label Aug 15, 2023
@kt474
Copy link
Member Author

kt474 commented Aug 16, 2023

For more context, here in the cancel method we are manually setting the job status to CANCELLED on line 252 even though in some cases the job is actually completed.

def cancel(self) -> None:
"""Cancel the job.
Raises:
RuntimeInvalidStateError: If the job is in a state that cannot be cancelled.
IBMRuntimeError: If unable to cancel job.
"""
try:
self._api_client.job_cancel(self.job_id())
except RequestsApiError as ex:
if ex.status_code == 409:
raise RuntimeInvalidStateError(f"Job cannot be cancelled: {ex}") from None
raise IBMRuntimeError(f"Failed to cancel job: {ex}") from None
self.cancel_result_streaming()
self._status = JobStatus.CANCELLED

@kt474 kt474 added this to the 0.12.0 milestone Aug 18, 2023
@robotAstray
Copy link
Contributor

robotAstray commented Aug 18, 2023

@kt474 I believe there might be an issue both in the cancel() function race condition and test_cancel_job_running.
It appears that the job status is not updated correctly, which could be causing the problem. The Job is completing before the CANCELLATION requests goes through.

I think the following might resolve the problem:

1. Modify test function: Check the server-side status after using the cancel() method and wait for the state to be updated before proceeding with any assertions in test_cancel_job_running case.
2. Modify cancel() method: make sure cancel() method waits until the server-side status is updated before marking the job as "CANCELLED" on the client side.
Let me know what you think and perhaps I can try to implement this.

@merav-aharoni
Copy link
Contributor

I think the test can be fixed quite simply by replacing the lines:

if not cancel_job_safe(job, self.log):
        return
time.sleep(10)  # Wait a bit for DB to update.
rjob = service.job(job.job_id())

by the lines:

rjob = service.job(job.job_id())
if not cancel_job_safe(rjob, self.log):
        return
time.sleep(10)

However, I think there is a deeper issue here, and that is why job and rjob are not identical. This does not have to do with the cancel itself.

@kt474 kt474 modified the milestones: 0.12.0, 0.12.1 Aug 29, 2023
@kt474 kt474 removed this from the 0.12.1 milestone Sep 12, 2023
@kt474
Copy link
Member Author

kt474 commented Sep 20, 2023

Closing this for now, will reopen if it becomes an issue again

@kt474 kt474 closed this as completed Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants