Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the stability and reliability of automatic grading #988

Closed
markkuriekkinen opened this issue Feb 16, 2022 · 0 comments · Fixed by #1035
Closed

Improve the stability and reliability of automatic grading #988

markkuriekkinen opened this issue Feb 16, 2022 · 0 comments · Fixed by #1035
Assignees
Labels
area: admin Issues related to server administration and service upkeep area: grading interface Stuff between A+ and grading tools area: performance Related to the performance of the system area: points/grade Includes points and grader per user and for all area: UX student User experience and usability for students area: UX teacher User experience and usability for teachers effort: weeks Estimated to take less than one month, from the creation of a new branch to the merging experience: moderate required knowledge estimate requester: CS The issue is raised internally by a CS teacher requires: discussion Requires discussion before it is possible to proceed to implementation service: mooc-grader This issue concerns about a service MOOC-Grader type: bug This is a bug
Milestone

Comments

@markkuriekkinen
Copy link
Contributor

From #470:
The grader may fail to send the grading results back to A+ because, for example,

  • its disk becomes full during grading
  • there is a network error when the grader tries to send HTTP POST to A+ frontend
  • the grading container fails to send the results to the MOOC-grader
  • MOOC-grader fails to receive the grading results from the container (the container sends an HTTP POST request to the MOOC-grader at the end of the grading)
  • In addition, if grader doesn't resolve or it's http queue is full, submission is left in initialized state.
    • To elaborate, when A+ receives the submission, the submission is at first in the initialized state. If A+ fails to send the submission to the MOOC-grader (i.e., MOOC-grader does not respond or responds with an error), then the submission in A+ is stuck in the initialized state.

Submissions shouldn't get stuck so easily in the grading pipeline.

Related to #793.

#470 (comment)

Indeed, I was more paying attention to the proposed solution than problem description when closing the issue. Not sure how to make the recovery fully automatic (assuming that we can never fully get rid of network/server failure situations). Simple timer-based retries might just potentially congest the (under-failure) system unnecessarily -- what should be the length of timeout in that case, anyway? #793 might be good enough if failures are not very common (e.g., alert on course front page in teacher's view, with some sort of "retry all" button).

#470 (comment)

Yeah, it is not a trivial question. However, from a teacher's perspective, the student has submitted and the system loses it. It makes the platform look fragile and unreliable. Retrying the grading wouldn't always have to be immediate if there is a risk of overloading the system. One simple idea is that the system could look for lost/stuck submissions every hour and use the new mass regrade feature on them.

#470 (comment)

Yes. There could also be some sort of intelligence observing if mass-regrade-initiated grading for those submissions actually complete (and if not, maybe cancel mass regrade and try again after another hour). Which reminds me of a couple of additional features that could be added:

  • Cancel-button for an ongoing mass regrade
  • I guess submitters would appreciate email notification that the earlier postponed grading has completed

#470 (comment)

I'm not sure if I documented it anywhere, but I think I was visioning to solve this in the following way:

  • If assessment request fail for an exercise, the exercise is put into maintenance mode or error mode. While in this mode, no assessments are attempted.
  • Background task will poll backend of an exercise and if that exercise comes back online, then the mode of the exercise is restored and thus background assessment request can continue (assuming workers query DB for submissions waiting to be assessed).
  • Assuming a single backend/domain handles multiple exercises, then there could be one more level for maintenance mode, where a backend services is put on the maintenance mode, which implies all exercises provided by it are in maintenance mode. Thus pollin is done in domain level and not in an exercise level.

That said, if amount of error emails doesn't bother, then running that mass regrade hourly might be an acceptable solution for now

@markkuriekkinen markkuriekkinen added area: admin Issues related to server administration and service upkeep area: grading interface Stuff between A+ and grading tools area: performance Related to the performance of the system area: points/grade Includes points and grader per user and for all area: UX student User experience and usability for students area: UX teacher User experience and usability for teachers requires: discussion Requires discussion before it is possible to proceed to implementation effort: weeks Estimated to take less than one month, from the creation of a new branch to the merging experience: moderate required knowledge estimate requester: CS The issue is raised internally by a CS teacher service: mooc-grader This issue concerns about a service MOOC-Grader requires: priority Currently using this label to flag issues that need EDIT decision ASAP (even if there was priority) type: bug This is a bug labels Feb 16, 2022
@markkuriekkinen markkuriekkinen added this to the v1.15 milestone Apr 7, 2022
@annirytkonen annirytkonen removed the requires: priority Currently using this label to flag issues that need EDIT decision ASAP (even if there was priority) label Apr 7, 2022
@markkuriekkinen markkuriekkinen moved this to Todo in A+ sprints Apr 7, 2022
@PasiSa PasiSa self-assigned this Apr 25, 2022
@PasiSa PasiSa moved this from Todo to In Progress in A+ sprints Apr 26, 2022
PasiSa added a commit to PasiSa/a-plus that referenced this issue May 9, 2022
Sometimes it happens that a grading job never completes because of
some technical issue, and it remains in "In Grading" state until
user submits it for regrading. This commit implements automatic
regrading of grading jobs that have not completed within configured
timeout. There is also a simple heuristic to limit the
(likely unsuccesful) automatic retries when it seems that the
grader is more persistently unavailable.

Closes apluslms#988.
PasiSa added a commit to PasiSa/a-plus that referenced this issue May 10, 2022
Sometimes it happens that a grading job never completes because of
some technical issue, and it remains in "In Grading" state until
user submits it for regrading. This commit implements automatic
regrading of grading jobs that have not completed within configured
timeout. There is also a simple heuristic to limit the
(likely unsuccesful) automatic retries when it seems that the
grader is more persistently unavailable.

Closes apluslms#988.
@PasiSa PasiSa moved this from In Progress to Under review in A+ sprints May 10, 2022
@markkuriekkinen markkuriekkinen modified the milestones: v1.15, v1.16 Jun 10, 2022
PasiSa added a commit to PasiSa/a-plus that referenced this issue Jul 24, 2022
Sometimes it happens that a grading job never completes because of
some technical issue, and it remains in "In Grading" state until
user submits it for regrading. This commit implements automatic
regrading of grading jobs that have not completed within configured
timeout. There is also a simple heuristic to limit the
(likely unsuccesful) automatic retries when it seems that the
grader is more persistently unavailable.

Closes apluslms#988.
PasiSa added a commit to PasiSa/a-plus that referenced this issue Aug 12, 2022
Sometimes it happens that a grading job never completes because of
some technical issue, and it remains in "In Grading" state until
user submits it for regrading. This commit implements automatic
regrading of grading jobs that have not completed within configured
timeout. There is also a simple heuristic to limit the
(likely unsuccesful) automatic retries when it seems that the
grader is more persistently unavailable.

Closes apluslms#988.
PasiSa added a commit to PasiSa/a-plus that referenced this issue Aug 17, 2022
This will be used for automatic regrading (issue apluslms#988) that will be
finished in a separate commit.
markkuriekkinen pushed a commit that referenced this issue Aug 17, 2022
This will be used for automatic regrading (issue #988) that will be
finished in a separate commit.
PasiSa added a commit to PasiSa/a-plus that referenced this issue Aug 17, 2022
Sometimes it happens that a grading job never completes because of
some technical issue, and it remains in "In Grading" state until
user submits it for regrading. This commit implements automatic
regrading of grading jobs that have not completed within configured
timeout. There is also a simple heuristic to limit the
(likely unsuccesful) automatic retries when it seems that the
grader is more persistently unavailable.

Closes apluslms#988.
PasiSa added a commit to PasiSa/a-plus that referenced this issue Aug 20, 2022
Sometimes it happens that a grading job never completes because of
some technical issue, and it remains in "In Grading" state until
user submits it for regrading. This commit implements automatic
regrading of grading jobs that have not completed within configured
timeout. There is also a simple heuristic to limit the
(likely unsuccesful) automatic retries when it seems that the
grader is more persistently unavailable.

Closes apluslms#988.
PasiSa added a commit to PasiSa/a-plus that referenced this issue Sep 14, 2022
Sometimes it happens that a grading job never completes because of
some technical issue, and it remains in "In Grading" state until
user submits it for regrading. This commit implements automatic
regrading of grading jobs that have not completed within configured
timeout. There is also a simple heuristic to limit the
(likely unsuccesful) automatic retries when it seems that the
grader is more persistently unavailable. This commit uses the
PendingSubmission model that was added in commit 62150ce.

Closes apluslms#988.
markkuriekkinen pushed a commit that referenced this issue Sep 14, 2022
Sometimes it happens that a grading job never completes because of
some technical issue, and it remains in "In Grading" state until
user submits it for regrading. This commit implements automatic
regrading of grading jobs that have not completed within configured
timeout. There is also a simple heuristic to limit the
(likely unsuccesful) automatic retries when it seems that the
grader is more persistently unavailable. This commit uses the
PendingSubmission model that was added in commit 62150ce.

Closes #988.
Repository owner moved this from Planned for April release to Done in A+ Backlog (Sprint 7/22 onward) Sep 14, 2022
Repository owner moved this from Under review to Done in A+ sprints Sep 14, 2022
@markkuriekkinen markkuriekkinen modified the milestones: v1.16, v1.17 Nov 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: admin Issues related to server administration and service upkeep area: grading interface Stuff between A+ and grading tools area: performance Related to the performance of the system area: points/grade Includes points and grader per user and for all area: UX student User experience and usability for students area: UX teacher User experience and usability for teachers effort: weeks Estimated to take less than one month, from the creation of a new branch to the merging experience: moderate required knowledge estimate requester: CS The issue is raised internally by a CS teacher requires: discussion Requires discussion before it is possible to proceed to implementation service: mooc-grader This issue concerns about a service MOOC-Grader type: bug This is a bug
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants