Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use non_atomic_requests decorator in handle_block view #34020

Merged

Conversation

mariajgrimaldi
Copy link
Member

@mariajgrimaldi mariajgrimaldi commented Jan 8, 2024

Description

This PR disables atomic requests so transactions made during the request commit immediately instead of waiting for the end of
the request transaction. This is necessary so the async tasks launched by the current process can see the changes made
during the request. One example is the async tasks launched when courses are published before the request
ends, which end up reading from an outdated database state.

Our 1st approach was adding a countdown to the task that pushes the new course structure into the learning sequences, but after the thread, you'll find in the PR, we opted for this one.

More context:

We found the out-of-date reads behavior after adding the hide_from_toc field to the XblockSerializer(among other changes) in this PR. We did this to use the current course sections visibility settings to configure the OLX-only field Hide from TOC; for more context on why we're doing this, please refer to the PR cover letter.

So, after adding this new field, we expected that saving the new visibility setting for the course section would update the course learning sequences (the courses' representation in the LMS). Instead, this was happening:

After deploying these changes to a tutor nightly remote installation, we found that the changes weren't reflected on the Course Outline after saving the new visibility changes for the 1st time. Now, the second time we save the changes are triggered. We don't have an explanation for it yet, but we're actively working on it. Here's a video of the behavior:

Screencast.from.19-12-23.12.08.56.mp4

After publishing the new course structure, the course_published signal is sent. This triggers a few receivers; among them, the listen_for_course_publish receiver that pushes the course outline to learning sequences asynchronously. This part of the implementation is crucial for maintaining up to date the course outline in the LMS, which was exactly what was failing. What makes the update in the course outline is update_outline_from_modulestore_task task, which reads from the modulesture and then update the learning sequences accordingly.

Having this in mind, we did some digging to find:

  1. The error disappeared by setting: CELERY_ALWAYS_EAGER = True. This confirmed that there's something wrong when using asynchronous processing.
  2. The course section visibility was being updated in the published branch (the one used by the LMS) -- we specifically checked the value of the new field, so we discarded any issues saving the blocks with the Hide From TOC changes.

Our hypothesis was that if Mongo was up-to-date after the 1st save with the hide from TOC changes, but the LMS didn't reflect them, then the issue was an out-of-date read. To prove this, we saved the hide from TOC changes the 1st time, and then we tried reading again from the course modulestore after some time had passed to update. For this, we ran ./manage.py LMS simulate_publish --courses <COURSE ID>, which simulates sending the course publish signal triggering the listen_for_course_publish receiver and, therefore, update_outline_from_modulestore_task. And it worked! The course outline in the LMS had the TOC changes.

So what was causing the out-of-date reads was the immediate push of the course outline to the learning sequences so we added the countdown.

Supporting information

PR where we found the issue: #33952
More info on the modulestore use in the LMS:

Writing to our App Models
``cms.djangoapps.contentstore.outlines.update_outline_from_modulestore``
The ``update_outline_from_modulestore`` is a short function that calls ``get_outline_from_modulestore`` to create a representation of the data that the ``learning_sequences`` app understands (``CourseOutlineData``), and then pushes that data into ``learning_sequences`` via an API method that ``learning_sequences`` exposes (``replace_course_outline``).
This function also sets custom attributes so that we can better monitor for performance issues and errors.
Note: One of the things we write is the *version* of the course. This is going to be important for diagnosing what's going on if these writes ever start failing. We get this information from the ``course_version`` attribute on the root ``CourseBlock``, and convert that to a string for convenient storage (it's a BSON object).
Celery Task
``cms.djangoapps.contentstore.outlines.tasks.update_outline_from_modulestore_task``
This is a simple celery ``@shared_task`` that wraps the call to``update_outline_from_modulestore``. It's critical to use celery to do this work asynchronously. Even if your code seems to work quickly enough to run in-process, courses can often use obscure features that can drastically increase the time it takes to get data out, and you will almost certainly not be able to comprehensively test for all those situations.
*You must be aggressive about alerting on task failures*. Publishes are infrequent enough to make it so that certain content-dependent errors will not trigger most error rate alerts. You have to be extremely sensitive to outright failures in your task because you may be blocking the publish for a course.
Signal Handler
``cms.djangoapps.contentstore.outlines.signals.handlers.listen_for_course_publish``
This is a centralized location where Studio does its post-publish data pushes, but you can also make a separate handler that listens for the same ``course_published`` signal. Its main task is to do some logging and queue the celery task.

Testing instructions

  1. Move to this branch
  2. Try configuring visibility settings for a section/subsection (or any other update), you can use hide from learners.
  3. Publish the course
  4. After a few seconds, check the LMS. The changes must be reflected in the course outline.

Deadline

Need to be merged before: #33952

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Jan 8, 2024
@openedx-webhooks
Copy link

openedx-webhooks commented Jan 8, 2024

Thanks for the pull request, @mariajgrimaldi! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

  • supporting documentation
  • Open edX discussion forum threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

@mariajgrimaldi mariajgrimaldi changed the title refactor: delay updating outline from modulestore after publishing course refactor!: delay updating outline from modulestore after publishing course Jan 9, 2024
@mariajgrimaldi mariajgrimaldi marked this pull request as ready for review January 9, 2024 16:51
@mariajgrimaldi mariajgrimaldi force-pushed the MJG/update-outline-after-countdown branch from 68ec410 to 4114e36 Compare January 9, 2024 16:52
@mariajgrimaldi
Copy link
Member Author

Hi there @ormsbee, I'm tagging you here since you authored these changes. Please let me know if we should tag other folks here. Thanks!

@ormsbee
Copy link
Contributor

ormsbee commented Jan 16, 2024

Can this be fixed by removing the implicit view-level transaction on whatever view is triggering the publish? I'm guessing that it's being caused because the async process doesn't see the updated split modulestore index that's stored in the Django ORM active versions table now, because the task launches before that transaction commits.

The reason I ask is that "delay be X seconds" inevitably causes operational race conditions at some point (e.g. when a piece of middleware starts hanging because it's contacting some service that has failed). If we can make sure the change is committed and visible before the celery queuing task executes, that would be ideal.

@mariajgrimaldi mariajgrimaldi force-pushed the MJG/update-outline-after-countdown branch 3 times, most recently from 7a68265 to 5b64322 Compare February 6, 2024 15:31
@mariajgrimaldi
Copy link
Member Author

Thank you @ormsbee for the suggestion! I tested it out here, and it seems to be working fine when publishing new configurations like hide from TOC and with existent ones like hide from learners or any other change, I also created/deleted/duplicated/edited blocks just to watch the views' behavior. I haven't tested more than that, eg. generating exceptions, so I can't tell what would happen with the DB consistency if the view fails mid-execution. But it seems like all the write operations are made in a block: _delete_item, modify_xblock, duplicate_block, _create_block, _move_item, and the rest seem to be read-only operations.
I'll also be deploying these changes to a nightly remote installation to test it out there. Thanks!

@ormsbee
Copy link
Contributor

ormsbee commented Feb 6, 2024

I haven't tested more than that, eg. generating exceptions, so I can't tell what would happen with the DB consistency if the view fails mid-execution.

The way Modulestore works now is that it will create definition documents and new structure documents that point to them. But nothing actually changes until the pointer to the structure document changes in the SplitModulestoreCourseIndex model from update_course_index. There is a history table that is attached to that should be updated in the same transaction (I don't remember the details of whether that's done in its own atomic block or not).

As far as the rest of the Modulestore is concerned, if something dies part-way through the process of altering a course, it will leave some unreferenced junk in MongoDB, but it shouldn't corrupt the content. That was an explicit goal of the SplitMongo Modulestore. MongoDB didn't have transactions back then, and a recurring issue with Old Mongo is that errors during the import process would leave to a corrupted, half-published state. Split changed things by creating all the pieces and then updating the pointer at the end so that imports were atomic.

@mariajgrimaldi mariajgrimaldi changed the title refactor!: delay updating outline from modulestore after publishing course refactor!: use non_atomic_requests decorator in handle_block view Feb 6, 2024
@mariajgrimaldi
Copy link
Member Author

mariajgrimaldi commented Feb 6, 2024

Thanks @ormsbee; I see why this is safer than my previous approach. Is it okay if I tag you as a reviewer? Let me know, we could also involve someone else. Thank you!

@ormsbee
Copy link
Contributor

ormsbee commented Feb 6, 2024

Sure. Thank you.

Copy link
Contributor

@ormsbee ormsbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a request for more comments, so nobody casually disables it later.

cms/djangoapps/contentstore/views/block.py Show resolved Hide resolved
Copy link
Contributor

@ormsbee ormsbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, please just squash and merge.

I think this qualifies as a fix: since the old behavior was wrong.

@mariajgrimaldi mariajgrimaldi changed the title refactor!: use non_atomic_requests decorator in handle_block view fix: use non_atomic_requests decorator in handle_block view Feb 16, 2024
@mariajgrimaldi mariajgrimaldi force-pushed the MJG/update-outline-after-countdown branch 2 times, most recently from b3eb6ad to 13f7a9b Compare February 16, 2024 12:31
@itsjeyd itsjeyd added product review PR requires product review before merging core contributor PR author is a Core Contributor (who may or may not have write access to this repo). labels Feb 16, 2024
@mariajgrimaldi
Copy link
Member Author

mariajgrimaldi commented Feb 20, 2024

Hi there @itsjeyd: although this fixes an error raised while implementing the feature enhancement proposal: Hide sections from course outline, this is not strictly related to the implementation. It's a fix for the platform found as a result of it, so I'll remove the product review label since this is backend-facing only, which will not change how the platform currently behaves.

@mariajgrimaldi mariajgrimaldi removed the product review PR requires product review before merging label Feb 20, 2024
@mariajgrimaldi mariajgrimaldi force-pushed the MJG/update-outline-after-countdown branch from 13f7a9b to 415703b Compare February 20, 2024 13:10
@mariajgrimaldi mariajgrimaldi merged commit f3dab82 into openedx:master Feb 21, 2024
46 checks passed
@openedx-webhooks
Copy link

@mariajgrimaldi 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future.

@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX staging environment in preparation for a release to production.

@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX production environment.

1 similar comment
@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX production environment.

@itsjeyd
Copy link
Contributor

itsjeyd commented Feb 23, 2024

Sounds good @mariajgrimaldi, thanks for the details 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core contributor PR author is a Core Contributor (who may or may not have write access to this repo). open-source-contribution PR author is not from Axim or 2U
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants