Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: ADR for an edx-platform representation of Catalog Courses. #34156

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
Course Models
#############
Comment on lines +1 to +2
Copy link
Member

@kdmccormick kdmccormick Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormsbee @feanil @robrap I've been mulling this ADR over for a while and I find myself wondering about the base assumption that edx-platform should have a Catalog Course model or that the LMS should care about Catalog Courses at all.

Hypothetically, if all LMS knew about were Course Runs, and we forbade it from ever looking how different runs could be grouped into Courses and Programs, what features would we lose? Would they be Core Product features (like credentialing and roles) or would they be things we've been discussing jettisoning out of the Core anyway (like masters degrees and upsell messaging)?

I know we want to enable seamless integration between Open edX and student-information/ecommerce/enterprise/marketing systems. But does having CatalogCourse and Program models in LMS help that, or might it actually make it more complex, since those systems might disagree about what a CatalogCourse means? Could it be better to let the external systems decide what "Course" and "Program" mean to them, and then use APIs to push metadata/enrollments/roles/whatever into edx-platform Course Runs as they need?

Getting more concrete: I'm not against having a CatalogCourse model, but before approving this ADR, I'd like to see a list of specific LMS/CMS features that would actually use such a model, and a bit in the Rejected Alternatives on why it's not sufficient for LMS/CMS to just think in Course Runs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick note to say I like your line of thought and questioning. The rest of this is very high-level thoughts.

I feel like the transition from edX to 2U has aided my transition from a worldview of the LMS being the center of the solar system, to its role as a planet orbiting with other planets and making up a different solar system for each operating organization.

At 2U, our catalog is much larger than the set of LMS course runs. What does that mean at the level of an LMS, and not necessarily the LMS, and how does this relate to this ADR? I know this may be more of a 2U issue, but it might inform how we think about the different pieces.

That said, if the Open edX platform always tries to provide an all-inclusive experience with as few parts as possible to make it simple for operators, we may always end up with very fuzzy boundaries. Or, if someone is able to dig into Kyle's request for specifics, maybe we can get to a cleaner solution that works for all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @pdpinch hit the nail on the head with:

I think this could be really valuable, because it materializes something that users assume to be true.

Course runs are already connected to each other in a bunch of ways. The Studio course listing page is eventually going to have the runs grouped by course. They often re-use 90+% of the same content, which is part of why I'd like to eventually see multiple runs of the same course stored in the same LearningPackage. That relationship is real, even if it's implicit today because it's only really held together by the CourseKey.

I think it's also valuable to be able to group Courses together into larger collections. There, we have use cases both for role/permissions (e.g. "this staff has broad admin or analytics reporting access to this set of courses"), as well as for catalog customization.

The global source of truth for this data doesn't need to be in edx-platform. For a large company, that source of truth might be Salesforce or any number of other things that eventually push to these primitives. But I think it's still valuable to have the base primitives defining these relationships in the platform.

I feel like the balancing act will be seeing just how thin we can make these core concepts, and how disciplined we can be about keeping other kinds of data separated (e.g. pricing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To echo what Dave said, I think we have had some implicit version of this for a long time. We're gonna start using it more in the presentation. Right now this concept lives in course-discovery but that comes with a lot of baggage that we don't want. Being able to house a core concept in edx-platform and then push a lot of the rest into things that either plugin or otherwise interface with the core bits here seems like it would make the platform more generically usable.

Another thought, would we be better served by having a concept of a CourseRun that is independent of course content? Like have a course run model that we can populate without having to create any course content. Then anyone could make plugins that allow end-users to group the runs however they want with whatever metadata they want for advertising/marketing/selling. Can we achieve this by inverting the relationship between courseoverviews and modulestore?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought, would we be better served by having a concept of a CourseRun that is independent of course content? Like have a course run model that we can populate without having to create any course content. Then anyone could make plugins that allow end-users to group the runs however they want with whatever metadata they want for advertising/marketing/selling. Can we achieve this by inverting the relationship between courseoverviews and modulestore?

I don't think CourseOveviews is necessarily the right mechanism to do this, but I do broadly support the idea that provisioning the course/run and roles associated with them is upstream of content creation. I sketched out a primitive proposal for that here a while back.

The tricky part in that scenario would be figuring out the migration path and what happens to import/export of those fields that already exist in Modulestore. I think it's totally doable, but there are a lot of details to work out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding defining a new CourseRun vs using the existing CourseOverview model...

Pros of using CourseOverview:

  • A lot of models already hang off of it.
  • A lot of views/serializers/etc are already written to expect CourseOverview objects.
  • It's already there. We'd have one model for this concept rather than two.

Cons of using CourseOverview:

  • The .id field (the PK) is the course key string, which is weird and probably not performant (?)
  • Has a huge mish mash of fields -- they were selected "by need" rather than "by design".
  • Bunch of weird footgun properties which invoke modulestore.get_course() anyway.
  • Data flow -- currently, the model is a persisted cache of modulestore, populated asynchronously upon course publish. We'd have to flip that without breaking anyone.
  • A lot of edx-platform duck-typedly tolerates both CourseBlock and CourseOverview instances, which can be confusing. It could be valuable to be forced to rewrite some of this code to explicitly handle only CourseRun objects.

If we make a new CourseRun field, it could be valuable to backfill course_run = OneToOneField(CourseRun) onto CourseOverview.

After listing that all out, I think I've come around to the side of defining a new CourseRun model, assuming we are very protective about what goes into it.


Status
******

Accepted
feanil marked this conversation as resolved.
Show resolved Hide resolved

Implementation pending

Decision
********

edx-platform will have a first-class model to represent the idea of a *catalog course*.
If we think of multiple runs of the same course to all derive from one
conceptual "course", the catalog course is what will represent that conceptual course.

Context
*******

The term "course" in Open edX is ambiguous, referring to one of two things:

1. An individual *run* of a course which is authored in Studio and published to LMS. Here, we call this the **CourseRun**. In edx-platform, it is modeled by the **CourseOverview**.
2. A logical course *catalog* entry that may be marketed, sold, awarded as a credential, and bundled into programs. Here, we call this the **CatalogCourse**.

Every one CatalogCourse maps to N CourseRuns.

Historically, CatalogCourses have been inferred from the names of courses with the
runs removed (eg. `course-v1:AximX/Eng101` could be a catalog course with multiple runs
(`course-v1:AximX/Eng101:2024H1`, `course-v1:AximX/Eng1012024H2`, etc)). Having the catalog course explicitly defined as a new model has multiple benefits.
runs removed (eg. "AximX/Eng101" could be a catalog course with multiple runs
(2024H1, 2024H2, etc)). Having this explicitly defined as a new model has
multiple benefits.

* Relationships between course runs can be defined explicitly instead of implicitly.

* Currently this information is either implied (via course key) or requested from
course-discovery. In the case of edx-platform, it loads the entire course-discovery
database into a cache.

* A list of all courses can be easily queried both internally and via the rest API.

* Catalog courses can be explicitly mapped to organizations rather than implicitly.

* The door is opened to also persist program or course bundle data in the future.
This will stop `programs-related features from breaking whenever memcached is reset <https://github.com/openedx/edx-platform/blob/master/openedx/core/djangoapps/catalog/docs/decisions/001-programs-cache.rst>`_.

* Org → Course → Course Runs mappings don't have to follow implicit naming
Conventions and associations can be changed more easily.

Rejected Alternatives
*********************

Rely on course-discovery for CatalogCourse
==========================================

The course-discovery service already has some models to represent this and we
could rely on it directly but it also contains a lot of representations that
are specific to the edx.org process and it's not valuable to spin up the
course-discovery service just to get this information for most operators.
Comment on lines +57 to +60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is there is an existing toggle, or a new toggle, which will be used to choose between different behaviors?
  2. Will the implementation be fully backward-compatible? Will the implementation fully support both using and not using course-discovery? For example, if discovery code is needed to call the new APIs, is that intended to be part of this implementation?
  3. Is there any longer-term deprecation of discovery in mind that this is a part of? Or possibly just making the external catalog more pluggable?
  4. Separate, but related, there are a number of places where the LMS makes synchronous calls to Discovery, which causes reliability issues for the LMS when Discovery is having issues. Do you know if a benefit of this design would be to replace any of these calls, even in cases where Discovery is being used as a separate service?

Copy link
Member

@kdmccormick kdmccormick Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robrap Can you elaborate on 4?

I know of the the programs cache, which this will definitely help.

My search just now also introduced me to the learner_skills_levels api, which this will not directly help since it relies on course-discovery's taxonomy API. In the long-term, though, taking course-discovery-querying utils out of edx-platform will have the benefit of discouraging new instances of the edx-platform-queries-course-discovery anti-pattern.

Do you know of other instances?

Copy link
Contributor

@robrap robrap Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding 4, I haven't actually looked into the problem specifically. I've just seen its affect when we've had temporary issues on discovery that back up into the LMS. I probably need to look more closely when we have an actual incident. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I'm mostly not going to address this until we've arrived at a general agreement on what idea it would be useful to implement which we're still kind of working through. Just making a note that I haven't missed it but seems premature.


As long as any local data can be updated from course-discovery if
course-discovery is setup in an environment, we should not have any conflicts
between the two locations.
Comment on lines +62 to +64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we specify what will be the source of truth for both cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DawoudSheraz my thought would be that this would be up to the operator. If the operator wants to use an external service as the source of truth, they would hook into the API and push changes into models in the LMS and could setup permissions so that data can't change from the LMS side. If the operator wants the LMS to be the source of truth, they can setup events and listen for changes to push data into their other systems. You could also do both to allow editing on either side and syncing in both directions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also do both to allow editing on either side and syncing in both directions.

A similar type of model currently exists between Discovery-Studio. It has caused sync issues on edX side in the past. As for the source of truth, if it is up to the operator, we should be mentioning that explicitly in ADR.


Essentially, if an operator has course-discovery enabled, the models here would
be a persisted cache of the relevant information. However, if
course-discovery is not enabled, the models can stand-alone and be populated
either directly via APIs or automatically based on course naming conventions so
that any features that rely on them can work in either system.


Implementation Details
**********************

* Create a new CatalogCourse model

* A 1:N mapping between 1 Catalog Course and N Course Run objects.

* Open question: do we re-use CourseOverview to represent Course Run, or do we make a new CourseRun run model?

* A 1:N mapping between 1 Organization and N Catalog Course objects.

* A django admin that allows you to View and Manipulate Catalog Course objects and their associations.

* A data migration to make all current implicit associations explicit.

* Eventing hooks to auto-associate new runs to their implicit Catalog Course
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you had mentioned a toggle around whether Discovery is being relied upon? In that case, is the expectation that events would instead be sent from Discovery to keep this data up-to-date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure. Course discovery is not currently required or enabled by default in Tutor. My expectation would be that this would remain true in the future but that the new models would get auto populated when course runs are created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question. I don't have a strong opinion on whether course-discovery should push, pull, or both. Mainly, I want to make sure that we treat course-discovery like any other external catalog system. So:

  • If course-discovery wants to push data to edx-platform, it should use the CRUD API mentioned below.
  • If course-discovery wants to pull data from edx-platform, then it should hook into these events.

So, either way, we'll want both the API and the event hooks.


* A REST api to CRUD Course Overview. Catalog Course, and Org data, along with their associations.
Loading