-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File uploads + Experimental Media Server #31
Conversation
@ormsbee Super interesting. I'm still collecting my thoughts about this, but here goes:
I think the usual/naive way to do this would be to serve this URL from Django, and have it issue a redirect to the hashed URL from the object store's CDN, right? This would ensure perfect caching of objects in the user's browser across versions (because the browser sees that the new version redirects to the same hashed URL, and the both the original redirects and the hashed object store URLs are immutable responses that can be cached forever). But the downside would be that the redirects would break the relative URLs functionality you liked ("If there's a piece of HTML that's referencing tracks/en-US.vtt, we don't have to worry that the file is really odpzXni_YLt76s7e-loBTWq5LSQ in S3.")
Yeah. As far as I know, the only way to achieve this is using redirects as described above or not including the version number in the URL. Speaking of which, that does suggest an alternative approach: Don't include a version number in the URL, and rely on ETag to indicate to browser's whether or not there is a new version. If the browser already has the latest version, just return So an example would be:
If the browser last saw v3, it sends a request to this URL where the ETag contains the hash of the actual object data from v3. If v4 has since been published but the hash is unchanged, our django app simply returns Now, I don't think this is the best approach, but I'm just mentioning it in case it's helpful or inspires any other ideas.
I love to see crazy ideas, and I love async code, so this is very cool and I like the approach. But I will say that the idea of a new required app just to make the LMS function, which has to be separately deployed and which uses a somewhat different stack, gives me pause. Further, I believe that the best practice these days is to run this kind of microservice at the edge as serverless functions, i.e. distributed globally on a CDN - using e.g. Lambda@Edge, Deno Deploy, Vercel Edge Functions, Fastly Compute@Edge, etc. And while Lambda@Edge can run python, most other CDNs cannot, so it could be better to implement in TypeScript or WebAssembly which do work on any Edge CDN and which have even stronger async primitives than python. With these sort of setups, you typically have a "shield" within the CDN that provides an intermediate cache, so that the edge nodes send their requests to the shield node within the CDN, which may then send a cached result to back to the edge node or make a request to the LMS (if it has a cache miss), so that there is only ever essentially one request to the LMS for each file ever and everything from that point forward gets cached within the CDN and served to users with lightning speed.
Edit - revised version of the above: I'm only familiar with Deno, but supporting things like If we want to deploy on the edge, it's probably best to not include any database logic but rather use a tiny+fast learning core REST API to determine the current version and file hash for any request that's not found in the edge cache. So I guess my suggestion/question would be this: what if we implement something simple and robust within learning core that works out of the box with no additional microservice (but with suboptimal caching), and we include a couple examples of highly optimized edge functions that can be deployed onto a CDN to provide near-perfect caching at the edge? Then the deployment story stays simple for small users and developers, and for big instances that are deploying CDNs anyways, it's likely easier to deploy a tiny edge function than to spin up a separate microservice. |
content.file.save( | ||
f"{content.learning_package.uuid}/{hash_digest}", | ||
ContentFile(data_bytes), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this saving every Content's data to both MySQL and S3 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This code is really hacky. The only content that gets here are the static assets referenced from XBlocks, and they are stored only in storages/s3 (the data=data_bytes
line was removed, so no db-level storage).
There's a separate part where the XBlock's own OLX data is read in, and that part remains db-only for storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I was reading this quickly and didn't catch that. Makes sense.
I think that works if there's an implicit
I really like this idea! If the default setup uses MinIO instead of S3 and the latency is lower and more predictable, it might even be good enough for moderate sized sites. I think I'm sold on this as an overall direction. I'm guessing we'll still want to have a separate site for security reasons, even if it hits the same processes in our out-of-the-box option. I see a guide to how to make the root urls.py swappable via middleware. I'm guessing we'd do something like that, possibly with a separate middleware stack if that's easy to arrange? Also, I wonder how we do assets that have auth requirements to view them. The content serving middleware in edx-platform today has session info, so it can apply the permissions check there. I'm not sure what the right way to do this is when the server URLs are split. I guess we could generate URLs that have some token in them that the server knows how to translate for those types of assets... and we just never cache them I guess? |
So for most assets in CDN-worker-powered world:
Does that sound right? |
Glad you like it; I like it too :)
I hadn't thought about that but it's a good point - that absolutely makes sense to me. 👍🏻
In part it depends on what kind of permissions scheme we want to support. (Have we sketched that out anywhere?) But generally I don't think we want to be dealing with session tokens and complex permission logic at this level, so a token-based approach makes sense to me - users with a [valid] token can view the asset and those without it cannot. Issuing tokens is the responsibility of something higher up the stack. To improve caching, we could actually put the tokens in a cookie rather than the URL. Assuming that the tokens are set at the learning package level (which would be much more efficient than at the component level or content level), a token cookie could be set with its domain and path configured such that the browser only sends the token for requesting asset files for that specific learning package. Whether this request is handled by Django or a CDN edge function, it could check if the asset is private and then validate the token if so before returning the result. The advantage of this approach is that the assets are still immutable and cacheable forever with unchanging URLs, but only accessible by authorized users. The disadvantage is that it can be tricky to ensure that a valid token is always configured with the right domain and path for each learning context that the author needs (either the frontend has to be on a common root domain with the CDN, and explicitly manage the cookies, or the edge function needs to do some complex redirects to authenticate the user if they don't have a valid token at the moment; this works and is e.g. how GitLab serves authenticated private static sites on arbitrary domains, but is definitely complex). So in practice including the token in the URL may be much easier. |
Closing this PR. I'd like to take a stab at @bradenmacdonald's suggested approach, but it might be a while since I'm out on PTO for all of next week. |
Superseded by #33 |
This is an experimental stab at: Content data model should use File Storage. (#29)
Status
This is not ready for real code review, but I'm putting it up more for a directional sanity check and feedback. I think it's unlikely to merge with this exact approach, but I wanted to talk about the ideas.
The Normal Part
I've added a FileField to the Content model. @bradenmacdonald and I had a discussion on when to have things in the BinaryField and when to offload them to the externally stored file. It's not enforced anywhere here, but rule of thumb I was thinking about was:
After having messed with this a little, I actually think Content should allow for having both simultaneously. So if part of your code is going to parse the
srt
file (because you need to do a crazy conversion to a hacky custom format on the fly), but that samesrt
file is also going to be served to browsers directly, then it should have both the BinaryField for internal use and the FileField for browsers to reference. Since it's an access optimization and not really a content change, copying data from one of those fields to the other could be a data migration that happens without creating new ContentVersions.The Media Asset Naming Problem
While I was making this, I got into the dilemma of serving versioned media assets. in particular I wrote:
The Crazy Part: FilePony
There's a new top-level package called
filepony
, which would be run using FastAPI + uvicorn for read-only async file serving, but use Django for what I hope are quick, synchronous database lookups.If any intrepid souls want to try this out, you have to install requirements again and start the media server like:
So why not Django Async Views? There were a few reasons:
FastAPI/Starlette is built with async in mind, up and down the stack. So I wanted to try an approach where we ran async FastAPI code for almost everything and then jumped into (synchronous ORM) Django just for the few milliseconds we need to look up the model information. I'm hoping that, combined with generous caching and CDN usage, will be enough to make the performance acceptable for most uses (though there's a lot more to implement before we get to that).
I realize that I should also do it using Django async views, and measure to see what kind of performance difference we see between the two (though I still worry more about regressions on the Django side).
Also, this entire thing might be obviated by a different design approach. Or if there's a mature server out there that actually does this already.