Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General outline of proposed updates to publishing logic #1522

Closed
jtgeibel opened this issue Oct 16, 2018 · 13 comments
Closed

General outline of proposed updates to publishing logic #1522

jtgeibel opened this issue Oct 16, 2018 · 13 comments
Labels
A-backend ⚙️ A-publish C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear

Comments

@jtgeibel
Copy link
Member

jtgeibel commented Oct 16, 2018

I've recently gone through our publishing logic and would like to document my findings and propose some changes. This was originally raised in the context of a cargo publish --dry-run option in #1517, but I think some refactoring here would help with proposed enhancements such as background jobs (#1466) and direct client uploads to S3.

Currently (edit: updated 2019-11-13)

Our publishing logic currently follows the following sequence:

  • Check metadata length against the global max crate size
  • Decode metadata
  • Check for non-empty: description, license, authors
  • Verify user is authenticated
  • Obtain database connection and enter transaction
  • Ensure user has a verified email address
  • Validate URLs if present: homepage, documentation, repository (NewCrate::validate)
  • Ensure name is not reserved (NewCrate::ensure_name_not_reserved)
  • If crate is not present, insert it and add the user as an owner (NewCrate::save_new_crate)
  • If this is a brand new crate, check the rate limit (NewCrate::create_or_update)
  • If crate already existed, update it (NewCrate::create_or_update)
  • Check that the user has publish rights on the crate
  • Check that the new name is identical to the existing name (sans-canonicalization)
  • Check that Content-Length header exists and doesn't exceed the crate specific max
  • Validate license if specified (NewVersion::validate_license via NewVersion::new)
  • Check if the version already exists (NewVersion::save)
  • Insert version and add authors to the version (NewVersion::save)
  • Iterate over deps (models::dependency::add_dependencies)
    • Check that not an alternate registry dependency
    • Check that crate exists in the database
    • Enforce "no wildcard" constraint
    • Handle package renames
    • Insert deps into database
    • Return vec of git::Dependency
  • Update keywords (Keyword::update_crate)
  • Update categories, returning a list of ignored categories for warning (Category::update_crate)
  • Update badges, returning a list for warnings (Badge::update_crate)
    • Validate deserialization to our enum, collecting invalid ones
    • Update database
  • Use database to obtain max_version of the crate (for response)
  • If readme provided, enqueue rendering and upload as a background job
  • Proposed --dry-run check
  • Upload crate (uploaders::upload_crate)
    • Read remaining request body
    • Verify tarball
    • Upload crate
    • Calculate crate tarball hash
  • Enque index update
  • Encode response
  • Commit database transaction

Background job: Render and upload README

Defined in render::render_and_upload_readme

  • Render README (render::readme_to_html)
  • Obtain connection
  • Record README rendered_at for version (Version::record_readme_rendering)
  • Upload the rendered README (uploaders::upload_readme)

Background job: Update Index

Defined in git::add_crate

  • Determine file path from crate name
  • Append line of JSON data to file in registry
  • Commit and push

Proposed

Notes

  • We enforce a 50MB max in nginx
  • We should add a configuration entry for the global max size of the metadata (we currently use max tarball size several places)
  • A few guidelines I tried to follow:
    • Identify and reject invalid requests as quickly as possible.
    • Minimize the work done while holding a database connection, especially after entering the main transaction.
    • The final main transaction may need to repeat some queries to ensure it doesn't rely on data obtained outside of the transaction.

Verify Headers

  • Verify user is authenticated
  • Check that Content-Length header exists and doesn't exceed global max tarball + global max metadata + 2 size fields

Verify Request Body

  • Check metadata length against the global max metadata size
  • Read in metadata
  • Read in tarball size, verify tarball size + metadata size + 2 size fields == Content-Length
  • Decode metadata
  • Check for non-empty: description, license, authors
  • Validate URLs if present: homepage, documentation, repository
  • Validate license if specified
  • Iterate over deps
    • Enforce "no wildcard" constraint on deps
    • Check that not an alternate registry dependency
  • Validate deserialization of badges into enum, collect invalid ones
  • Read remaining request body
  • Verify tarball
  • Calculate crate tarball hash (for registry)

With database, outside of main transaction

  • Obtain database connection
  • Ensure user has a verified email address
  • Ensure name is not reserved
  • Obtain a list of valid and invalid categories
  • Ensure that all deps exist
  • If crate exists
    • Check that the new name is identical to the existing name (sans-canonicalization)
    • Verify tarball doesn't exceed the crate specific max
    • Check that the user has publish rights on the crate
  • If crate didn't exist
    • Check the rate limit
    • Verify tarball doesn't exceed default max
  • Check if the version already exists
  • --dry-run check

Start writing within the transaction

  • Enter database transaction
  • If crate didn't exist then insert and add the user as an owner
  • If crate was present, update it (TODO: review what fields on the crate we update under which circumstances. How do we deal with prerleases (Incorrect metadata coming from last published version #1389) and backports?)
  • Insert version (abort if exists) and add authors to the version
  • Record README rendered_at for version
  • Iterate over deps
    • Handle package renames
    • Insert deps into database
  • Update keywords
  • Update categories
  • Update badges
  • Use database to obtain max_version of the crate (for response)
  • Iterate over deps to get a vec of git::Dependency
  • Upload crate
  • Background jobs
    • If readme provided, enqueue rendering and upload as a background job
    • Enqueue index update
  • Commit database transaction
  • Encode response
@sgrif
Copy link
Contributor

sgrif commented Oct 16, 2018

Just for posterity, max crate size will eventually need to be checked asynchronously, so this probably should be primarily enforced in cargo.

@carols10cents
Copy link
Member

carols10cents commented Nov 11, 2019

I've been thinking of how this task could be split up into smaller tasks to make it easier to review and lower the risk of breaking something (or at least making smaller changes so that we can tell which change broke something).

I tried putting the current and proposed lists as revisions of a gist to enable viewing them in diff format, it helps a little I think: https://gist.github.com/carols10cents/4f32c43855fdfd77a8a5b48f53ab06b5/revisions#diff-b160a194db1110d5914710115e64d429

I also think there's opportunity to refactor the publish function and the parse_new_headers function by extracting smaller functions that name the checks they're doing, so that the publish function reads more like the bulleted list here and each function encapsulates the exact implementation of these checks.

So I'd like to see these smaller changes to the code in the publish controller made in separate PRs in approximately this order (multiple items in this list might be accomplished in one PR depending on how much reordering is or is not necessary):

Verify Headers

Verify Request Body

With database, outside of main transaction

Start writing within the transaction

  • TODO i need to go right now but I will edit this with lots more items later

@carols10cents
Copy link
Member

@jtgeibel can you clarify a bit more what you mean by this in the proposed section:

  • Read in tarball size, verify tarball size + metadata size + 2 size fields == Content-Length

As far as I can tell, this isn't a check we're doing directly right now. Are you thinking this is a quick way we can reject invalid requests rather than waiting until verify_tarball gets the data? Or would this new check prevent problems we could potentially be open to right now?

@carols10cents
Copy link
Member

carols10cents commented Nov 13, 2019

* We should add a configuration entry for the global max size of the metadata (we currently use max tarball size several places)

Do we really need a separate setting for this? Isn't max content length an effective maximum on metadata, because you could theoretically have a tarball that's 0 bytes and metadata that takes up the rest of the space? Just thinking in terms of what we would set this configuration option to if we had it!

I suppose that gets into how we want to resolve this comment and how we want to communicate this limit to someone whose crate is getting rejected who has a .crate file below our stated limits.

@carols10cents
Copy link
Member

Since the original issue was created, we've added the requirement that the publishing user have a verified email address. It's currently pretty early in the process, and it doesn't depend on the crate content at all, but it does need a database connection. If I'm following the logic you've laid out here, I think it should get inserted here? Do you agree @jtgeibel ?

  ### With database, outside of main transaction
  
  * Obtain database connection
+ * Ensure user has a verified email address
  * Ensure name is not reserved

(also the reason that I'm all of the sudden all over this issue is that I think this can be split into a bunch of smaller contributions we can get new people to swarm on, and also it'll help us enable cargo publish --dry-run that I think would be a nice feature to do soon)

@jtgeibel
Copy link
Member Author

Thanks for taking a fresh look at this @carols10cents! I definitely agree with the approach of taking small, reviewable steps towards this general outline.

I also think there's opportunity to refactor the publish function and the parse_new_headers function by extracting smaller functions that name the checks they're doing, so that the publish function reads more like the bulleted list here and each function encapsulates the exact implementation of these checks.

👍 This is what I have in mind as well. In general, I think we should review logic that is currently in models like NewCrate, NewVersion, and dependency::add_dependencies, as some of this logic may be clearer if moved into the controller. It seems like the models could be the minimal needed to orchestrate tests, with most everything else being private to this one endpoint.

I've gone through the code again and updated the "Currently" section above, splitting out the background job work and adding notes showing where other steps are currently located. I've also added several new steps (to both sections):

  • Verified email
  • Rate limiting for new crate names
  • For declared dependencies
    • Reject alternate registries
    • Handle package renames

Read in tarball size, verify tarball size + metadata size + 2 size fields == Content-Length

As far as I can tell, this isn't a check we're doing directly right now.

No, we don't do that currently. I was considering it mainly as a quick sanity check on the request that can be done very early in the request processing. We might want to land it in an atomic deploy, if we decide to add such a check.

We should add a configuration entry for the global max size of the metadata (we currently use max tarball size several places)

Do we really need a separate setting for this? Isn't max content length an effective maximum on metadata, because you could theoretically have a tarball that's 0 bytes and metadata that takes up the rest of the space? Just thinking in terms of what we would set this configuration option to if we had it!

I think it makes sense to add a separate config item here, but I don't know what we should set it to. Maybe we should add some logging to get a feel for some typical metadata sizes. The publish endpoint is fairly attractive from a DOS perspective, and the metadata could kick off a lot of database activity. I could potentially see a scenario where it would be nice to drop this limit quickly via an environment variable. Although I haven't really put much serious though into what such an attack might look like and if this would be an effective mitigation.

(also the reason that I'm all of the sudden all over this issue is that I think this can be split into a bunch of smaller contributions we can get new people to swarm on, and also it'll help us enable cargo publish --dry-run that I think would be a nice feature to do soon)

Both of these sound great to me!

@carols10cents
Copy link
Member

I've gone through the code again and updated the "Currently" section above

Thanks!!! I've updated the diff view and I'm going to update the checklist next :)

@carols10cents
Copy link
Member

I decided to split off the max metadata length check to a separate issue: #1896

@carols10cents
Copy link
Member

### Verify Request Body

...
* Upload crate and rendered README (if not `--dry-run`)

### With database, outside of main transaction

...

### Start writing within the transaction

...

Hm, shouldn't we wait until we validate everything to upload to s3? If the crate doesn't make it into the database or the index, then I don't think there's a way for cargo to download the .crate thinking it's legitimate, but I wonder about invalid stuff (say, by someone who doesn't own a crate) being uploaded and accessible from the direct static.crates.io URL. Am I missing something?

@jtgeibel
Copy link
Member Author

jtgeibel commented Nov 14, 2019 via email

@carols10cents
Copy link
Member

Those uploads are now done in the background jobs.

The readme rendering and uploading is done in a background job, but I don't think uploading the .crate file is as far as I can tell? Where upload_crate is called in the publish endpoint, definitions of upload_crate and upload

@jtgeibel
Copy link
Member Author

The readme rendering and uploading is done in a background job, but I don't think uploading the .crate file is as far as I can tell?

You're right. I've updated the issue above to clarify. (I thought I had already done so, but maybe I didn't hit save on that edit.)

In the proposed section above, I've added the crate upload step to be the last step before enqueueing the background jobs.

@Turbo87
Copy link
Member

Turbo87 commented Jun 19, 2022

since there hasn't been any activity here for 2.5 years, I guess we can close this issue. feel free to reopen if it becomes relevant again :)

@Turbo87 Turbo87 closed this as completed Jun 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ A-publish C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear
Projects
None yet
Development

No branches or pull requests

5 participants
@jtgeibel @Turbo87 @carols10cents @sgrif and others