Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve upload and download API docs #2955

Merged
merged 3 commits into from
Nov 1, 2024
Merged

Improve upload and download API docs #2955

merged 3 commits into from
Nov 1, 2024

Conversation

willkg
Copy link
Contributor

@willkg willkg commented Jul 3, 2024

This improves the HTTP status codes, adds response headers where applicable, and adds a first-pass section on improving symbol upload success rates.

This should help users who are using the download and upload APIs know what to do in certain status code situations that were previously undocumented.

This specifies the response headers so we can write systemtests for them.

This will help people writing symbol upload jobs with a set of things they can look at to improve the likelihood those jobs finish successfully. We can hone this section over time.

For whoever reviews this, feel free to merge it if it's approved. Thank you!

@willkg willkg requested a review from a team as a code owner July 3, 2024 17:23
:resheader Content-Length: length in bytes for the file
:resheader Content-Encoding: (optional) content encoding for the file; note
that ``.sym`` files are compressed even thought he file extension doesn't
indicate that
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are the critical response headers. I'm a little fuzzy on HTTP 302s, though. Does this sound right? Are there other headers we want to include here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the Location header is in the response from Tecken. The other headers are part of the response from S3. We should make that clear in this list.

The 302 response from Tecken will have these headers:

  • Location – the redirect URL
  • Debug-Time – the time if Debug:true was in the request. (I'm unsure we really want this to be part of the documented interface)
  • Content-Length: 0 – a redirect response doesn't have a body. We don't need to document this.
  • Content-Type: text/html – Django seems to be putting this there by default. It doesn't really make sense.
  • There's no Content-Encoding header in the Tecken response. There's no content, so specifying an encoding would be pointless.

If we follow the redirect, we talk directly to S3. The response include these headers:

  • Content-Encoding: gzip – If the file is compressed. In general, the file should only be returned with this content encoding if the client indicated that it can deal with it with Accept-Encoding: gzip. However, S3 will always return it if the file was uploaded with this encoding. (GCS actually decompresses the file on the server side if the client doesn't indicate Accept-Encoding: gzip.) A content encoding is only used for transport and should be transparent to the end user. You get the "actual" content after decoding. For example you can navigate to https://s3.us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/xpcshell/4C4C44B655553144A1B2DAA18738D74B0/xpcshell.sym in Firefox, and you will see the text content. The file is still stored and transmitted compressed, but the Content-Encoding header indicates to Firefox that the file should be decompressed first. In curl you can use curl --compressed to get the same behaviour; curl will list all the content encodings it supports in the Accept-Encoding header, and if the response indicates that one of these encodings was used the content will be decoded accordingly.
  • Content-Length: The length of the response body. If the body is compressed, it's the size of the compressed body.
  • Content-Type: The content type of the response after decompressing it. Will be text/plain for symbol files.

I'm not actually sure whether this response makes things clearer or more confusing, but I'll send it anyway. The important point is that we should make clear what headers are in Tecken's response, and what headers are in S3's response, if we want to list the latter as well.

Copy link
Contributor Author

@willkg willkg Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fact that it's coming from AWS is an implementation detail that isn't relevant to the docs.

However, I do see your point about which headers come in which response. I'll have to think about how to document that clearly. I'll think about this more when I get back from PTO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to think about the subtle difference in behaviour between S3 and GCS. Maybe we can add the Accept-Encoding: gzip header at the load balancer level in GCS to make sure the behviour is the same, but I'll have to look into it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:statuscode 500: sleep for a bit and retry; if retrying doesn't work, then please
file a bug report
:statuscode 503: sleep for a bit and retry
:statuscode 502: sleep for a bit and retry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 502 instead of 503?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the logs and we don't really emit 503s as far as I can tell. We do emit 502s, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 503s are emitted by the ELB. We don't have load balancer logs for Tecken, so we can't see 503s in the logs. I expect 503s to be far more common than 502s, but I don't think we have any way to verify that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you expect 503s to be more common than 502s for Tecken?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got these status codes mixed up. 504s are probably more common than 502s. The ELB will emit a 504 if the backends are overloaded, or when requests take too long. 503s can happen as well if all instances are unhealthy, which sometimes happens for Tecken.

We actually do have some numbers to esitmate how common 500s from the ELB are in general, compared to 500s from the backend, since the ELB stores aggregate numbers for these two classes of 500s. I would have expected that the vast majority of 500s come from the ELB, but it turns out it's only a slight majority – about 60% of all 500s come from the ELB and 40% from the backend.

:resheader Content-Length: length in bytes for the file
:resheader Content-Encoding: (optional) content encoding for the file; note
that ``.sym`` files are compressed even thought he file extension doesn't
indicate that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the Location header is in the response from Tecken. The other headers are part of the response from S3. We should make that clear in this list.

The 302 response from Tecken will have these headers:

  • Location – the redirect URL
  • Debug-Time – the time if Debug:true was in the request. (I'm unsure we really want this to be part of the documented interface)
  • Content-Length: 0 – a redirect response doesn't have a body. We don't need to document this.
  • Content-Type: text/html – Django seems to be putting this there by default. It doesn't really make sense.
  • There's no Content-Encoding header in the Tecken response. There's no content, so specifying an encoding would be pointless.

If we follow the redirect, we talk directly to S3. The response include these headers:

  • Content-Encoding: gzip – If the file is compressed. In general, the file should only be returned with this content encoding if the client indicated that it can deal with it with Accept-Encoding: gzip. However, S3 will always return it if the file was uploaded with this encoding. (GCS actually decompresses the file on the server side if the client doesn't indicate Accept-Encoding: gzip.) A content encoding is only used for transport and should be transparent to the end user. You get the "actual" content after decoding. For example you can navigate to https://s3.us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/xpcshell/4C4C44B655553144A1B2DAA18738D74B0/xpcshell.sym in Firefox, and you will see the text content. The file is still stored and transmitted compressed, but the Content-Encoding header indicates to Firefox that the file should be decompressed first. In curl you can use curl --compressed to get the same behaviour; curl will list all the content encodings it supports in the Accept-Encoding header, and if the response indicates that one of these encodings was used the content will be decoded accordingly.
  • Content-Length: The length of the response body. If the body is compressed, it's the size of the compressed body.
  • Content-Type: The content type of the response after decompressing it. Will be text/plain for symbol files.

I'm not actually sure whether this response makes things clearer or more confusing, but I'll send it anyway. The important point is that we should make clear what headers are in Tecken's response, and what headers are in S3's response, if we want to list the latter as well.

Comment on lines +187 to +188
:statuscode 429: your request has been rate-limited; sleep for a bit and retry
:statuscode 500: there's an error with the server; sleep for a bit and
retry; if retrying doesn't work, then please file a bug report
:statuscode 502: sleep for a bit and retry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the last week of logs for Tecken prod for download requests, and this list looks good to me (i.e. I couldn't find any other status codes to include).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logs are from nginx, so they only include requests that make it to nginx. If the ELB can't reach the backend, it will reply with a 503 or 504, and these requests won't be in the logs. It's possible to enable ELB logs as well, but we never did that for a variety of reasons. In GCP, we'll also have load balancer logs, so we'll be able to see all the requests.

retry; if retrying doesn't work, then please file a bug report
:statuscode 502: sleep for a bit and retry
:statuscode 504: the request is taking too long to complete; sleep for a bit
and retry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found in the Tecken prod logs over the last week 29 occurrences (0.3% of all response codes) of a 499 response code. Interestingly, this is a non-standard code (not in MDN), but it appears to be a code used by nginx to indicate the client closed the connection before the server could send a response. I couldn't find this in nginx's docs, but I saw a bunch of articles about it with an organic web search.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a response code clients will ever see, though. It's never actually sent by nginx, since the client already closed the connection. It's only recorded in the logs. We don't need to document this since it's not visible from the outside.

@willkg
Copy link
Contributor Author

willkg commented Jul 18, 2024

Update: I really appreciate all the comments and thoughts here. Thank you!

I want to address the comments, but probably won't have time to do that until next week. I also think that while updating docs is important, the discussion this kicked off was invaluable and several issues were created to fix tests and other things. Actually updating the docs is less important than that work was.

@willkg
Copy link
Contributor Author

willkg commented Aug 14, 2024

^^^ That updates the PR to factor in the comments from everyone. Does this match what we've got in AWS and what we're building in GCP now? Is it clear for users?

This improves the HTTP status codes, adds response headers where
applicable, and adds a first-pass section on improving symbol upload
success rates.

This should help users who are using the download and upload APIs know
what to do in certain status code situations that were previously
undocumented.

This specifies the response headers so we can write systemtests for
them.

This will help people writing symbol upload jobs with a set of things
they can look at to improve the likelihood those jobs finish
successfully. We can hone this section over time.
This fixes the location redirect download API docs by moving the
separate request into a new section. This adds status code items for
502, 503, and 504 where missing.
@willkg
Copy link
Contributor Author

willkg commented Oct 30, 2024

@smarnach , @biancadanforth: ^^^ does that look ok?

Copy link
Contributor

@biancadanforth biancadanforth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Had one question about one of the statements.

docs/upload.rst Outdated
Comment on lines 313 to 314
If you find your job is getting HTTP 429s or 504s frequently or it doesn't seem
like symbol uploads are being completed, try these tips:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should 504s be 502s, 503s and 504s per the previous discussion and additions to the HTTP codes? IIUC the reasons for these codes could affect both the download and upload APIs.

@willkg
Copy link
Contributor Author

willkg commented Nov 1, 2024

^^^ That clarifies the situations where the tips are helpful.

@willkg willkg added this pull request to the merge queue Nov 1, 2024
@willkg
Copy link
Contributor Author

willkg commented Nov 1, 2024

Thank you!

Merged via the queue into main with commit 99a1898 Nov 1, 2024
2 checks passed
@willkg willkg deleted the improve-docs branch November 1, 2024 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants