-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve upload and download API docs #2955
Conversation
docs/download.rst
Outdated
:resheader Content-Length: length in bytes for the file | ||
:resheader Content-Encoding: (optional) content encoding for the file; note | ||
that ``.sym`` files are compressed even thought he file extension doesn't | ||
indicate that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are the critical response headers. I'm a little fuzzy on HTTP 302s, though. Does this sound right? Are there other headers we want to include here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the Location header is in the response from Tecken. The other headers are part of the response from S3. We should make that clear in this list.
The 302 response from Tecken will have these headers:
- Location – the redirect URL
- Debug-Time – the time if Debug:true was in the request. (I'm unsure we really want this to be part of the documented interface)
- Content-Length: 0 – a redirect response doesn't have a body. We don't need to document this.
- Content-Type: text/html – Django seems to be putting this there by default. It doesn't really make sense.
- There's no Content-Encoding header in the Tecken response. There's no content, so specifying an encoding would be pointless.
If we follow the redirect, we talk directly to S3. The response include these headers:
- Content-Encoding: gzip – If the file is compressed. In general, the file should only be returned with this content encoding if the client indicated that it can deal with it with Accept-Encoding: gzip. However, S3 will always return it if the file was uploaded with this encoding. (GCS actually decompresses the file on the server side if the client doesn't indicate Accept-Encoding: gzip.) A content encoding is only used for transport and should be transparent to the end user. You get the "actual" content after decoding. For example you can navigate to https://s3.us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/xpcshell/4C4C44B655553144A1B2DAA18738D74B0/xpcshell.sym in Firefox, and you will see the text content. The file is still stored and transmitted compressed, but the Content-Encoding header indicates to Firefox that the file should be decompressed first. In curl you can use
curl --compressed
to get the same behaviour; curl will list all the content encodings it supports in the Accept-Encoding header, and if the response indicates that one of these encodings was used the content will be decoded accordingly. - Content-Length: The length of the response body. If the body is compressed, it's the size of the compressed body.
- Content-Type: The content type of the response after decompressing it. Will be text/plain for symbol files.
I'm not actually sure whether this response makes things clearer or more confusing, but I'll send it anyway. The important point is that we should make clear what headers are in Tecken's response, and what headers are in S3's response, if we want to list the latter as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fact that it's coming from AWS is an implementation detail that isn't relevant to the docs.
However, I do see your point about which headers come in which response. I'll have to think about how to document that clearly. I'll think about this more when I get back from PTO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to think about the subtle difference in behaviour between S3 and GCS. Maybe we can add the Accept-Encoding: gzip header at the load balancer level in GCS to make sure the behviour is the same, but I'll have to look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented on the GCS implementation bug about this.
:statuscode 500: sleep for a bit and retry; if retrying doesn't work, then please | ||
file a bug report | ||
:statuscode 503: sleep for a bit and retry | ||
:statuscode 502: sleep for a bit and retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 502 instead of 503?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the logs and we don't really emit 503s as far as I can tell. We do emit 502s, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 503s are emitted by the ELB. We don't have load balancer logs for Tecken, so we can't see 503s in the logs. I expect 503s to be far more common than 502s, but I don't think we have any way to verify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you expect 503s to be more common than 502s for Tecken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got these status codes mixed up. 504s are probably more common than 502s. The ELB will emit a 504 if the backends are overloaded, or when requests take too long. 503s can happen as well if all instances are unhealthy, which sometimes happens for Tecken.
We actually do have some numbers to esitmate how common 500s from the ELB are in general, compared to 500s from the backend, since the ELB stores aggregate numbers for these two classes of 500s. I would have expected that the vast majority of 500s come from the ELB, but it turns out it's only a slight majority – about 60% of all 500s come from the ELB and 40% from the backend.
docs/download.rst
Outdated
:resheader Content-Length: length in bytes for the file | ||
:resheader Content-Encoding: (optional) content encoding for the file; note | ||
that ``.sym`` files are compressed even thought he file extension doesn't | ||
indicate that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the Location header is in the response from Tecken. The other headers are part of the response from S3. We should make that clear in this list.
The 302 response from Tecken will have these headers:
- Location – the redirect URL
- Debug-Time – the time if Debug:true was in the request. (I'm unsure we really want this to be part of the documented interface)
- Content-Length: 0 – a redirect response doesn't have a body. We don't need to document this.
- Content-Type: text/html – Django seems to be putting this there by default. It doesn't really make sense.
- There's no Content-Encoding header in the Tecken response. There's no content, so specifying an encoding would be pointless.
If we follow the redirect, we talk directly to S3. The response include these headers:
- Content-Encoding: gzip – If the file is compressed. In general, the file should only be returned with this content encoding if the client indicated that it can deal with it with Accept-Encoding: gzip. However, S3 will always return it if the file was uploaded with this encoding. (GCS actually decompresses the file on the server side if the client doesn't indicate Accept-Encoding: gzip.) A content encoding is only used for transport and should be transparent to the end user. You get the "actual" content after decoding. For example you can navigate to https://s3.us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/xpcshell/4C4C44B655553144A1B2DAA18738D74B0/xpcshell.sym in Firefox, and you will see the text content. The file is still stored and transmitted compressed, but the Content-Encoding header indicates to Firefox that the file should be decompressed first. In curl you can use
curl --compressed
to get the same behaviour; curl will list all the content encodings it supports in the Accept-Encoding header, and if the response indicates that one of these encodings was used the content will be decoded accordingly. - Content-Length: The length of the response body. If the body is compressed, it's the size of the compressed body.
- Content-Type: The content type of the response after decompressing it. Will be text/plain for symbol files.
I'm not actually sure whether this response makes things clearer or more confusing, but I'll send it anyway. The important point is that we should make clear what headers are in Tecken's response, and what headers are in S3's response, if we want to list the latter as well.
:statuscode 429: your request has been rate-limited; sleep for a bit and retry | ||
:statuscode 500: there's an error with the server; sleep for a bit and | ||
retry; if retrying doesn't work, then please file a bug report | ||
:statuscode 502: sleep for a bit and retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the last week of logs for Tecken prod for download requests, and this list looks good to me (i.e. I couldn't find any other status codes to include).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logs are from nginx, so they only include requests that make it to nginx. If the ELB can't reach the backend, it will reply with a 503 or 504, and these requests won't be in the logs. It's possible to enable ELB logs as well, but we never did that for a variety of reasons. In GCP, we'll also have load balancer logs, so we'll be able to see all the requests.
retry; if retrying doesn't work, then please file a bug report | ||
:statuscode 502: sleep for a bit and retry | ||
:statuscode 504: the request is taking too long to complete; sleep for a bit | ||
and retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also found in the Tecken prod logs over the last week 29 occurrences (0.3% of all response codes) of a 499 response code. Interestingly, this is a non-standard code (not in MDN), but it appears to be a code used by nginx to indicate the client closed the connection before the server could send a response. I couldn't find this in nginx's docs, but I saw a bunch of articles about it with an organic web search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a response code clients will ever see, though. It's never actually sent by nginx, since the client already closed the connection. It's only recorded in the logs. We don't need to document this since it's not visible from the outside.
Update: I really appreciate all the comments and thoughts here. Thank you! I want to address the comments, but probably won't have time to do that until next week. I also think that while updating docs is important, the discussion this kicked off was invaluable and several issues were created to fix tests and other things. Actually updating the docs is less important than that work was. |
^^^ That updates the PR to factor in the comments from everyone. Does this match what we've got in AWS and what we're building in GCP now? Is it clear for users? |
This improves the HTTP status codes, adds response headers where applicable, and adds a first-pass section on improving symbol upload success rates. This should help users who are using the download and upload APIs know what to do in certain status code situations that were previously undocumented. This specifies the response headers so we can write systemtests for them. This will help people writing symbol upload jobs with a set of things they can look at to improve the likelihood those jobs finish successfully. We can hone this section over time.
This fixes the location redirect download API docs by moving the separate request into a new section. This adds status code items for 502, 503, and 504 where missing.
@smarnach , @biancadanforth: ^^^ does that look ok? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Had one question about one of the statements.
docs/upload.rst
Outdated
If you find your job is getting HTTP 429s or 504s frequently or it doesn't seem | ||
like symbol uploads are being completed, try these tips: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should 504s be 502s, 503s and 504s per the previous discussion and additions to the HTTP codes? IIUC the reasons for these codes could affect both the download and upload APIs.
^^^ That clarifies the situations where the tips are helpful. |
Thank you! |
This improves the HTTP status codes, adds response headers where applicable, and adds a first-pass section on improving symbol upload success rates.
This should help users who are using the download and upload APIs know what to do in certain status code situations that were previously undocumented.
This specifies the response headers so we can write systemtests for them.
This will help people writing symbol upload jobs with a set of things they can look at to improve the likelihood those jobs finish successfully. We can hone this section over time.
For whoever reviews this, feel free to merge it if it's approved. Thank you!