Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix JWK expiry time parsing from header on startup behaviour (fix #3655) #3779

Merged

Conversation

ecthiender
Copy link
Member

@ecthiender ecthiender commented Jan 27, 2020

Description

Whenever there is an error on fetching or parsing the JWK, on startup, graphql-engine exits with an error.
This is not the correct behaviour when Cache-Control or Expires header value cannot be parsed on startup. This PR fixes this.

So with this PR -

  1. if it cannot fetch and parse the JWK, graphql-engine will exit on startup
  2. if it is unable to parse the header values, it will not exit and start, but will not start the refresh thread in background

Affected components

  • Server

Related Issues

#3655

Solution and Design

The possible errors while fetching and parsing JWKs are more explicit now. And the mkJwtCtx function now explicitly pattern matches over the errors and ignores if it is only expiry parsing error (JFEExpiryParseError). But throws error if it is any other error (consequently, that results in exit during startup).

Due to the above error ADT changes, it changes the log schema (internal log) of jwk-refresh-log (more details are noted below). Should we make sure to not change the schema? This can be done, but not sure if the extra effort is required.

Side note: This was the (incorrect) behaviour on graphql-engine previously as well. But after we introduced parsing cache-control header, this issue became apparent because Cache-Control header can have other directives than just specifying the expiry (or the caching) time. Previously, for the Expires header, the only valid value it could have was expiry of the response, hence parse errors on it was never encountered (in my hypothesis).

Steps to test and verify

  1. Run graphql-engine with JWK URL which returns a Cache-Control header which does not have max-age or s-maxage directive.
  2. Observe, graphql-engine should start (and not exit) and log that it could not parse expiry time from header.

Limitations, known bugs & workarounds

Server checklist

Catalog upgrade

Does this PR change Hasura Catalog version?

  • No

Metadata

Does this PR add a new Metadata feature?

  • No

GraphQL

  • No new GraphQL schema is generated

Breaking changes

  • There are breaking changes:

    1. Logging

      • Log JSON schema has changed
        Of the internal log type jwk-refresh-log. The old and new schemas are noted below. Do we need to ensure compatibility for this?
      • Log type names have changed

The breaking change on jwk-refresh-log schema

On http exception

Old
{
  "type": "jwk-refresh-log",
  "timestamp": "2020-01-27T12:04:25.878+0530",
  "level": "critical",
  "detail": {
    "error": "Error fetching JWK: ConnectionFailure Network.Socket.connect: <socket: 28>: does not exist (Connection refused)",
    "http_error": {
      "response": null,
      "url": "http://localhost:5001/jwk-cache-control-header",
      "http_exception": {
        "message": "ConnectionFailure Network.Socket.connect: <socket: 28>: does not exist (Connection refused)",
        "request": {
          "proxy": null,
          "secure": false,
          "path": "/jwk-cache-control-header",
          "responseTimeout": "ResponseTimeoutDefault",
          "method": "GET",
          "host": "localhost",
          "requestVersion": "HTTP/1.1",
          "redirectCount": "10",
          "port": "5001"
        }
      },
      "status_code": null
    }
  }
}
New
{
  "type": "jwk-refresh-log",
  "timestamp": "2020-01-27T11:39:06.322+0530",
  "level": "critical",
  "detail": {
    "error": {
      "http_exception": {
        "type": "http_exception",
        "message": "ConnectionFailure Network.Socket.connect: <socket: 27>: does not exist (Connection refused)"
      }
    },
    "message": null
  }
}

On HTTP error (non-200 response)

Old
{
  "type": "jwk-refresh-log",
  "timestamp": "2020-01-27T12:06:27.183+0530",
  "level": "critical",
  "detail": {
    "error": "Non-200 response on fetching JWK from: http://localhost:5001/jwk-cache-control-header",
    "http_error": {
      "response": "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">
<title>404 Not Found</title>
<h1>Not Found</h1>
<p>The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.</p>",
      "url": "http://localhost:5001/jwk-cache-control-header",
      "http_exception": null,
      "status_code": 404
    }
  }
}
New
{
  "type": "jwk-refresh-log",
  "timestamp": "2020-01-27T11:41:06.327+0530",
  "level": "critical",
  "detail": {
    "error": {
      "response": "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">
<title>404 Not Found</title>
<h1>Not Found</h1>
<p>The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.</p>",
      "url": "http://localhost:5001/jwk-expires-header",
      "status_code": 400
    },
    "message": null
  }
}

On JWK parsing error

Old
{
  "type": "jwk-refresh-log",
  "timestamp": "2020-01-27T12:08:27.549+0530",
  "level": "critical",
  "detail": {
    "error": "Error parsing JWK from url (http://localhost:5001/jwk-cache-control-header): Error in $: key \"keys\" not present",
    "http_error": null
  }
}
New
{
  "type": "jwk-refresh-log",
  "timestamp": "2020-01-27T11:43:07.550+0530",
  "level": "critical",
  "detail": {
    "error": {
      "error": "Error in $: key \"keys\" not present",
      "message": "Error parsing JWK from url: http://localhost:5001/jwk-expires-header"
    },
    "message": null
  }
}

On header parsing errors

Old
{
  "type": "jwk-refresh-log",
  "timestamp": "2020-01-27T12:10:04.194+0530",
  "level": "critical",
  "detail": {
    "error": "Failed parsing Cache-Control header from JWK response. Could not find max-age or s-maxage",
    "http_error": null
  }
}
New
{
  "type": "jwk-refresh-log",
  "timestamp": "2020-01-27T11:44:17.388+0530",
  "level": "info",
  "detail": {
    "error": {
      "error": "could not find max-age/s-maxage",
      "message": "Failed parsing Cache-Control header from JWK response. Could not find max-age or s-maxage"
    },
    "message": null
  }
}

Just info logs:

Old
{
  "type": "jwk-refresh-log",
  "timestamp": "2020-01-27T12:10:04.194+0530",
  "level": "info",
  "detail": {
    "http_error": null,
    "message": "retrying again in 60 secs"
  }
}
New
{
  "type": "jwk-refresh-log",
  "timestamp": "2020-01-27T16:09:50.943+0530",
  "level": "info",
  "detail": {
    "error": null,
    "message": "refreshing JWK from endpoint: http://localhost:5001/jwk-cache-control-header"
  }
}

- fix when header parsing fails on startup, do not exit
- change qualified import of aeson in Auth.JWT from A to J to be consistent with the rest of the
  codebase
  - not sure why this was added, but I think it is not required
@hasura-bot
Copy link
Contributor

Review app for commit 18b1778 deployed to Heroku: https://hge-ci-pull-3779.herokuapp.com
Docker image for server: hasura/graphql-engine:pull3779-18b1778b

@ecthiender ecthiender added c/server Related to server s/wip Status: This issue is a work in progress labels Jan 27, 2020
@ecthiender ecthiender changed the title fix jwk header parsing startup behaviour fix jwk expiry time parsing from header on startup behaviour (fix #3655) Jan 27, 2020
@hasura-bot
Copy link
Contributor

Review app for commit 1dfd928 deployed to Heroku: https://hge-ci-pull-3779.herokuapp.com
Docker image for server: hasura/graphql-engine:pull3779-1dfd9282

@ecthiender ecthiender changed the title fix jwk expiry time parsing from header on startup behaviour (fix #3655) fix JWK expiry time parsing from header on startup behaviour (fix #3655) Jan 28, 2020
@ecthiender ecthiender force-pushed the fix-3655-jwk-cache-control-regression-2 branch from b6c2070 to 32539bf Compare January 28, 2020 14:49
@hasura-bot
Copy link
Contributor

Review app for commit 32539bf deployed to Heroku: https://hge-ci-pull-3779.herokuapp.com
Docker image for server: hasura/graphql-engine:pull3779-32539bfe

@ecthiender ecthiender marked this pull request as ready for review January 28, 2020 15:26
@0x777
Copy link
Member

0x777 commented Jan 30, 2020

@ecthiender

This is not the correct behaviour when Cache-Control or Expires header value cannot be parsed on startup.

When you say 'cannot be parsed', does it mean that the provided values for Cache-Control/Expires are not as per the HTTP spec? Or that they don't have values for some keys that we expect?

@ecthiender
Copy link
Member Author

ecthiender commented Jan 30, 2020

When you say 'cannot be parsed', does it mean that the provided values for Cache-Control/Expires are not as per the HTTP spec? Or that they don't have values for some keys that we expect?

@0x777 the headers can have other directives, and nothing related to max-age for example. So our parse won't be successful.

Or that they don't have values for some keys that we expect?

Yes, basically this. For example Cache-Control header might just have no-cache and nothing related to max-age.

@0x777
Copy link
Member

0x777 commented Jan 30, 2020

What is the behaviour when the JWK server responds with cache-control: no-cache during 1. startup and 2. during some refresh cycle?

@ecthiender
Copy link
Member Author

ecthiender commented Jan 30, 2020

What is the behaviour when the JWK server responds with cache-control: no-cache during 1. startup and 2. during some refresh cycle?

During startup it will log that it could not parse any max age from the header and will not start any background refresh thread.

From the above PR description:

if it is unable to parse the header values, it will not exit and start, but will not start the refresh thread in background

In the case, when graphql-engine has started the background refresh thread, and in one of those cycles the JWK server responded with a header where we could not parse the expiry time, it will log and default to 1 min of refresh cycle.

@0x777
Copy link
Member

0x777 commented Jan 30, 2020

In the case, when graphql-engine has started the background refresh thread, and in one of those cycles the JWK server responded with a header where we could not parse the expiry time, it will log and default to 1 min of refresh cycle.

It feels a little inconsistent (continuing to refresh in 1min intervals when the server's caching strategy clearly changed), but I guess it is quite uncommon for a server to change its caching strategy for us to worry about these edge cases. Can you update the docs clearly documenting the behaviour when a cache-control header is present?

@ecthiender
Copy link
Member Author

ecthiender commented Jan 30, 2020

It feels a little inconsistent (continuing to refresh in 1min intervals when the server's caching strategy clearly changed), but I guess it is quite uncommon for a server to change its caching strategy for us to worry about these edge cases.

Yeah, I was not sure what our strategy should be in this scenario. It did not make sense to cancel the refresh thread either, as it could be because of temporary errors from the server? (e.g. the actual JWK server is down and temp responses come from gateway/reverse proxy)

Can you update the docs clearly documenting the behaviour when a cache-control header is present?

👍

@ecthiender
Copy link
Member Author

ecthiender commented Jan 30, 2020

continuing to refresh in 1min intervals when the server's caching strategy clearly changed

Just to clarify, this will happen when we cannot parse any expiry time from the response headers [1]. If any subsequent response returns a valid expiry time, it will use that to refresh accordingly.

[1] and presumably we could parse a expiry time before, hence the refresh thread is running. If we could not parse any expiry time in the beginning, the refresh thread is not started.

I will document these scenarios in detail.

@0x777 0x777 requested a review from jberryman January 30, 2020 17:09
@jberryman
Copy link
Collaborator

It did not make sense to cancel the refresh thread either, as it could be because of temporary errors from the server? (e.g. the actual JWK server is down and temp responses come from gateway/reverse proxy)

Shouldn't this same logic apply during startup (some transient error causing the parse to fail)? Would it not be better to have the refresh thread running in all cases?

^ This is my one concern. If it's super easy to undo the Aeson import alias change that would be good to do in my opinion, but I'll leave that up to you!

@netlify
Copy link

netlify bot commented Feb 5, 2020

Deploy preview for hasura-docs ready!

Built with commit 5da4021

https://deploy-preview-3779--hasura-docs.netlify.com

@hasura-bot
Copy link
Contributor

Review app for commit d360920 deployed to Heroku: https://hge-ci-pull-3779.herokuapp.com
Docker image for server: hasura/graphql-engine:pull3779-d360920f

@ecthiender ecthiender requested a review from rikinsk as a code owner February 5, 2020 06:01
@hasura-bot
Copy link
Contributor

Review app for commit 5da4021 deployed to Heroku: https://hge-ci-pull-3779.herokuapp.com
Docker image for server: hasura/graphql-engine:pull3779-5da40215

@0x777
Copy link
Member

0x777 commented Feb 5, 2020

Shouldn't this same logic apply during startup (some transient error causing the parse to fail)? Would it not be better to have the refresh thread running in all cases?

I was thinking the same, but it also does seem wrong to keep refreshing a JWK which never changes. Our current 'trying to infer' the key's behaviour from the server's response headers can only go so far, which I think is not bad, covers typical cases but as you pointed out there are some edge cases. I think a better option would be to have a flag like --jwk-refresh-interval <interval> which will force us to create a refresh thread regardless of the server's response headers. We can add this when people ask for it!

@ecthiender
Copy link
Member Author

@rikinsk Oops I didn't realize the console files changed. Probably because of the pre-commit hook.

@0x777 0x777 merged commit c6e34ba into hasura:master Feb 5, 2020
@hasura-bot
Copy link
Contributor

Review app https://hge-ci-pull-3779.herokuapp.com is deleted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/server Related to server s/wip Status: This issue is a work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants