Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include Signed-urls in the External Tools Framework #7715

Closed
raprasad opened this issue Mar 23, 2021 · 18 comments · Fixed by #9001
Closed

Include Signed-urls in the External Tools Framework #7715

raprasad opened this issue Mar 23, 2021 · 18 comments · Fixed by #9001
Assignees
Labels
NIH OTA: 1.6.1 6 | 1.6.1 | Integrate with OpenDP tools to support differentially private statistical releases ... pm.GREI-d-1.6.1 NIH, yr1, aim6, task1: Integrate OpenDP tools on private statistical releases
Milestone

Comments

@raprasad
Copy link
Contributor

raprasad commented Mar 23, 2021

In a March 16th meeting with the Dataverse team, improvements to the external tools framework were discussed. Specifically, the team described an external tools improvement to provide signed urls instead of passing a Dataverse general API token. In transitioning from Dataverse to an external tool, Dataverse would:

  • Dynamically generate signed URLs that grant limited-time access to specific API endpoints. The characteristics of these signed URLs will be defined in an updated external tools manifest/specification.
  • Pass these signed urls to the external tool via a POST request to:
    • Avoid having any form of authorization token appear in the browser history.
    • Minimize potential exposure and malicious use of the signed urls, in places including server logs. (Server log settings for the DPcreator should not expose POST data.)

These signed urls would have the following characteristics:

(a) Limited in scope and linked to a particular user

  • The signed urls would be limited in scope. Examples include:
    • A url to retrieve a Dataverse Datafile would be limited to a particular file. In addition, the user connected to the url will be checked to make sure they/she/he still have permissions for the operation.
    • Similarly, a url to retrieve Schema.org JSON-LD dataset info would be limited to a specific dataset and user permissions would be checked.
    • If it’s important to also support rich clients, the signed URL generated by DataVerse could use a registered mime-type. (@joshua-oss note)

(b) Limited in time

  • As specified in a prior manifest or equivalent, each signed url passed to the external tool would have to be used within a certain time window. Examples include:
    • Retrieving a file within 30 minutes or
    • Depositing DP release files within 48 hours.

(c) Encoded with a Dataverse signature

  • URL creation note: The URLs are always created in the context of an authenticated user session and the signature links to the users. (@joshua-oss note)
  • Each signed url would contain a cryptographic signing token that allows Dataverse to confirm that the URL hasn't been changed. Additional tokens (signed as part of the URL) identify:
    • The time the signing was done and
    • The user it was done for, allowing Dataverse to securely verify that the URL is being used within proper (a) * scope and (b) time. (* that the user connected to the signed url still has permissions to make the request specified by the URL)

(d) Involve Usage tracking

  • Each time the signed url is used, the Dataverse would log usage information including:

    • Timestamp. Datetime, timezone.
    • Identifying information. Identifying information related to the request source (such as IP address).
    • Success. Whether the request was successful or not. An example of an invalid request may be one where a retrieval failed due to a timeout or missing/corrupt/deleted data.
    • Validity. Whether the request was valid or invalid. Examples of invalid requests:
      • The signed-url timeframe has expired;
      • The user for whom the signed url was created no longer has permissions for the operation
      • etc.

    This was not specifically addressed in the meeting but in addition to the signed urls, the external tools framework should still include defining and passing along the data currently specified as “queryParameters” and described by the Reserved Words in the external tools documentation. Note: The exception is the “apiGeneralToken”--this would be replaced by the specialized urls.


@djbrooke
Copy link
Contributor

djbrooke commented Mar 24, 2021

  • Relevant doc: https://docs.google.com/document/d/1rDhL2QBY2NhVqan3Mwp11BJ0O9naiFEhvnlbKVDQkBc/edit
  • The SP team has requested some changes to external tools (re: Globus). Both are part of the framework but this is related to how information is exchanged, and Globus changes are likely related to entry point and tool scope
  • In scope - tools will take a POST vs a GET
  • Note the logging section - we expect this will be covered by current logging, but we should check
  • We should generate a test scenario in case the OpenDP tool is not available. We should discuss with @raprasad to see if there's a way to test that we're missing. We can see the POST in the browser to see if it failed and this may be enough to test.

@djbrooke djbrooke added the Large label Mar 24, 2021
@qqmyers
Copy link
Member

qqmyers commented Mar 24, 2021

Could potentially be approached as ~3 sub-tasks:

  • enable tools to be called via POST
  • enable sending signed URLs to tools
  • enable authentication/validation of signed API calls

Relevant code snippets: UrlSignerUtil.java has sign/validate methods, #6957 is an old PR where Dataverse calls javascript to do a POST in response to a button push, and changes in #7568 (which add a workflow invocationId as a short-lived, workflow-specific alternative to an APIKey) highlight one option for where you could validate the URLs (i.e. you'd fail to findAuthenticatedUser(), which is used across the API, if the URL was invalid).

@raprasad
Copy link
Contributor Author

Condensed security notes from @joshua-oss:

Since this is authorizing access to resources, on a time-bound basis, it would be good to examine:

The authorization token generation process

  • The description says there is a private key, so that mitigates many potential attacks. The focus then is on the protection of the private key, including the signing process that needs to load the private key into memory. The degree of hardening here will depend heavily on the threat model (e.g. “adversary exploits an XSS attack to trick the signer into signing a fraudulent URI” might be out of scope).

How the system identifies resources

  • Based on the description, I’m also assuming that resource identification is done via URIs. It’s always good to make sure that the system is carefully handling name resolution, escaping, internationalization in a way that doesn’t allow minting of URIs that refer to something that’s already been minted.

How the system defines time to ensure that none of these can be spoofed or otherwise controverted

  • I’m guessing you already have an expiring token library in mind, so item (b) may be adopting best practices. But it’s always good to think through how the technique handles things like people messing with clocks.

@raprasad
Copy link
Contributor Author

Desired functionality:

  • Will there be a way to test if a signed url is still valid w/o using it?

Use case from the DPcreator perspective:

  • User resumes an in-progress workflow.
    • The user originally logged in via Dataverse which generated new signed urls
    • However, this time the user logs in directly to the DPcreator and has access to the original signed urls held in dpCreator
  • Can we test that the signed urls are still valid w/o actually using them?
    • e.g. Can we see if the download file signed url is still valid w/o actually downloading the file?
    • Note: This is a proxy for checking whether the user still has permissions for working with the file
  • Is there a better way to check if a signed url is valid w/o exercising it or an existing alternative?
    • Some type of expiration date? (Although perms may have changed on the DV side)

{
   "signedUrls":[
      {
         "type":"fileDownload",
         "signedUrl":"https://some-signed-url-343234-234324",
         "expirationDate":"2021-05-12T13:40:06Z"
      },
      {
         "type":"metdataRetrievalJSONLD",
         "signedUrl":"https://some-signed-url-343234-234324",
         "expirationDate":"2021-05-12T13:40:06Z"
      }
   ]
}

@qqmyers
Copy link
Member

qqmyers commented Apr 13, 2021

Here's a description of how the signedURLs I proposed work, which I think addresses the questions that have been asked (if I missed some, let me know).

Unless I made a typo, here's an example of a signedURL that should be validate-able against the apikey d4ef4fd5-891e-4ef5-9a1c-e05e3aff1d60 and a never-shared secret key that would be stored within Dataverse (79cc5072-a157-4111-b84e-2fe4a0c93638 in this case) I made this URL manually but followed the logic in the URLSignerUtil class I've mentioned:

https://demo.dataverse.org/api/access/datafile/1827764?format=original&gbrecs=true&until=2021-05-13T17:09:42.411&user=method=GET&token=52e7d0b918ebfd64a12ff03728b9788f664f994d38ca0b6f3b3a14c2f97c2af4db409ee1cc942992019abfae5b6611b28633bd5f72877649e323ea9252032f83
Looking at the params - the format=original and gbrecs=true are part of the base URL I chose to sign (this is an unrestricted file in a public dataset on the demo server so the URL will work, not because demo is validating (it doesn't), just because the extra params for signing are ignored).

  • The until= param lets you see the date the URL is valid until
  • The method param shows you this is a GET call
  • There is an optional user= param (not shown here) that would give you the user id. This would be used in the OpenDP case.

Given that these are all visible, the receiver (OpenDP) can inspect them and, for example, verify that the url is still valid and/or get the user id. (The validation mechanism, as discussed below, doesn't require keeping the date/time or user id secret and having them visible in the URL doesn't help anyone trying to create a malicious URL.)

When the URL is used, the server (Dataverse in the OpenDP case) looks at the user= param, gets the apikey for that user and then creates a sha512 hash of the string
https://demo.dataverse.org/api/access/datafile/1827764?format=original&gbrecs=true&until=2021-05-13T17:09:42.411&user=method=GET&token=79cc5072-a157-4111-b84e-2fe4a0c93638d4ef4fd5-891e-4ef5-9a1c-e05e3aff1d60

where the last param is the concatenation of the never-shared secret Dataverse key and the api key of the listed user (also not shared with OpenDP via the proposed external tool api changes, but potentially available from other sources).

If, and only if, that sha512 hash matches the original token= param sent on the URL (as it should here if I manually created the URL correctly - you can try it at https://emn178.github.io/online-tools/sha512.html for example), the server (Dataverse) can be assured that the URL has not been changed since it was signed. If that's true, Dataverse can then check:

  • the date/time (since Dataverse created the date/time initially, and Dataverse is the one checking to see if the date/time means the URL is expired, we know the check is being done by the same (Dataverse machine) clock, so there's no way to cheat if the receiver's clock is deliberately shifted). If the use is after the time listed, it fails.
  • the method - if the param doesn't match the HTTP method actually used it fails.

Dataverse can then use the user= param (which has to match with the api key used to originally sign because that user id was also used by Dataverse to choose the right key for validation) to assign the user's privileges to this api call.

Any change to the URL results in the hash failing and Dataverse would then reject the call as not authorized.

If the user's permissions in Dataverse have been revoked since the signedURL was created, the call would still fail. (There isn't a way from the URL to know if that has happened, but one could potentially use a Dataverse admin api (with a credential supplied out-of-band) to query Dataverse about that.)

To create a valid signedURL outside Dataverse would require knowing the appropriate user's APIkey (possible available from, for example, the user's browser unless we change Dataverse to not ever show/distribute API keys once signed URLs are implemented) AND the never-shared secret Dataverse key.

The latter (which doesn't exist yet since URLSigning has so far only been used for Dataverse requesting data from other services) could be stored on disk with access limited to the account running Dataverse, or, if there's concern that someone could still read it, this key could be stored in a service such as Vault (not recommending this - just an example if storing the key with restricted access on the Dataverse server is not secure enough. Given that a user who could get to a key stored that way could probably do other harmful things to Dataverse, just protecting the Dataverse server as well as possible might be the best option.)

To prevent someone from request many URLs to derive the secret Dataverse key, it would also make sense to regenerate that key periodically. That could be manual, or automated.

(One aspect of this model with a time-limited signed URL, that wouldn't be true with for example, a one-time use URL, is that there is no database record in Dataverse related to the signed URL (there could be logging of the creation of the signedURL, but validation of the URL doesn't require looking up any per-signed-URL database entry. This helps with scalability and makes things simpler as Dataverse doesn't need an extra table or to worry about removing entries from that table if the URLs are never used, etc.)

@raprasad
Copy link
Contributor Author

@qqmyers, thank you for the detailed explanation ^

@djbrooke
Copy link
Contributor

  • enable tools to be called via POST could be rolled out separately, mainly for integrators (enable sending signed URLs to tools and enable authentication/validation of signed API calls will likely have to go together)
  • we should reach out to integrators and external tool makers about this and encourage them to update, but we should keep the old way for some period of time (we should revisit the autogeneration of tokens for users, but note that the tokens are generated but only a hash is sent)
  • to use this new mechanism we'd have to update the previewers (for ex.) to receive POST
  • consider the implications for installations that want to serve sensitive data - the old way may not be useful for sensitive data situations

@mreekie
Copy link

mreekie commented May 11, 2022

sprint

  • (Len) Waiting on a conversation with Raman and the team
  • That conversation may lead to the development of a standalone tool that would mean that we would not have to store L3 or higher data on dataverse
  • Leave this in WIP for now.
  • Post meeting determine what's next for this.

@mreekie
Copy link

mreekie commented May 25, 2022

@mreekie
Copy link

mreekie commented May 25, 2022

We met today to talk about the big picture of integrating DP Creator with dataverse.

An important thing that came out of that discussion was that the initial integration is only at a demo level.

Bob explained today why the integration of the support for signed URLs into the External Tools framework in dataverse is not a small task. When we talked further with Raman it became clear that what is required in the short term is only for an MVP that will not be ingesting truly secure data from dataverse. The MVP demonstrates the user experience, operation of the DP Creator libraries, and interaction with dataverse for retrieving data and publishing reaults. It is not going into production at this stage.

Raman and Bob and some others will get together next Thursday to figure out a way forward in the short term to satisfy this intial MVP.

@mreekie
Copy link

mreekie commented Jun 2, 2022

@qqmyers
Copy link
Member

qqmyers commented Jun 2, 2022

Summary from discussions of the proposed addition to add suport for sending signed URLs within the external tools framework:

Add support for a new json object in the external tools registration mechanism (as described in the guides at https://guides.dataverse.org/en/latest/admin/external-tools.html and https://guides.dataverse.org/en/latest/api/external-tools.html#creating-an-external-tool-manifest) that would list the set of signed Urls for calls in the Dataverse API that the tool should receive. For each signed Url, the entry should include:

  • name: a descriptive name that the external tool will understand/use to retrieve the right Url for the right purpose
  • method: the Http method (i.e. GET, PUT, POST, DELETE) required for the give API call
  • url template: a parameterized url template, as discussed below, that provides Dataverse enough info to construct a specific API call for a given dataset/dataset version/datafile as appropriate
  • timeout: the number of minutes a Url should be valid for (up to a Dataverse defined maximum (~3000 minutes?))

Url Templates would be relative to the base site url and take advantage of the existing reserved words (defined at https://guides.dataverse.org/en/latest/api/external-tools.html#reserved-words) to indicate where Dataverse should replace a token with the value for the specific dataset/file for which the tool is being launched. For example:

"url template":"/api/datasets/{datasetId}/metadata"

would be signed with the specific dataset id and specific site, e.g.

"https://demo.dataverse.harvard.edu/api/datasets/1234/metadata"

The ExternalTool class would be extended with a new text column which would store the JSON, serialized as a string, for this new registration parameter. The parameter and column will have a name that indicates that these are requested/allowed uris/api calls.

In various PRs there are utility classes to read/write Json to/from a string. (Nominally postgres can store Json directly but we have not done this anywhere yet (we have several other columns that store json)).

For backward compatibility, if this new param does not exist, Dataverse can continue to send the apikey as it does now.

The ExternalToolHandler class has an existing method to retrieve/format the values for each of the registered words (see

private String getQueryParam(String key, String value) {
ReservedWord reservedWord = ReservedWord.fromString(value);
switch (reservedWord) {
case FILE_ID:
// getDataFile is never null for file tools because of the constructor
return key + "=" + getDataFile().getId();
case FILE_PID:
GlobalId filePid = getDataFile().getGlobalId();
if (filePid != null) {
return key + "=" + getDataFile().getGlobalId();
}
break;
case SITE_URL:
return key + "=" + SystemConfig.getDataverseSiteUrlStatic();
case API_TOKEN:
String apiTokenString = null;
ApiToken theApiToken = getApiToken();
if (theApiToken != null) {
apiTokenString = theApiToken.getTokenString();
return key + "=" + apiTokenString;
}
break;
case DATASET_ID:
return key + "=" + dataset.getId();
case DATASET_PID:
return key + "=" + dataset.getGlobalId().asString();
case DATASET_VERSION:
String versionString = null;
if(fileMetadata!=null) { //true for file case
versionString = fileMetadata.getDatasetVersion().getFriendlyVersionNumber();
} else { //Dataset case - return the latest visible version (unless/until the dataset case allows specifying a version)
if (getApiToken() != null) {
versionString = dataset.getLatestVersion().getFriendlyVersionNumber();
} else {
versionString = dataset.getLatestVersionForCopy().getFriendlyVersionNumber();
}
}
if (("DRAFT").equals(versionString)) {
versionString = ":draft"; // send the token needed in api calls that can be substituted for a numeric
// version.
}
return key + "=" + versionString;
case FILE_METADATA_ID:
if(fileMetadata!=null) { //true for file case
return key + "=" + fileMetadata.getId();
}
case LOCALE_CODE:
return key + "=" + getLocaleCode();
default:
break;
}
return null;
}
). That could be refactored for the new functionality, i.e. to retrieve the raw value for each registered word which could then be used to generate query params (current functionality) or to do string replace operations to replace the registered words in the URL templates (for signing).

Another task to complete this would be to add a flyway sql script that would add the newly defined column to existing databases, e.g.
ALTER TABLE externaltool ADD COLUMN allowedapicalls TEXT;

@qqmyers
Copy link
Member

qqmyers commented Jun 2, 2022

W.r.t. #7715 (comment) - it sounds like task

  • 1 (sending a POST) is working (in https://github.com/IQSS/dataverse/tree/7715-signed-urls-for-external-tools).
  • 2 uses the design above and is tbd.
  • 3 was discussed ~May 6th and was planned to involve extending the AbstractAPIBean.findAuthenticatedUserOrDie() method to include calling a getAuthenticatedUserFromSignedUrl() method that would look for the signing token and other signed URL params and, if the signature is valid according to the existing URLSignerUtil.isValidUrl() method, would return the AuthenticatedUser for whom the URL was signed. (The individual API methods that call findAuthenticatesUserOrDie() method then check that that specific user currently has the permissions required to perform the action.)

Nominally 3 could be chanced by completing 2, but it should also be possible to a) use the URLSignerUtil manually to sign URLs to test or b) to add a (superuser-only?) createSignedUrl API call that would allow dynamic generation of signed URLs. The latter is probably useful in general and would be a good way to create an integration test (create a signed URL and then try to use it and verify the signed URL API call worked.).

@mreekie
Copy link

mreekie commented Jun 3, 2022

Caught up with Bob today. He explained in a way that I could understand the concepts around how things work now with the token exchange and how they are going to work.

We agreed that it would be good to get the results in front of the other folks involved to makes sure we all stay on the same page in terms of how this demo implementation is going to work.

I laid out the flow as I understand it in Miro here

  • Bob has laid out a path for the first few steps:
    • Get the exchange working with correctly in the case of a POST and also a GET, no signed URL.
    • Get the signing part worked out.
    • Work with Raman and DP team on testing the result.

Next: Touch base again and go a level deeper.

  • The requirements as they are now are captured from the issue notes into: Requirements DV Support for DP Creator Demo
  • The high level flows are in Miro here
  • Let's review that document to to confirm we're all aiming at the same thing.
  • Are there any places where there's room for misunderstanding? Let's identify them

@mreekie
Copy link

mreekie commented Jun 3, 2022

@mreekie
Copy link

mreekie commented Jun 8, 2022

Sprint:

  • pm.sprint.2022_05_25 ended WIP

@mreekie
Copy link

mreekie commented Jun 28, 2022

Bob and Jim agreed to a quick quick meeting today.
They reviewed the " RFC - DV Support for DP Creator Demo "
Bob is making use of the doc as a reference.

There were some questions raised.
Sent an email to Prasad and Ellen.

If we need a meeting we can do one, but it sounds like we may be close enough that updating the doc is good. As Jim pointed out once we get a bit deeper into the code, we can iterate around anything further.

@mreekie
Copy link

mreekie commented Sep 22, 2022

At today's meeting

  • Gustavo would like to see if we can resolve this without waiting on Ramen.
    • Sounds like Ramen's team is having trouble since a backend dev is gone.
  • There are a a few small things for Bob to take care of.
  • Next step as soon as those are resolved is to create a PR
  • Jim suggested that we have the tools that will allow us to QA this.

@qqmyers qqmyers self-assigned this Sep 26, 2022
pdurbin added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Nov 10, 2022
pdurbin added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Nov 10, 2022
pdurbin added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Nov 10, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Nov 16, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Nov 16, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Nov 16, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Nov 16, 2022
qqmyers added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Nov 17, 2022
feat: make API signing secret a JvmSetting IQSS#7715
@pdurbin pdurbin added this to the 5.13 milestone Dec 1, 2022
@mreekie mreekie added the NIH OTA: 1.6.1 6 | 1.6.1 | Integrate with OpenDP tools to support differentially private statistical releases ... label Dec 6, 2022
@mreekie mreekie added the pm.GREI-d-1.6.1 NIH, yr1, aim6, task1: Integrate OpenDP tools on private statistical releases label Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NIH OTA: 1.6.1 6 | 1.6.1 | Integrate with OpenDP tools to support differentially private statistical releases ... pm.GREI-d-1.6.1 NIH, yr1, aim6, task1: Integrate OpenDP tools on private statistical releases
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants