Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plumb HTTPS backend through CLI and adjust it to use full storage prefixes #1597

Merged
merged 6 commits into from
Oct 22, 2024

Conversation

jhiemstrawisc
Copy link
Member

@jhiemstrawisc jhiemstrawisc commented Sep 24, 2024

This PR accomplishes two things -- first, it plumbs a few of the https backend components through the CLI, allowing users to serve minimal https origins without writing a configuration yaml.

Second, it adjust the https backend to use storage prefixes. Consider the following configuration:

Origin:
  StorageType: "https"
  HttpServiceUrl: "https://data.lhncbc.nlm.nih.gov/public"
  Exports:
    - StoragePrefix: "/Tuberculosis-Chest-X-ray-Datasets/Montgomery-County-CXR-Set/MontgomerySet/CXR_png"
      FederationPrefix: "/my-prefix"

The adjustments cause a request for object /my-prefix/MCUCXR_0005_0.png to be converted to a libCurl request in the Origin for https://data.lhncbc.nlm.nih.gov/public/Tuberculosis-Chest-X-ray-Datasets/Montgomery-County-CXR-Set/MontgomerySet/CXR_png/MCUCXR_0005_0.png

While we don't yet support multiple exports for the https backend, I think this configuration affords us maximum flexibility for the future where we do by allowing us to set the service URL (every request to the origin uses this as the base URL), a storage prefix (each namespace can then carve out some section of data hosted by the service url), and the typical federation prefix mapping (i.e. strip the fed prefix from the object and tack that value to the end of the service url + storage prefix).

While this is a slight change in how an http origin would be configured, I'm not worried about backwards compat because a) we never documented the limitations of this, even within our own codebase and b) this makes the https backend conformant with the way we use storage prefixes in every other backend.

Closes #1279

@jhiemstrawisc jhiemstrawisc added enhancement New feature or request origin Issue relating to the origin component labels Sep 24, 2024
@jhiemstrawisc jhiemstrawisc added this to the v7.11.0 milestone Sep 24, 2024
@jhiemstrawisc jhiemstrawisc added the critical High priority for next release label Sep 24, 2024
@turetske
Copy link
Collaborator

@jhiemstrawisc How do I test this?

@jhiemstrawisc
Copy link
Member Author

The config in the PR description should be functional. You can use it for the basis of a test if you're setting things up with yaml-based configuration. Otherwise, you should be able to start the origin based on the new CLI args alone.

@turetske
Copy link
Collaborator

The config in the PR description should be functional. You can use it for the basis of a test if you're setting things up with yaml-based configuration. Otherwise, you should be able to start the origin based on the new CLI args alone.

So, I should create an https based origin and see if it works? Just want to make sure that's the goal.

@jhiemstrawisc
Copy link
Member Author

The two things to check are:

  1. The HTTPS backend can be started from the CLI (without needing to touch the Origin block of your yaml config)
  2. The new storage prefix mechanism behaves as described above.

…fixes

This PR accomplishes two things -- first, it plumbs a few of the https backend
components through the CLI, allowing users to serve minimal https origins without
writing a configuration yaml.

Second, it adjust the https backend to use storage prefixes. Consider the following
configuration:
```
Origin:
  StorageType: "https"
  HttpServiceUrl: "https://data.lhncbc.nlm.nih.gov/public"
  Exports:
    - StoragePrefix: "/Tuberculosis-Chest-X-ray-Datasets/Montgomery-County-CXR-Set/MontgomerySet/CXR_png"
      FederationPrefix: "/my-prefix"
```

The adjustments cause a request for object `/my-prefix/MCUCXR_0005_0.png` to be converted to
a libCurl request in the Origin for `https://data.lhncbc.nlm.nih.gov/public/Tuberculosis-Chest-X-ray-Datasets/Montgomery-County-CXR-Set/MontgomerySet/CXR_png/MCUCXR_0005_0.png`

While we don't yet support multiple exports for the https backend, I think this configuration
affords us maximum flexibility for the future where we do by allowing us to set the service URL
(every request to the origin uses this as the base URL), a storage prefix (each namespace can then
carve out some section of data hosted by the service url), and the typical federation prefix mapping
(i.e. strip the fed prefix from the object and tack that value to the end of the service url + storage prefix).

While this is a slight change in how an http origin would be configured, I'm not worried about backwards compat
because a) we never documented the limitations of this, even within our own codebase and b) this makes the
https backend conformant with the way we use storage prefixes in every other backend.
These started failing after changes to the https backend. After inspection,
I'm surprised they'd been passing at all, and am not convinced they were
actually testing what we wanted them to. At the very least, they were grabbing
config from my system installation of Pelican (NAUGHTY), and this correctly
isolates them.
@jhiemstrawisc
Copy link
Member Author

With the latest commit, the three modes of configuration to check are:

  1. Configuration via CLI (with a bit of config in pelican.yaml).
Set Origin.EnablePublicReads in your config yaml. Then, at the command line you can configure the rest of the origin with:
pelican origin serve --mode https --http-service-url "https://data.lhncbc.nlm.nih.gov/public/Tuberculosis-Chest-X-ray-Datasets" -v "/Montgomery-County-CXR-Set/MontgomerySet/CXR_png:/my-prefix" 
  1. Configuration via Origin exports block:
Origin:
  StorageType: "https"
  HttpServiceUrl: "https://data.lhncbc.nlm.nih.gov/public/Tuberculosis-Chest-X-ray-Datasets"
  Exports:
    - StoragePrefix: "/Montgomery-County-CXR-Set/MontgomerySet/CXR_png"
      FederationPrefix: "/my-prefix"
      Capabilities: ["PublicReads", "DirectReads", "Listings"]
  1. Configuration via top-level origin config:
Origin:
  StorageType: "https"
  HttpServiceUrl: "https://data.lhncbc.nlm.nih.gov/public/Tuberculosis-Chest-X-ray-Datasets"
  StoragePrefix: "/Montgomery-County-CXR-Set/MontgomerySet/CXR_png"
  FederationPrefix: "/my-prefix"
  EnablePublicReads: true
  <any other caps if you actually care to toggle them>

Each of these configurations should allow you to get the public object /my-prefix/MCUCXR_0005_0.png.

@jhiemstrawisc
Copy link
Member Author

One thing to note is that I'm punting on the general cleanup of server_utils.GetOriginExports(). Issue #1286 already requests an overhaul of the function, but I don't want to touch all of the other backend configurations in this unrelated PR.

Copy link
Collaborator

@turetske turetske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhiemstrawisc When I test this with export PELICAN_ORIGIN_STORAGEPREFIX=/Tuberculosis-Chest-X-ray-Datasets/Montgomery-County-CXR-Set/MontgomerySet/CXR_png/ it is still failing due to the trailing /.

@turetske turetske merged commit 9a38446 into PelicanPlatform:main Oct 22, 2024
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
critical High priority for next release enhancement New feature or request origin Issue relating to the origin component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTP backend not plumbed through to CLI
2 participants