Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document best-practices for IP address stability and accessing remote datasets #63

Closed
JILPulvino opened this issue Mar 12, 2021 · 14 comments

Comments

@JILPulvino
Copy link

We use an Azure blob storage and approve specific IP addresses to the blob storage. Is there an IP range that we should approve? I've approved the IP address for the hub when I log in, but it appears that other users have a different ip address?

@yuvipanda
Copy link
Member

Aah, each node has its own external IP address, and they rotate fairly frequently. Can you instead create a new service principal that can be used to access blob storage instead? Even if we do have hub IP stability, it means folks on other hubs (since they all share a cluster) could access your blob storage!

@choldgraf
Copy link
Member

@yuvipanda this seems like a useful use-case to think about for documentation purposes. I think it is more "power-user" than most people in 2i2c.org/pilot, but it also seems like it'd be pretty common. Where do we imagine the Q/A for this kinda question would go? Maybe this is the kind of thing that would be useful to have a Discourse forum for?

@choldgraf choldgraf changed the title IP Address stability Document best-practices for IP address stability and accessing remote datasets Mar 13, 2021
@JILPulvino
Copy link
Author

This sounds like a good suggestion, I just need to figure out how to 'create a new service principal that can be used to access blob storage'. Any suggestions for that?

@yuvipanda
Copy link
Member

For Azure, I'm guessing we'll need to create a service principal that has access to blob storage. https://docs.microsoft.com/en-us/azure/storage/common/storage-auth seems to be the article that has the overview.

Can you tell us how you're allowing access to a particular IP? Maybe we can dig in from there.

@JILPulvino
Copy link
Author

Within the container, I'm just granting specific IPs access in the networking blade - this was the easiest thing for me to manage with a high level of certainty and without having to deal with a virtual network or Azure active directory. But happy to change it around.

@yuvipanda
Copy link
Member

yeah, unfortunately you might need to do the azure active directory thing now.

I guess longer term this won't be as much of a problem since we can run this on Azure.

@JILPulvino
Copy link
Author

JILPulvino commented Mar 23, 2021

I've started playing around with this and Microsoft seems to recommend granting users access to the blob using AAD rather than creating a broader service principal with access. In our case, this would look like creating guest user AAD accounts in Azure and then using this (https://azuresdkdocs.blob.core.windows.net/$web/python/azure-storage-blob/12.8.0/index.html) to provide users access.

What I've tried so far is:

  1. Registered an app in Azure and assigned it a role as a storage blob contributor to the storage container I'd like it to access
  2. Created a client secret for the app in Azure
  3. Stored the AZURE_TENANT_ID AZURE_CLIENT_ID and AZURE_CLIENT_SECRET as environment variables on the hub using %set_env according to the Azure SDK (https://github.com/Azure/azure-sdk-for-python/tree/azure-storage-blob_12.8.0/sdk/identity/azure-identity#async-credentials)
  4. Access the blob storage using DefaultAzureCredential()

But, I still get an authentication error:
ClientAuthenticationError: DefaultAzureCredential failed to retrieve a token from the included credentials. Attempted credentials: EnvironmentCredential: Authentication failed: AADSTS7000215: Invalid client secret is provided.

I have also tried to set the AZURE_USERNAME and AZURE_PASSWORD variables and authenticate as myself to the app, but also received an authentication error. Using this method is a little more confusing as I'm not quite sure what I would need to set up in Azure for this to work properly (e.g. assigning users to the app) or how this could be set up on the jupyterhub for multiple users.

@JILPulvino
Copy link
Author

As of right now, I'm just going to store a copy of the data in the shared folder on the hub and wait till we move to Azure itself to figure out how to integrate using the blob storage.
@yuvipanda I'd love to get your thoughts on how to best store our data. Currently, we have partners that provide us with confidential data. Each partner's dataset should be accessible only to certain hub users which is why I set up separate Azure blob storage spaces for each partner and constrain access to only certain users. Our hub users need to be able to read and write to those storage containers. We also need at least one storage container where we can store data that can be made public again with hub users able to read and write to that.
Perhaps it would be better to set up the storage all on the hub and in some way set up different folders for different partners with differential access from hub users?

@yuvipanda
Copy link
Member

@JILPulvino ah, I see re: confidentiality. How big is the data? Managing it in the hub in the medium run is probably the best option, and we should figure out how to do that securely.

In the meantime, putting each in an Azure blob and granting access is a great way to go. Can you give me access to your Azure cloud project so I can try debug per-user permissions?

@JILPulvino
Copy link
Author

Nope, not a problem at all.

@JILPulvino
Copy link
Author

@yuvipanda I've added you as an admin to our azure portal and you should be receiving an invitation.

@choldgraf
Copy link
Member

Hey all - I'm not sure if there's more to work on here or not. Let's scope this issue to resolving @JILPulvino's immediate need, and I've opened up 2i2c-org/infrastructure#372 to track us documenting best-practices for object data storage access in general. @JILPulvino - what's left to do here? Is this actionable on 2i2c's end?

@JILPulvino
Copy link
Author

In our use case, we have a number of different confidential datasets and our hub users should have differing access to them. Because of this, we can't just store all of them in the current shared folder system on the hub as then all users have access to all data. Ideally we could either (1) isolate data on the hub to specific users or (2) store the data on our Azure services in isolated containers and then provide access to users to specific containers. Prior to the hub, I had been using (2) and granting access based on an individual's IP address, but if users are using the hub, then that option no longer works and I need to provision access via Azure Active Directory - which is where I think we landed was the best solution and I just need to set up Azure Active Directory rather than using the IP address access.

As for what 2i2c can do, I think it'd be just laying out what you think best practices are.

@JILPulvino
Copy link
Author

We've ultimately switched to authentication to our Azure containers and storage accounts using AAD so this is no longer an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants