Skip to content

Commit

Permalink
Merge pull request #2809 from yuvipanda/anonymize-username
Browse files Browse the repository at this point in the history
Enable username anonymization feature for cosmicds & disable persistent storage
  • Loading branch information
yuvipanda authored Jul 14, 2023
2 parents 8f6fe06 + eb9e289 commit f6c70bd
Show file tree
Hide file tree
Showing 7 changed files with 267 additions and 8 deletions.
27 changes: 27 additions & 0 deletions config/clusters/2i2c/cosmicds.values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ jupyterhub:
hosts:
- cosmicds.2i2c.cloud
custom:
singleuserAdmin:
# Disable shared-readwrite mount, as we don't have any mounts enabled
extraVolumeMounts: []
auth:
anonymizeUsername: true
2i2c:
add_staff_user_ids_to_admin_users: true
add_staff_user_ids_of_type: "github"
Expand All @@ -25,6 +30,15 @@ jupyterhub:
funded_by:
name: Cosmic DS, Harvard
url: https://www.cosmicds.cfa.harvard.edu/
singleuser:
# No persistent storage should be kept to reduce any potential data
# retention & privacy issues.
# Ref https://github.com/2i2c-org/infrastructure/issues/2128#issuecomment-1635107926
initContainers: []
storage:
# Must set jupyterhub.custom.singleuserAdmin.extraVollumeMounts to [] as well
type: none
extraVolumeMounts: []
hub:
config:
Authenticator:
Expand All @@ -40,7 +54,20 @@ jupyterhub:
oauth_callback_url: https://cosmicds.2i2c.cloud/hub/oauth_callback
shown_idps:
- http://github.com/login/oauth/authorize
# Temporarily enable google & microsoft auth, to be reverted
# after July 21 2023
# Ref https://github.com/2i2c-org/infrastructure/issues/2128#issuecomment-1633128941
- http://google.com/accounts/o8/id
- http://login.microsoftonline.com/common/oauth2/v2.0/authorize
allowed_idps:
# The username claim here is used to do *authorization*, for both
# admin use and any allow listing we want to do.
http://github.com/login/oauth/authorize:
username_derivation:
username_claim: "preferred_username"
http://google.com/accounts/o8/id:
username_derivation:
username_claim: "email"
http://login.microsoftonline.com/common/oauth2/v2.0/authorize:
username_derivation:
username_claim: "email"
14 changes: 8 additions & 6 deletions config/clusters/2i2c/enc-cosmicds.secret.values.yaml
Original file line number Diff line number Diff line change
@@ -1,20 +1,22 @@
jupyterhub:
hub:
extraEnv:
USERNAME_DERIVATION_PEPPER: ENC[AES256_GCM,data:AXMgK5+Gzojb2j65OA87X0BEs4JxjZr1jkemLRNhMp5pxdvt40YyMEO2fyhx+nfNwrvMf9DV6z9Hl7l2XEsbTQ==,iv:B9EBaac4VFOkU+nzxcm7LUzqJRJ4N38o4BbsZqxW69Q=,tag:cERvKEh9TfxyoDyzzVrb1Q==,type:str]
config:
CILogonOAuthenticator:
client_id: ENC[AES256_GCM,data:MraS1vlTRjh01IaYE2VzzhPxe/MrfBrcqwir/+rXhF4pUVHTQy70SmIqD8D05BEguWMI,iv:xQ2MX5HMSZ7XwwqFVXk0i4IsUsSh9k7sCJcVZsJTmFo=,tag:b9mmtDTC7hHQTGQqcWtavw==,type:str]
client_secret: ENC[AES256_GCM,data:9i3q29/DbnNxDtdvB3QU4q0zPM1gtMoavMtUKHiMYeBXt8WqnkbMCnM49QjUpKgibxOHPcF1dQdT+/+2iAtJt/+xH+jrkLPR0bHCpESxo7bV6EziysY=,iv:JpWhPM178K2i2hF/91WBNBOxQ5oRiDcP+oufLcSOlrM=,tag:IL7kGXdCL5OZD9+OQy+3IA==,type:str]
client_id: ENC[AES256_GCM,data:bG3o+fgg9Un2YzPxgBisMSpZ7mu0NnF3u7fbHFf3TErMRSNZdbKYne9InZfntOSt9CFP,iv:K+L30QknUdByGTTgs/Xo7xdWAl3ceUjyRi09PFFq0Us=,tag:GRtn+QCLaBjToj4Wk42kEg==,type:str]
client_secret: ENC[AES256_GCM,data:G/KW5Lha8iIEbK5nslLGyoM5wT/dokcGZN7cHd15QVdAExghviq1AtvqOIBDGH8O9QPU2YjJOfN+qCE3AGOAtuNFXHWevp5cwhrhD9aOsNqR1Qo4RrQ=,iv:losa6BtAz9dT4m7E3ANejNAJQt3ttKUdu21A00iErHU=,tag:QzBd1z3Qdc0DRDFsWTSZ5g==,type:str]
sops:
kms: []
gcp_kms:
- resource_id: projects/two-eye-two-see/locations/global/keyRings/sops-keys/cryptoKeys/similar-hubs
created_at: "2023-03-30T12:24:48Z"
enc: CiUA4OM7eB49Qcys54jXfx2nwZ4rVwmHk2cfFmaglB8liibktmyOEkkALQgViM91gM0fb6p0q8SAHnE9f+aTaeKNWOuTrlD+aHtK5WdpBEj3JY78DqWxw6QjYNjuyhSdmJPFCrBmeXCj01HmpQSrEn3d
created_at: "2023-07-12T23:25:29Z"
enc: CiUA4OM7eBXY4SwY7PqtCXIN7/imKKYLzyV95f+5/DHHbo5HCTWcEkkAyiwFHIvAzIjhSO3eQzb0EL6A6KW3ZEu2ZUp7s3PN8gOAy6HIcPbTLQrVnFlbMSAxDT8WShiikQDXHbyjFAKVzqo/KKMuEt9o
azure_kv: []
hc_vault: []
age: []
lastmodified: "2023-03-30T12:24:48Z"
mac: ENC[AES256_GCM,data:3Tu+97pphFNHaMdATYeP6fVsZ3Wywrj/EkxuEAe7VLDjedIwfdWju6hvVu6cP71//qZP2f1zouBDN5T5N7HpXN8+lA5Pd1AIUdGb/W9xGLyo4z0oAP1refamKkgdrr9wMVF/8X8bEzVi9jV37cipU3pwRO43QZrkrTZUpQoCWW8=,iv:6GWaVZJaPDhL4pgUHPmCntYGHrFe4dFtnU91b6yCN8E=,tag:I07UWDmqNTpm7tnkPFscOg==,type:str]
lastmodified: "2023-07-12T23:25:29Z"
mac: ENC[AES256_GCM,data:JOXoIfDhqmOeSCml+lTi56Sd5/8R36scNKTNP5Z3l1CqUJlm9Z3SBOitFWJ3uG78kNKnrYmtqc9ygbeKv3odBfx+IBRWwcp3kg+IGTxYcVzB1Ys+J0j2S0GzI7kPuEBYPuIuXxD1aAuJsolhyAjbS1S7ZqknFiyz+JCiqMMCLAY=,iv:C69JScxiP/XKWiUlu7AtMkf+s/EGnXKwmS8PrptDzZs=,tag:zxIWjO3Alas/uj5cJZqkbg==,type:str]
pgp: []
unencrypted_suffix: _unencrypted
version: 3.7.3
142 changes: 142 additions & 0 deletions docs/howto/features/anonymized-usernames.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Anonymize usernames with CILogon

By default, we use a human readable identifier - like email or username -
when logging a user in. This is familiar to most users, makes support easier,
and usually not a problem. However, in some specific cases, we might need to
reduce the amount of Personally Identifiable Information ([PII](https://en.wikipedia.org/wiki/Personal_data))
stored in the system, and so emails or usernames may not be viable. We offer
a way to anonymize usernames if absolutely needed.

## Do you really need this?

. Usernames will now look like gibberish (`j77eci3dubsngx76m4ssd5sh6z3uiomgdrkvjymoxruigcystuva`, not `yuvipanda`), confusing many users. It almost looks like
a password! When needing to provide support to users, they *must* log in to
the hub, go to the Control Panel, and find their username. If they can't
do this, there is no real way to identify the user for support. This is
a **major** disadvantage, so seriously consider if you really really need this.

Moving away from this after the fact is also close to impossible, and changing
authentication providers (away from CILogon) is also impossible, without losing
all existing users' home directories.

So, very seriously consider if you need this before you enable it!

## How does it work?

Given that anonymizing usernames comes at a cost, it **must** provide us some
useful privacy guarantees to be worth it. Those are:

1. We must not possess, *stored at rest*, a user identifier that is Personally
Identifiable. This includes usernames, emails, as well as opaque integer
user ids from external services. For example, if we were to use the numerical
user id from GitHub (via the `oidc` attribute from CILogon), it can
be trivially mapped back to the username [via
BigQuery](https://www.gharchive.org/#bigquery) or any number of
public data sources. The numerical id is also shared with any other
website using GitHub (or Google, etc) for login, so any data
breaches in those websites can also be used to de-anonymize our
users.

2. We live in a world where user data leaks are a fact of life, and you can buy
tons of user identifiers for pretty cheap. This may also happen to *us*, and
we may unintentionally leak data too! So users should still be hard to
de-anonymize when the attacker has in their posession the following:

1. List of user identifiers (emails, usernames, numeric user ids,
etc) from *other data breaches*.
2. List of user identifiers *from us*.
3. Any secret keys we use to hash these identifiers.

(1) is out of our control, and we must be prepared for (2) and (3), so
we truly do not store any personal information, rather than just make it
slightly more complicated for our users to be deanonymized.

To provide these guarantees, we create the anonymized username in the following
way:

1. Take a combination of user attributes returned to us by CILogon. Right now,
we pick the following:

1. A CILogon specific opaque identifier (`sub`)
2. The identifier for the 3rd party OAuth provider chosen by the user (Google,
GitHub, Microsoft, etc) (`idp`)
3. The internal opaque identifier used by *that* third party (`oidc`)

2. Combine it with a per-hub *secret salt* (or [pepper](https://en.wikipedia.org/wiki/Pepper_(cryptography)))

3. Using the pepper as the key, hash the user attributes with the
[blake2b](https://en.wikipedia.org/wiki/BLAKE_(hash_function)) keyed
hashing algorithm, to produce a 32byte secret. This is used as the username.

To now deanonymize these usernames, an attacker must have:

1. Breached user information from CILogon (for the CILogon identifier)
2. Breached user information from the third party auth provider
3. Access to the secret value we used as pepper for the hub in question

This provides a reasonable level of protection. And given that we 2i2c
don't have access to (1) or (2) at rest, even we can't deanonymize this in
the future if we turn evil.

Note, however, that we still *receive* personally identifiable information
*when the user logs in*, and we might use this for authorization purposes too.
All this *only* removes our liability for storing this data *at rest*, not
while in transit.

## Limitations

Currently, only hubs with the following configuration are supported:

1. Must use CILogon for authentication
2. Only non-institutional CILogon authentication providers are supported. This
means Google, GitHub and Microsoft. Institutional authentication providers may
be supported in the future.
3. All existing user accounts will become invalid.

## Authorization

We still want to be able to do **authorization** based on user attributes,
such as domain of email or explicit list of allowed emails. This is still
possible, as the anonymization step is done in `post_auth_hook` which runs
*after* the initial authorization steps are done. So `admin_users` and
`allowed_users` can be used in the same manner as used in CILogon without
anonymization turned on.

## Enabling anonymization

1. In the unencrypted config yaml file for the hub, add the following:
```yaml
jupyterhub:
custom:
auth:
anonymizeUsername: true
```
Nest this inside a `basehub` key if this is for a daskhub.

2. Generate a secret key to be used for deriving the username, by running
`openssl rand -hex 32` on your commandline.

3. In the corresponding encrypted values file for the hub, add the following
config:

```yaml
jupyterhub:
hub:
extraEnv:
USERNAME_DERIVATION_PEPPER: <value-generated-in-step-2>
```

Nest this inside a `basehub` key if this is for a daskhub.

This should be enough configuration changes for this to work.

## Longer term solution

This is a common problem, and the longer term solution is to get CILogon to
implement [Pairwise Pseudonymous Identifiers](https://curity.io/resources/learn/ppid-intro/)

## Credit

Thanks to the `#infosec` channel on the [hangops slack](https://signup.hangops.com/)
for their help in thinking this through.
8 changes: 6 additions & 2 deletions docs/howto/features/ephemeral.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,13 +93,17 @@ provide persistent storage. So we turn it all off.

```yaml
jupyterhub:
custom:
singleuserAdmin:
# Turn off trying to mount shared-readwrite folder for admins
extraVolumeMounts: []
singleuser:
initContainers: null
initContainers: []
storage:
# No persistent storage should be kept to reduce any potential data
# retention & privacy issues.
type: none
extraVolumeMounts: null
extraVolumeMounts: []
```
Expand Down
1 change: 1 addition & 0 deletions docs/howto/features/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ See the sections below for more details.
cloud-access.md
gpu.md
github.md
anonymized-usernames.md
```

```{toctree}
Expand Down
12 changes: 12 additions & 0 deletions helm-charts/basehub/values.schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,19 @@ properties:
- singleuser
- cloudResources
- 2i2c
- auth
properties:
auth:
type: object
additionalProperties: false
required:
- anonymizeUsername
properties:
anonymizeUsername:
type: boolean
description: |
Anonymize the username passed in, so no PII is stored at
rest as part of the system.
singleuser:
type: object
additionalProperties: false
Expand Down
71 changes: 71 additions & 0 deletions helm-charts/basehub/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ global: {}

jupyterhub:
custom:
auth:
anonymizeUsername: false
singleuser:
extraPVCs: []
singleuserAdmin:
Expand Down Expand Up @@ -757,3 +759,72 @@ jupyterhub:
# Customize list of profiles dynamically, rather than override options form.
# This is more secure, as users can't override the options available to them via the hub API
c.KubeSpawner.profile_list = custom_profile_list
06-salted-username: |
# Allow anonymizing username to not store *any* PII
import json
import os
import base64
import hashlib
from z2jh import get_config
USERNAME_DERIVATION_PEPPER = bytes.fromhex(os.environ['USERNAME_DERIVATION_PEPPER'])
def salt_username(authenticator, handler, auth_model):
# Combine parts of user info with different provenances to eliminate
# possible deanonym attacks when things get leaked.
# FIXME: Provide useful error message when using an auth provider that
# doesn't give us 'oidc'
# FIXME: Raise error if this is attempted to be used with anything other than CILogon
cilogon_user = auth_model['auth_state']['cilogon_user']
user_key_parts = {
# Opaque ID from CILogon
"sub": cilogon_user['sub'],
# Combined together, opaque ID from upstream IDP (GitHub, Google, etc)
"idp": cilogon_user['idp'],
"oidc": cilogon_user['oidc']
}
# Use JSON here, so we don't have to deal with picking a string
# delimiter that will not appear in any of the parts.
# keys are sorted to ensure stable output over time
user_key = json.dumps(user_key_parts, sort_keys=True).encode('utf-8')
# The cryptographic choices made here are:
# - Use blake2, because it's fairly modern
# - Set blake2 to output 32 bytes as output, which is good enough for our use case
# - Use base32 encoding, as it will produce maximum of 56 characters
# for 32 bytes output by blake2. We have 63 character username
# limits in many parts of our code (particularly, in usernames
# being part of labels in kubernetes pods), so this helps
# - Convert everything to lowercase, as base64.b32encode produces
# all uppercase characters by default. Our usernames are preferably
# lowercase, as uppercase characters must be encoded for kubernetes'
# sake
# - strip the = padding provided by base64.b32encode. This is present
# primarily to be able to determine length of the original byte
# sequence accurately. We don't care about that here. Also = is
# encoded in kubernetes and puts us over the 63 char limit.
# - Use blake2 here explicitly as a keyed hash, rather than use
# hmac. This is the canonical way to do this, and helps make it
# clearer that we want it to output 32byte hashes. We could have
# used a 16byte hash here for shorter usernames, but it is unclear
# what that does to the security properties. So better safe than
# sorry, and stick to 32bytes (rather than the default 64)
digested_user_key = base64.b32encode(hashlib.blake2b(
user_key,
key=USERNAME_DERIVATION_PEPPER,
digest_size=32
).digest()).decode('utf-8').lower().rstrip("=")
# Replace the default name with our digested name, thus
# discarding the default name
auth_model["name"] = digested_user_key
return auth_model
if get_config('custom.auth.anonymizeUsername', False):
# https://jupyterhub.readthedocs.io/en/stable/reference/api/auth.html#jupyterhub.auth.Authenticator.post_auth_hook
c.Authenticator.post_auth_hook = salt_username

0 comments on commit f6c70bd

Please sign in to comment.