Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pkg protocol: basic anonymous opt-out telemetry #1544

Merged
merged 3 commits into from
Dec 12, 2019
Merged

Conversation

StefanKarpinski
Copy link
Member

@StefanKarpinski StefanKarpinski commented Dec 12, 2019

The only information this provides over what we can get just by clients connecting is:

  • Client UUID: randomly generated, totally anonymous UUID which allows correlating requests from the same Julia install over time.
  • Project path hash: secure hash of the project path, which allows correlating requests from this client that are associated with the same project.

Regarding privacy: as long the secret salt value isn't leaked, it's impossible to work out what the project path was—even if by brute force, since you don't know what the salt values is. Even if the salt value is leaked, you'd still have to reverse a secure hash. The other purpose of the secret salt is so that if the same project path is used by different Julia installs, they won't have the same hash.

Telemetry files are per-server, which prevents a bad actor from correlating information across servers to learn something additional about the user (e.g. correlating a client UUID from an authenticated server to an anonymous one to de-anonymize the user). In order to opt out of telemetry entirely, just put telemetry = false in the appropriate telemetry file. It's also possible to just opt out of sending project path hashes by putting project_hash = false in the telemetry file.

@codecov
Copy link

codecov bot commented Dec 12, 2019

Codecov Report

Merging #1544 into master will decrease coverage by 0.81%.
The diff coverage is 13.84%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #1544      +/-   ##
=========================================
- Coverage   85.82%     85%   -0.82%     
=========================================
  Files          24      24              
  Lines        5277    5336      +59     
=========================================
+ Hits         4529    4536       +7     
- Misses        748     800      +52
Impacted Files Coverage Δ
src/PlatformEngines.jl 56.47% <13.84%> (-7.2%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e66a75f...90b8482. Read the comment docs.

@00vareladavid
Copy link
Contributor

Can you add some docs about this as well?

@StefanKarpinski
Copy link
Member Author

Will do.

@DilumAluthge
Copy link
Member

DilumAluthge commented Dec 12, 2019

Will it be possible to use these data to tell package authors some statistics about their packages? For example, it would be great if we could tell package authors how many times per month their package was downloaded.

@StefanKarpinski
Copy link
Member Author

Yes, that's exactly why we want these stats. It will also allow us to tell how many Julia installs there are and how many of them are long-lived (presumably actual people) versus ephemeral (presumably VMs). I'm also going to add some telemetry about the OS and Julia version.

@StefanKarpinski
Copy link
Member Author

To give a little documentation by example, here are the telemetry headers that are currently sent:

Julia-Version: 1.4.0-DEV.577
Julia-Commit: 1432c5a08531aab153953848e5e8015faed2bb8d
Julia-System: x86_64-apple-darwin14-libgfortran5
Julia-Client-UUID: d955f7eb-e27e-4c85-8eb3-31c9bbb82270
Julia-Salt-Hash: 45bf87cb1abbdf8dd14c163e4209700459b879b1
Julia-Project-Hash: b0d97ef509ef57fa6582989d7a8cc1beb5c467d3

Details and why we want this info:

  • Version: for obvious reasons
  • Commit: somewhat redundant, but can help identify problems with specific commits
  • System: crucial to help us know how popular Julia is across OSes
  • Client UUID: allow correlating the same system across time
  • Salt hash: allows telling if project hashes are comparable or not
  • Project hash: allows correlating operations on the same project

Opting out (per ~/.julia/servers/telemetry.toml file):

  • To opt out of telemetry entirely, set telemetry = false
  • To opt out of client-specific telemetry, set client_uuid = false
  • To opt out of project hash telemetry, set secret_salt = false.

@Nosferican
Copy link
Contributor

Is there a way to pass kinda of user-agents as in CI, user, or something?

@StefanKarpinski
Copy link
Member Author

Yes, we should standardize a way to do that. Probably an environment variable to set like export JULIA_CI=true or something like that. Needs a bit of design. Maybe a generic way of passing something through to telemetry. Or, if there are common variables we know are set on common CI systems, we could potentially look for those and record their presence in telemetry.

@KristofferC
Copy link
Member

I think these

  • Client UUID: allow correlating the same system across time
  • Salt hash: allows telling if project hashes are comparable or not
  • Project hash: allows correlating operations on the same project

could use a bit more justification why they are needed and how that information will be presented back to users. Presumably, the client UUID is to identify e.g. CI services but cannot that be done by the Pkg Server detecting a huge number of downloads of packages to the same place? And what is the project hash for?

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Dec 12, 2019

The client UUID is to answer by far the most common question that funders, potential adopters, skeptical bosses, etc. pose about Julia: how many users are there? This doesn't give the number of users since some people will install Julia on multiple systems, but it does give a concrete number for number of active installs, which I think most people asking that question will consider a good enough proxy. How would you suggest estimating the number of Julia users/installs otherwise?

The project path hash is less important (everything is, pretty much), but it allows answering questions of the form "how often are package A and package B used together?" Which is, in turn, useful to understand how important compatibility between A and B is. If they're rarely used together, then if A introduces an incompatibility with B—meh, who cares? If they're used together a lot then we might have a serious problem that needs to be addressed asap. If we only know that they were installed on the same system (based on the client UUID), that doesn't really tell us much given the way Pkg works: they may never be in the same project. If the request to install A and B have the same project hash value, then they're being used together (unless one of them has since been removed from the project). The project hash computation is carefully designed not to reveal anything about the actual path—it's only useful for telling if two operations are in the same project or not. Again, if you have other ideas for how to estimate that information, let's hear it.

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Dec 12, 2019

Regarding determining whether something is a CI process or not, it seems like most the CI services automatically set the CI environment variable. This comment has a good summary:

https://github.saobby.my.eu.orgmunity/t5/GitHub-Actions/Have-the-CI-environment-variable-set-by-default/m-p/32358/highlight/true#M1097

So it looks like checking for the CI environment variable gets us 90% of the way there (AppVeyor, CircleCI, GitLab, Travis) and checking for a couple of other vendor-specific environment variables gets us the rest of the way: TF_BUILD for Azure Pipelines, and maybe GITHUB_ACTION for GitHub? Unfortunately, none of the GitHub variables are boolean and we don't want to leak sensitive information by accident, so we need to look for their presence only.

So I think what we can probably do is check for a set of CI-indicator variables and indicate whether they are: present and if present, whether they have a value that we recognize as true, false or other. Since it's a fixed set of well-known variables that are unlikely to be used for private purposes and we are only sending present/absent information about them, that seems acceptable for privacy.

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Dec 12, 2019

I've pushed a commit that looks through a predefined list of environment variables (APPVEYOR, CI, CIRCLECI, CONTINUOUS_INTEGRATION, GITHUB_ACTION, GITLAB_CI, JULIA_CI, TF_BUILD, TRAVIS) that are common indicators of being in a CI process and indicates one of four states for each one:

  • n for "none" or "not set"
  • t for "true" if it has one of the (case insensitive) values: true, t, 1, yes, y
  • f for "false" if it has one of the (case insensitive) values: false, f, 0, no, n
  • o if it is set with any other value.

This should let us understand whether a connection is coming from a CI process or not by looking for the presences and/or trueness of these variables, while not leaking sensitive information if someone happens to put their password in one of these environment variables. For example, this is what the header looks like when I have done export CI=true:

Julia-CI-Indicators: APPVEYOR=n;CI=t;CIRCLECI=n;CONTINUOUS_INTEGRATION=n;GITHUB_ACTION=n;GITLAB_CI=n;JULIA_CI=n;TF_BUILD=n;TRAVIS=n

The reason to send the names of all the variables, even those that aren't present, is that whether they were checked for or not is also information. Over time we may change the set of variables that we check for and when doing an analysis, you don't want to have to worry about which Julia version may have sent this info when deciding whether a variable was checked for and not found or whether it wasn't checked for at all. Sending the full vector of checked indicators makes that clear.

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Dec 12, 2019

Note that JULIA_CI isn’t a thing yet, but if someone wants to set a variable specifically to indicate to Julia that they are doing CI, this is the variable they could use.

@StefanKarpinski
Copy link
Member Author

In a future PR, I will add a docs section on telemetry, including what it sends, how to opt out, and the reasons we collect each value. I'd like to get this merged now so that it's ready for the 1.4 feature freeze (tentatively on Sunday).

@StefanKarpinski StefanKarpinski merged commit 8e236a7 into master Dec 12, 2019
@StefanKarpinski StefanKarpinski deleted the sk/telemetry branch December 12, 2019 21:27
@tkf
Copy link
Member

tkf commented Dec 12, 2019

Julia-Version: 1.4.0-DEV.577
Julia-Commit: 1432c5a08531aab153953848e5e8015faed2bb8d

Not sure if this is relevant, but can't you guess who the client is if you send Julia-Commit and if it happens to be from a non-master branch (e.g., PR)?

@00vareladavid
Copy link
Contributor

It does seem to be very specific information. Perhaps it should be changed to just the Julia version (only major.minor.patch).

@StefanKarpinski
Copy link
Member Author

Yeah, maybe too specific. Of course if it’s not a public commit then how would we know? And if it is a public commit, it could be anyone.

@00vareladavid
Copy link
Contributor

I could be mistaken, but it seems it could be a private commit someone is working on, they use Pkg while working on it, send over the commit through the protocol, then they make a PR. Now anyone with the info could correlate the commit in the PR (and thus the client UUID) with their identity.

@StefanKarpinski
Copy link
Member Author

That’s true. We, the Julia devs, can probably identify Julia devs this way :)

I’d be happy to take the Julia commit out. I think that keeping the version number with how far ahead of master it is seems innocuous. We should at least keep some indicator of whether you are exactly on a tagged version or not.

@davidanthoff
Copy link

Client UUID: randomly generated, totally anonymous UUID which allows correlating requests from the same Julia install over time.

Has this been run by a lawyer specialised in this stuff? From my (admittedly amateur) understanding of the European (and maybe even CA) legal situation, this requires explicit opt-in.

@StefanKarpinski
Copy link
Member Author

It currently does require an opt-in, so it's going out in 1.4 as-is. We'll review for 1.5 and may put a nag prompt in for interactive usage, along with some kind of HyperLogLog-based stats for counting the number of unique clients without revealing anything identifiable.

@chrisvwx
Copy link

chrisvwx commented Feb 22, 2020

To be clear, would the HyperLogLog calculation be done on the server? If so, the user-specific, pseudonymous UUID has to be there, and so the HyperLogLog calculation doesn't help for privacy.

Client UUID: randomly generated, totally anonymous UUID

I guess you mean "pseudonymous", not "anonymous"

"CI",
"CIRCLECI",
"CONTINUOUS_INTEGRATION",
"GITHUB_ACTION",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment comes quite too late... Actually IIUC it should be GITHUB_ACTIONS (note tailing S here) according to default GitHub action environment variables

And GITHUB_ACTION is the job id

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will fix this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR: #1693

@StefanKarpinski
Copy link
Member Author

No, the HyperLogLog sample would be generated on the client. What would be sent would be a pair of values with a total number of distinct values on the order of 2^16, which is not enough unique values to cover the number of unique clients expected.

@chrisvwx
Copy link

Thanks. Will you send the project path hash to the server or do the HyperLogLog business on the client? For long-lived installs, this hash will be just as good of a user identifier as a UUID.

@StefanKarpinski
Copy link
Member Author

If no UUID is sent then nothing that depends on the UUID is sent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants