Pkg protocol: basic anonymous opt-out telemetry #1544

StefanKarpinski · 2019-12-12T00:27:28Z

The only information this provides over what we can get just by clients connecting is:

Client UUID: randomly generated, totally anonymous UUID which allows correlating requests from the same Julia install over time.
Project path hash: secure hash of the project path, which allows correlating requests from this client that are associated with the same project.

Regarding privacy: as long the secret salt value isn't leaked, it's impossible to work out what the project path was—even if by brute force, since you don't know what the salt values is. Even if the salt value is leaked, you'd still have to reverse a secure hash. The other purpose of the secret salt is so that if the same project path is used by different Julia installs, they won't have the same hash.

Telemetry files are per-server, which prevents a bad actor from correlating information across servers to learn something additional about the user (e.g. correlating a client UUID from an authenticated server to an anonymous one to de-anonymize the user). In order to opt out of telemetry entirely, just put telemetry = false in the appropriate telemetry file. It's also possible to just opt out of sending project path hashes by putting project_hash = false in the telemetry file.

codecov · 2019-12-12T00:48:50Z

Codecov Report

Merging #1544 into master will decrease coverage by 0.81%.
The diff coverage is 13.84%.

@@            Coverage Diff            @@
##           master   #1544      +/-   ##
=========================================
- Coverage   85.82%     85%   -0.82%     
=========================================
  Files          24      24              
  Lines        5277    5336      +59     
=========================================
+ Hits         4529    4536       +7     
- Misses        748     800      +52

Impacted Files	Coverage Δ
src/PlatformEngines.jl	`56.47% <13.84%> (-7.2%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e66a75f...90b8482. Read the comment docs.

00vareladavid · 2019-12-12T01:22:24Z

Can you add some docs about this as well?

StefanKarpinski · 2019-12-12T02:51:29Z

Will do.

DilumAluthge · 2019-12-12T02:59:40Z

Will it be possible to use these data to tell package authors some statistics about their packages? For example, it would be great if we could tell package authors how many times per month their package was downloaded.

StefanKarpinski · 2019-12-12T03:30:17Z

Yes, that's exactly why we want these stats. It will also allow us to tell how many Julia installs there are and how many of them are long-lived (presumably actual people) versus ephemeral (presumably VMs). I'm also going to add some telemetry about the OS and Julia version.

StefanKarpinski · 2019-12-12T04:38:26Z

To give a little documentation by example, here are the telemetry headers that are currently sent:

Julia-Version: 1.4.0-DEV.577
Julia-Commit: 1432c5a08531aab153953848e5e8015faed2bb8d
Julia-System: x86_64-apple-darwin14-libgfortran5
Julia-Client-UUID: d955f7eb-e27e-4c85-8eb3-31c9bbb82270
Julia-Salt-Hash: 45bf87cb1abbdf8dd14c163e4209700459b879b1
Julia-Project-Hash: b0d97ef509ef57fa6582989d7a8cc1beb5c467d3

Details and why we want this info:

Version: for obvious reasons
Commit: somewhat redundant, but can help identify problems with specific commits
System: crucial to help us know how popular Julia is across OSes
Client UUID: allow correlating the same system across time
Salt hash: allows telling if project hashes are comparable or not
Project hash: allows correlating operations on the same project

Opting out (per ~/.julia/servers/telemetry.toml file):

To opt out of telemetry entirely, set telemetry = false
To opt out of client-specific telemetry, set client_uuid = false
To opt out of project hash telemetry, set secret_salt = false.

Nosferican · 2019-12-12T05:36:42Z

Is there a way to pass kinda of user-agents as in CI, user, or something?

StefanKarpinski · 2019-12-12T05:39:52Z

Yes, we should standardize a way to do that. Probably an environment variable to set like export JULIA_CI=true or something like that. Needs a bit of design. Maybe a generic way of passing something through to telemetry. Or, if there are common variables we know are set on common CI systems, we could potentially look for those and record their presence in telemetry.

KristofferC · 2019-12-12T08:13:10Z

I think these

Client UUID: allow correlating the same system across time

Salt hash: allows telling if project hashes are comparable or not

Project hash: allows correlating operations on the same project

could use a bit more justification why they are needed and how that information will be presented back to users. Presumably, the client UUID is to identify e.g. CI services but cannot that be done by the Pkg Server detecting a huge number of downloads of packages to the same place? And what is the project hash for?

StefanKarpinski · 2019-12-12T15:42:51Z

The client UUID is to answer by far the most common question that funders, potential adopters, skeptical bosses, etc. pose about Julia: how many users are there? This doesn't give the number of users since some people will install Julia on multiple systems, but it does give a concrete number for number of active installs, which I think most people asking that question will consider a good enough proxy. How would you suggest estimating the number of Julia users/installs otherwise?

The project path hash is less important (everything is, pretty much), but it allows answering questions of the form "how often are package A and package B used together?" Which is, in turn, useful to understand how important compatibility between A and B is. If they're rarely used together, then if A introduces an incompatibility with B—meh, who cares? If they're used together a lot then we might have a serious problem that needs to be addressed asap. If we only know that they were installed on the same system (based on the client UUID), that doesn't really tell us much given the way Pkg works: they may never be in the same project. If the request to install A and B have the same project hash value, then they're being used together (unless one of them has since been removed from the project). The project hash computation is carefully designed not to reveal anything about the actual path—it's only useful for telling if two operations are in the same project or not. Again, if you have other ideas for how to estimate that information, let's hear it.

StefanKarpinski · 2019-12-12T17:12:00Z

Regarding determining whether something is a CI process or not, it seems like most the CI services automatically set the CI environment variable. This comment has a good summary:

https://github.saobby.my.eu.orgmunity/t5/GitHub-Actions/Have-the-CI-environment-variable-set-by-default/m-p/32358/highlight/true#M1097

So it looks like checking for the CI environment variable gets us 90% of the way there (AppVeyor, CircleCI, GitLab, Travis) and checking for a couple of other vendor-specific environment variables gets us the rest of the way: TF_BUILD for Azure Pipelines, and maybe GITHUB_ACTION for GitHub? Unfortunately, none of the GitHub variables are boolean and we don't want to leak sensitive information by accident, so we need to look for their presence only.

So I think what we can probably do is check for a set of CI-indicator variables and indicate whether they are: present and if present, whether they have a value that we recognize as true, false or other. Since it's a fixed set of well-known variables that are unlikely to be used for private purposes and we are only sending present/absent information about them, that seems acceptable for privacy.

StefanKarpinski · 2019-12-12T17:49:15Z

I've pushed a commit that looks through a predefined list of environment variables (APPVEYOR, CI, CIRCLECI, CONTINUOUS_INTEGRATION, GITHUB_ACTION, GITLAB_CI, JULIA_CI, TF_BUILD, TRAVIS) that are common indicators of being in a CI process and indicates one of four states for each one:

n for "none" or "not set"
t for "true" if it has one of the (case insensitive) values: true, t, 1, yes, y
f for "false" if it has one of the (case insensitive) values: false, f, 0, no, n
o if it is set with any other value.

This should let us understand whether a connection is coming from a CI process or not by looking for the presences and/or trueness of these variables, while not leaking sensitive information if someone happens to put their password in one of these environment variables. For example, this is what the header looks like when I have done export CI=true:

Julia-CI-Indicators: APPVEYOR=n;CI=t;CIRCLECI=n;CONTINUOUS_INTEGRATION=n;GITHUB_ACTION=n;GITLAB_CI=n;JULIA_CI=n;TF_BUILD=n;TRAVIS=n

The reason to send the names of all the variables, even those that aren't present, is that whether they were checked for or not is also information. Over time we may change the set of variables that we check for and when doing an analysis, you don't want to have to worry about which Julia version may have sent this info when deciding whether a variable was checked for and not found or whether it wasn't checked for at all. Sending the full vector of checked indicators makes that clear.

StefanKarpinski · 2019-12-12T18:08:52Z

Note that JULIA_CI isn’t a thing yet, but if someone wants to set a variable specifically to indicate to Julia that they are doing CI, this is the variable they could use.

StefanKarpinski · 2019-12-12T19:56:09Z

In a future PR, I will add a docs section on telemetry, including what it sends, how to opt out, and the reasons we collect each value. I'd like to get this merged now so that it's ready for the 1.4 feature freeze (tentatively on Sunday).

tkf · 2019-12-12T23:49:49Z

Julia-Version: 1.4.0-DEV.577
Julia-Commit: 1432c5a08531aab153953848e5e8015faed2bb8d

Not sure if this is relevant, but can't you guess who the client is if you send Julia-Commit and if it happens to be from a non-master branch (e.g., PR)?

00vareladavid · 2019-12-13T00:06:07Z

It does seem to be very specific information. Perhaps it should be changed to just the Julia version (only major.minor.patch).

StefanKarpinski · 2019-12-13T04:21:14Z

Yeah, maybe too specific. Of course if it’s not a public commit then how would we know? And if it is a public commit, it could be anyone.

00vareladavid · 2019-12-13T04:37:07Z

I could be mistaken, but it seems it could be a private commit someone is working on, they use Pkg while working on it, send over the commit through the protocol, then they make a PR. Now anyone with the info could correlate the commit in the PR (and thus the client UUID) with their identity.

StefanKarpinski · 2019-12-13T04:55:42Z

That’s true. We, the Julia devs, can probably identify Julia devs this way :)

I’d be happy to take the Julia commit out. I think that keeping the version number with how far ahead of master it is seems innocuous. We should at least keep some indicator of whether you are exactly on a tagged version or not.

davidanthoff · 2020-02-13T23:02:04Z

Client UUID: randomly generated, totally anonymous UUID which allows correlating requests from the same Julia install over time.

Has this been run by a lawyer specialised in this stuff? From my (admittedly amateur) understanding of the European (and maybe even CA) legal situation, this requires explicit opt-in.

StefanKarpinski · 2020-02-14T15:24:04Z

It currently does require an opt-in, so it's going out in 1.4 as-is. We'll review for 1.5 and may put a nag prompt in for interactive usage, along with some kind of HyperLogLog-based stats for counting the number of unique clients without revealing anything identifiable.

chrisvwx · 2020-02-22T17:42:02Z

To be clear, would the HyperLogLog calculation be done on the server? If so, the user-specific, pseudonymous UUID has to be there, and so the HyperLogLog calculation doesn't help for privacy.

Client UUID: randomly generated, totally anonymous UUID

I guess you mean "pseudonymous", not "anonymous"

johnnychen94 · 2020-02-22T17:53:01Z

src/PlatformEngines.jl

+    "CI",
+    "CIRCLECI",
+    "CONTINUOUS_INTEGRATION",
+    "GITHUB_ACTION",


This comment comes quite too late... Actually IIUC it should be GITHUB_ACTIONS (note tailing S here) according to default GitHub action environment variables

And GITHUB_ACTION is the job id

Thanks, will fix this.

StefanKarpinski · 2020-02-22T20:06:49Z

No, the HyperLogLog sample would be generated on the client. What would be sent would be a pair of values with a total number of distinct values on the order of 2^16, which is not enough unique values to cover the number of unique clients expected.

chrisvwx · 2020-02-22T21:34:41Z

Thanks. Will you send the project path hash to the server or do the HyperLogLog business on the client? For long-lived installs, this hash will be just as good of a user identifier as a UUID.

StefanKarpinski · 2020-02-22T21:35:54Z

If no UUID is sent then nothing that depends on the UUID is sent.

StefanKarpinski force-pushed the sk/telemetry branch from d76a9e9 to 734dcb8 Compare December 12, 2019 00:48

Pkg protocol: basic anonymous opt-out telemetry

246dbd0

StefanKarpinski force-pushed the sk/telemetry branch from 734dcb8 to 246dbd0 Compare December 12, 2019 04:27

StefanKarpinski mentioned this pull request Dec 12, 2019

Proposal: Pkg & Storage Protocols #1377

Closed

CI telemetry: send indicators for common CI env vars

228fb97

telemetry: factor out telemetry file loading

90b8482

StefanKarpinski merged commit 8e236a7 into master Dec 12, 2019

StefanKarpinski deleted the sk/telemetry branch December 12, 2019 21:27

maleadt mentioned this pull request Dec 28, 2019

Set the environment variable "JULIA_PKGEVAL" to "true" JuliaCI/PkgEval.jl#51

Merged

johnnychen94 reviewed Feb 22, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pkg protocol: basic anonymous opt-out telemetry #1544

Pkg protocol: basic anonymous opt-out telemetry #1544

StefanKarpinski commented Dec 12, 2019 •

edited

Loading

codecov bot commented Dec 12, 2019 •

edited

Loading

00vareladavid commented Dec 12, 2019

StefanKarpinski commented Dec 12, 2019

DilumAluthge commented Dec 12, 2019 •

edited

Loading

StefanKarpinski commented Dec 12, 2019

StefanKarpinski commented Dec 12, 2019

Nosferican commented Dec 12, 2019

StefanKarpinski commented Dec 12, 2019

KristofferC commented Dec 12, 2019

StefanKarpinski commented Dec 12, 2019 •

edited

Loading

StefanKarpinski commented Dec 12, 2019 •

edited

Loading

StefanKarpinski commented Dec 12, 2019 •

edited

Loading

StefanKarpinski commented Dec 12, 2019 •

edited

Loading

StefanKarpinski commented Dec 12, 2019

tkf commented Dec 12, 2019

00vareladavid commented Dec 13, 2019

StefanKarpinski commented Dec 13, 2019

00vareladavid commented Dec 13, 2019

StefanKarpinski commented Dec 13, 2019

davidanthoff commented Feb 13, 2020

StefanKarpinski commented Feb 14, 2020

chrisvwx commented Feb 22, 2020 •

edited

Loading

johnnychen94 Feb 22, 2020

StefanKarpinski Feb 22, 2020

StefanKarpinski Feb 22, 2020

StefanKarpinski commented Feb 22, 2020

chrisvwx commented Feb 22, 2020

StefanKarpinski commented Feb 22, 2020

Pkg protocol: basic anonymous opt-out telemetry #1544

Pkg protocol: basic anonymous opt-out telemetry #1544

Conversation

StefanKarpinski commented Dec 12, 2019 • edited Loading

codecov bot commented Dec 12, 2019 • edited Loading

Codecov Report

00vareladavid commented Dec 12, 2019

StefanKarpinski commented Dec 12, 2019

DilumAluthge commented Dec 12, 2019 • edited Loading

StefanKarpinski commented Dec 12, 2019

StefanKarpinski commented Dec 12, 2019

Nosferican commented Dec 12, 2019

StefanKarpinski commented Dec 12, 2019

KristofferC commented Dec 12, 2019

StefanKarpinski commented Dec 12, 2019 • edited Loading

StefanKarpinski commented Dec 12, 2019 • edited Loading

StefanKarpinski commented Dec 12, 2019 • edited Loading

StefanKarpinski commented Dec 12, 2019 • edited Loading

StefanKarpinski commented Dec 12, 2019

tkf commented Dec 12, 2019

00vareladavid commented Dec 13, 2019

StefanKarpinski commented Dec 13, 2019

00vareladavid commented Dec 13, 2019

StefanKarpinski commented Dec 13, 2019

davidanthoff commented Feb 13, 2020

StefanKarpinski commented Feb 14, 2020

chrisvwx commented Feb 22, 2020 • edited Loading

johnnychen94 Feb 22, 2020

Choose a reason for hiding this comment

StefanKarpinski Feb 22, 2020

Choose a reason for hiding this comment

StefanKarpinski Feb 22, 2020

Choose a reason for hiding this comment

StefanKarpinski commented Feb 22, 2020

chrisvwx commented Feb 22, 2020

StefanKarpinski commented Feb 22, 2020

StefanKarpinski commented Dec 12, 2019 •

edited

Loading

codecov bot commented Dec 12, 2019 •

edited

Loading

DilumAluthge commented Dec 12, 2019 •

edited

Loading

StefanKarpinski commented Dec 12, 2019 •

edited

Loading

StefanKarpinski commented Dec 12, 2019 •

edited

Loading

StefanKarpinski commented Dec 12, 2019 •

edited

Loading

StefanKarpinski commented Dec 12, 2019 •

edited

Loading

chrisvwx commented Feb 22, 2020 •

edited

Loading