-
-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pkg protocol: basic anonymous opt-out telemetry #1544
Conversation
d76a9e9
to
734dcb8
Compare
Codecov Report
@@ Coverage Diff @@
## master #1544 +/- ##
=========================================
- Coverage 85.82% 85% -0.82%
=========================================
Files 24 24
Lines 5277 5336 +59
=========================================
+ Hits 4529 4536 +7
- Misses 748 800 +52
Continue to review full report at Codecov.
|
Can you add some docs about this as well? |
Will do. |
Will it be possible to use these data to tell package authors some statistics about their packages? For example, it would be great if we could tell package authors how many times per month their package was downloaded. |
Yes, that's exactly why we want these stats. It will also allow us to tell how many Julia installs there are and how many of them are long-lived (presumably actual people) versus ephemeral (presumably VMs). I'm also going to add some telemetry about the OS and Julia version. |
734dcb8
to
246dbd0
Compare
To give a little documentation by example, here are the telemetry headers that are currently sent:
Details and why we want this info:
Opting out (per
|
Is there a way to pass kinda of user-agents as in CI, user, or something? |
Yes, we should standardize a way to do that. Probably an environment variable to set like |
I think these
could use a bit more justification why they are needed and how that information will be presented back to users. Presumably, the client UUID is to identify e.g. CI services but cannot that be done by the Pkg Server detecting a huge number of downloads of packages to the same place? And what is the project hash for? |
The client UUID is to answer by far the most common question that funders, potential adopters, skeptical bosses, etc. pose about Julia: how many users are there? This doesn't give the number of users since some people will install Julia on multiple systems, but it does give a concrete number for number of active installs, which I think most people asking that question will consider a good enough proxy. How would you suggest estimating the number of Julia users/installs otherwise? The project path hash is less important (everything is, pretty much), but it allows answering questions of the form "how often are package A and package B used together?" Which is, in turn, useful to understand how important compatibility between A and B is. If they're rarely used together, then if A introduces an incompatibility with B—meh, who cares? If they're used together a lot then we might have a serious problem that needs to be addressed asap. If we only know that they were installed on the same system (based on the client UUID), that doesn't really tell us much given the way Pkg works: they may never be in the same project. If the request to install A and B have the same project hash value, then they're being used together (unless one of them has since been removed from the project). The project hash computation is carefully designed not to reveal anything about the actual path—it's only useful for telling if two operations are in the same project or not. Again, if you have other ideas for how to estimate that information, let's hear it. |
Regarding determining whether something is a CI process or not, it seems like most the CI services automatically set the So it looks like checking for the So I think what we can probably do is check for a set of CI-indicator variables and indicate whether they are: present and if present, whether they have a value that we recognize as true, false or other. Since it's a fixed set of well-known variables that are unlikely to be used for private purposes and we are only sending present/absent information about them, that seems acceptable for privacy. |
I've pushed a commit that looks through a predefined list of environment variables (
This should let us understand whether a connection is coming from a CI process or not by looking for the presences and/or trueness of these variables, while not leaking sensitive information if someone happens to put their password in one of these environment variables. For example, this is what the header looks like when I have done
The reason to send the names of all the variables, even those that aren't present, is that whether they were checked for or not is also information. Over time we may change the set of variables that we check for and when doing an analysis, you don't want to have to worry about which Julia version may have sent this info when deciding whether a variable was checked for and not found or whether it wasn't checked for at all. Sending the full vector of checked indicators makes that clear. |
Note that |
In a future PR, I will add a docs section on telemetry, including what it sends, how to opt out, and the reasons we collect each value. I'd like to get this merged now so that it's ready for the 1.4 feature freeze (tentatively on Sunday). |
Not sure if this is relevant, but can't you guess who the client is if you send |
It does seem to be very specific information. Perhaps it should be changed to just the Julia version (only major.minor.patch). |
Yeah, maybe too specific. Of course if it’s not a public commit then how would we know? And if it is a public commit, it could be anyone. |
I could be mistaken, but it seems it could be a private commit someone is working on, they use |
That’s true. We, the Julia devs, can probably identify Julia devs this way :) I’d be happy to take the Julia commit out. I think that keeping the version number with how far ahead of master it is seems innocuous. We should at least keep some indicator of whether you are exactly on a tagged version or not. |
Has this been run by a lawyer specialised in this stuff? From my (admittedly amateur) understanding of the European (and maybe even CA) legal situation, this requires explicit opt-in. |
It currently does require an opt-in, so it's going out in 1.4 as-is. We'll review for 1.5 and may put a nag prompt in for interactive usage, along with some kind of HyperLogLog-based stats for counting the number of unique clients without revealing anything identifiable. |
To be clear, would the HyperLogLog calculation be done on the server? If so, the user-specific, pseudonymous UUID has to be there, and so the HyperLogLog calculation doesn't help for privacy.
I guess you mean "pseudonymous", not "anonymous" |
"CI", | ||
"CIRCLECI", | ||
"CONTINUOUS_INTEGRATION", | ||
"GITHUB_ACTION", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment comes quite too late... Actually IIUC it should be GITHUB_ACTIONS (note tailing S here) according to default GitHub action environment variables
And GITHUB_ACTION
is the job id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR: #1693
No, the HyperLogLog sample would be generated on the client. What would be sent would be a pair of values with a total number of distinct values on the order of 2^16, which is not enough unique values to cover the number of unique clients expected. |
Thanks. Will you send the project path hash to the server or do the HyperLogLog business on the client? For long-lived installs, this hash will be just as good of a user identifier as a UUID. |
If no UUID is sent then nothing that depends on the UUID is sent. |
The only information this provides over what we can get just by clients connecting is:
Regarding privacy: as long the secret salt value isn't leaked, it's impossible to work out what the project path was—even if by brute force, since you don't know what the salt values is. Even if the salt value is leaked, you'd still have to reverse a secure hash. The other purpose of the secret salt is so that if the same project path is used by different Julia installs, they won't have the same hash.
Telemetry files are per-server, which prevents a bad actor from correlating information across servers to learn something additional about the user (e.g. correlating a client UUID from an authenticated server to an anonymous one to de-anonymize the user). In order to opt out of telemetry entirely, just put
telemetry = false
in the appropriate telemetry file. It's also possible to just opt out of sending project path hashes by puttingproject_hash = false
in the telemetry file.