-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What data should be included in a user download feature #1
Comments
Speaking as somebody who is working in law, and somebody who has assisted in the creation of a number of GDPR policies, and extensions, I will enter my two-cents here, and expand this further when I have more time after work.
Honestly, I would list this out in a checklist under an issue - allowing developers and contributors for this extension to tick off as the feature is implemented.
Here, this should be linked to the data that Flarum collects at core. However it becomes trickier when one considers extended use cases - for instance, ignore extensions entirely, do we classify a username as a PII? Particularly when it can (or perhaps previously did) contain a name - I know for example that Phenomlab on the forums used to use his full name.
Would argue a point on this one - Data should be personal - therefore the directory would arguably not come into play at all, as this is a list of users. Instead, the data for that specific user should be exported, such as (but not limited to) their topics, posts, avatars, etc.
On other implementations that I have supported, data was exported as a JSON file, which actually (even with our largest exports) was quite minimal in size. We'd have to consider the format in which we export the data. Whilst we have to provide data, it is not implicit in what format this must be provided.
Partially agree. The only complaint I would have here, is that we should have a hook for extensions. The reason being, if a user was to request a name change for example, then this data should also be exported (once supported by the extension) as this contains the potential identifier of the user's name.
Administrative actions such as deletion of posts still retains an entry in the database. This data should also be provided to the end user, as that data is still stored.
Agreed, the full administrative action against a user should not be included in the data export report. |
This is the correct way to establish a link between the user, and the data being requested. The purpose of the data extraction is to provide the requesting owner an exact "dump" of all data that relates to them from the system it is extracted from. The format is open in the sense that there are no physical restrictions on how this should be presented. I agree with @katosdev that JSON is clearly the way to go here as it is relatively easy to ingest this into an excel spreadsheet for example thanks to it's comma delimited structure.
This is also correct. The language used is intentionally broad so that it provides headroom in terms of the format it can be exported in. The main point to be careful of there is that we do not include information relating to another user within the extract as that in itself would be a violation of GDPR.
This isn't relevant in the sense that you are required to provide ALL records that relate to that particular owner, ensuring that you redact any information within that which does not "belong" to the owner.
I'd agree that provided there was no attribution to the owner in the sense of name, physical address, email address, SSN etc, then ok. However, this would still have to be extracted and manually parsed to ensure that this is indeed not the case. The issue that arises here is one of time and commitment - particularly for large data sets Any data relating to the user that references a system outside of Flarum is not in scope. The owner of such information would need to request this from that particular source. It is not the responsibility of the forum owner to provide this as they aren't even custodians of that data - it does not reside on this particular system - even if it is a link elsewhere. Now for @katosdev points
100% agree. This is the only way to ensure any level of consistency with each check
Data should be personal, yes, but if it does not relate in any way to the original owner, it should not (and cannot legally) be provided without the real owner's consent. And taking this route introduces a legal minefield.
Hmm - not necessarily. The key point here is attribution. If the name change request comes from a username where there is no correlation between that and the original owner in the sense of a real name or email address, then there is no legal requirement to provide this. However, I do agree in terms of the hook for other extensions to use.
Good point - unless they have been permanently deleted, they would need to be provided |
@phenomlab small remark, can you push that information on "right to be forgotten" to a new issue to keep this one as minimal as possible ( To @katosdev and @angusmcleod and @phenomlab ; this is great!!!!!!! I started work on this feature where:
content:
Is there anything from core that I'm missing? Content in this extension implements an interface, meaning any extension can implement their own data type and register it on this extension. When the queue job is being executed the data type will retrieve the user model and the open zip stream to append information to it. |
@luceos Good news re the work. The mechanics of the feature sound like the right approach to me. I think we'll definitely get there in terms of the content, but it will help to hash out some of these legal issues too so we're confident on that approach. @phenomlab @katosdev Thanks guys, some great points. Perhaps we should attempt to formulate a practical heursitic(s) framed in language relevant to software development (as opposed to legal principles) that largely reflect the legal position so these can be applied both to this initial implementation and any extensions of this initial implementation. The heuristic(s) won't be perfect of course but if the three of us, with our different backgrounds / approaches, could agree on some it could be quite a useful framework to branch off of for the edge cases. A significant element there I think is sustaining this over time as Flarum grows and this is applied to extensions. If its possible to look back in a few years time and say "ok these guys decided to include / exclude this data based on these heuristics" that would help, as opposed to folks having to parse a lot of legal debate. Looking back on the debate that led up to the Discourse legal tools plugin I see this issue now, insofar as the thinking that led up to that and the heuristics that were applied in the end are not easy to parse if you're an outsider / looking at it later. ScopeOn the question of scope, i.e. what is "personal" data, whether usernames qualify, whether directory items qualify @katosdev I think your position is definitely arguable, however the way I see the utility of this feature is to guard against a number of possible interpretations. This is a young law and there are alot of different authorities that will be applying it. I feel its too early to say with confidence what the character and scope of application of specific articles and principles will be. This is what I meant by the "maximalist" approach. I would err on the side of including more than necssary, with the limiting principle of countervailing rights and responsibilites, i.e. as @phenomlab pointed out re records that actually belong to someone else and would require their consent to provide. In terms of trying to formulate a a scope heuristic, I would kick that off by fomulating it as follows
I would note the ip address in particular here. That turned out to be a bit of an issue in Discourse, which logs ips in quite a few different places. @phenomlab @katosdev Please feel free to propose a different or alternate scope heuristic. Generally, I think we should try and formulate them in practical software terms. It doesn't have to be 100% infallible, and there will be edge cases. FormatOn the format front, I've copied what I said about that when we discussed that re Discourse. I agree JSON is a good format. In fact I think it's arguable that a JSON API (as opposed to a download) qualifies. CSV is probably also fine. I'm not sure much will turn on JSON v CSV tbh, but I would also favour JSON. Copied from https://meta.discourse.org/t/providing-data-for-gdpr/83595/23?u=angus Concerning the Article 29 Working Party’s Guidelines on the Right to Data Portability I note:
p.s. I've started a seperate issue re the right to erasure: #2 |
The simple answer to this is EVERYTHING that relates to the owner of such information, with information concerning others outside the scope of the collection redacted. |
I need to read over this after I've had a coffee and time to read through it properly. |
Based on a quick think:
We do not ask for address in the core I don't believe?
This is why the extension should be extendable at core, insofar as it should have hooks to allow developers and web masters to build on top and hook in their own extensions for export, for example.
I agree with the maximalist approach to an extent, but as has been mentioned we must be very careful to not include data that may also incorporate that of another user, which as @phenomlab so rightly said, would in itself constitute a data breach.
I would agree, but as I mention above I am unsure (can't honestly remember) whether we ask for address in the core?
IP Addresses are a difficult subject, and actually one of very interesting debate in my day job (I am a helpdesk manager for a law firm) - we actually see that IP Addresses are a 'gray area' in the sense that these are often shared by ISP's. As such, are these really personal identifying information? However, the counter argument is that these are used within the core for the purpose of identify, and tracking, a user. I'd be interested to hear what @phenomlab has to say on this one :)
As @phenomlab mentioned, JSON is widely accepted as it can be easily converted and imported into a wide range of solutions. |
Yup, sounds good. We should be thinking long term here.
While it's an interesting question, under the 'maximalist' approach I don't think we have to resolve it as a user's own IP does not incorporate information about another user. Ok, heursitic-wise so far we have ScopeInclude any record that contains that user's, and only that users:
FormatJSON |
Agreed completely :) |
A few comments:
I like the idea of automatically downloading everything with a user id relation, as that will scale better to extensions. However, there should be a way of opting out for certain "administrative record" models. |
For the Discourse Legal Tools Plugin we took a "maximalist" approach to what data should be included in a download of user data for the purposes of data portability and access.
I created a class that looped through lists of columns for each table which were then used to cumulatively add data to an object that was than bundled as a csv file. This piggy backed off of, and expanded, the existing Discourse user data download feature. The lists of columns included in the download are here. Those lists serve as a decent reference list for the same in Flarum.
In terms of the thinking behind the approach taken here, I would highlight the reasoning I laid out here, which I've copied below so we can discuss them further in the Flarum context.
I would emphasise “for the purposes of this feature”, as the purpose of this feature is to take a ‘maximalist’ approach to possible interpretations of the GDPR. It does not attempt to parse ‘likely’ approaches. I’ve laid out some of my own views on the ‘likely’ approaches in this topic (which remain unchanged).
The specifics of the reasoning behind this ‘maximalist’ approach are:
The broadest interpretation of A.4.1 (the definition of ‘Personal Data’ in the GDPR) as it applies to Discourse is any record in the db that contains the user’s user_id, i.e.:
Read literally, this definition doesn’t care about how the data is produced (e.g. whether the user is acting or not). It merely requires the data to be related to the user’s identifier in some way.
However, applying that literally would produce a fair amount of duplication (e.g. the records in the directory_items table are duplicative of various other entries).
The point of the extended download is to guard against even the small risk that Article 4.1 could receive a very broad interpretation by some authority or court in Europe.
The factors against including it - size of download, potentially security (?) - do not outweigh the possible benefit of including it.
We also considered whether to include ‘administrative’ records with the user’s user_id such as flags, complaints and staff whispers. We decided against this, reasoning as follows:
They’re already in the territory of information associated with the user purely by their identifier. They are not information about the user per se (i.e. name, email, age, location etc). This is already assuming a wide interpretation of A.4.1.
Whether administrative records intrude on the privacy of other parties, or other relevant concerns (i.e. R. 63.5 & A.15.4) must be determined on a case-by-case basis.
Other parties, such as Facebook, do no include such data in their user download functionality.
The text was updated successfully, but these errors were encountered: