Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype loading privacy declarations directly from source code #209

Closed
ThomasLaPiana opened this issue Nov 3, 2021 · 20 comments
Closed
Labels
discussion Please consider using Discussions instead; label use for implementation discussion only

Comments

@ThomasLaPiana
Copy link
Contributor

ThomasLaPiana commented Nov 3, 2021

The separate system declarations are a potential burden for users. A good middle-ground between code analysis and what we have now is to co-locate the declarations and the code.

The two implementation methods I can think of for the POC are as follows:

  1. A very python-specific implementation where we ingest the python code, extract the docstrings and then extract the system declarations from there
    • A major issue here is that this is not generalizable to other languages
  2. We go for a more general approach, and treat each source code file as a txt file. We then use regex to look for matching cases and attempt to load it into a system declaration
    • Because we would still expect it to be yaml-like, this would only work in languages with multi-line comments

Option 1: Declaration inside of the docstring

def some_func(some_parameter: str) -> None:
    """
    Do something important with user data.

    system:
      - fides_key: demo_analytics_system
        name: Demo Analytics System
        description: A system used for analyzing customer behaviour.
        system_type: Service
        privacy_declarations:
          - name: Analyze customer behaviour for improvements.
            data_categories:
              - user.provided.identifiable.contact
              - user.derived.identifiable.device.cookie_id
            data_use: improve.system
            data_subjects:
              - customer
            data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
            dataset_references:
              - demo_users_dataset
    """"

    user_data = get_user_data(some_parameter)
    advertise_to(user_data)

Option 2: Declaration as a multi-line comment:

"""
system:
  - fides_key: demo_analytics_system
    name: Demo Analytics System
    description: A system used for analyzing customer behaviour.
    system_type: Service
    privacy_declarations:
      - name: Analyze customer behaviour for improvements.
        data_categories:
          - user.provided.identifiable.contact
          - user.derived.identifiable.device.cookie_id
        data_use: improve.system
        data_subjects:
          - customer
        data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
        dataset_references:
          - demo_users_dataset
"""

def some_func(some_parameter: str) -> None:
    """
    Do something important with user data.
    """

    user_data = get_user_data(some_parameter)
    advertise_to(user_data)

An additional caveat here is that it would be extremely difficult if not impossible for a plugin to help with these annotations, as they're embedded in other source code.

Additional questions to think about:

  • Do we have the user define a system in a system.yaml file, and then attribute all of the nearby code declarations to that?
  • Do they need to define a system-per-declaration? that seems weird, so this ^ option seems better
  • How should this be handled during evaluations? Should it be done at apply/evaluate time, or should there be a separate command that generates a full system.yaml file from the source code declarations?
@ThomasLaPiana ThomasLaPiana self-assigned this Nov 3, 2021
@ThomasLaPiana ThomasLaPiana removed their assignment Nov 9, 2021
@iamkelllly iamkelllly added this to the Fidesctl 1.1 release milestone Nov 17, 2021
@PSalant726
Copy link
Contributor

PSalant726 commented Nov 29, 2021

I think a generalized option is the most valuable path forward. Fidesctl isn't meant to be a Python-specific tool, and so it should support all languages. @ThomasLaPiana your second suggestion is a good first step, and we can decide if it's worth working through an implementation of the feature using that format, but the drawbacks you mentioned are significant.

What if we instead borrowed from other static analysis tools? Users could create the same detailed markup using a single-line comment syntax that we define. Your example above could look like:

# fidesctl-system-declaration start
# 
# fides_key: demo_analytics_system
# name: Demo Analytics System
# description: A system used for analyzing customer behavior.
# system_type: Service
# privacy_declaration_1_name: Analyze customer behavior for improvements.
# privacy_declaration_1_data_categories: [user.provided.identifiable.contact, user.derived.identifiable.device.cookie_id]
# privacy_declaration_1_data_use: improve.system
# privacy_declaration_1_data_subjects: [customer]
# privacy_declaration_1_data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
# privacy_declaration_1_dataset_references: [demo_users_dataset]
# 
# fidesctl-system-declaration end

def some_func(some_parameter: str) -> None:
    """
    Do something important with user data.
    """

    user_data = get_user_data(some_parameter)
    advertise_to(user_data)

This is just a first pass. It's fairly verbose, and tbh I'm not sure either solution really improves the user experience vs. authoring YAML files.

Another idea could be to enforce file-scoped markup. This has the drawback of prescribing how users structure their projects' files, but improves things by allowing annotations to live closer to the code that they annotate. I'm thinking in the context of dataset annotations describing a Postgres DB. The annotations could live in a schema file, like you would see in a Rails app, or in a DB setup SQL script like you might see elsewhere. For example, in a Rails schema.rb file, you could have:

# fidesctl-dataset todos_db "ToDo Database" "Stores ToDo items created by users"
ActiveRecord::Schema.define(version: 2015_05_19_181601) do

  # fidesctl-dataset-collection todos "ToDo items"
  create_table "todos", force: :cascade do |t|
	# fidesctl-dataset-field title "ToDo item's title"
    t.string "title"
	# fidesctl-dataset-field is_completed "ToDo item completion state"
    t.boolean "is_completed", default: false
	# fidesctl-dataset-field created_at "ToDo item creation timestamp" [system.operations]
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.string "session_user_id"
  end

end

This is an incomplete example, but it's already verbose in its own right. In general, a first pass could be to create a few single-line comment "functions" that take "arguments" at specific indices. In a more robust implementation, we could infer a lot of the argument values from the code's structure. This has the huge benefit of being "compilable" (the comments could be used to create YAML files like we expect users to author today). It enables us to simultaneously support inline annotations and manifest file annotations, which I think is critical and would be tricky to do otherwise.

Thinking even further down the line, we could treat this like an entire language. It would further improve the user experience and further reduce hurdles for adoption if we had a supporting language server that could integrate with IDE's. We could recognize single-line comments beginning with fidesctl- and recommend that users install our language server/IDE extension, which would include documentation and code completion for inline annotation comments.

@earmenda
Copy link
Contributor

  • Do we have the user define a system in a system.yaml file, and then attribute all of the nearby code declarations to that?
  • Do they need to define a system-per-declaration? that seems weird, so this ^ option seems better

First thought here is that it makes the most sense for the system to already exist outside of each code declaration. What about giving the option of referencing an existing system within the declaration?

privacy_declarations:
  - name: Analyze customer behaviour for improvements.
    data_categories:
      - user.provided.identifiable.contact
      - user.derived.identifiable.device.cookie_id
    data_use: improve.system
    data_subjects:
      - customer
    data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
    dataset_references:
      - demo_users_dataset
    *system_reference: system1*

For simplicity it would be nice to keep the same exact model, so it could just be a field outside of the declarations also:

*system_reference: system1*
privacy_declarations:
  - name: Analyze customer behaviour for improvements.
    data_categories:
      - user.provided.identifiable.contact
      - user.derived.identifiable.device.cookie_id
    data_use: improve.system
    data_subjects:
      - customer
    data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
    dataset_references:
      - demo_users_dataset

The project having a base system.yaml file which declarations are added to seems okay to me also. Not sure if there will be complexities around finding this file though(will a config be needed to locate system.yaml?).

@earmenda
Copy link
Contributor

This is just a first pass. It's fairly verbose, and tbh I'm not sure either solution really improves the user experience vs. authoring YAML files.

This is along the lines of what I am thinking. Based on the feedback we've gotten, user's want to do as little as possible but still be able to track their data usages. Even learning the fides model can be a turn off and asking them to create these blocks in their code is just shifting the pain. @PSalant726 mentioned single line comments and that makes the most sense to me as a way to make this feel better. A function could just have these three lines and we could create a declaration out of it:

""" or #
@fides.data_categories: [ "user.provided.identifiable.contact", "user.derived.identifiable.device.cookie_id"]
@fides.data_use: "improve.system"
@fides.data_subjects: ["customer"]

Maybe this defeats the purpose of this issue since we are looking for a middle ground, but I am on the fence about how useful the yaml definitions in code will be.

@PSalant726
Copy link
Contributor

The first decision we make should answer the question "how do we want to address the tester feedback?" We can either:

  1. Take the smallest possible step forward and double-down on YAML declarations, simply move the location of that YAML from dedicated manifest files to inline comments in source code, and continue to iterate on that design over time until we reach a reasonable compromise
  2. Treat this like a fundamental redesign of the developer interface with fidesctl, design a system from scratch that abstracts away the most cumbersome parts of annotating your systems and datasets, and implement/maintain robust tooling to support both first-time and veteran users that want to integrate fidesctl into large and small projects

Both approaches have their merits and their drawbacks. Personally, I'm in favor of the latter, but I fully acknowledge that it represents a tremendous effort.

FWIW - I never felt like this feedback was worth addressing in the first place. IMO, implementing fidesctl is like implementing anything else; you have to learn a new system, make a large one-time investment to get it implemented, and then make small changes over time to maintain that implementation. Having dedicated YAML files feels like the cleanest compromise between allowing fidesctl to maintain a strong hold on the project it annotates/monitors, while staying out of the way and not cluttering the project's code. If testers are complaining about writing YAML, then maybe they aren't thinking deeply about the ramifications of a version of fidesctl without writing YAML. It might be enough for us to consider how we can cause fidesctl evaluate to fail when a new system has been added, or a dataset modified, but changes haven't been made to any manifests.

@ThomasLaPiana
Copy link
Contributor Author

I want to hone back in on one of the potential problems this is solving. While it isn't going to "solve" the problem of them writing the declarations, having the declaration co-located with the code itself has enormous benefits and by itself does reduce friction.

If I went to a random project that was using fidesctl, and I opened their system manifest file I could see all of the declarations, but I would have no idea where any of that functionality actually lives within the codebase. Getting the declarations as close to the code as we can is valuable.

By having it live with the code itself, I can easily see exactly what is going on in that function from a privacy perspective. It is also much easier to remember to update a declaration if it is right next to the function it describes, as opposed to having to go to a system manifest file, figure out which declaration describes that function, and then update it.

I think the solution here is to support both approaches. People should be able to choose whether they want to write the yaml files or if they want to embed the definitions in code. I am completely ok with having the code definitions look different and not follow the exact same syntax as the yaml. I think the following format suggested by @earmenda & @PSalant726 makes perfect sense:

Multi-Line Comment

"""
@start_privacy_declaration
@parent_system: system1*
@name: Analyze Customer Behaviour
@data_categories: [user.provided.identifiable.contact, user.derived.identifiable.decide]
@<other fields>
@end_privacy_declaration
"""
def some_func():
    pass

Multiple Single-Line Comments

#@start_privacy_declaration
#@parent_system: system1*
#@name: Analyze Customer Behaviour
#@data_categories: [user.provided.identifiable.contact, user.derived.identifiable.decide]
#@<other fields>
#@end_privacy_declaration
def some_func():
    pass

YAML in code comments feels pretty wrong, so the simplified format above makes more sense

We can then also support generating yaml files from these definitions. (once they are loaded into the right model, we already have the code to create yaml from that)

@PSalant726 I push back on the idea that this feedback isn't worth addressing. We can't completely remove the overhead of fidesctl for our users, but there are still plenty of places where we can make a better user experience. This definitely falls under that umbrella

@PSalant726
Copy link
Contributor

I see the value in colocated annotations, but it's still predicated on the assumption that humans should be writing them vs. simply reviewing and maintaining them. Maybe it's just my opinion that they shouldn't be handled that way? Nevertheless, I still believe that the real value that we can provide as the core team maintaining fidesctl is to better enable programmatic creation and maintenance of annotations through enhancements to the CLI. We can improve the discoverability and functionality of the generate-dataset and annotate-dataset commands, add new commands to handle guided creation of system & policy annotations + updates to existing dataset annotations, and introduce static analysis tooling that can call attention to outdated annotations.

There's no reason why the colocation features can't/shouldn't exist, but to me it feels like a red herring.

@ThomasLaPiana
Copy link
Contributor Author

@PSalant726 I agree there are other places we can also do work to ease users into using fidesctl, and we can/will open issues for those pain points. But to refocus on the question at hand, we do want the ability to put the declarations in the code, so this POC is still valid and worth doing. This is a low-risk first iteration to get a feel for how this looks/feels, and we can go from there. It's a prototype, and if it turns out that it doesn't feel right or we want to go in a different direction altogether, we can do that. But we need to have something to play around with first to make that decision.

That all being said, are we in agreement on how those annotations should look? It seems like we're all on the same page there and now its down to implementing it

@PSalant726
Copy link
Contributor

Should this work include an implementation of the multi-line comment syntax? Given that we know single-line comments are supported by all languages, I think it would be simplest to start with the single-line syntax only, and determine if support for multi-line comments is worth the additional complexity later on.

Otherwise, I'm in agreement.

@ThomasLaPiana
Copy link
Contributor Author

Let's do the smallest possible iteration and stick with single-line comments in Python as our POC

@iamkelllly iamkelllly modified the milestones: Backlog, fidesctl 1.2 Nov 30, 2021
@NevilleS
Copy link
Contributor

NevilleS commented Dec 1, 2021

Using Python as an example- is there typically really a very clear place in the code where a system declaration would come from? Remember the system declaration specifies:

  • what the system is (e.g. some kind of web app, or database)
  • what the data uses are (e.g. provide the product, perform analytics)
  • what data categories are needed for those data uses (e.g. emails, device IDs)

For this project, for example, is there a single source file where it'd be clear where those comments would live? Would it be...

Alternatively, we might find it a lot more obvious to switch gears a bit and do inline dataset declarations instead of systems? Feels like that might be this file: https://github.com/ethyca/fides/blob/main/fidesctl/src/fidesapi/sql_models.py

@NevilleS
Copy link
Contributor

NevilleS commented Dec 1, 2021

One other potential option where the system declaration might feel "natural" is the package setup for pip/etc.: https://github.com/ethyca/fides/blob/main/fidesctl/setup.py?

In other words, what I'm saying is that I'm currently not optimistic that "inline" system declarations will feel any better than their YAML equivalent. This is because I believe the general "work" of a system declaration is modeling the reason why the system exists, which is something that source code fundamentally doesn't do. I can't think of many examples where you'd be writing source code that says "this is an ecommerce app, so we need to collect someone's address so we can deliver their order to them". Since we can't infer this, there's no way to make that magical.

However, I do think that we could (in theory) reasonably infer what data is being used by the source code by analyzing the source. That's why it might feel more natural to do a dataset declaration inline (which makes the system declaration easier to write, as you can just reference the dataset there).

Regardless, I do think we should just "try something" and see how it feels; treat it as experimental, try a couple options out in practice, then rework it for a final implementation that we want to live with. Whether we do inline system declarations or inline dataset declarations we'll still need to decide some kind of syntax for how to parse those.

@ThomasLaPiana
Copy link
Contributor Author

The reason for trying to do privacy declarations (note the system declaration would still happen elsewhere, and the privacy declarations in the source code would be attributed to a system defined elsewhere) is my thinking that you could write a privacy declaration for each function that is making calls out to external systems (data leaving or coming into the system).

Dataset declarations within the source code is also a cool idea, and very feasible since most projects will have these explicit models written out

@ThomasLaPiana
Copy link
Contributor Author

ThomasLaPiana commented Dec 1, 2021

When trying to get to the heart of the issue of maintainability vs. "getting started", we came up with some of the following ideas:

  1. Implementing a "generate system" and "generate policy" CLI command
  2. More robust documentation around maintaining fidesctl, best practices, identify data uses. Include example use-cases
  3. Adding a "pointer" in source code that describes which privacy declaration describes it
  4. Inline dataset declarations through various frameworks (rails, django, sqlalchemy)
  5. Auto-discover systems through infra
  6. Automated validation (such as Add a database-coverage command #161)

@iamkelllly
Copy link
Contributor

Most specifically relevant to item (2) above, opened issues for additional documentation:

@ThomasLaPiana ThomasLaPiana added the discussion Please consider using Discussions instead; label use for implementation discussion only label Dec 9, 2021
@ThomasLaPiana
Copy link
Contributor Author

When trying to get to the heart of the issue of maintainability vs. "getting started", we came up with some of the following ideas:

  1. Implementing a "generate system" and "generate policy" CLI command
  2. More robust documentation around maintaining fidesctl, best practices, identify data uses. Include example use-cases
  3. Adding a "pointer" in source code that describes which privacy declaration describes it
  4. Inline dataset declarations through various frameworks (rails, django, sqlalchemy)
  5. Auto-discover systems through infra
  6. Automated validation (such as Add a validate-datasets command #161)

We have issues now for everything except for 3 and 4

What is accomplished with 3 that isn't accomplished with a code comment? What is the user benefit?

For 4, this sounds like a nice-to-have but would require its own entire design doc, doesn't seem ready for an issue yet

@NevilleS
Copy link
Contributor

NevilleS commented Dec 9, 2021

Re 3, I can see the utility of that kind of "backpointer" comment reference...

Imagine a system declaration like:

system:
  - fides_key: user_mgmt_service
    name: User Management Service
    privacy_declarations:
      - name: Account Login
         ...
      - name: Data Sharing
         ... 
      - name: Support Chat
         ...

Imagine that this service implements a couple different data uses: the ability to login, the option to share data with some 3rd party integration, and the option to initiate a user support chat, etc. Those are all different types of data uses that would have involve different data categories.

It might be helpful to be able to "tag" in the source code the parts of the code that implement each of those declarations, e.g.

# user_login.py

def login(email, password):
  '''Logs in the user by validating their credentials. See @fides "Account Login" for privacy policy'''

In theory, the tagging can make it easier to be reminded to update the login declaration when changes are made to that file.

@ThomasLaPiana
Copy link
Contributor Author

@NevilleS for the sake of simplicity, wouldn't it actually be easier to put that in the system declaration itself? Then we could instantly parse it out of the system file and check the path for any changes.

If we go the route of having to parse source code for comments, that's a lot more effort for functionally the same thing, since I don't think the subtle nuance of having them in the source vs. the declaration makes a huge difference in functionality.

If we have the source code pointer in the system declaration file, it is also very easy to see which declarations are missing source code pointers

@edthedev
Copy link

edthedev commented Dec 10, 2021

Re the discussion on 3: I think the ideal outcome is being able to report (ideally during a scheduled automated CI/CD step) when new data handling code arrives that isn't covered. If that can be automated (sort of like how the Python coverage command handles unit test coverage) then that's the best case.

An 80% as good alternative is to require a standard comment or decorator in the source code so that when developers copy/paste everything from one module into others, it at least becomes apparent that the comment or decorator is being copy/pasted and re-training can happen (and the changes reviewed) when that is discovered.

@ThomasLaPiana
Copy link
Contributor Author

@edthedev With the feature I proposed, the coverage report would show how many of your system declarations had associated code files. The new system declaration would look like this

system:
  - fides_key: fidesctl_system
    name: Fidesctl System
    code_paths:
      - src/some_code_file.py
      - src/another_code_file.py

This method has the benefit of being language agnostic, and we can then throw errors for when the code_paths section is empty. We could also move it down into the declarations section

@iamkelllly iamkelllly modified the milestones: fidesctl 1.2, Backlog Feb 10, 2022
@ThomasLaPiana
Copy link
Contributor Author

closing this as the discussion is stale and we've determined some other ways forward here

starting with the #341 and the general infra scanning project, will revisit as needed!

@iamkelllly iamkelllly removed this from the Backlog milestone May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Please consider using Discussions instead; label use for implementation discussion only
Projects
None yet
Development

No branches or pull requests

6 participants