Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Embedded Python VM, Plugins and UDFs #25537

Open
pauldix opened this issue Nov 12, 2024 · 6 comments
Open

Epic: Embedded Python VM, Plugins and UDFs #25537

pauldix opened this issue Nov 12, 2024 · 6 comments

Comments

@pauldix
Copy link
Member

pauldix commented Nov 12, 2024

This is an umbrella for many issues related to adding a Python VM to the database. This will require work in API, CLI, and the internals. There should be an easy way for users to define Python based plugins that run inside the database that are able to receive data, process it, interact with third party APIs and services, and send data back into the database. Ideally, the runtime would be able to import libraries from the broader Python ecosystem and work with them.

This issue is by no means exhaustive, but can serve as a jumping off point for further refinement and detail.

Here are the contexts under which we'd want to run:

  1. On write (or rather on wal flush, send data to the VM)
  2. On Parquet persist (when persistence is triggered, we'll want to either persist and then run the persisted data through the plugin, or run through the plugin and persist the output)
  3. On a schedule (like cron)
  4. On request (bind to /api/v3/plugins/<name> and send request body and headers to the plugin)
  5. Ad-hoc (submitted like a query)

Within the plugin context, the Python script should have some API automatically imported that allows it to make queries to the database, or write data out to the database. We'll also want to have an in-memory key/value store accessible by the script. Each script should have its own sandboxed store.

We'll need to have an API and CLI for submitting new Python scripts to the DB. We'll also want to collect logs from the scripts and make those accessible via system tables in the query API.

We'll need to have a method for storing and accessing secrets in plugins for connecting to services.

We ultimately want these scripts to be user defined plugins or functions. We'll run a service (like crates.io) for hosting these and will want a method to quickly and easily bring plugins or functions from that service into the database.

Some plugin ideas:

  1. Consume data from Kafka to write into the DB
  2. Collect system metrics to write into the DB
  3. Monitor and alert for specific conditions (threshold values, deadman alerts, etc)
  4. Write to an Iceberg Catalog (i.e. on Parquet persist)

We should create these plugins ourselves to test drive the developer experience of plugin creation, operation, and debugging.

@pauldix pauldix changed the title Epic: Plugins and UDFs Epic: Embedded Pythong VM, Plugins and UDFs Nov 12, 2024
@philjb philjb changed the title Epic: Embedded Pythong VM, Plugins and UDFs Epic: Embedded Python VM, Plugins and UDFs Nov 12, 2024
@jdstrand
Copy link
Contributor

jdstrand commented Dec 6, 2024

@pauldix and @david (cc @jacksonrnewhouse) - Is there a specification for this? Looking at the list and thinking through security aspects, at first glance these are interesting from a security POV:

  • "We'll also want to have an in-memory key/value store accessible by the script. Each script should have its own sandboxed store."

    • What does "sandboxed" mean in this context? Is it simply something like namespacing within the kv so scripts don't stomp on each other or is it something more involved?
  • "Within the plugin context, the Python script should have some API automatically imported that allows it to make queries to the database, or write data out to the database."

    • What the authz model here? Eg, the scripts will need read or write access to the database (but a script shouldn't be able to create/modify/delete others scripts. This is why we want a 'database' resource vs a 'script' resource in the permissions model; authz for a database write should not imply the ability to create/update/delete scripts within the system). How are script permissions declared? Will the be used with an existing token? Note that 2.x 'tasks' had a mostly serviceable but sometimes problematic design where the task would run with the permissions of the user who created it. This is not ideal since you have to manage when users are removed or permissions are changed for the user; scripts should have the ability to run with least privilege (eg, 'read-only on this database' as opposed to 'read and/or write on everything the user who uploaded the script had), etc. OTOH without reading specs, etc, a script should have an associated token assigned to it. This solves most problems with the 2.x tasks implementation (but there are devils in the details that need not be discussed here) and remains flexible since database token permissions are mutable
  • "We'll need to have an API and CLI for submitting new Python scripts to the DB"

    • What is the authz model for this? Will it be exposed via a public HTTP API (in which case what is the permissions model and authz for this)? Is it something only for the local command (eg, a UNIX domain socket, etc)?
  • "We'll also want to collect logs from the scripts and make those accessible via system tables in the query API"

    • I wonder if the system tables is the right place? What is the permissions model for accessing system tables? Eg, there is a difference between schema info, statistics for partitions, etc and logs from scripts for users. Some (non-exhaustive) considerations are that admins should be able to see everything, non-admin database users probably should only see logs for their scripts that use their databases (and not logs for other users and databases they don't have access to), etc. Perhaps the easiest path forward is to have a separate token permission for access scripts logs than for system tables. It's tempting to conflate read access to a database as allowing read access to scripts that use the database, but that might expose details of scripts from other users who have a different token with read access to the same database.... This needs more thought....
  • "We'll need to have a method for storing and accessing secrets in plugins for connecting to services"

    • This reminds me of /api/v2/secrets in 2.x. I wonder if we can avoid exposing a public HTTP API (like /api/v2/secrets is) and instead have some mechanism to inject the secret into the env (eg, perhaps more k8s secret vs HTTP GET)? When implementing this, like with database tokens, it probably makes sense to attach specific secrets to scripts in some manner. OTOH is feels like associating a secret to a script and my idea of "a script should have an associated token assigned to it" (see above) are related enough to be done in the same way, with the same code (ie, associating a token with a script is nothing more than adding the token as a secret and associating the secret to the script)
  • "We ultimately want these scripts to be user defined plugins or functions. We'll run a service (like crates.io) for hosting these and will want a method to quickly and easily bring plugins or functions from that service into the database."

    • Is this a locally run service that is a different process/API/etc that OSS/Pro provides? If so, how are scripts added to the system? If via an API, what is the authz permissions model (the resource should be different from database resources (see below)
    • If this is not a locally run service, is this instead a service that InfluxData hosts that is meant to hold scripts? What is the authz/permissions model for this remote API? If we are hosting a service:
      a. Are these scripts ones that InfluxData itself writes? Ones that people submit via some sort of PR process? Wondering if this is analogous to say, telegraf plugins that are gatekept by InfluxData via code reviews, etc (though in this case, the scripts are of course hosted remotely)
      b. Are these meant to be user uploadable without InfluxData review but namespaced in some manner to a particular account/org/email/etc (ie, people can upload their scripts and only their installations can download their scripts)? How is this enforced?
      c. Are these meant to be user uploadable as a community repository where everyone can upload anything?
      d. How do we handle scripts that need different python versions? Dependencies? InfluxDB versions?

    A scripts store is a broad, deep and very security-sensitive topic since we don't to be a malware store or make it easy for people to shoot themselves in the foot. A locally run service has the least security exposure (eg, need to figure out how to get scripts into the system). A remote service is a high value target for attack and the store needs to be appropriately hardened. 'a', 'b' and 'c' are from least to most risk in terms of attack surface and impact. My comments on flux modules have some (but not all!) context for this. OTOH without reading specs, requirements, etc, 'a' seems most prudent (to start at least). If 'a' is all we every want, then I would propose to instead ship the InfluxData-reviewed scripts within InfluxDB itself and forego the hosting (local or remote) but continue to have a facility for users to add their own scripts (see below).

Other questions that come to mind:

  • Depending on the answers to this above, this is potentially a very big lift with lots of security exposure. What are the phases of rollout? Eg, the MVP, the next pieces, etc
  • Do we plan to add the processing engine to other products at some point? Eg, Clustered, Dedicated and/or Serverless? Do we plan to have a consistent experience? Python is a great choice for users but the security contexts of OSS/Pro/Clustered (ie on prem) vs Dedicated (hosted by InfluxData) and Serverless (multi-tenant, hosted by InfluxData) are very different (see my Google doc from earlier this year). If we are planning on rolling out to distributed InfluxDB, then we probably want to at least consider some high level requirements for those so we have a consistent (from a user perspective) design
  • Related to the last point, how are we defining the runtime environment and its security characteristics? Thinking about arbitrary code exec, SSRF (ie, influxdb reaching out to other hosts from scripts that a user wouldn't otherwise have access to), data exfiltration, etc, etc. In some ways we can simply say for on prem (OSS/Pro and to some extent Clustered) "you are responsible for who can upload scripts and what they do. Scripts are not sandboxed and have full access to the system and network as the user influxdb runs scripts as" (with more appropriate language in documentation). If that is our stance, we need to be very explicit in documentation that this is the intended design and security controls around scripts are around who can manage them (another reason for have a separate resource in the permissions model). We won't be able to take the stance for Dedicated and Serverless.
    • I phrased "as the user influxdb runs scripts as" very deliberately. If the system can be designed such as influxdb runs as one user on the system but it runs scripts as another user, then we get a lot of security benefit by running as non-root without access to InfluxDB's files that is cross-platform (Linux and OSX; Windows would need investigating). This is an implementation detail but one presented for inspiration when designing/implementing this
  • Python is a great choice for users. I imagine the very first question will be how to use 3rd party dependencies. There's pip and its repository, but there are other repositories as well as dependency managers (pip, pipx, poetry, conda, etc, etc). AIUI, the scripts we're talking about here are quite different from dependencies that the scripts might use and so there should be a different mechanism for dependencies. I very strongly suggest we not reinvent a python repository, but how can users leverage 3rd party dependencies? Perhaps we need to allow uploading a(n architecture-dependent to accommodate wheels) virtual env as a zip that is associated with the script and untarred into a specific (read-only) directory that the script invocation can . /path/to/script/specific/venv/bin/activate just before it is run? We need to consider zip bombs, malware in the zip, zip overwriting files on the system, etc (we can probably devise a scheme for unpacking that won't harm the host by using different users (see above))
  • Another question is what version of the interpreter? Will we support different ones? Where is the python coming from (is it cpython? something else?)? Is it coming from the system?
  • If we aren't using a system python and we're shipping it, what is the maintenance story for it? CPython gets regular security updates and we'll need to make sure our python (and any installed python modules) are kept up to date
  • Presumably the system is going to evolve and scripts that work in one version of InfluxDB might break in a future update of InfluxDB. We can try very hard to always be backward compatible, but we should plan for the inevitable flag days. Should scripts have some metadata in them for InfluxDB version? Python version? Something else?

I'll stop there as I suspect it's already too many questions (and it's not exhaustive).

@pauldix
Copy link
Member Author

pauldix commented Dec 9, 2024

@jdstrand The security model for all of this is pretty basic in the open source build. We already have a token setup where users make requests with tokens and they either get full access (i.e. they can do anything on the API) or no access. When users submit plugins to the database, that is what will be checked and the resulting plugin will run with full access to the local DB and server (as it's just a Python VM). We can make the plugin system something that can be turned off via configuration so that it won't run.

The commercial Pro version will have finer grained controls, but we'll define that later based on customer requests and needs.

Plugin code can ultimately come from anywhere. We won't be gating what plugins exist. This is by design, we don't want people to have to submit a PR to a repo we own and then wait for review from us to create their own plugins or share code with others. If we end up setting up a service like Crates.io, that will be publicly accessible on the internet and anyone will be able to create an account and upload plugin code, which could be accessed by others. Just as with any of similar services online, we provide no guarantees that plugins uploaded by random people don't contain malware. We will likely have a list of vetted plugins (or ones created by us) for our customers.

My expectation is that plugins will mostly be single files, so we may not need to bother with a service. A simple mechanism in the server that is able to pull from say a Gist or GH repo would suffice. I was thinking of a service for the added benefit of being able to have approved plugins and to have one place to go to search for user created plugins (that doesn't require us explicitly updating it).

We're too early in the process to have many of these things answered. The goal is to get an alpha of the functionality released to the community and then iterate based on feedback and use.

@jdstrand
Copy link
Contributor

jdstrand commented Dec 11, 2024

  • Another question is what version of the interpreter? Will we support different ones? Where is the python coming from (is it cpython? something else?)? Is it coming from the system?
  • If we aren't using a system python and we're shipping it, what is the maintenance story for it? CPython gets regular security updates and we'll need to make sure our python (and any installed python modules) are kept up to date

Note to self and answering my own question and looking at a very early in progress PR, we are planning on (/exploring) using https://crates.io/crates/pyo3 to embed a python interpreter. The docs on that site mention using an existing python shared library from the system.

I plan to watch how the implementation evolves (don't have to answer now) wrt what we are embedding (depend on the system? will we build it ourselves? grab it from somewhere official? etc) as there is a security maintenance angle here.

Also see: https://pyo3.rs/v0.23.3/python-from-rust.html

@crepererum
Copy link
Contributor

Sharing this because this may save you some headache:

Note that you may have trouble having multiple interpreters running in a single process, because many extension modules (via C or via Rust) have global state (= static variables) that can only exists one (because the underlying Linux lib is only runtime-linked ONCE into the process). See PEP 489 for more details, esp. the "legacy init" section. Pyo3 for example doesn't support multi-phase inits yet (ref PyO3/pyo3#2274).

So I think you have the following options:

  • tell people that they cannot use binary extensions (but that would rule out about anything including NumPy, Pandas, Arrow, Tensorflow, PyTorch, ...)
  • tell people they have to wait until PEP 489 is adopted by the relevant libs
  • live w/ a single python interpreter
  • isolate the interpeter into a subprocess.

@jdstrand
Copy link
Contributor

jdstrand commented Dec 17, 2024

isolate the interpeter into a subprocess.

This has a nice property that it also allows for the opportunity to run the interpreter under another UID, which could be a meaningful security hardening measure (eg, database runs as one user (eg, influxdb) and the interpreter subprocess in another (eg, influxdb-processor) which could, for example, be used to prevent a buggy process from deleting database files, etc). I don't have fully formed thoughts on this, but thought I'd mention it.

@crepererum
Copy link
Contributor

I forgot one option in #25537 (comment) :

Use a WASM VM like wasmtime and Pyodide. That however makes the package installation more difficult since micropip (the special PIP that is bundled w/ Pyodide) can only install pure Python packages. So it seems that arrow, pandas, and numpy work, but torch for example doesn't (I've tried that using the public REPL).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants