Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Vector more scriptable #1041

Closed
ghost opened this issue Oct 17, 2019 · 7 comments
Closed

Make Vector more scriptable #1041

ghost opened this issue Oct 17, 2019 · 7 comments
Labels
domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin. needs: approval Needs review & approval before work can begin. needs: requirements Needs a a list of requirements before work can be begin

Comments

@ghost
Copy link

ghost commented Oct 17, 2019

In this issue I want to discuss on high level adding scripting APIs to Vector. It might not be the top priority at the moment, but I'm creating this issue now to give us enough time to think about it and discuss it.

Introduction

Goals and scope of this issue

The goal is to define how ideal APIs should look like, even if they cannot be easily implemented at the moment. This would allow us to have the whole picture in advance to ensure that when we implement separate scripting-related features they play together well.

In the text below JavaScript will be used as the scripting language because I'm familiar with it and it can be implemented on top of QuickJS engine (see #721). Most of the ideas described here can be translated to some other scripting language, for example Lua, if we find out that it fits better.

Intended usage of scripting

I want to highlight that scripting is intended to be used for non-standard things and can't replace native sources/transforms/sinks because a scripting language will always be an order of magnitude slower than native Rust code. However, it still could be indispensable to users who need to do something custom.

Additionally, if we actually arrive at the point where scripting is flexible enough, it would be possible to prototype some new features as scripts before actual high-performant and user-friendly implementation in Rust.

Scriptable components

Overview

We can have all three possible types of scriptable components:

  • scriptable source for generating events from user code
  • scriptable transform for changing, augmenting, or multiplying events
  • scriptable sink for storing events in non-standard ways

Below there are examples of how they can be used.

Config structure

Each of the components needs to have some kind of a handler function which implements the logic. The config could either

  • load the source from a file

    [component]
    handler = "function_name"
    path = "script_path.js"

    where the path by default is relative to the path of the config file, and not to the current working directory to ensure that running vector --config <config path> works with any working directory.

  • or contain inlined source as a string:

    [component]
    handler = "function name"
    source = "<actual JS source>"

I also find it reasonable to allow usage of anonymous handlers for simple cases, where entire source contains (and evaluates to) a single function. In that case the handler field should be skipped and the config could look just like

[component]
source = "function (...) {}"

Sources

The general idea of is similar to #992, but instead of a shell command a JavaScript function is periodically executed and generates either actual event or promise that resolves to an event.

  • Just a counter with state, can be used for tests:

    let counter = 0;
    
    function handler() {
        return ++counter;
    }
  • HTTP API reader with fetch API:

    async function handler() {
        const res = await fetch("http://...");
        const data = await res.json();
    
        return {message: data.message};
    }
  • File reader. Here a decision has to be made on filesystem APIs, as they are not standard. This example uses Node-style readFileSync, while QuickJS provides more low-level filesystem functions. In particular, it can be used to read values from /sys or /proc filesystems:

    function handler() {
        const content = fs.readFileSync("/sys/class/gpio/gpio3");
        return {state: content};
    }

It should also be possible to returned multiple events at once by returning an array or no events by returning null.

Transforms

Scripted transforms are a generalization of #721 with support of promises in addition to normal events. The promise support is mostly needed for the fetch API, which can be used as described below. The promise can resolve to a single event, an array or events, or null. Returning actual event (or array or null) instead of promises needs to be supported too.

Examples:

  • Currency rate conversion:

    let currencyRate = 0;
    let currencyRateFetchedAt = new Date(0);
    
    async function getCurrencyRate() {
        const now = new Date();
        if (now - currencyRateFetchedAt > 1000 * 60 * 60) { // cache rate for one hour
            const res = await fetch("http://currency-api/...");
            const data = await res.json();
            currencyRate = data.rate;
            currencyRateFetchedAt = now;
        }
        return currencyRate;
    }
    
    async function handler(event) {
        return {...event, price: event.price * (await getCurrencyRate())};
    }
  • Merging events from different sources. For example, there could be two sources, one of which produces events containing current room temperature and another containing current atmospheric pressure. They can be combined using the following config:

    [transforms.combiner]
    type = "javascript"
    inputs = ["temperature_source", "pressure_source"]
    handler = "handler"
    path = "..."

    with script source looking like

    let temperature = null;
    let pressure = null;
    
    function handler(event) {
        if (event.hasOwnProperty("temperature")) {
            temperature = event.temperature;
        }
        if (event.hasOwnProperty("pressure")) {
            pressure = event.pressure;
        }
        if (temperature !== null && pressure !== null) {
            return {
                timestamp: event.timestamp,
                message: `Temperature: ${temperature}, pressure: ${pressure}`
            }
        }
    }

Sinks

While scriptable sinks seems to be less important to me than sources and transforms, I think the main use case here not covered by native sinks can be making multi-step HTTP requests, writing to the filesystem using complex logic, or invoking command-line programs with arguments combined using complex logic. The handler should take an events array containing a batch of events and process them either synchronously or asynchronously.

Example:

  • Send HTTP requests requiring temporary authentication tokens:

    let token = null
    
    async function handler(events) {
        if (token === null) {
            // send authentication request with login and password from environment variables
            // and receive the token
            token = await authenticate();
        }
        // send actual request using the token
        await makeRequest(events, token);
    }

Proposed API groups to be implemented

From examples and use cases listed above, I think the following list of APIs provided to the user scripts would be useful:

  • HTTP requests (for JavaScript it is fetch API)
  • Access to environment variables
  • Access to the filesystem
  • Invocation of other programs (similar to system/popen)
  • Logging facilities (ability to log to Vector's standard log, could be useful for debugging)

Questions

  • Should there be optional timeouts for handlers, which would make Vector restart the scripting engine if the timeout is exceeded?
  • Are there missing API groups which could be provided too?
@ghost ghost added the needs: approval Needs review & approval before work can begin. label Oct 17, 2019
@ghost ghost added this to the Improve data processing milestone Oct 17, 2019
@LucioFranco
Copy link
Contributor

This all looks fantastic!

One question:

The handler should take an events array containing a batch of events and process them either synchronously or asynchronously.

This means pretty much we have to do all the batching in rust then invoke the handler with the batch. Does this mean it may be harder to do streaming-based sinks with scripting?

@ghost ghost mentioned this issue Oct 18, 2019
@ghost
Copy link
Author

ghost commented Oct 18, 2019

This means pretty much we have to do all the batching in rust then invoke the handler with the batch. Does this mean it may be harder to do streaming-based sinks with scripting?

The batch size can be set to 1 or a special option can be added to send individual events instead of arrays of events.

@ghost ghost mentioned this issue Dec 18, 2019
@ghost ghost added the needs: requirements Needs a a list of requirements before work can be begin label Dec 18, 2019
@ghost
Copy link
Author

ghost commented Dec 18, 2019

I think this might benefit from some coordination with #1328.

Ideally we want as much of the user-written code as possible to be operating on the topology components instead of individual events. Such operations on entire streams of events can be called vector operations, while operations on individual events can be called scalar operations. In most of the cases the users would be recommended to use the vector operations, resorting to the scalar ones only in very specific cases. This would make it possible to keep most of the flexibility, while maintaining native Rust performance, which is not achievable from any scripting transform.

Then most of our components would turn into functions. This means that while we don't have advanced scripting, adding new components only benefits us, as they could be nicely wrapped into highly-performant functions later.

For example, there could be a function representing http transform component (#1385) and then it could be applied on the streams of the events, which would definitely be faster than a similar function written in a scripting language and acting on individual language.

With this approach in mind, I think it might be beneficial to not rush in implementing as much as possible "scalar" APIs, but rather focus on adding high-quality components that can be later represented as "vectorized" functions acting on the streams.

@binarylogic binarylogic assigned ghost Dec 21, 2019
@binarylogic
Copy link
Contributor

Such operations on entire streams of events can be called vector operations, while operations on individual events can be called scalar operations.

I absolutely love this.

In most of the cases the users would be recommended to use the vector operations, resorting to the scalar ones only in very specific cases. This would make it possible to keep most of the flexibility, while maintaining native Rust performance, which is not achievable from any scripting transform.

Big 👍 from me. I wouldn't mind seeing more concrete examples of this to make sure I'm understanding this correctly.

@ghost
Copy link
Author

ghost commented Jan 2, 2020

Big 👍 from me. I wouldn't mind seeing more concrete examples of this to make sure I'm understanding this correctly.

I was thinking about something like this:

  • events streams are considered to be infinite-dimensional collections of events;
  • each stream allows accessing a named field of its events;
  • this named fields are considered infinite-dimensional vectors of numbers, strings, or other events.
stream = SysLog.new # creates a data new stream
stream.events["field"] # <-- this is an infinite-dimensional vector of values

# for numerical fields
stream.events["a"] = stream.events["a"] + stream.events["b"] # addition of vectors
stream.events["a"] = stream.events["b"] * 2 # multiplication by scalar

# for strings multiplication by negative numbers is not defined, so only addition is possuble
stream.events["a"] = stream.events["a"] + stream.events["b"]

# note that each stream corresponds to different vector space, so that the following is not allowed:
stream1.events["x"] = stream2.events["y"]

# for numeric fields some vector functions could be introduced
stream.events["a"] = cumulative_sum(stream.events["a"]) # get cumulative sum of the elements
stream.events["b"] = lag(stream.events["b"], 3)
stream.events["c"] = running_average(stream.events["c"])

# special functions can return vectors which can be assigned
stream.events["ones"] = identity_function() # assign ones to the field
stream.events["count"] = cumulative_sum(identity_function()) # assign count to the field
stream.events["http_field"] = http_get("https://...", {timeout: 3600}) # create a vector of values obtained using an HTTP request performed each hour

@binarylogic binarylogic removed this from the Improve Data Processing milestone Jul 26, 2020
@binarylogic
Copy link
Contributor

Noting, this will be covered via our WASM roadmap to extend Vector's components. The ideas here are good inspiration for the API we expose.

@binarylogic binarylogic added domain: extensions meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin. domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) labels Aug 7, 2020
@jamtur01 jamtur01 assigned jamtur01 and unassigned jamtur01 and ghost Oct 23, 2020
@jszwedko
Copy link
Member

jszwedko commented Aug 1, 2022

Closing this since we have the remap and lua transforms as well as the exec source and an issue for an exec sink: #935 . We may revisit WASM in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin. needs: approval Needs review & approval before work can begin. needs: requirements Needs a a list of requirements before work can be begin
Projects
None yet
Development

No branches or pull requests

4 participants