Make Vector more scriptable #1041

ghost · 2019-10-17T11:30:45Z

In this issue I want to discuss on high level adding scripting APIs to Vector. It might not be the top priority at the moment, but I'm creating this issue now to give us enough time to think about it and discuss it.

Introduction

Goals and scope of this issue

The goal is to define how ideal APIs should look like, even if they cannot be easily implemented at the moment. This would allow us to have the whole picture in advance to ensure that when we implement separate scripting-related features they play together well.

In the text below JavaScript will be used as the scripting language because I'm familiar with it and it can be implemented on top of QuickJS engine (see #721). Most of the ideas described here can be translated to some other scripting language, for example Lua, if we find out that it fits better.

Intended usage of scripting

I want to highlight that scripting is intended to be used for non-standard things and can't replace native sources/transforms/sinks because a scripting language will always be an order of magnitude slower than native Rust code. However, it still could be indispensable to users who need to do something custom.

Additionally, if we actually arrive at the point where scripting is flexible enough, it would be possible to prototype some new features as scripts before actual high-performant and user-friendly implementation in Rust.

Scriptable components

Overview

We can have all three possible types of scriptable components:

scriptable source for generating events from user code
scriptable transform for changing, augmenting, or multiplying events
scriptable sink for storing events in non-standard ways

Below there are examples of how they can be used.

Config structure

Each of the components needs to have some kind of a handler function which implements the logic. The config could either

load the source from a file
```
[component]
handler = "function_name"
path = "script_path.js"
```
where the path by default is relative to the path of the config file, and not to the current working directory to ensure that running vector --config <config path> works with any working directory.

or contain inlined source as a string:

[component]
handler = "function name"
source = "<actual JS source>"

I also find it reasonable to allow usage of anonymous handlers for simple cases, where entire source contains (and evaluates to) a single function. In that case the handler field should be skipped and the config could look just like

[component]
source = "function (...) {}"

Sources

The general idea of is similar to #992, but instead of a shell command a JavaScript function is periodically executed and generates either actual event or promise that resolves to an event.

Just a counter with state, can be used for tests:

let counter = 0;

function handler() {
    return ++counter;
}

HTTP API reader with fetch API:

async function handler() {
    const res = await fetch("http://...");
    const data = await res.json();

    return {message: data.message};
}

File reader. Here a decision has to be made on filesystem APIs, as they are not standard. This example uses Node-style readFileSync, while QuickJS provides more low-level filesystem functions. In particular, it can be used to read values from /sys or /proc filesystems:
```
function handler() {
    const content = fs.readFileSync("/sys/class/gpio/gpio3");
    return {state: content};
}
```

It should also be possible to returned multiple events at once by returning an array or no events by returning null.

Transforms

Scripted transforms are a generalization of #721 with support of promises in addition to normal events. The promise support is mostly needed for the fetch API, which can be used as described below. The promise can resolve to a single event, an array or events, or null. Returning actual event (or array or null) instead of promises needs to be supported too.

Examples:

Currency rate conversion:

let currencyRate = 0;
let currencyRateFetchedAt = new Date(0);

async function getCurrencyRate() {
    const now = new Date();
    if (now - currencyRateFetchedAt > 1000 * 60 * 60) { // cache rate for one hour
        const res = await fetch("http://currency-api/...");
        const data = await res.json();
        currencyRate = data.rate;
        currencyRateFetchedAt = now;
    }
    return currencyRate;
}

async function handler(event) {
    return {...event, price: event.price * (await getCurrencyRate())};
}

Merging events from different sources. For example, there could be two sources, one of which produces events containing current room temperature and another containing current atmospheric pressure. They can be combined using the following config:

[transforms.combiner]
type = "javascript"
inputs = ["temperature_source", "pressure_source"]
handler = "handler"
path = "..."

with script source looking like

let temperature = null;
let pressure = null;

function handler(event) {
    if (event.hasOwnProperty("temperature")) {
        temperature = event.temperature;
    }
    if (event.hasOwnProperty("pressure")) {
        pressure = event.pressure;
    }
    if (temperature !== null && pressure !== null) {
        return {
            timestamp: event.timestamp,
            message: `Temperature: ${temperature}, pressure: ${pressure}`
        }
    }
}

Sinks

While scriptable sinks seems to be less important to me than sources and transforms, I think the main use case here not covered by native sinks can be making multi-step HTTP requests, writing to the filesystem using complex logic, or invoking command-line programs with arguments combined using complex logic. The handler should take an events array containing a batch of events and process them either synchronously or asynchronously.

Example:

Send HTTP requests requiring temporary authentication tokens:

let token = null

async function handler(events) {
    if (token === null) {
        // send authentication request with login and password from environment variables
        // and receive the token
        token = await authenticate();
    }
    // send actual request using the token
    await makeRequest(events, token);
}

Proposed API groups to be implemented

From examples and use cases listed above, I think the following list of APIs provided to the user scripts would be useful:

HTTP requests (for JavaScript it is fetch API)
Access to environment variables
Access to the filesystem
Invocation of other programs (similar to system/popen)
Logging facilities (ability to log to Vector's standard log, could be useful for debugging)

Questions

Should there be optional timeouts for handlers, which would make Vector restart the scripting engine if the timeout is exceeded?
Are there missing API groups which could be provided too?

The text was updated successfully, but these errors were encountered:

LucioFranco · 2019-10-18T15:15:53Z

This all looks fantastic!

One question:

The handler should take an events array containing a batch of events and process them either synchronously or asynchronously.

This means pretty much we have to do all the batching in rust then invoke the handler with the batch. Does this mean it may be harder to do streaming-based sinks with scripting?

ghost · 2019-10-18T16:35:20Z

This means pretty much we have to do all the batching in rust then invoke the handler with the batch. Does this mean it may be harder to do streaming-based sinks with scripting?

The batch size can be set to 1 or a special option can be added to send individual events instead of arrays of events.

ghost · 2019-12-18T19:22:37Z

I think this might benefit from some coordination with #1328.

Ideally we want as much of the user-written code as possible to be operating on the topology components instead of individual events. Such operations on entire streams of events can be called vector operations, while operations on individual events can be called scalar operations. In most of the cases the users would be recommended to use the vector operations, resorting to the scalar ones only in very specific cases. This would make it possible to keep most of the flexibility, while maintaining native Rust performance, which is not achievable from any scripting transform.

Then most of our components would turn into functions. This means that while we don't have advanced scripting, adding new components only benefits us, as they could be nicely wrapped into highly-performant functions later.

For example, there could be a function representing http transform component (#1385) and then it could be applied on the streams of the events, which would definitely be faster than a similar function written in a scripting language and acting on individual language.

With this approach in mind, I think it might be beneficial to not rush in implementing as much as possible "scalar" APIs, but rather focus on adding high-quality components that can be later represented as "vectorized" functions acting on the streams.

binarylogic · 2019-12-28T18:50:59Z

Such operations on entire streams of events can be called vector operations, while operations on individual events can be called scalar operations.

I absolutely love this.

In most of the cases the users would be recommended to use the vector operations, resorting to the scalar ones only in very specific cases. This would make it possible to keep most of the flexibility, while maintaining native Rust performance, which is not achievable from any scripting transform.

Big 👍 from me. I wouldn't mind seeing more concrete examples of this to make sure I'm understanding this correctly.

ghost · 2020-01-02T13:54:08Z

Big 👍 from me. I wouldn't mind seeing more concrete examples of this to make sure I'm understanding this correctly.

I was thinking about something like this:

events streams are considered to be infinite-dimensional collections of events;
each stream allows accessing a named field of its events;
this named fields are considered infinite-dimensional vectors of numbers, strings, or other events.

stream = SysLog.new # creates a data new stream
stream.events["field"] # <-- this is an infinite-dimensional vector of values

# for numerical fields
stream.events["a"] = stream.events["a"] + stream.events["b"] # addition of vectors
stream.events["a"] = stream.events["b"] * 2 # multiplication by scalar

# for strings multiplication by negative numbers is not defined, so only addition is possuble
stream.events["a"] = stream.events["a"] + stream.events["b"]

# note that each stream corresponds to different vector space, so that the following is not allowed:
stream1.events["x"] = stream2.events["y"]

# for numeric fields some vector functions could be introduced
stream.events["a"] = cumulative_sum(stream.events["a"]) # get cumulative sum of the elements
stream.events["b"] = lag(stream.events["b"], 3)
stream.events["c"] = running_average(stream.events["c"])

# special functions can return vectors which can be assigned
stream.events["ones"] = identity_function() # assign ones to the field
stream.events["count"] = cumulative_sum(identity_function()) # assign count to the field
stream.events["http_field"] = http_get("https://...", {timeout: 3600}) # create a vector of values obtained using an HTTP request performed each hour

binarylogic · 2020-08-07T21:23:26Z

Noting, this will be covered via our WASM roadmap to extend Vector's components. The ideas here are good inspiration for the API we expose.

jszwedko · 2022-08-01T18:54:20Z

Closing this since we have the remap and lua transforms as well as the exec source and an issue for an exec sink: #935 . We may revisit WASM in the future.

ghost added the needs: approval Needs review & approval before work can begin. label Oct 17, 2019

ghost added this to the Improve data processing milestone Oct 17, 2019

ghost mentioned this issue Oct 18, 2019

New cmd or exec source #992

Closed

ghost mentioned this issue Oct 18, 2019

Add ability to throttle streaming sinks #1051

Closed

ghost mentioned this issue Dec 18, 2019

New http transform #1385

Open

ghost added the needs: requirements Needs a a list of requirements before work can be begin label Dec 18, 2019

binarylogic assigned ghost Dec 21, 2019

binarylogic removed this from the Improve Data Processing milestone Jul 26, 2020

binarylogic added domain: extensions meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin. domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) labels Aug 7, 2020

binarylogic mentioned this issue Sep 25, 2020

Building a source/transform/sink guide #4110

Open

jamtur01 assigned jamtur01 and unassigned jamtur01 and ghost Oct 23, 2020

jszwedko removed the domain: extensions label Jan 18, 2022

jszwedko closed this as completed Aug 1, 2022

precisionconage mentioned this issue Jan 8, 2024

Add option to resolve relative paths using config file's path instead of current working directory #19542

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Vector more scriptable #1041

Make Vector more scriptable #1041

ghost commented Oct 17, 2019 •

edited by ghost

Loading

LucioFranco commented Oct 18, 2019

ghost commented Oct 18, 2019

ghost commented Dec 18, 2019

binarylogic commented Dec 28, 2019

ghost commented Jan 2, 2020

binarylogic commented Aug 7, 2020

jszwedko commented Aug 1, 2022

Make Vector more scriptable #1041

Make Vector more scriptable #1041

Comments

ghost commented Oct 17, 2019 • edited by ghost Loading

Introduction

Goals and scope of this issue

Intended usage of scripting

Scriptable components

Overview

Config structure

Sources

Transforms

Sinks

Proposed API groups to be implemented

Questions

LucioFranco commented Oct 18, 2019

ghost commented Oct 18, 2019

ghost commented Dec 18, 2019

binarylogic commented Dec 28, 2019

ghost commented Jan 2, 2020

binarylogic commented Aug 7, 2020

jszwedko commented Aug 1, 2022

ghost commented Oct 17, 2019 •

edited by ghost

Loading