[Feature Request] Be able to merge streams with different groups #144

melck · 2016-01-14T14:11:24Z

I want to merge streams with differents groups.

This issue is related to this post on google groups for @nathanielc

nathanielc · 2016-01-14T16:30:07Z

@melck Thanks for creating the issue.

Simplified use case from the mailing list. This TICKscript wants to be able to calculate the percentage of user cpu each process is consuming on a given host.

// Get the user cpu consumed for the host
var user = stream.from().measurement('cpu_time_user')
    .where(lambda: "cpu" == 'cpu-total')
    .groupBy('host')
      ....

// Get the per process user cpu consumed.
var puser = stream
    .from()
    .measurement('procstat_cpu_user')
    .groupBy('host', 'name')
     ....

// Here the puser stream is grouped by 'host', and process 'name'
// The user stream is only grouped by 'host'
// The intent is to apply the same user value for each process name per host.
// This way we calculate the percentage of user cpu each process consumes by host.
// Right now this doesn't work and no points are joined.
puser.join(user)
      .as('puser', 'user')
    .eval(lambda: 100.0 * ("puser.value" / "user.value"))
      .as('value')
    .influxDBOut()
     //store the per process percentage of user cpu consumed by host.

My proposed solution is to add an optional on property to the join node. Theon will take a list of dimensions to join on and if on of the parent streams has more dimensions than specified then an outer join is performed for each unique instance of the more specific stream.

puser.join(user)
      .as('puser', 'user')
       // Instruct the join to only join on host.
       // But since puser has a dimensions host and name,
      // the same user value will be used for each unique process name by host.
      // The resultant data points will continue to be grouped by both host and name.
      .on('host')
    .eval(lambda: 100.0 * ("puser.value" / "user.value"))
      .as('value')
    .influxDBOut()
    // Store the per process percentage of user cpu consumed by host.
    // The data points stored will have the tags 'host' and 'name'

This should continue to work for more dimensions. Taking a different example say we were trying to get percentage of network traffic per interface, protocol and port by host.

var packets = stream....
       .groupBy('host', 'interface', 'protocol', 'port')

var interface_packets = stream...
       .groupBy('host','interface')
       .mapReduce(influxql.sum('value')).as('value')

var total_packets = stream...
      .groupBy('host')
      .mapReduce(influxql.sum('value')).as('value')

// Get per interface protocol port by host percentages.
packets.join(total_packets)
       .as('packets', 'total')
       .on('host')
    .eval(lambda: 100.0 * "packets.value" / "total.value").as('value')
    .influxDBOut()
      .measurement('host_network_percentages')
    // store package used for each interface protocol and port as a percentage of the total host packets.
    // The data points stored will have the tags 'host', 'interface', 'protocol' and 'port'

// Get per protocol port by host, interface percentages.
packets.join(interface_packets)
       .as('packets', 'total')
       .on('host', 'interface')
    .eval(lambda: 100.0 * "packets.value" / "total.value").as('value')
    .influxDBOut()
      .measurement('interface_network_percentages')
    // store package used for each  protocol and port as a percentage of the total per host and interface.
    // The data points stored will have the tags 'host', 'interface', 'protocol' and 'port'

While both joins will have the same tags set the value will have a different meaning. As a result they will need to be stored in different measurements.

Thoughts, Is the on property intuitive on how it would work and what it would do?

melck · 2016-01-16T11:19:36Z

I like the way to do it, it's clear and simple but after implementation, it would be nice to have those examples in documentation.

In meantime, i will see in what other cases i need to use this feature and check if it theoricly works

@nathanielc When do you think it would be available with your current solution ?

nathanielc · 2016-02-25T22:07:03Z

@melck Could you give the new join.on feature a spin and let us know how it works? I implemented as discussed above there shouldn't surprises. It was merged just now so you can build from source or wait for the nightly build tonight at midnight PST.

melck · 2016-02-27T13:52:58Z

@nathanielc I will try asap and do a feedback. Thanks

nathanielc · 2016-02-27T14:19:51Z

Great, FYI, I discovered a bug where Kapacitor will hang when using theses
new features. I am working on the fix, so if you run into that just kill
the daemon and start over.
On Feb 27, 2016 6:53 AM, "Melck" notifications@github.com wrote:

@nathanielc https://github.com/nathanielc I will try asap and do a
feedback. Thanks

—
Reply to this email directly or view it on GitHub
#144 (comment)
.

nathanielc added this to the v0.11 milestone Feb 5, 2016

nathanielc mentioned this issue Feb 24, 2016

Implement join.On functionality #257

Merged

nathanielc closed this as completed in #257 Feb 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Be able to merge streams with different groups #144

[Feature Request] Be able to merge streams with different groups #144

melck commented Jan 14, 2016

nathanielc commented Jan 14, 2016

melck commented Jan 16, 2016

nathanielc commented Feb 25, 2016

melck commented Feb 27, 2016

nathanielc commented Feb 27, 2016

[Feature Request] Be able to merge streams with different groups #144

[Feature Request] Be able to merge streams with different groups #144

Comments

melck commented Jan 14, 2016

nathanielc commented Jan 14, 2016

melck commented Jan 16, 2016

nathanielc commented Feb 25, 2016

melck commented Feb 27, 2016

nathanielc commented Feb 27, 2016