Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

influxdb: Concurrent PeriodicFlusher #2190

Merged
merged 1 commit into from
Oct 21, 2021
Merged

influxdb: Concurrent PeriodicFlusher #2190

merged 1 commit into from
Oct 21, 2021

Conversation

codebien
Copy link
Contributor

@codebien codebien commented Oct 19, 2021

  • Restored the concurrency support for the InfluxDB output.
  • Small optimization to avoid the flush process with an empty SampleContainer slice.

Closes #2185

@codebien codebien self-assigned this Oct 19, 2021
output/helpers.go Outdated Show resolved Hide resolved
@inancgumus
Copy link
Member

On a side-note, can you also change the names of the file from helpers.go (vague), and put types to their related files periodic_flusher.go (for the PeriodicFlusher type and sample_buffer.go (for the SampleBuffer type).

@na--
Copy link
Member

na-- commented Oct 19, 2021

Regarding the file name, in general I agree with @inancgumus, but in this specific case, I think helpers.go may be the more practical choice. Consider the use case of someone new to k6 and its outputs that wants to write an extension.

If they look at https://pkg.go.dev/go.k6.io/k6/output, the file name doesn't matter, they mostly see the types, the filenames are at the very bottom (or if you click to see the implementation of something).

But if they open https://github.com/grafana/k6/tree/master/output, they see 3 very simple files that should be almost self-explanatory:

  • output/extensions.go - things to do with output extensions
  • output/types.go - some type definitions related to outputs
  • output/helpers.go - helpers to write outputs

Contrast this to something like output/periodic_flusher.go - nobody will know what this beast is and why it's needed until they actually look at the code

@inancgumus
Copy link
Member

inancgumus commented Oct 19, 2021

  • output/helpers.go - helpers to write outputs

Fair enough.

What about:

  • output/flusher.go or,
  • output/writer.go.

(or without an -er suffix to not to confuse it with an interface)

IMHO, it's practical, makes more sense, and explains what it provides.

@na--
Copy link
Member

na-- commented Oct 19, 2021

What about:

* `output/flusher.go` or,

* `output/writer.go`.

(or without an -er suffix to not to confuse it with an interface)

IMHO, it's practical, makes more sense, and explains what it provides.

Yes, but that explanation only makes sense if someone is familiar with the outputs enough to know that these are actually helpers 😅 Otherwise, flusher and writer might give someone the impression that these are essential parts of the k6 outputs when they are, in fact, very much not essential. They are simple helpers that remove the boilerplate for handling commonly occurring tasks in various outputs. However, they are not required to implement an output (or output extension), you can write one perfectly fine without them, e.g. the cloud output doesn't use them.

@inancgumus
Copy link
Member

Yes, but that explanation only makes sense if someone is familiar with the outputs enough to know that these are actually helpers 😅 Otherwise, flusher and writer might give someone the impression that these are essential parts of the k6 outputs when they are, in fact, very much not essential. They are simple helpers that remove the boilerplate for handling commonly occurring tasks in various outputs. However, they are not required to implement an output (or output extension), you can write one perfectly fine without them, e.g. the cloud output doesn't use them.

OK, then 😅 You know, when you call something a helper, you can put anything in it. There are no limits.

@na--
Copy link
Member

na-- commented Oct 19, 2021

You know, when you call something a helper, you can put anything in it. There are no limits.

There are - code reviews 😄 output/helpers.go should contain only helpers that are useful for a big chunk of the outputs k6 has or might have.

As I said, in general, I completely agree with you, just not about this specific case 😅 And k6 definitely suffers from the problem you're worried about in a few places, most notably with lib (https://github.com/grafana/k6/tree/master/lib). We really need to refactor that one, since it's a nasty meaningless catch-all module that contains everything and the kitchen sink... 😞 And that mess has practical implications, #1302 is a huge PITA... 😞

@mstoykov mstoykov added this to the v0.35.0 milestone Oct 19, 2021
Copy link
Member

@inancgumus inancgumus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting a request change, so we don't forget to add @codebien 's explanation here.

inancgumus
inancgumus previously approved these changes Oct 19, 2021
Copy link
Member

@inancgumus inancgumus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@codebien
Copy link
Contributor Author

@na-- I also added the log for warning about long execution of the flush metrics function. It could be very noisy in some situations. Do you think is something acceptable from a UX perspective? An experienced user can detect the same thing by inspecting the debug logs with the verbose mode.

@codebien codebien requested review from na-- and inancgumus October 19, 2021 15:33
output/helpers.go Outdated Show resolved Hide resolved
output/helpers.go Outdated Show resolved Hide resolved
@codebien codebien requested a review from na-- October 19, 2021 16:14
output/helpers.go Outdated Show resolved Hide resolved
Comment on lines 152 to 153
case <-ticker.C:
limiter <- struct{}{}
Copy link
Member

@na-- na-- Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this may not be ideal 😞

It will work fine when the InfluxDB instance is able to keep up with the load, but if ever starts slowing down, the AsyncPeriodicFlusher will be stuck here, since all of the allowed goroutines will be started and limiter will be full. But the metrics data will keep coming to the output, so the next flushCallback() will have more data to flush and thus have to do more work and send a bigger chunk to InfluxDB. And so on and so forth, potentially taking more and more time to free up a slot in limiter and sending ever bigger chunks of data.

The way it was implemented before, we always started a new goroutine:

for {
select {
case <-ticker.C:
c.wg.Add(1)
go c.commit()
case <-ctx.Done():
c.wg.Add(1)
go c.commit()
c.wg.Wait()
return
}
}

And we always got the currently buffered data in a local buffer in that goroutine before we potentially waited for the semaphore to have a free slot:
func (c *Collector) commit() {
defer c.wg.Done()
c.bufferLock.Lock()
samples := c.buffer
c.buffer = nil
c.bufferLock.Unlock()
// let first get the data and then wait our turn
c.semaphoreCh <- struct{}{}
defer func() {
<-c.semaphoreCh
}()

We just waited to send the actual network requests until we had a free slot. So the chunk sizes were smaller, but we could potentially have a whole bunch of started goroutines, waiting, with big chunks of data in memory... But at least these chunks would have been roughly the same size, so InfluxDB likely wouldn't have choked on one.

Not sure which is better, and to be honest, given that InfluxDB v1 is dead, it probably doesn't really matter. But if we want to bring the old way back, we can't do it with a new helper, it needs to be done in the InfluxDB output itself. It should be easy to do, we can use the old PeriodicFlusher and just spin up the goroutine immediately at the start of the output's flushMetrics() method 🤷‍♂️

@mstoykov, @codebien, WDYT?

Copy link
Contributor Author

@codebien codebien Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was done in this way to maintain the same logic with the current sync (before we decided to split). We have the same problem with the Stop procedure.

With the current code, I have these suggestions:

  1. Add another select for skipping requests so we can avoid the stuck (not optimal for performances):
select {
case limiter <- struct{}{}:
    go func() {
        ....
    }
default:
    continue LOOP
}
  1. Add timeout for requests, in this way, we could guarantee that also Stop doesn't stuck forever.

Copy link
Contributor Author

@codebien codebien Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and to be honest, given that InfluxDB v1 is dead, it probably doesn't really matter.

This should be used also from the InfluxDB v2

Copy link
Contributor

@mstoykov mstoykov Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done so in influxdbv1 for a couple of reasons IIRC:

  • influxdbv1 has an upper limit to the size of the post, which means that just continuously increasing that is ... probably not great ;)
  • influxdbv1 does at least non zero amount of the ingest while it reads the request, which means it takes quite a while for big requests.
  • from my experiments pushing multiple requests was faster/better which was tried after I noticed there is CPU not used, but influxdb uses just 1 core (from a quick look at htop, so grains of salt and all that jazz).

All of this was done by me probably 2 years ago, as far as I remember within a day as a quick try to make the influxdbv1 output more ... stable as it previously was getting to writing 50mb+ requests which were taking upwards of 60s regardless of how you configure influxdb.

Whether the above is applicable for another output is a question I can't answer. The cloud output for example has parallel pushing as well but it's after it has things aggregated and as such can't be in the PeriodicFlusher(which it also doesn't use).

Now given that I broke the concurrent writes in the influxdbv1 a few months ago when I moved it to an output, it seems to me that it might not have worked all that much better as nobody has come to complain. Also if I have to add it again it will literally be to add go func() { and } around the parts that pushes after the PeriodicFlusher has flushed.

Arguably the additional change (that I decided I don't want to spend the time IIRC) is to split the samples so there aren't more than a certain (configurable) amount and push those concurrently if necessary. But this likely will be terrible for some other output that will ingest concurrently but we now push to it in multiple requests.

To be honest given the IMO ease of implementing whatever concurrency push logic on top of the current PeriodicFlusher and that IMO it will need to be different depending on the output, I am for scrapping this additional utility type and just writing the 4 lines in each output that needs it if they do and I still don't know if influxdbv2 will actually be helped by this 🤷 .

Additionally, arguably a lot of the problems is that if you have 1k VUs doing 5 iters/s you need to write 5k iteration sample a second ... which probably should all be combined in 1 sample, doing that will likely fix all of the problems ;)

Also AsyncPeriodicFlusher seems like the wrong name ... maybe ConcurrentPeriodicFlusher ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if I have to add it again it will literally be to add go func() { and } around the parts that pushes after the PeriodicFlusher has flushed.

To be honest given the IMO ease of implementing whatever concurrency push logic on top of the current PeriodicFlusher and that IMO it will need to be different depending on the output, I am for scrapping this additional utility type and just writing the 4 lines in each output that needs it if they do and I still don't know if influxdbv2 will actually be helped by this

Yeah, you are probably right, this new helper seems to be more harm than help after all 😞 Restoring the original async logic, with smaller (even if not constant) chunk sizes, back in the influxdb code and using the old PeriodicFlusher seems to be the best way to go here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the old PeriodicFlusher seems to be the best way to go here...

@na-- to be sure, are we saying we want to remove the PeriodicFlusher and implement the ticker with the concurrent flush directly in the influxdb output, right?

Copy link
Member

@na-- na-- Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should:

  • remove (i.e. not add) NewAsyncPeriodicFlusher
  • leave PeriodicFlusher how it used to be
  • also don't change how the old PeriodicFlusher was used in the influxdb output before this PR
  • leave the semaphore code in there as well, but do everything in the flushMetrics() method after the samples := o.GetBufferedSamples() in a new goroutine (with an added waitgroup to ensure we flush everything before Stop() ends):
    samples := o.GetBufferedSamples()
    o.semaphoreCh <- struct{}{}
    defer func() {
    <-o.semaphoreCh
    }()

I think this would mimic the old InfluxDB behavior, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @na-- that this should happen for the old influxdb output, whether this should be a priority is a different question, but it should be fairly straightforward.

For the new one(influxdbv2), I am still interested in a real-world test with and without the upstream influxdb library's async writer instead of the bad benchmark I have written. I would expect that influxdbv2 will handle ingestion better than the old one and maybe even use multiple cores/goroutines for the ingestion without needing to send multiple requests 🤞

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still interested in a real-world test with and without the upstream influxdb library's async writer instead of the bad benchmark I have written

an attempt about it can be found in the influxv2 PR grafana/xk6-output-influxdb#2

@yorugac
Copy link
Contributor

yorugac commented Oct 20, 2021

Just a suggestion: once the logic of spawning in NewAsyncPeriodicFlusher is finalized, it might make sense to add the test checking that AsyncPeriodicFlusher cannot spawn more than concurrency goroutines.

Copy link
Contributor

@yorugac yorugac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turning the above ^^ into a "request change"

na--
na-- previously approved these changes Oct 21, 2021
Copy link
Member

@na-- na-- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and sorry for the whole whole confusion with me suggesting the async helper 😞

mstoykov
mstoykov previously approved these changes Oct 21, 2021
Copy link
Contributor

@mstoykov mstoykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think we might be (in general) writing too many debug messages in outputs, but this is fine and unrelated to the problem at hand so you can ignore it :)

return nil
}

func (o *Output) flushMetrics() {
samples := o.GetBufferedSamples()
if len(samples) < 1 {
o.logger.Debug("Any buffered samples, skipping the flush operation")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a message IMO, this will just catch cases where someone is running very light script that has really big response times

yorugac
yorugac previously approved these changes Oct 21, 2021
@codebien codebien dismissed stale reviews from yorugac, mstoykov, and na-- via dd6cef2 October 21, 2021 10:22
@codebien
Copy link
Contributor Author

@yorugac added the test for asserting concurrency and rate limiting

@codebien codebien requested review from mstoykov, yorugac and na-- October 21, 2021 10:24
@codebien codebien changed the title Async PeriodicFlusher influxdb: Concurrent PeriodicFlusher Oct 21, 2021
Copy link
Contributor

@yorugac yorugac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I meant that test for new periodic flusher type -- now it's specific for InfluxDB output... 🤔 But that's a good check to have anyway 🙂 LGTM 👍

@codebien codebien merged commit 65322ac into master Oct 21, 2021
@codebien codebien deleted the 2185-flusher-wg branch October 21, 2021 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

influxdb: Concurrent flush operations for PeriodicFlusher
5 participants