Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluent-bit: multi-instance support #1294

Merged
merged 1 commit into from
Jan 6, 2020

Conversation

JensErat
Copy link
Contributor

What this PR does / why we need it:

To run multiple plugin instances at the same time, the global plugin
variable needs to be replaced by a map and lookup of the correct plugin
by reading an ID from the function call's context.

Based on the out_multiinstance example in the upstream repository.

Signed-off-by: Jens Erat email@jenserat.de

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist

  • Documentation added
  • Tests updated

@JensErat
Copy link
Contributor Author

JensErat commented Nov 20, 2019

Probably to be discussed, I was not successful at getting go build -race running with the plugging, immediately segfaults during initialization (but not because of a race): is the loki client code threadsafe? The initialization phase is definitely sequential (went through the fluent-bit setup code). I'm not sure whether we might have concurrent calls to plugin.sendRecord(record, timestamp) though.

I see #1283 is working on thread safety for promtail, which uses pretty much the same code.

We're running this code (together with the tenantID change from my other PR) in integration already -- still having a bunch of out-of-order issues (which we also had with a single output), no further issues so far.

@cyriltovena
Copy link
Contributor

We have #1254 to fix the out of order problem.

Not sure if this is possible to add some tests for this multi instance features ?

@pstibrany
Copy link
Member

We have #1254 to fix the out of order problem.

That fixes issues in promtail, while #1247 fixes problem in ingesters.

To run multiple plugin instances at the same time, the plugin instance
must be registered and later retrieved.

Additionally, a list of registered plugin instances must be stored for
proper disposal during fluent-bit shutdown.

Based on the [out_multiinstance example] in the upstream repository.

[out_multiinstance example]: https://github.com/fluent/fluent-bit-go/blob/fc386d263885e50387dd0081a77adf4072e8e4b6/examples/out_multiinstance/out.go

Signed-off-by: Jens Erat <email@jenserat.de>
@JensErat
Copy link
Contributor Author

This is an updated version, merging ideas of @cosmo0920 in #1454 into my work. The fluent-bit side configuration of an ID is indeed not required, but the code can be simplified even further by passing the plugin reference as pointer, not the ID. I was able to remove a whole bunch of lines.

I had a shot at adding some kind of end-to-end test, but this is not a trivial task. cgo and golang tests do not really go together well, I have to provide some mocks and function wrappers (in a totally reasonable amount, but still work to do). At least for the receiver side there is reusable code in promtail. I will not be able to finish this within the next days though, out of office again for more than a week now. I totally agree we should have tests, but I'd love to see this split apart to another follow-up PR since this blocks at least us and @fmax in #1446 from using upstream releases.

I will have another go on the test code as soon as time allows, but for now I just have some vast ideas on how to test the plugin.

Unlike cosmo's PR, this also keeps the ID reference for logs. This makes understanding what's going wrong much easier. For example (see the id log label):

level=info caller=out_loki.go:56 id=0 [flb-go]="Starting fluent-bit-go-loki" version="(version=fluent-bit-multi-instance-8e5abdd-WIP, branch=fluent-bit-multi-instance, revision=8e5abdd3)"
level=info caller=out_loki.go:58 id=0 [flb-go]="provided parameter" URL=http://loki-distributor.logging.svc:3100/api/prom/push
level=info caller=out_loki.go:59 id=0 [flb-go]="provided parameter" TenantID=platform
level=info caller=out_loki.go:60 id=0 [flb-go]="provided parameter" BatchWait=5s
level=info caller=out_loki.go:61 id=0 [flb-go]="provided parameter" BatchSize=32000
level=info caller=out_loki.go:62 id=0 [flb-go]="provided parameter" Labels="[job=fluent-bit]"
level=info caller=out_loki.go:63 id=0 [flb-go]="provided parameter" LogLevel=info
level=info caller=out_loki.go:64 id=0 [flb-go]="provided parameter" AutoKubernetesLabels=false
level=info caller=out_loki.go:65 id=0 [flb-go]="provided parameter" RemoveKeys=[]
level=info caller=out_loki.go:66 id=0 [flb-go]="provided parameter" LabelKeys=[HOSTNAME]
level=info caller=out_loki.go:67 id=0 [flb-go]="provided parameter" LineFormat=0
level=info caller=out_loki.go:68 id=0 [flb-go]="provided parameter" DropSingleKey=true
level=info caller=out_loki.go:69 id=0 [flb-go]="provided parameter" LabelMapPath=map[]
level=info caller=out_loki.go:56 id=1 [flb-go]="Starting fluent-bit-go-loki" version="(version=fluent-bit-multi-instance-8e5abdd-WIP, branch=fluent-bit-multi-instance, revision=8e5abdd3)"
level=info caller=out_loki.go:58 id=1 [flb-go]="provided parameter" URL=http://loki-distributor.logging.svc:3100/api/prom/push
level=info caller=out_loki.go:59 id=1 [flb-go]="provided parameter" TenantID=audit
level=info caller=out_loki.go:60 id=1 [flb-go]="provided parameter" BatchWait=5s
level=info caller=out_loki.go:61 id=1 [flb-go]="provided parameter" BatchSize=32000
level=info caller=out_loki.go:62 id=1 [flb-go]="provided parameter" Labels="[job=fluent-bit]"
level=info caller=out_loki.go:63 id=1 [flb-go]="provided parameter" LogLevel=info
level=info caller=out_loki.go:64 id=1 [flb-go]="provided parameter" AutoKubernetesLabels=false
level=info caller=out_loki.go:65 id=1 [flb-go]="provided parameter" RemoveKeys=[]
level=info caller=out_loki.go:66 id=1 [flb-go]="provided parameter" LabelKeys=[TENANT]
level=info caller=out_loki.go:67 id=1 [flb-go]="provided parameter" LineFormat=0
level=info caller=out_loki.go:68 id=1 [flb-go]="provided parameter" DropSingleKey=true
level=info caller=out_loki.go:69 id=1 [flb-go]="provided parameter" LabelMapPath=map[]
[snip]
level=error caller=client.go:236 id=1 component=client host=loki-distributor.logging.svc:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-12-27 16:57:10.205839657 +0000 UTC ignored, reason: 'entry out of order' for stream: {job=\"fluent-bit\"},"

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'll co author our friend @cosmo0920

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants