Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Promtail Push API #1627

Merged
merged 5 commits into from
Mar 23, 2020
Merged

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Feb 4, 2020

This PR adds a proposal for a Promtail push API.

Feedback is welcome; please insert feedback as a comment in this PR or as a code review against the PR's added doc.

@rfratto rfratto added the proposal A design document to propose a large change to Promtail label Feb 4, 2020
Co-Authored-By: Owen Diehl <ow.diehl@gmail.com>

### Considerations

Users will be able to define multiple `http` scrape configs, but the base URL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to have pipelines available via suffix instead of base url?
loki/api/v1/push/{A,B,C} instead of {A,B,C}/loki/api/v1/push?.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, could we still consider that Loki-compatible if it's a suffix?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this endpoint difference would actually be a problem, but I may be missing something here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just thinking of the Grafana use-case where it adds the Loki API as a suffix, and allows a base URL with whatever prefix you want. If other endpoints had custom suffixes, Grafana wouldn't be able to function properly.

OTOH, Grafana doesn't push logs to Loki, so this isn't a real concern; I'm just assuming there might be other tooling that works this way.

value must be different for each instance. This allows to cleanly separate
pipelines through different push endpoints.

Users must also be aware about problems with running Promtail with an HTTP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load balancing issue is very annoying and makes scaling difficult (users have to implement sharding or something else themselves).

I'm not sure I like it, but we could always support a simple sharding config where you can specify multiple promql endpoints and we can shard between them via labels. Changing this hash ring in such a primitive mode would require service restarts, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. I'm tempted to call this out of scope and handle it later on. IMO, the most flexible thing to do would be to use a DNS lookup to find all of our promtails and use that for sharding; we want to be able to support non-Kubernetes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be afraid to support DNS in sharding because of how ephemeral it may be. You'd still likely run into the same problems as you would with traditional round robin load balancing, albeit less often. That's why I was considering a simple inline config that hardcodes a number of endpoints, like:

promtails:
  prom1.<namespace>.svc.cluster.local
  prom2.<namespace>.svc.cluster.local
  prom3.<namespace>.svc.cluster.local

Note: I'm definitely not in love with this, but I'm trying to think of an expedient solution here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Owen and I had a quick chat offline and agreed that we shouldn't try to deal with load balancing ourselves right now. There's a few points to be made here:

  1. If you hardcode endpoints, you can't dynamically scale without rolling all your instances.
  2. If you use DNS, the sharding is eventually consistent and might cause improper writes.
  3. If instances rotate, the old instance might still be buffering some streams. If a new instance sends data for new values for streams before the old instance finishes flushing its buffer, you'll get out of order errors.

This is definitely a problem, but it's really complicated and not worth solving in Promtail right now. The (unfortunate) recommendation will be for people to write their own sharding mechanism in front of Promtail if they need to scale, and just use one Promtail instance otherwise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think considering it out of scope until you have feedback on your first version is the right way to go, but since we are discussing possibilities, @owen-d has a good idea. Round robin over configured list is a solid option - it's flexible enough to work, and is simple.

That said, I always like when things seamlessly integrate with consul :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not happy with the consensus here. We are going to make it worse and it's already a problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to revisit the assumption that we need to be able to forward logs to multiple promtails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically with the proposed API, you could chain together Promtails by having one Promtail write to another.

What use case are you thinking about?

to be parsed, but this will generally be faster than the reflection requirements
imposed by JSON.

However, note that this API limits Promtail to accepting one line at a time and
Copy link
Member

@owen-d owen-d Feb 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could support batching easily via multiline http bodies, like

<ts>
<log-line>
<ts>
<log-line>
...

This is similar to the elasticsearch bulk api which uses ndjson:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rfratto does ^ seem reasonable to you?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I'm still partial to just having Promtail expose the same push API as Loki and leaving decisions about text-based APIs to external tooling.

For me, I really want to be able to run this when doing local testing for something:

loki -log.level=debug -config.file=/my/config/file | \
  promtail-cat -lapp=loki http://localhost:1234/loki/api/v1/push

I think your batching idea makes sense, it just drifts a little bit from what I'm hoping to eventually be able to do with this API. (Although I suppose my promtail-cat tool could still exist on top of what you're suggesting here)

Copy link
Member

@owen-d owen-d Feb 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's a good point. It seems weird to expose an unbatched endpoint, but I guess it's also weird/redundant to do double buffering here and in promtail. I guess the nice part is it's easy to scale out promtails linearly across infra/streams, so the cost of unbuffered pushes is amortized over that. I think you're taking the right approach here and we probably don't want to use the multiline approach.

Copy link
Contributor

@cyriltovena cyriltovena Feb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're taking a big bet based on preference and possibility to cat logs to promtail.

I would like a benchmark here, batching vs non batching on high throughput. We don't have a good solution for scaling and we're saying it's fine we will scale easily promtail.

Also you can now pipe data to promtail.

Copy link
Member Author

@rfratto rfratto Feb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm miscommunicating? I do want the solution we build here to support high throughput, and I do want it to be able to support batching.

The idea of a promtail-cat is a completely separate project that lives outside of the Loki repo and that builds on top of an existing high-throughput solution. That separate tool isn't meant for high throughput. It's out of scope for the design here and I probably shouldn't have brought it up.

@grafana grafana deleted a comment from codecov-io Feb 4, 2020
### Option 3: Plaintext Payload

Prometheus' [Push Gateway API](https://github.com/prometheus/pushgateway#command-line)
is cleverly designed and we should consider implementing our API in the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Push Gateway has been designed only for difficult metrics collection. Not to support big throughput of metrics. Batching is and will always perform better than this.

This is probably why we don't use this API in cortex to push metrics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree it's a bad primary approach for the lack of throughput.

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good document overall ! I think you really pointed out well the problem.

But I'm not satisfied with solution, there is also other alternative where we could run pipelines in Loki configured per tenant/stream this would have at least the benefit of not affecting the roundtripping latency.

@rfratto
Copy link
Member Author

rfratto commented Feb 14, 2020

there is also other alternative where we could run pipelines in Loki configured per tenant/stream this would have at least the benefit of not affecting the roundtripping latency.

That's a good point. My concern with that is users would not have the ability to easily utilize metrics stages. I guess unless Loki could act as a Prometheus remote_read or remote_write?

@cyriltovena
Copy link
Contributor

cyriltovena commented Feb 21, 2020

I think the next step, is for @slim-bean to take action with all feedback, we should not wait for consensus. All feedback have been given.

@slim-bean
Copy link
Collaborator

I will take a look at this next week!

@cyriltovena cyriltovena requested review from slim-bean and owen-d and removed request for owen-d February 21, 2020 17:01
@cyriltovena cyriltovena removed their assignment Feb 21, 2020
@rfratto rfratto assigned rfratto and unassigned rfratto Feb 24, 2020
@mattmendick mattmendick self-assigned this Feb 24, 2020
@mattmendick mattmendick removed their assignment Mar 4, 2020
The preferred implementation choice is copying the Loki API, but 
that wasn't very clear in the document.
@slim-bean
Copy link
Collaborator

We've had a few discussions about this and I would like to move forward with Option 1 for the purpose of the original requirements (mainly being able to send from the docker logging driver to promtail).

I believe Option 4 is also valuable but should be submitted as a separate proposal as it solves a different use case.

Copy link
Collaborator

@slim-bean slim-bean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @rfratto!

@rfratto rfratto merged commit 18e828c into grafana:master Mar 23, 2020
@rfratto rfratto deleted the rfc-promtail-push-api branch March 23, 2020 21:43
@slim-bean slim-bean mentioned this pull request Jul 3, 2020
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal A design document to propose a large change to Promtail size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants