Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/http sink #67

Merged
merged 8 commits into from
Oct 1, 2023
Merged

Feature/http sink #67

merged 8 commits into from
Oct 1, 2023

Conversation

vidosits
Copy link
Contributor

@vidosits vidosits commented Sep 24, 2023

  • Implement Sink
  • Document config params and update README
  • Create tests

@vidosits vidosits mentioned this pull request Sep 24, 2023
@vidosits vidosits marked this pull request as ready for review September 24, 2023 20:52
func (h *httpSink) Emit(
_ sink.Context, _ time.Time, topicName string, key, envelope schema.Struct,
) error {
delete(envelope, "schema")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there should be a general property to disable schema sending. I think right now it'd be the only sink not including it. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this mainly depends on where you'd like to take the project. Would you like to have everything or almost everything as a configurable parameter?
If so then yeah it would make sense to be able to turn on/off the sending of the schema from the config file.

I can also see a world where the package always emits everything and it's up to the receiver to discard whatever is not needed or if one would like to avoid unnecessary traffic then one could transform the payload with a plugin before emitting.

I think you probably have better things to work on in this project then having to make schema.Struct be optional throughout all the sinks just because this one sink and the stdout one does not send it.

I have to be honest and tell you that I've started the http sink by copying the stdout one and that one starts the emitting process by deleting the schema.

For now I think it'd be best if all the sinks behaved the same way (aside from maybe the stdout one, see above), not just because of consistency, but because if you decide to make the schema sending optional you'll have an easier time refactoring and not having to wade through code just to find out if a particular sink decides to omit schema by default.

I've pushed a commit that adds back in the schema for this sink as well. I'll use a plugin to transform the payload before sending to the sink anyway, so I'll just omit it there.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree. I think it could be interesting since the schema creates quite some overhead and traffic. I also think it may be better to provide the option to create a separate stream of schema data, which could be routed to a schema registry, but I never used one before so I wouldn't have enough insight into it.

I think for now we can just keep it in and make it up to the receiver to discard it. For traffic reasons it may be interesting to eventually provide the option. Tbh, even stdout should support it, since the idea (next to testing) was to pipe the output somewhere else. The reason why you can send any log message to stderr 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] since the schema creates quite some overhead and traffic. [...] For traffic reasons it may be interesting to eventually provide the option

Yeah, this is probably the biggest argument for making it optional.

I think for now we can just keep it in and make it up to the receiver to discard it.

What are your thoughts on discarding it with a Plugin before sending it? We're streaming anywhere between 10k - 100k per second with timescaledb-event-streamer although it's on a private network I can see how one would like to exclude the schema from being sent to save on ingress/egress/networking cost, i.e.: like with a cloud provider.

}
}

address := config.GetOrDefault(c, config.PropertyHttpUrl, "http://localhost:80")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a separate transport, wouldn't it make sense to remove the protocol part here? Alternatively instead of using the TLS enabled prop, you could just check the protocol and if it's HTTPS go and assume TLS enabled (and just read the remaining properties).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right, I was just hacking it together so the hard-coded one stayed in. I've pushed the commit where we infer the usage of TLS by the prefix of the url and if it's present then take the TLS settings into account.

})

go func() {
if err := http.ListenAndServe(address, nil); err != nil {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to use an actual instance and shut it down cleanly in the tear down? That way you also wouldn't be required to use a fixed port number (while I don't think it's a real issue though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It absolutely would, that was my first approach as well, but I wrote less than 10k lines of golang in total, so I couldn't figure out why my server instance seems to have stopped instantly. I could make it work this way though.

I tried my best for the second try, let me know if it's better this way or not. Thanks! :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look! No worries, not a Go master myself, even though I've wrote quite some go code by now. I'm originally a Java guy 😊

@noctarius
Copy link
Owner

I'll be off until the weekend (few days trip). Just want to make sure you know that I'm not ignoring you, but be less online for a few 😂

@vidosits
Copy link
Contributor Author

I'll be off until the weekend (few days trip). Just want to make sure you know that I'm not ignoring you, but be less online for a few 😂

Thanks for letting me know! I did not think you were ignoring me, you also don't owe me anything, but I appreciate it nonetheless :)

@noctarius
Copy link
Owner

While being out I actually thought about sending data to an HTTP endpoint immediately. That means that all queuing has to happen somewhere else, either in here or right behind the HTTP endpoint. It also means we may have to add some back-pressure to slow down handling of incoming messages (which in turn may prevent us from handling keep-alive messages in time and improve the reconnect functionality).

I think you mentioned you're already using it. Have you load-tested it? Maybe there should be a bit of a warning that long running operations may cause issues 🤔

@vidosits
Copy link
Contributor Author

vidosits commented Sep 30, 2023

That means that all queuing has to happen somewhere else, either in here or right behind the HTTP endpoint.

Yeah, although if you're using it in a way that you need to buffer messages maybe the HTTP sink is the wrong choice?
I'm not in charge of the direction, but from my POV it's entirely acceptable that this package does one thing and it does it well. I'm not saying that it's unreasonable to expect that queuing happen in this package, but IMHO then it begs the question if one would expect this package to provide queueing for the other sinks as well.

For example, Kafka may be down or writing to it may be slow. Debezium for example provides a queue for that..

In summary, I agree, a queue would be nice, but then maybe it wouldn't have to be HTTP sink specific.

I think you mentioned you're already using it. Have you load-tested it?

No proper load testing has happened, but we are streaming 10k-100k changes per second from one of our timescale clusters and there were no problems so far whatsoever. Although, the service we're sending these event to is written with Actix and it's just ingesting the data, so the request doesn't block for long. It was specifically set up that way, because I knew it was up to us to guarantee that we can keep up with the event streamer (because there is no queuing). We haven't seen any issues (so far).

Maybe there should be a bit of a warning that long running operations may cause issues 🤔

Yeah, I guess that could be nice.
Yeah, a warning or disclaimer

@noctarius
Copy link
Owner

10-100k events sounds nice! I'll be ok leaving it that way then. Guess we can add a little note to the http endpoint to make sure people understand it's up to them then! 👍

@noctarius noctarius merged commit 5ff11aa into noctarius:main Oct 1, 2023
10 checks passed
@noctarius
Copy link
Owner

Thanks a lot again :) Keep the contributions coming if you have other ideas :)

@vidosits
Copy link
Contributor Author

vidosits commented Oct 1, 2023

No problem, thanks for merging and thanks for the great package!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants