-
Notifications
You must be signed in to change notification settings - Fork 105
NMT doesn't properly handle out of order data #41
Comments
It should be noted that this is really on a problem when trying to back fill data (ie sending 24hours of data in the span of a few seconds). under normal operation data wouldnt see this problem. |
We really cant afford to let this block us from getting NMT into production. The way that is see it is that NSQ is just the wrong solution for this problem. The primary reason for using NSQ was to allow metrics to be buffered to disk when we couldnt process them due to faults. If writing to disk causes the messages to then be sent to consumers out of order, that is a huge problem (it wasnt a problem for kairosdb as we didnt really care about ordering). Buffering messages for 10seconds just introduces more issues. Either we ACK the messages immediately and then risk losing up to 10seconds of data when NMT restarts, or we delay the ACKs for 10seconds and put additional pressure on NSQ causing it to buffer to disk and compound the ordering problem. Additionally, under normal operation metrics will generally be spaced at least 10seconds apart anyway. So i really feel that we should just accept NSQ is the wrong tool for this job. We dont have time to change it right now and it will be good enough in the near term. This problem will only introduce itself when there is an outage causing metrics to be buffered on the collector, in grafana or in NSQ for more then 10seconds. And even then, the worst case scenario is that we drop some metrics for the duration of the outage. Moving forward i think we should just move to using Kafka, as it will also help us address the fault tolerance and scaling issues of NMT. @Dieterbe thoughts/comments? |
yeah it's unfortunate that when we addressed the original scope of the raintank-metrick (nsq) work we didn't have the foresight for the needs around data aggregation.
what additional pressure? we had this convo before, nsqd doesn't have a performancy penalty from delayed or out of order acks. (i have not personally verified that, but a lot of people - nsq users and developers - would be quite surprised if it was the case) also keep in mind that nsqd is undergoing refactoring to send all data through a WAL which is basically a disk backed FIFO (no more separate memory vs disk channels) (see the link in OP), it's just that the message delivery will still be based on an out-of-order model. we will probably effectively have an order much closer to natural time sorting if we use this new mechanism. likely we can even achieve guaranteed ordering as long as we have a single thread/connection and make sure messages don't timeout and trigger a requeue, we also don't do any defers. i thought you had several arguments against kafka (what were they again?), we seem to agree that nsq will probably be ok for a bit longer but let's try to get a really detailed picture of our current+future needs before starting a migration to kafka |
Memory pressure. If you delay acks then NSQ will consume more memory resulting in more messages being sent to the diskqueue.
I really like Kafka. We were using it, but moved away from it due to problems with the NodeJS zookeeper library. We also had some issues with just keeping Zookeeper up and running with large message rates, but i see the latest Kafka can now handle tracking the read offsets of consumers itself, which was the majority of the workload for Zookeeper. http://blog.cloudera.com/blog/2015/07/deploying-apache-kafka-a-practical-faq/ I also agree that NSQ will be ok for a bit longer. I just dont want any more time sunk into trying to make it do something that it is not designed for. |
maybe. i'm not sure the case where we ack immediately causes significantly less stress on the disk queue. for example, the delay between message going into the memqueue and being acked may be too long compared to the rate of inbound messages and may send the vast majority of messages to diskqueue anyway even if you ack "quickly". well that's good news re kafka. there's also a rich ecosystem around it (spark etc). if we can't actually build something that meets our needs then of course we'll need to switch away at some point, but i think there's value in exploring further how we can mold nsqd to meet our needs, especially with the big WAL refactoring. but this could be a timesink and cost more time than just switching to kafka, i'm not quite sure how much time that would take. i presume its Go libraries have matured by now |
we should also think about what should the desired behavior be in case the pipeline (whether nsq or kafka) accumulates several seconds, minutes, or hours worth of data. |
This is a good point. though much lower in the prio list than other items. This would be hard to impliment in code though, as we know data needs to be processed in order. What i think would work though is if a outage is detected, or simply a large backlog of data detected: |
due to the way nsqd currently fills over traffic from a in-mem to via-diskqueue channels (by selecting on them), it can arbitrarily reorder your data. ideally if the in memory channel is always empty this shouldn't happen but maybe due to minor hickups. we can see if increasing the size of the memory buffer helps though obviously then we would incur more data loss in case of an nsqd crash.
@woodsaj confirmed this by feeding data into nsqd in order, and have an NMT consumer with 1 concurrent handler, and the data was out of order.
I've had conversations with Matt Reiferson (of nsq) seeing how feasible it would be to add simple ordering guarantees to nsqd, even if merely per-topic per nsqd instance. but even that seems quite complex/tricky and would require a different model for requeues, defers, msg timeouts etc and would be a drastically different nsqd behavior, even with nsqio/nsq#625
his recommendation was to use an ephemeral channel to always read the latest data to serve up to users from RAM, and just drop what we can't handle, an additionally use a diskqueue backed channel which you read from and store into like HDFS, so that you can then use hadoop to properly compute the chunks to store in archival storage (i.e. go-tsz chunks in cassandra) even on out of order data.
though this seems like far more complexity than we want, although i like the idea of separating in-mem data and archival storage, that seems to let us simplify things. but using hadoop to work around poor ordering after the fact...
what we can also do:
however this means for update operations we might commit the wrong values if the 2 writes for the same slot happen in the wrong order (though we're not currently doing any updates) and also it would be less RAM efficient to keep the data in such arrays.
note that in both above approaches we assume ordering of messages is all we need.
in reality messages from the collectors can contain points for different timestamps (and this is hard to address in the collectors per AJ) so in NMT we would have to order the actual points, not just the messages.
The text was updated successfully, but these errors were encountered: