-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crazy memory peak, and pod evicted #1047
Comments
@mcuadros Sorry about that. I don't have an explanation for it at this time. Let me ask you some questions. Could you elaborate on:
Does that mean that some subscription was started receiving messages from a channel with 10M messages of size <=1kb? This totals to 10GB (well, if they are close to 1kb).
So as the above processing was going on, 50k total (or is it per machine, so 1M messages?) were being processed by each node. If it is 50k total, it is no more than 500MB. If it was per machine, it would be a total of 10GB. Messages are not kept in memory (at least no longer than 1sec for the filestore implementation). But I wonder if this could be that the Go GC is not kicking because it assumes that there is a lot of memory on that machine? That is, it may have set its target really high because of the amount of mem avail? I assume that those reports are 1 per node and that only the streaming server is running there, not the applications, etc.. |
First done worry, this is a stagging environment and this was a test run. The consumers are allocated in another namespace with dedicated nodes, the nats-streaming are in a services namespace with other services like NATS. The consumers were consuming the records (at a slow rate, due to the nature of the job), and producing a new one in another channel with the result. Producing at a high rate and consumed by another kind of consumer. The messages lost were in the channel where the consumers were dumping the result. The reports belong to the pod itself, not the node. Checking the graphs of the node I figured out that the 3 nodes of the cluster were scheduled in the same node.
As you can see in the graph, the memory explosion was a chain effect on the cluster, one node went done and the effect went to the next one. Also, something very relevant is that since the moment I started to publish events the memory was growing. The MaxInflight it's set to zero. |
"Good". So if you can reproduce we could maybe get some metrics. If you are embedding the NAT Server (and not connecting to a remote NATS cluster), then you could add the
That is a bit confusing. If the consumption is at a slow rate (due to nature of the job) and this produces a new message in another channel (the result of processing), how can that be a "producing at a high rate"? Meaning that the producing rate here should be as slow as the consuming rate, no?
Mind sharing the server config?
You are using filestore (or SQL) are you? There is a defect that allows the MEMORY store to be used in clustering mode although it is not supported. I wonder if that's what is happening here? Because with file/sql, the messages stored on disk and not kept in memory (except for some small cache). So Go should gc them. Again, could be that GC is not kicking fast enough because of the apparent amount of memory avail?
This is not a valid value and server would reject the subscription. I am guessing that you meant that you did not specify it, which means that default is 1024. |
I am running nats and stan operator, so hard to debug nothing. The server config is empty besides of this:
This is the natsstreamingcluster resource: I am using filestore writing to a remote SSD. |
This is the nats streaming operator, right? |
|
So yes, it is file based. If you want to force sync (which will dramatically reduce performance), you would need to add that to the configuration file because I think because not many options can be set with the operator. So config could be:
But to explain memory usage, not sure how to go about it without using the |
I was able to replicate it, what I did this time was:
I am deploying and embed version. And I will try to debug the problem. Also during a review of the config and saw that due to a misconfiguration all the nodes were using the same HD, that's why were scheduled in the same pod. |
This corresponds to the top of the peak, as you can see the memory reported by k8s and the memory in the heap is the half.
And this is the channelz at the same time (more or less) of the heap top. Let me know if you need more info, I have the heap and the channels for every node every 30 secs for the whole execution. |
From channelz I see more than 1 consumer. Is that the queue group you are referring to? I see 53 members on the queue "queue_name": "analyzer.analyze_events:analyzer.analyze_events". Both channelz output and the top result you provided show that most of the memory is used because of message redelivery. It seems that your queue members are very slow to consume messages but they have default maxInflight of 1024. My guess is that the server is consistently trying to redeliver messages to those queue subscriptions. Few things to try:
Reducing the MaxInflight to 1 would make the second issue listed above less of a problem, since it would be at most 1 message per member. Note that current heap is at about 4GB, but it seem that about 12GB were allocated during the lifetime of this process. Some will be garbage collected I presume. I never had to tweak the Go garbage collection settings, so hopefully we don't have to get there if we solve/understand better why there are so many redeliveries. |
I already tried reducing the MaxInflight to the number of gorutines taking care of the tasks. And the result was almost the same. Also, let you know that the tasks may take several minutes even hours. About the Close() we have a problem with the memory in the analyzer/consumer parts and are killed by memory so sadly can't we close properly the connections. So this make the problem even worst. |
How can that be? The top shows that it is all about server trying to redeliver messages. And the slice size is based on the number of unacknowledged messages. So the more messages delivered but unacknowledged, the bigger the slice allocation will be.
I understand, but having more than MaxInflight==1 in this case is counter productive. The server will send messages that applications have no chance to process for a while. Furthermore, when a message is sent to a member while this member is already spending a long time processing an older message, that message is not available to other members that you may start to try to increase parallel processing.
Again, having the MaxInflight to 1 may help reduce the memory usage for each member because they deal with 1 message at a time. If they are killed due to memory constraint, this is a problem, especially if each member had thousands of unacknowledged messages. |
In GKE, a cluster of 3 nodes, deployed with the operator, persisting to a network SSD disk.
On in the middle of a queue processing from a single machine of 10M of the message of a size of 1kb or less, and a workload in from 20machines publishing around 50k message of a size of 10kb. All the nodes went to a crazy peak of memory and k8s evicted it.
As a result, many of the messages were lost.
Which is the origin of this memory? The messages are not regularly persisted to the disk?
The running image is
nats-streaming:0.17.0
and the operator configuration is:The text was updated successfully, but these errors were encountered: