Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large memory usage by gcloud-logging API. #1449

Closed
gregw opened this issue Dec 6, 2016 · 37 comments
Closed

Large memory usage by gcloud-logging API. #1449

gregw opened this issue Dec 6, 2016 · 37 comments
Assignees
Labels
api: logging Issues related to the Cloud Logging API. performance

Comments

@gregw
Copy link

gregw commented Dec 6, 2016

The logging API has a large memory foot print, with approximately 4KB of memory consumed for each log generated.

This was measured with the following program:

package com.google.cloud.runtimes.jetty9;

import java.util.concurrent.TimeUnit;
import java.util.logging.Logger;

public class StackDriverLogging {

  static void init() throws Exception {
    Logger log = Logger.getLogger(StackDriverLogging.class.getName());
    log.info("test info");
    log.fine("test fine");

    long last = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();

    // Run 50 times to ensure everything is hot and any queues are max allocated
    for (int r=50;r-->0;) {
      long total = Runtime.getRuntime().totalMemory();
      long free = Runtime.getRuntime().freeMemory();
      long used = total - free;
      long delta = used - last;
      last = used;
      System.err.printf("total(%d)-free(%d)=used(%d) delta=%d%n", total, free, used, delta);

      for (int i = 1000; i-- > 0;)
        log.info("test info");
      
      if (r==10) {
        System.err.println("DRAINING...");
        // Wait a minute so all queued logs can drain out
        TimeUnit.MINUTES.sleep(1);
        System.gc();
        System.err.println("LAST 10...");
      }
      
      Thread.sleep(10);
    }

    System.err.println("DUMP!");
    Thread.sleep(1000000);
    
  }

  public static void main(String... args) throws Exception {
    init();
  }
}

Eclipse MAT was used to analyse a head dump taken immediately after 10K log entries were generated, then again once the program had returned to idle and network monitor indicated that all data had been flushed.

After 10L log messages, the heap was 51MB, which reduced to 13.4MB once the logs had been flushed (but no forced GC, so the difference may be larger). Thus at least 38MB appears to be consumed by logging 10K messages, which is 3.9KB per message. The attached reports( 10K Top Consumers.pdf & Idle Top Consumers.pdf ) indicate that 22.9 MB is used by io.grpc.netty.NettyClientTransport$2 and 12.1MB alone is consumed by io.netty.handler.codec.http2.StreamBufferingEncoder.

Ideally, the memory usage per log line could be greatly reduced. Eitherway, it would also be highly desirable to be able to put some limits on the memory consumed by the logging subsystem so that excessively verbose logs are either discarded, summarized, blocked or excepted.

@meltsufin
Copy link
Member

Thanks @gregw for a detailed analysis.
@garrettjonesgoogle Can you please take a look at this? 4KB per log message seems excessive.

@garrettjonesgoogle
Copy link
Member

Unfortunately, google-cloud-java gets its version of Netty indirectly through grpc, so there isn't a super-clean way to update that. Regardless, I would be very interested in seeing what your analysis shows if you update the Netty dependency on your end to 4.1.5 using the exclusions feature of Maven described in #1433 .

@garrettjonesgoogle garrettjonesgoogle added the api: logging Issues related to the Cloud Logging API. label Dec 7, 2016
@gregw
Copy link
Author

gregw commented Dec 8, 2016

@garrettjonesgoogle I ran the same test with 4.1.5.Final Netty. It improve the base memory usage, but not the dynamic usage per request. After the 10K log messages total memory usage is now only 45.5MB, but when idle it now reduces to 8.7MB. So there is an approx 5MB saving, but the per message cost is about the same, dominated by 22.8 MB is used by io.grpc.netty.NettyClientTransport$2 and 11.4MB by io.netty.handler.codec.http2.StreamBufferingEncoder.

@gregw
Copy link
Author

gregw commented Jan 19, 2017

As per the discussion in grpc/grpc-java#2554, the issue appears to be caused (or at least aggravated) by the flush size of 1, which results in only an individual log entry being sent per RPC and many log entry lists being created as garbage. The simple flush level will be replaced by a better mechanism called "bundling", to provide timely yet aggregated log entries.

@pongad pongad self-assigned this Jan 20, 2017
@pongad
Copy link
Contributor

pongad commented Jan 23, 2017

@garrettjonesgoogle I see two ways to resolve this

  • Add bundling options to logging and regenerate it with toolkit
  • Hand-write code to work with ThresholdBundlingForwarder and friends directly in this class

Do you have a preference?

Since this class allows configuring various options on per-handler basis, I think the 2nd option is probably easier to do. Due to the work we're doing on pubsub, the bundling surface feels somewhat unstable. So either way we decide to do it, we might have to rewrite pieces of it. I'll be happy to do the work though.

@garrettjonesgoogle
Copy link
Member

Bundling was already added to Logging and the GAPIC client was regenerated in d4d494d .

@pongad
Copy link
Contributor

pongad commented Jan 24, 2017

@garrettjonesgoogle I somehow missed this. Was the config not in the YAML? I see no reference to bundling in https://github.com/googleapis/googleapis/blob/622cee91552e025d8f2f5733308ee2f633356f98/google/logging/v2/logging_gapic.yaml

@garrettjonesgoogle
Copy link
Member

Ahh you're right, I haven't submitted it there yet.

@pongad
Copy link
Contributor

pongad commented Jan 24, 2017

@gregw After discussing this with Garrett, I think this issue should have been resolved by d4d494d . Could you try again?

@meltsufin
Copy link
Member

@gregw Can you please take a look?

@gregw
Copy link
Author

gregw commented Jan 27, 2017

@meltsufin @garrettjonesgoogle

I have rerun the test I did above. The good news is that the network usage appears to be a lot better!

Bad news is that the memory usage appears to be a lot worse:

  • After initialization and logging two simple messages, then heap size is only 5.3MB
  • After warm up logging of 50K messages the heap size grew to 88MB of which 83MB is consumed by com.google.api.gax.bundling
  • After logging 10K more messages, the heap grew to 108MB of which 103MB is consumed by com.google.api.gax.bundling

I'll see if I can attach some detailed reports.

Bad news is that the total image size is a bit bigger at 62MB, of which 54MB is consumed by com.google.api.gax.bundling. But the good news is that the size remains pretty much constant over the generation of 10K logs. Also good is that the network activity is now hardly noticeable, so the RPC usage appears to be much more efficient.

@gregw
Copy link
Author

gregw commented Jan 27, 2017

Here is a report. Tell me if you need more info.
Export.pdf

@meltsufin
Copy link
Member

@gregw Thanks for the detailed analysis!

It sounds like there is a memory leak somewhere in the use of com.google.api.gax.bundling. @garrettjonesgoogle Can you take a look?

@garrettjonesgoogle
Copy link
Member

@pongad will be the one to take a look.

@meltsufin
Copy link
Member

ping, any updates?

@pongad
Copy link
Contributor

pongad commented Feb 3, 2017

Apologies for the delay. I have been pulled a few different ways the past couple of weeks.

I looked deeper into this issue. Previously d4d494d bundles the log messages so that so that we make fewer GRPC calls. This is good.

The problem is that flush() is now incorrect. Currently, the implementation of flush eventually calls writeAsync, which just puts the entries in an unbounded buffer and return. It doesn't guarantee that an RPC is issued to the service, much less that the RPC is completed.

This would explain why Greg's benchmark reports so little time spent on the RPC: The RPCs might not have happened at all. Since flush wasn't working properly, high memory consumption would also make sense since the program would add messages much faster than RPCs could be completed.

Getting flush to work properly with bundling proved more difficult than I thought. I plan to write and share a doc on this subject.

@gregw
Copy link
Author

gregw commented Feb 9, 2017

@pongad I'm not sure we really need the flush mechanism, at least how it is currently written. See the discussion on grpc/grpc-java#2554.

Currently the mechanism is such that as soon as the number of queued messages is over a configured limit (default 1), flush is called. So even if a write from a previous flush is outstanding another flush is done and another asyncwrite is queued.

I think the semantic that is needed is that if a previous RPC is outstanding, then the next RPC is not initiated until the previous one is complete. This will allow logs to accumulate and for many to be written at once.

Perhaps this can be approximately achieved by having both a high flush limit (say 100) and a minimum time to wait before flushing (say 1000ms). So the log queue will be flushed and an RPC created only if more than 100 entries are created OR an entry has been waiting more than 1000ms.

@garrettjonesgoogle implied that this is exactly what the bundling system is meant to achieve. So perhaps you should be looking at removing the flush paradigm as it may be incompatible with bundling?

@pongad
Copy link
Contributor

pongad commented Feb 10, 2017

@gregw That makes sense. "Proper" flushing is probably still desirable, but we don't need it for this.

I think that might be referring to the FlowController. The idea is that it limits the number of outstanding bytes/messages. So while it doesn't make sure "the next RPC isn't initiated until the previous one is complete" like you asked, it does make sure "if we're already have N unsent messages, wait for at least one to finish before processing one more".

@meltsufin
Copy link
Member

@michaelbausor Have you had a chance to take a look? This is our only blocker for integrating the library into the Java runtime.

@michaelbausor
Copy link
Contributor

@meltsufin I have had an initial look - I will start working on a fix tomorrow.

@meltsufin
Copy link
Member

@michaelbausor Thanks for the update! :)

@michaelbausor
Copy link
Contributor

@meltsufin @garrettjonesgoogle A further update about the outstanding issues:

Unbounded queue of bundles
The bundling implementation will accept logs faster than it can process, accumulating them until OOM. This is also flagged in this GAX issue: googleapis/gax-java#157
We can resolve this using the FlowController that @pongad has implemented.

Flush implementation is broken
The current implementation of flush is broken by sitting on top of bundling. We will need to enhance bundling with a working flush implementation, to be used by LoggingHandler flush. This is the major outstanding work item.

Bundles are too small
Currently, the max bundle size is set to 1024 bytes, which is only 5 request objects (each has serialized size of 221 bytes). So each bundle only contains 5 objects, instead of the expected 100. (These settings are coming from here: https://github.com/GoogleCloudPlatform/google-cloud-java/blob/master/google-cloud-logging/src/main/java/com/google/cloud/logging/spi/v2/LoggingServiceV2Settings.java#L548).
We can resolve this by increasing the max bundle size to 20K. Note that in this case we are using very small log messages, so the max bundle size will still be the limiting constraint for larger logs.

Bundling consumes a lot of memory
My measurements (using the code from @gregw) put the memory consumed per bundle at 1872 bytes, of which 1616 bytes is the request object. So there isn't much we can do here, unless we want to store the request object in serialized form, which would give us approx. 7x reduction in the request size (1616 bytes to 221 bytes) and a total reduction of just under 4x once we account for the overhead. My feeling is that once we have flow control available to stop the stored bundles from causing an OOM error, we don't need to worry about this - what are your thoughts?

@pongad
Copy link
Contributor

pongad commented Feb 22, 2017

@michaelbausor Flushing is indeed a big problem. I don't think it's immediately relevant to this issue though. Once we deal with items 1, 3, and 4, the memory problem should be resolved. We can deal with item 2 afterwards. I think I conflated the two issues because they were introduced in the same commit.

@meltsufin
Copy link
Member

@garrettjonesgoogle It looks like we've now identified the underlying issues. How are you prioritizing the fixes? Do you have a rough ETA? Thanks!

@michaelbausor
Copy link
Contributor

@meltsufin I will follow what @pongad said, and prioritize the other fixes above the flushing implementation. Hopefully I can provide a fix for bundling size and implement using FlowController by tomorrow.

@michaelbausor
Copy link
Contributor

michaelbausor commented Feb 28, 2017

Status update: I have one PR in flight for adding FlowController here: googleapis/gax-java#213
I am also working on a second PR to address the excessive memory consumption of bundling. We will restructure bundling to combine messages as they are received, instead of storing individual request messages which are combined once a bundle is taken to be sent. I will update this thread with a PR once it is available.

@meltsufin
Copy link
Member

@michaelbausor I see that the FlowController PR was merged in. Do you have a link to the "bundling" PR? When can we expect all of these changes to make it into the client lib release? Thanks!

@michaelbausor
Copy link
Contributor

I should have an open PR by the end of today, I will link to it here. cc @garrettjonesgoogle for when we will do a release, but hopefully as soon as the changes are merged.

@michaelbausor
Copy link
Contributor

PR is open here: googleapis/gax-java#214

@michaelbausor
Copy link
Contributor

Update: The above PR was replaced by googleapis/gax-java#226, which is now merged. There is one outstanding PR with the equivalent changes in toolkit: googleapis/gapic-generator#1095
Once this is merged, we will be ready to begin merging into google-cloud-java
ETA for completion is 3/8

@michaelbausor
Copy link
Contributor

@gregw @meltsufin The fix for this issue is now in master. We will try to create a release as soon as possible, but we also have other outstanding breaking changes to logging which we would like to include.

It would be great to get verification that these changes have fixed the issue, and if not to figure out what more needs to be done. Are you able to test using the master branch? If not, let me know and we can try to use a snapshot release or some other alternative.

@meltsufin
Copy link
Member

@gregw Would you be able to run the test again with a build from master?

@gregw
Copy link
Author

gregw commented Mar 9, 2017

@meltsufin @michaelbausor I'll rerun today... standby ...

@gregw
Copy link
Author

gregw commented Mar 10, 2017

@meltsufin @michaelbausor I've run the same tests and the memory usage is much much better.:

  • after sending 80,000 logs and then waiting 30s, I have 5MB on the heap of which nothing much is related to logging.
  • immediately after sending another 10,000 logs I have a 13MB heap with 8MB of com.google.api.gax.bundling.ThresholdBundlingForwarder
  • 30s later I'm back to a 5MB heap.

So there is still a moderate penalty for queued logs, but it is not leaked and quickly cleared. Few applications will log 10,000 messages in a tight loop.

@michaelbausor
Copy link
Contributor

michaelbausor commented Mar 10, 2017

@gregw @meltsufin Great, thanks! That matches with what I would expect. I will close this issue once we create a new release.

Also note that @pongad has a PR in progress that will reduce the memory consumption of queued logs further in some cases: #1721

@meltsufin
Copy link
Member

Thanks @michaelbausor!! Also, #1721 sounds like a great idea for compressing the log batches!

@michaelbausor
Copy link
Contributor

github-actions bot pushed a commit that referenced this issue Aug 24, 2022
🤖 I have created a release *beep* *boop*
---


## [2.3.5](googleapis/java-bigquerydatatransfer@v2.3.4...v2.3.5) (2022-08-24)


### Dependencies

* update dependency com.google.cloud:google-cloud-bigquery to v2.14.7 ([#1448](googleapis/java-bigquerydatatransfer#1448)) ([3310cbb](googleapis/java-bigquerydatatransfer@3310cbb))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: logging Issues related to the Cloud Logging API. performance
Projects
None yet
Development

No branches or pull requests

6 participants