Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default worker childopts: GC logging, IPv4, -server; fixes #492 #632

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mrflip
Copy link
Collaborator

@mrflip mrflip commented Jul 21, 2013

  • Enable verbose GC tuning into {storm.home}/logs/gc-worker.log.N, capped at 100k
  • added a new childopts interpolant, '%STORMHOME%', set to storm.home property at runtime.
  • put example production childopts in the storm.yaml.example
  • Added the 'preferIPv4Stack' setting, to make storm not bind to ipv6 interfaces by default

@nathanmarz
Copy link
Owner

Are you sure these GC settings should be default? What are the performance implications of this? This can easily be manually set by users.

The %STORMHOME% feature addition is fine, I'd like to merge that in as a separate pull request. Likewise for the preferIPv4Stack setting.

* Enable verbose GC tuning into {storm.home}/logs/gc-worker.log.N, capped at 100k
* added a new childopts interpolant, '%STORMHOME%', set to storm.home property at runtime.
* put example production childopts in the storm.yaml.example
* Added the 'preferIPv4Stack' setting, to make storm not bind to ipv6 interfaces by default
@mrflip
Copy link
Collaborator Author

mrflip commented Jul 31, 2013

The fancypants jvm options are commented out -- I'm only enabling the gc logging settings, ipv4 and server mode.

The system shouldn't be doing new-gen GC's more than once every few seconds, so verbosity of GC logging shouldn't be an issue.

I'll move the "production" options to the wiki, and separate this pull request into pieces as requested. Each of the candidate settings is described below, so please recommend additional ones to include.

  • -server -- documents that it is in server mode: "intended for executing long-running server applications, which need the fastest possible operating speed more than a fast start-up time or smaller runtime memory footprint"
  • -Djava.net.preferIPv4Stack=true -- bind to IPv4, not IPv6, addresses by default.
  • -Xloggc:%STORMHOME%/logs/gc-worker-%ID%.log -- record GC settings to the named log
  • -verbose:gc -- be descriptive
  • -XX:GCLogFileSize=1m -- cap the gc log file size
  • -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 rotate the gc log files out from current one, keeping the last ten
  • -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution -- make the GC logs informative and readable
  • -XX:-PrintGCApplicationStoppedTime -- turned off, but noted as one you might want to enable. Shows the length of time each thread was stopped for a new-gen GC

Performance and Memory settings:

  • -XX:+AggressiveOpts -- "Turns on point performance optimizations that are expected to be on by default in upcoming releases." A bit of cargo cult tuning here; most big systems we use have this set on.
  • -XX:+UseCompressedOops -- significant savings in heap size on 64-bit machines; you must turn this off above 32 GB ram, however.
  • -Xmx2500m -Xms2500m -- set to twice the new-gen size plus the steady-state old-gen occupancy (hwo much old-gen is used after you force a full s.t.w. GC)
    • Your goal is that there are No stop-the-world (STW) gc's, and nothing in the logs about aborted CMS, ever; and that old-gen GCs should not last longer than 1 second or happen more often than every 5 minutes.
  • -Xss256k -- the default stack size is much too large. This amount of memory is set aside for each and every thread, so it's important to decrease it for highly threaded applications.
  • -XX:MaxPermSize=128m -XX:PermSize=96m -- the initial perm-gen size is small, so set it high enough that system behaviour doesn't change after startup. Set a hard cap -- the perm-gem usage should never change, so you want it to blow up if it ever somehow did.
  • New-gen size: unless you have very large cache maps &such, almost all of your heap usage will be extremely-short-lived objects. The default is to use 1/3 the total heap for new-gen . We've found
    • -XX:NewSize=1000m -XX:MaxNewSize=1000m for production use, the new-gen size should be at least one GB, and it should be pinned to a hard number so the system is predictable. (Note: you must never use more than 1/2 the heap for new-gen). Faster flows and larger tuple values will require larger new-gen heap size. Your goal is that new-gen GCs should not last longer than 50 ms or happen more often than every 5 seconds.
    • -XX:+UseParNewGC -- documents that the parallel new-gen collector is being used
    • -XX:MaxTenuringThreshold=1 some short-lived objects will survive the first new-gen GC simply because they were made right before the sweep, so it's important to have tenuring on. Essentially no short-lived objects make it through two gc cycles though, so rather than copying objects back and forth in the new-gen space many times (eden - survivor - ... - survivor - old-gen), tenure them after one cycle (eden - survivor - old-gen).
    • -XX:SurvivorRatio=6 the default survivor space size is far larger than typically needed; this makes the eden space use 75% of the new-gen and the two survivors 12.5% each. New-gen GCs should not fill the survivor space
  • CMS for old-gen
    • -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -- use the low-pause garbage collector, and have it take good advantage of multiple cores.
    • -XX:CMSInitiatingOccupancyFraction=75 -- start each gc process such that there's plenty of time before the heap gets too crowded. Especially important for large heap sizes.
    • -XX:+UseCMSInitiatingOccupancyOnly -- don't let the JVM do old-gen GCs unpredictably

@mrflip
Copy link
Collaborator Author

mrflip commented Aug 1, 2013

(I don't know if you need more convincing, but a recent thread on the list -- mysterious death of a worker on prepare() that turned out to be a heap blowout -- really convinces me that the default settings for development should have GC logging turned to max. If something like that is invisible to a clearly quite competent dev, then there's little chance folks will naturally suspect they're leaking fast data into the old-gen causing repeated avoidable major GCs)

@vbajaria
Copy link

vbajaria commented Aug 1, 2013

I don't know if all the GC settings mentioned in here should be on by default for production but for dev mode it definitely is helpful. Maybe we could even make it easy to provide a flag when running the java command to run the most verbose level of GC logging on production too (if needed).

I agree that with mrflip that verbose GC logging did help me figure out a memory-leak in one of my topologies which could only be simulated on production after the topology ran for over 4-6 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants