Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing Coordinated omission + replacing histogram with HdrHistogram #214

Closed
wants to merge 20 commits into from

Conversation

nitsanw
Copy link
Contributor

@nitsanw nitsanw commented Mar 4, 2015

Hi,
I'm submitting this PR as a halfway point to an agreed upon PR. There are a few changes, some are cosmetic (which I should have perhaps avoided in the name of having a cleaner PR for you to look at), but the 2 items listed in the title are of importance:

  1. Fixing coordinated omission(as described and discussed by Gil Tene[1][2][3]). This required the introduction of an intended start time per operation to be used in computing the latency. So if operation is delayed from executing due to previous operation not completing on time the latency is inclusive of that delay. This divorces the load generator intended execution schedule from the server push back behaviour.
  2. Replacing OneMeasurementHistogram with OneMeasurementHdrHistogram which uses HdrHistogram[4] to capture latency thus enjoying a better suited data structure and allowing loss less compressed logging of interval histograms[5].
    Your feedback is most welcome.
    Thanks,
    Nitsan
    PS: I would like to cover this PR in a blog post in the near future and would value your comments on the draft as well
    [1] https://groups.google.com/forum/#!msg/mechanical-sympathy/icNZJejUHfE/BfDekfBEs_sJ
    [2] http://www.infoq.com/presentations/latency-pitfalls
    [3] https://www.youtube.com/watch?v=9MKY4KypBzg
    [4] https://github.com/HdrHistogram/HdrHistogram
    [5] http://psy-lob-saw.blogspot.com/2015/02/hdrhistogram-better-latency-capture.html

@cmatser
Copy link
Collaborator

cmatser commented Mar 4, 2015

This is a great addition to YCSB. The usefulness of the current Histogram implementation is dubious. When I was evaluating performance values, I would only rarely look at the histogram results. This would require some adjustments and re-runs in order to capture the correct "window" of data. This PR sounds like it would solve that problem and help provide more insight into the performance results. Thanks!

@nitsanw
Copy link
Contributor Author

nitsanw commented Mar 5, 2015

Glad you approve, not sure what you mean by "window" of data? is that the bucket sizing for original histogram?
The extra benefit of this patch when serializing the HdrHistograms to file is that you can get correct histogram data over parts of the run by using the HistogramLogProcessor and setting a time period of interest (or crafting your own tooling). This is helpful when trying to gain insight to application latency behaviour over time.

@cmatser
Copy link
Collaborator

cmatser commented Mar 5, 2015

The bucket sizing, exactly. We would often find that the histogram data would not be properly represented because we had a bulk of data ending up in the last "overflow" bucket. This would change based on our test parameters, specifically when we changed threadcount. We would then try re-runs with the number of buckets large enough to fit the current run and then wondered if we were adding undue complexity to the client, thus skewing the results. Needless to say, we just did this a few times, enough to confirm expected results, and then gave up on it.

@nitsanw
Copy link
Contributor Author

nitsanw commented Mar 5, 2015

HdrHistogram completely solves that problem, as well as a few others :)

 On Thursday, March 5, 2015 6:44 PM, Chrisjan Matser <notifications@github.com> wrote:

The bucket sizing, exactly. We would often find that the histogram data would not be properly represented because we had a bulk of data ending up in the last "overflow" bucket. This would change based on our test parameters, specifically when we changed threadcount. We would then try re-runs with the number of buckets large enough to fit the current run and then wondered if we were adding undue complexity to the client, thus skewing the results. Needless to say, we just did this a few times, enough to confirm expected results, and then gave up on it.—
Reply to this email directly or view it on GitHub.

@busbey
Copy link
Collaborator

busbey commented May 17, 2015

Some feedback:

  • The details in CHANGES.md should be in a wiki page.
  • Please break the jar-with-dependencies change into a different PR.

Am I reading this correctly that this defaults to the same latency measuring behavior as before?

@nitsanw
Copy link
Contributor Author

nitsanw commented May 17, 2015

It's been so long I can't remember why I added the dependency packaging... I'll look into it and you can exclude it from the merge if you don't think it's important. If you think it's a good idea maybe open a separate issue for it and make this PR fix it. It does fix more than just one isolated thing and I realize not everyone likes to work that way.
I'm not sure which wiki you refer to and how can I edit it as part of a branch...

@nitsanw
Copy link
Contributor Author

nitsanw commented May 17, 2015

OK, having looked at the changes I think my change was the simplest way to deliver the dependency to the release package. The assembly configuration picks up the jars from the target folders of all modules and does not bring in modules dependencies. Removing the all in one change will break the HdrHistogram dependency for the end package.

@nitsanw
Copy link
Contributor Author

nitsanw commented May 17, 2015

Please note that given the slow response to PRs on this repository I have joined efforts with DB vendors and am trying to encourage collaboration on a fork:
https://github.com/YCSB/YCSB
The fork includes many contributions from @jbellis and has the CO fix merged. I'm not sure what would be the best course of action going forward in maintaining YCSB.

@busbey
Copy link
Collaborator

busbey commented May 23, 2015

The problem with grabbing top level dependencies is that not all of our modules use the same versions of things.

Instead, we should update the distribution module to pull in the core dependencies and add them to the classpath.

@nitsanw
Copy link
Contributor Author

nitsanw commented May 25, 2015

Are you sure this is something that needs to be done as part of this PR? I'm not sure this is the place to enforce new policy for new problems. Alternatively you can manually merge and fix the issue as you see fit? I'm saying this not to spite, but as a maintainer of an open-source library.

@nitsanw
Copy link
Contributor Author

nitsanw commented Jun 1, 2015

Is there any particular action pending on this merge? I see the dependency issue as a continuation of #250 but if that is not the case I'm not sure what you need from me to resolve this PR

@busbey
Copy link
Collaborator

busbey commented Jun 1, 2015

Here are the outstanding review feedback pieces, the first two are blockers:

  1. I need someone to answer this question from above, so that we can decide if something different needs to happen or if we need a release note:

    Am I reading this correctly that this defaults to the same latency measuring behavior as before?

  2. CHANGES.md needs to be removed from the code base and made into a part of our docs. Since wiki pages are just markdown (our wiki is over here), this should be straight forward. Placing the content in a gist and linking here will allow me to move it into the docs (or waiting for make community-maintainable documentation #258 would mean it could be a PR). It might need some edits (like referencing that it's a branch, 'further suggestions' should be filed as issues, etc) but I think those can largely be follow-on.

  3. It would be nice if the new core dependency was placed in the assembly (as the jackson dep in core must be already) instead of building a jar-with-dependencies. That the jackson dependencies work fine but the HdrHistogram one doesn't causes me some concern. So long as Improve deployment of optional dependencies #250 (refactor deployment) is a blocker for Release version 0.2.0 #266 (release 0.2.0) I'd be fine punting to that issue.

  4. It would be nice to drop the commit that disables mapkeeper (since it's already disabled). But that's a nit.

I've been trying to find someone to pick things up, since your previous comment indicated that you didn't have time to take care of the needed changes. If you do now, that'd be great. Eventually I'll end up taking care of things myself, but my availability to handle PRs that need work is limited.

@nitsanw
Copy link
Contributor Author

nitsanw commented Jun 1, 2015

  1. Default is latency measurement as before. I've changed this on https://github.com/YCSB/YCSB and would consider it a good idea to do so here to before release. The PR as it stands highlights the issue and allows before/after comparison. If we are agreed that the non-coordinated measurement is the way to go things could be simplified.
  2. I'll have a look and see if I can find the relevant place for it. Dumping it as is seems a bit harsh. I could add the notes to the PR, allowing you to split it as you see fit when you release (e.g. parts to the release notes, parts to the wiki etc.)
  3. I'll have a look at how the jackson dependency is handled and follow same pattern.
  4. Dropping mapkeeper was probably done to make the build pass.

@busbey
Copy link
Collaborator

busbey commented Jun 3, 2015

That plan sounds great. I'm hesitant to make the latency change the default until we've had some time to have folks knock it around (so maybe we can plan for making it the default in 0.3.0 or 0.4.0). The rest sounds great.

Let me know how things are going. I'm budgeting some dedicated time this weekend and need to decide if I can spend it on #250.

@nitsanw
Copy link
Contributor Author

nitsanw commented Jun 8, 2015

Did not end up having time for it. Life got in the way...
Will hopefully spend some time on it this week.
When are you aiming to release 0.2.0?

@busbey
Copy link
Collaborator

busbey commented Jun 8, 2015

The 0.2.0 RC will branch on June 15th. (The next release will branch on July 15th).

.3. I'll have a look at how the jackson dependency is handled and follow same pattern.

You can ignore this one and forget about the dependency thing. I got far enough on #250 to see that it's probably only luck that made the jackson dependency work before.

@nitsanw
Copy link
Contributor Author

nitsanw commented Jun 16, 2015

Adding the changes notes here:
--- Please Copy/Paste/Break up as you think is appropriate ---
When used as a latency under load benchmark YCSB in it's original form suffers from
Coordinated Omission[1] and related measurement issue:

  • Load is controlled by response time
  • Measurement does not account for missing time
  • Measurement starts at beginning of request rather than at intended beginning
  • Measurement is limited in scope as the histogram does not provide data on overflow values

To provide a minimal correction patch the following were implemented:

  1. Replace internal histogram implementation with HdrHistogram[2]:
    HdrHistogram offers a dynamic range of measurement at a given precision and will
    improve the fidelity of reporting. It allows capturing a much wider range of latencies.
    HdrHistogram also supports compressed loss-less serialization which enable capturing
    snapshot histograms from which lower resolution histograms can be constructed for plotting
    latency over time. Snapshot interval histograms are serialized on status reporting which
    must be enabled using the '-s' option.
  2. Track intended operation start and report latencies from that point in time:
    Assuming the benchmark sets a target schedule of execution in which every operation
    is supposed to happen at a given time the benchmark should measure the latency between
    intended start time and operation completion.
    This required the introduction of a new measurement point and inevitably
    includes measuring some of the internal preparation steps of the load generator.
    These overhead should be negligible in the context of a network hop, but could
    be corrected for by estimating the load-generator overheads (e.g. by measuring a
    no-op DB or by measuring the setup time for an operation and deducting that from total).
    This intended measurement point is only used when there is a target load (specified by
    the -target paramaeter)

This branch supports the following new options:

  • -p measurementtype=histogram|hdrhistogram|hdrhistogram+histogram|timeseries
    The new measurement types are hdrhistogram and hdrhistogram+histogram. Default is still
    histogram, which is the old histogram. Ultimately we would remove the old measurement types
    and use only HdrHistogram but the old measurement is left in there for comparison sake.
  • -p measurement.interval=op|intended|both
    This new option deferentiates between measured intervals and adds the intended interval(as described)
    above, and the option to record both the op and intended for comparison.
  • -p hdrhistogram.fileoutput=true|false
    This new option will enable periodical writes of the interval histogram into an output file. The path can be set using '-p hdrhistogram.output.path='.

Example parameters:
-target 1000 -s -p workload=com.yahoo.ycsb.workloads.CoreWorkload -p basicdb.verbose=false -p basicdb.simulatedelay=4 -p measurement.interval=both -p measurementtype=hdrhistogram -p hdrhistogram.fileoutput=true -p maxexecutiontime=60

Further changes made:

  • -p status.interval= (default=10)
    Controls the number of seconds between status reports and therefore between HdrHistogram snapshots reported.
  • -p basicdb.randomizedelay=true|false
    Controls weather the delay simulated by the mock DB is uniformly random or not.

Further suggestions:

  1. Correction load control: currently after a pause the load generator will do
    operations back to back to catchup, this leads to a flat out throughput mode
    of testing as opposed to controlled load.
  2. Move to async model: Scenarios where Ops have no dependency could delegate the
    Op execution to a threadpool and thus separate the request rate control from the
    synchronous execution of Ops. Measurement would start on queuing for execution.
  3. https://groups.google.com/forum/#!msg/mechanical-sympathy/icNZJejUHfE/BfDekfBEs_sJ
  4. https://github.com/HdrHistogram/HdrHistogram

busbey added a commit that referenced this pull request Jun 17, 2015
    [core] use HdrHistogram and fix coordinated omission

also closes #10
@busbey
Copy link
Collaborator

busbey commented Jun 17, 2015

I merged this manually in 4d68068. not sure why it didn't show up.

Filed a follow on to clean up the CHANGES.md stuff.

Thanks for working on this, it's a great addition.

@busbey busbey closed this Jun 17, 2015
@nitsanw
Copy link
Contributor Author

nitsanw commented Jun 17, 2015

AWESOME!!!!
Thanks for picking this up, this is great news :-)

@cmccoy
Copy link
Collaborator

cmccoy commented Jun 17, 2015

👍 - thanks!

@busbey busbey mentioned this pull request Jun 18, 2015
jaricftw pushed a commit to jaricftw/YCSB that referenced this pull request Jul 19, 2016
    [core] use HdrHistogram and fix coordinated omission

also closes brianfrankcooper#10
jaricftw pushed a commit to jaricftw/YCSB that referenced this pull request Jul 19, 2016
    [core] use HdrHistogram and fix coordinated omission

also closes brianfrankcooper#10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants