Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dogstatsd] handle properly utf8 packets #1279

Merged
merged 1 commit into from
Jan 28, 2015
Merged

Conversation

LeoCavaille
Copy link
Member

Fixes #1256. We should always consider that dogstatsd
receives a utf-8 encoded string through its socket,
but still support unicode python strings in case we
submit things programatically (e.g. useful for tests)

# network socket, but if submit_packets is used
# programatically and packets is unicode already
# then do not decode!
if not isinstance(packets, unicode):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use:

if not type(packets) == unicode:

instead ?

It's slightly faster:

topdog@i-1b871af7(prod):~$ python -m timeit 'not isinstance("sdfsdfsd" , unicode)'
10000000 loops, best of 3: 0.197 usec per loop
topdog@i-1b871af7(prod):~$ python -m timeit 'not type("sdfsdfsd") == unicode'
10000000 loops, best of 3: 0.151 usec per loop

and this code is run in a pretty performance sensitive loop.

@remh
Copy link
Contributor

remh commented Jan 9, 2015

Looks great besides the comment!

@remh remh self-assigned this Jan 9, 2015
@clutchski
Copy link
Contributor

what's the perf impact?

@LeoCavaille
Copy link
Member Author

Looks like the extra decode comes with a 30% impact cc @remh @clutchski

Benchmark utf8 handling (only the code in this branch, raises on the master branch)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.246    0.246    4.492    4.492 /Users/leo/datadog/vm/dd-agent/tests/performance/benchmark_aggregator.py:64(test_dogstatsd_utf8_events)
   300000    1.114    0.000    3.766    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:500(submit_packets)
   300000    1.434    0.000    1.667    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:463(parse_event_packet)
   300000    0.587    0.000    0.587    0.000 /Users/leo/datadog/vm/dd-agent/venv/lib/python2.7/encodings/utf_8.py:15(decode)
   300000    0.479    0.000    0.479    0.000 /Users/leo/datadog/vm/dd-agent/tests/performance/benchmark_aggregator.py:54(create_event_packet)
   300000    0.397    0.000    0.397    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:553(event)
   300000    0.234    0.000    0.234    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:460(_unescape_event_text)
       10    0.000    0.000    0.001    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:673(flush)

Benchmark ascii text with this branch

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.242    0.242    4.613    4.613 /Users/leo/datadog/vm/dd-agent/tests/performance/benchmark_aggregator.py:87(test_dogstatsd_ascii_events)
   300000    1.096    0.000    3.858    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:500(submit_packets)
   300000    1.588    0.000    1.852    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:463(parse_event_packet)
   300000    0.512    0.000    0.512    0.000 /Users/leo/datadog/vm/dd-agent/tests/performance/benchmark_aggregator.py:54(create_event_packet)
   300000    0.503    0.000    0.503    0.000 /Users/leo/datadog/vm/dd-agent/venv/lib/python2.7/encodings/utf_8.py:15(decode)
   300000    0.406    0.000    0.406    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:553(event)
   300000    0.265    0.000    0.265    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:460(_unescape_event_text)
       10    0.000    0.000    0.001    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:673(flush)

Benchmark ascii text with master branch

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.199    0.199    3.543    3.543 /Users/leo/datadog/vm/dd-agent/tests/performance/benchmark_aggregator.py:87(test_dogstatsd_ascii_events)
   300000    0.729    0.000    2.895    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:500(submit_packets)
   300000    1.546    0.000    1.793    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:463(parse_event_packet)
   300000    0.447    0.000    0.447    0.000 /Users/leo/datadog/vm/dd-agent/tests/performance/benchmark_aggregator.py:54(create_event_packet)
   300000    0.373    0.000    0.373    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:545(event)
   300000    0.247    0.000    0.247    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:460(_unescape_event_text)
       10    0.000    0.000    0.001    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:665(flush)

@LeoCavaille
Copy link
Member Author

Also @remh I tried your suggestion but looks like it results in slower code on my laptop:

        1    0.258    0.258    4.654    4.654 /Users/leo/datadog/vm/dd-agent/tests/performance/benchmark_aggregator.py:64(test_dogstatsd_utf8_events)
   300000    1.131    0.000    3.887    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:500(submit_packets)
   300000    1.510    0.000    1.761    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:463(parse_event_packet)
   300000    0.598    0.000    0.598    0.000 /Users/leo/datadog/vm/dd-agent/venv/lib/python2.7/encodings/utf_8.py:15(decode)
   300000    0.508    0.000    0.508    0.000 /Users/leo/datadog/vm/dd-agent/tests/performance/benchmark_aggregator.py:54(create_event_packet)
   300000    0.397    0.000    0.397    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:553(event)
   300000    0.251    0.000    0.251    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:460(_unescape_event_text)
       10    0.000    0.000    0.001    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:673(flush)
        1    0.248    0.248    4.656    4.656 /Users/leo/datadog/vm/dd-agent/tests/performance/benchmark_aggregator.py:87(test_dogstatsd_ascii_events)
   300000    1.073    0.000    3.878    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:500(submit_packets)
   300000    1.606    0.000    1.869    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:463(parse_event_packet)
   300000    0.529    0.000    0.529    0.000 /Users/leo/datadog/vm/dd-agent/tests/performance/benchmark_aggregator.py:54(create_event_packet)
   300000    0.525    0.000    0.525    0.000 /Users/leo/datadog/vm/dd-agent/venv/lib/python2.7/encodings/utf_8.py:15(decode)
   300000    0.411    0.000    0.411    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:553(event)
   300000    0.263    0.000    0.263    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:460(_unescape_event_text)
       10    0.000    0.000    0.001    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:673(flush)

@remh
Copy link
Contributor

remh commented Jan 15, 2015

30% is a lot.
Accepting UTF-8 should probably be an option in the config file then..

Does the statsd protocol mention anything about accepting UTF-8 in the payloads ?

@LeoCavaille
Copy link
Member Author

@remh Using unicode(s, 'utf-8') is really faster than s.decode('utf-8') ! Found this here http://stackoverflow.com/a/440432

The performance impact is only 7.1 % now.

About the statsd protocol, didn't find much in the protocol, however I found a few .encode('utf-8') in the statsd clients.

        1    0.238    0.238    3.952    3.952 /Users/leo/datadog/vm/dd-agent/tests/performance/benchmark_aggregator.py:64(test_dogstatsd_utf8_events)
   300000    1.262    0.000    3.245    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:500(submit_packets)
   300000    1.388    0.000    1.612    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:463(parse_event_packet)
   300000    0.468    0.000    0.468    0.000 /Users/leo/datadog/vm/dd-agent/tests/performance/benchmark_aggregator.py:54(create_event_packet)
   300000    0.371    0.000    0.371    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:553(event)
   300000    0.223    0.000    0.223    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:460(_unescape_event_text)
       10    0.000    0.000    0.001    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:673(flush)
        1    0.218    0.218    3.794    3.794 /Users/leo/datadog/vm/dd-agent/tests/performance/benchmark_aggregator.py:87(test_dogstatsd_ascii_events)
   300000    1.056    0.000    3.102    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:500(submit_packets)
   300000    1.431    0.000    1.666    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:463(parse_event_packet)
   300000    0.473    0.000    0.473    0.000 /Users/leo/datadog/vm/dd-agent/tests/performance/benchmark_aggregator.py:54(create_event_packet)
   300000    0.380    0.000    0.380    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:553(event)
   300000    0.236    0.000    0.236    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:460(_unescape_event_text)
       10    0.000    0.000    0.001    0.000 /Users/leo/datadog/vm/dd-agent/aggregator.py:673(flush)

@remh
Copy link
Contributor

remh commented Jan 15, 2015

So 7% is still pretty high i think. It might be worth making that optional ?

@clutchski any additional thought ?

@LeoCavaille
Copy link
Member Author

@remh this one is ready to go too, however I merged the histogram min branch in it before because it contained a refactor of aggregator.py, thus that one should go out first.

Fixes #1256. We should always consider that dogstatsd
receives a utf-8 encoded string through its socket,
though it comes with a non-negligeable performance overhead
over ~7% in simple benchmarks.
The default behavior of the server is thus considering packets
as ASCII-only content, flipping the utf8_decoding flag in the
config will allow you to parse such packets correctly.

Most (dog)statsd clients should already encode their packets
in utf-8 when sending data.
LeoCavaille added a commit that referenced this pull request Jan 28, 2015
[dogstatsd] handle properly utf8 packets
@LeoCavaille LeoCavaille merged commit fbbf4de into master Jan 28, 2015
@LeoCavaille LeoCavaille deleted the leo/dogstatsdutf8 branch January 28, 2015 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[dogstatsd] events payload does not support utf-8 properly
3 participants