Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement WSGI middleware integration. #84

Merged
merged 38 commits into from
Aug 19, 2019

Conversation

a-feld
Copy link
Member

@a-feld a-feld commented Aug 12, 2019

This PR implements a WSGI middleware (#81) which invokes the OpenTelemetry tracer APIs to time the request/response cycle.

This PR currently uses trace_api.tracer() and tracer.start_span() APIs but the intent is to use the set_attribute APIs as well to add dimensions to the generated span.

@a-feld a-feld force-pushed the wsgi-integration branch 2 times, most recently from bbd118c to 5938ea6 Compare August 12, 2019 23:46
@c24t
Copy link
Member

c24t commented Aug 13, 2019

@a-feld do you want reviews on this yet, or should we wait until it's out draft?

@a-feld
Copy link
Member Author

a-feld commented Aug 13, 2019

@c24t Comments are always welcome! 😄 I was waiting until after #83 is merged to update the PR with set_attribute calls.

@reyang
Copy link
Member

reyang commented Aug 14, 2019

Regarding the package naming convention, package structure, folder structure, README.rst, setup.py, setup.cfg and tox, please take a look at https://github.com/census-instrumentation/opencensus-python/tree/master/contrib.

@reyang
Copy link
Member

reyang commented Aug 14, 2019

@a-feld Just saw your comments that you expect early comments. Tip: you can let us know if the PR is ready for review by removing the draft status.
Any PR that has "WIP" or "Work in Progress" in the title, or draft status, is treated as under construction and I personally wouldn't comment on it until it's ready.
If the PR is ready for comments, but is pending on another PR before ready to merge, we can tag it as do-not-merge.

integrations/opentelemetry-wsgi/.flake8 Outdated Show resolved Hide resolved
self.span = None

def _add_request_attributes(self, environ):
self.span.add_attribute("http.method", environ["HTTP_METHOD"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should provide constants for these well-known attributes (like http.method).


# TODO: change start_response to self._add_response_attributes after
# set_attribute API is implemented
for yielded in self.wsgi(environ, start_response):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not add our own loop here. For example, if the underlying app uses https://www.python.org/dev/peps/pep-3333/#optional-platform-specific-file-handling, we destroy any optimizations done by that. Maybe more relevant, some servers could optimize if an object with known len is returned. I think yield from would have the same problem but I'm not sure. We could try to instrument/add the close method of the result (though that might fail if a C object is returned that does not support setting that attribute).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely an interesting problem that I haven't considered in depth yet. From reading the specification I believe the optimization you're referring to is this part of the spec?

The callable must return an iterable object, and must not perform any data transmission until and unless the server/gateway actually receives the iterable as a return value from the application. (To do otherwise would prevent middleware from being able to interpret or override the response data.)

From a functional perspective, I believe the behavior is correct since the loop is iterating over the wsgi.file_wrapper iterable.

From a timing perspective, I wonder what the response time should be recorded as in this case? Should it include the time the server spends streaming the file? For any other request, the answer would generally be yes.

Performance-wise, the intent was to avoid allocating a new response-like iterable object and instead pass the values through transparently, taking advantage of the context manager behavior for proper timing. In order to maintain the pass-through behavior, the application would have to be invoked and a response wrapper object that was iterable would have to be created.

def __call__(self, environ, start_response):
    iterable = self.wsgi(environ, start_response)
    return ResponseTimer(iterable)  # allocation of the ResponseTimer object

Copy link
Member Author

@a-feld a-feld Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more relevant, some servers could optimize if an object with known len is returned.

I'll have to think more about this. If there are examples of these types of optimizations I'd love to have a look!

Copy link
Member Author

@a-feld a-feld Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as far as I can tell, the only way to preserve the wsgi.file_wrapper optimization is to add an isinstance check for the wsgi.file_wrapper. Using this approach implies that we cannot time the full response cycle where the file contents are streamed.

So it sounds like the trade-off is performance versus accuracy of response time values.

if url is not None:
self.span.add_attribute("http.url", url)

def _add_response_attributes(self, status, response_headers, *args):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later, we should consider processing the exception information in args. Maybe add a # TODO for now.

@Oberon00
Copy link
Member

Regarding the package naming convention, package structure, folder structure, README.rst, setup.py, setup.cfg and tox, please take a look at https://github.com/census-instrumentation/opencensus-python/tree/master/contrib.

-- (@reyang in #84 (comment))

For example, this means no tox.ini but add the package to the root tox.ini?

@reyang
Copy link
Member

reyang commented Aug 14, 2019

For example, this means no tox.ini but add the package to the root tox.ini?

Exactly. Personally I would highly recommend this way because:

  1. It makes CI much easier (especially if the CI involves installing multiple packages across the repo).
  2. It enforces all the packages to follow the same set of rules/tools (e.g. lint, doc, coverage).
  3. Maintenance cost is lower.

Later if we decided to move the package to a git submodule, we can easily copy the tox.ini file with small changes.

@a-feld a-feld force-pushed the wsgi-integration branch 3 times, most recently from 8380e96 to 71a7c6e Compare August 14, 2019 18:16
@a-feld a-feld marked this pull request as ready for review August 14, 2019 18:18
@a-feld a-feld force-pushed the wsgi-integration branch 5 times, most recently from bc6fed8 to 9b27c21 Compare August 14, 2019 21:43
Co-Authored-By: Reiley Yang <reyang@microsoft.com>
start_response = self._create_start_response(span, start_response)

for yielded in self.wsgi(environ, start_response):
yield yielded
Copy link
Member

@reyang reyang Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, first time I've seen yield yielded :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still vote to at least use yield from.

Copy link
Member Author

@a-feld a-feld Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yield from is equivalent in this circumstance since wsgi applications are not coroutines and unfortunately using yield from doesn't solve the issue with WSGI applications becoming generators / deferring the call to the wsgi callable 😞

Copy link
Member Author

@a-feld a-feld Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been putting some thought into this and I think yield from would actually be incorrect in this case. While the call to self.wsgi returns an iterable, that iterable is not guaranteed to be a generator so delegating send and close calls to the result of self.wsgi would be incorrect.

Note that yield from defined in PEP 380 defines yield from as being used by a generator to delegate part of its operations to another generator.

Since self.wsgi does not always return a generator, using yield from here would be incorrect.

Copy link
Member

@Oberon00 Oberon00 Aug 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yield from supports, but does not require the iterable (it is defined as yield from <iterable> not yield from <generator>, even though that seems to have been the main motivation) to implement close or throw or send. But the WSGI spec explicitly calls out close and generators, see https://www.python.org/dev/peps/pep-3333/#specification-details. So delegating close (if it exists) would actually be beneficial:

If the iterable returned by the application has a close() method, the server or gateway must call that method upon completion of the current request, whether the request was completed normally, or terminated early due to an application error during iteration or an early disconnect of the browser. (The close() method requirement is to support resource release by the application. This protocol is intended to complement PEP 342's generator support, and other common iterables with close() methods.)

What is more, yield from seems to be a tiny bit faster than a manual loop:

In [1]: def generate(xs): yield from xs

In [4]: def generate2(xs):
   ...:     for x in xs:
   ...:         yield x
   ...:

In [6]: xs = [b'<html>'] * 1024

In [18]: %timeit for _ in generate(xs): pass
23.5 µs ± 374 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [19]: %timeit for _ in generate2(xs): pass
30.2 µs ± 729 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [21]: xs = xs[:1]

In [22]: %timeit for _ in generate(xs): pass
196 ns ± 1.19 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [23]: %timeit for _ in generate2(xs): pass
202 ns ± 4.79 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The close call is definitely an omission on my part, great catch! I'll fix this before merging! 😄

For yield from, the performance improvements are pretty compelling but the behavior does change though between yield and yield from. We don't want to delegate close calls to an underlying iterable that's not a generator. I think the problem here is the "re-use" of the close function in the WSGI sense versus the generator sense. Here's an example where close behavior is unstable:

import sys


class Response:
    def __init__(self, v, exc=None):
        self.iterable = iter(v)
        self.exc = exc

    def __next__(self):
        if self.exc:
            raise self.exc
        return next(self.iterable)

    def __iter__(self):
        return self

    def close(self):
        caller = sys._getframe(1)
        print("***** close called from {}***** ".format(str(caller)))


def gen_yielded(iterable):
    try:
        for yielded in iterable:
            yield yielded
    finally:
        iterable.close()


def gen_yield_from(iterable):
    try:
        yield from iterable
    finally:
        iterable.close()


def drive(g, first_value_only=False, exc=None):
    iterable = Response([1,2,3], exc=exc)
    g = g(iterable)

    try:
        for item in g:
            if first_value_only:
                break
    except Exception:
        return "exception"
    finally:
        g.close()

    return item


print("gen_yielded complete:")
print(drive(gen_yielded), "\n")
print("gen_yield_from complete:")
print(drive(gen_yield_from), "\n")

print("gen_yielded first_only:")
print(drive(gen_yielded, first_value_only=True), "\n")
print("gen_yield_from first_only:")
print(drive(gen_yield_from, first_value_only=True), "\n")

print("gen_yielded exc:")
print(drive(gen_yielded, exc=ValueError), "\n")
print("gen_yield_from exc:")
print(drive(gen_yield_from, exc=ValueError), "\n")

Result:

gen_yielded complete:
***** close called from <frame at 0x10ab915f0, file 'wsgi.py', line 27, code gen_yielded>***** 
3 

gen_yield_from complete:
***** close called from <frame at 0x10ac31bf0, file 'wsgi.py', line 34, code gen_yield_from>***** 
3 

gen_yielded first_only:
***** close called from <frame at 0x10ab915f0, file 'wsgi.py', line 27, code gen_yielded>***** 
1 

gen_yield_from first_only:
***** close called from <frame at 0x10ac82250, file 'wsgi.py', line 48, code drive>***** 
***** close called from <frame at 0x10ac31bf0, file 'wsgi.py', line 34, code gen_yield_from>***** 
1 

gen_yielded exc:
***** close called from <frame at 0x10ab915f0, file 'wsgi.py', line 27, code gen_yielded>***** 
exception 

gen_yield_from exc:
***** close called from <frame at 0x10ac31bf0, file 'wsgi.py', line 34, code gen_yield_from>***** 
exception 

tox.ini Show resolved Hide resolved
tox.ini Outdated Show resolved Hide resolved
Co-Authored-By: Reiley Yang <reyang@microsoft.com>
Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after fixing the lint issues.

Copy link
Member

@c24t c24t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more blocking comments from me. This PR looks great, and will be especially helpful as we write integration tests to prepare for the release.

Thanks for fielding all these comments @a-feld, and for forcing me to go read the WSGI PEPs.

Let's get @Oberon00's approval before merging.

@Oberon00
Copy link
Member

Even though I'd like to see the yield from discussion properly resolved, apart from that, this LGTM.

@Oberon00
Copy link
Member

Oberon00 commented Aug 19, 2019

Actually, I noticed one additional (probably minor) issue: By making __call__ a generator, we only start the span once the server starts iterating. Since we also delay all processing that the wrapped WSGI app might do directly in it's __call__ to that point, the span's time should be more or less correct though. (My LGTM still stands, if you want, this could be discussed in a separate issue.)

@a-feld
Copy link
Member Author

a-feld commented Aug 19, 2019

Actually, I noticed one additional (probably minor) issue: By making call a generator, we only start the span once the server starts iterating. Since we also delay all processing that the wrapped WSGI app might do directly in it's call to that point, the span's time should be more or less correct though. (My LGTM still stands, if you want, this could be discussed in a separate issue.)

Yes, this is the case but the timing is still correct minus the time it takes between call and iterate (which should encompass only the generator allocation + anything between these two calls in the server).

The alternative is to allocate a custom response iterable and return that object:

def __call__(self, environ, start_response):
  span = tracer.create_span(...)
  span.start()
  iterable = self.wsgi(environ, start_response)
  return Response(span, iterable)

I figured the performance hit for the allocation of the Response object wasn't worth it, but it is an option 😄

@a-feld
Copy link
Member Author

a-feld commented Aug 19, 2019

Thanks for the reviews everybody! 🎉 I'm going to merge this in for now. We can always follow up if changes are required!

@a-feld a-feld merged commit 17afba5 into open-telemetry:master Aug 19, 2019
@a-feld a-feld deleted the wsgi-integration branch August 19, 2019 18:10
@Oberon00 Oberon00 mentioned this pull request Sep 18, 2019
srikanthccv pushed a commit to srikanthccv/opentelemetry-python that referenced this pull request Nov 1, 2020
…open-telemetry#84)

* Setup `platform` folder and add browser tests

* Add @types/webpack-env to fix TS compile error

* Add CHROME_BIN environment variable for CircleCI

* Use the Node 12 with browsers CircleCI image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants