Ability to transform proxied content #5

simonw · 2023-07-16T15:39:50Z

Right now this proxy isn't useful for anything other than forwarding traffic to somewhere else.

Including patterns for transforming proxied content would be really useful.

Some transformations that I would want to support:

Manipulating headers in some way (adding and removing HTTP headers sent to and retrieved from the proxied backend)
Content transformations on the body

A challenge with content transformations is that they are harder to implement in a streaming fashion - often they'll need to accumulate the entire response body before being applied. Supporting a neat pattern for optionally doing that would be useful too.

simonw · 2023-07-16T15:51:28Z

It would be fun to experiment with transforming streaming content - using ijson for JSON and some streaming HTML parser for HTML.

simonw · 2023-07-16T15:52:33Z

Built this quick prototype with the help of ChatGPT:

import asyncio
from html.parser import HTMLParser
import sys

class AddClassToPTagParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == "p":
            attrs = dict(attrs)
            if "class" in attrs:
                if "foo" not in attrs["class"].split():
                    attrs["class"] += " foo"
            else:
                attrs["class"] = "foo"

            attrs_str = " ".join([f'{k}="{v}"' for k, v in attrs.items()])
            self.modified_html += f"<{tag} {attrs_str}>"
        else:
            self.modified_html += self.get_starttag_text()

    def handle_endtag(self, tag):
        self.modified_html += f"</{tag}>"

    def handle_data(self, data):
        self.modified_html += data

    def handle_entityref(self, name):
        self.modified_html += f"&{name};"

    def handle_charref(self, name):
        self.modified_html += f"&#${name};"

    def feed(self, data):
        self.modified_html = ""
        super().feed(data)
        return self.modified_html


async def transform_html(async_generator):
    parser = AddClassToPTagParser()

    async for chunk in async_generator:
        yield parser.feed(chunk)


async def test():
    async def html_generator():
        chunks = [
            "<html><head>",
            "<title>Te",
            "st</title></he",
            "ad><body><p class=",
            '"bar">Hello, ","world!</p>","<p>',
            'Hello again, ","world!</p></body></html>',
        ]
        for chunk in chunks:
            yield chunk
            await asyncio.sleep(0.5)

    async for transformed_chunk in transform_html(html_generator()):
        print(transformed_chunk, end='')
        sys.stdout.flush()
    print()


# Run the test coroutine
asyncio.run(test())

simonw · 2023-07-16T16:02:11Z

For ijson I think this example is relevant, from https://github.com/ICRAR/ijson/tree/master#push-interfaces

import ijson

events = ijson.sendable_list()
coro = ijson.items_coro(events, 'earth.europe.item')
f = urlopen('http://.../')
for chunk in iter(functools.partial(f.read, buf_size)):
   coro.send(chunk)
   process_accumulated_events(events)
   del events[:]
coro.close()
process_accumulated_events(events)

simonw · 2023-07-16T16:05:45Z

ijson example (ChatGPT wanted remove_keys to be an async def but I fixed that):

import asyncio
import ijson


def remove_keys(obj):
    """Remove keys starting with '_' from an object"""
    if isinstance(obj, dict):
        return {
            k: remove_keys(v) for k, v in obj.items() if not k.startswith("_")
        }
    elif isinstance(obj, list):
        return [remove_keys(item) for item in obj]
    else:
        return obj


async def transform_json(async_generator):
    events = ijson.sendable_list()
    coro = ijson.items_coro(events, "item")

    async for chunk in async_generator:
        coro.send(chunk)
        while events:
            transformed_item = remove_keys(events.pop(0))
            yield transformed_item
    coro.close()


async def test():
    async def json_stream():
        chunks = [
            b'[{"item": {"_id": 1, "name": "test1"}},',
            b'{"item": {"_id": 2, "name": "test2"}},',
            b'{"item": {"_id": 3, "name": "test3"}}]',
        ]
        for chunk in chunks:
            yield chunk
            await asyncio.sleep(0.1)

    async for transformed_item in transform_json(json_stream()):
        print(transformed_item)


# Run the test coroutine
asyncio.run(test())

Note that this outputs:

{'item': {'name': 'test1'}}
{'item': {'name': 'test2'}}
{'item': {'name': 'test3'}}

Losing the [ and ]. Needs more work.

ChatGPT transcript that got me here: https://chat.openai.com/share/3461da01-6e49-4324-9ece-cc2be1134f04

simonw · 2023-07-16T16:09:49Z

So I think the interface for this ends up looking something like this:

# As seen above:
async def transform_html(async_generator):
    parser = AddClassToPTagParser()

    async for chunk in async_generator:
        yield parser.feed(chunk)

app = asgi_proxy("https://datasette.io", body_transformer=transform_html)

This may be too simple though, since the transformer should probably take into account things like the content-type header before deciding what to do.

Maybe something like this instead:

def transform_factory(httpx_response):
    # At this point we just have the headers
    if 'text/html' in httpx_response.headers.get('content-type'):
        return transform_html

app = asgi_proxy("https://datasette.io", body_transform_factory=transform_factory)

simonw added the enhancement label Jul 16, 2023

simonw mentioned this issue Nov 6, 2024

Support for ["accept-encoding"] = "identity" #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to transform proxied content #5

Ability to transform proxied content #5

simonw commented Jul 16, 2023

simonw commented Jul 16, 2023

simonw commented Jul 16, 2023

simonw commented Jul 16, 2023

simonw commented Jul 16, 2023

simonw commented Jul 16, 2023

Ability to transform proxied content #5

Ability to transform proxied content #5

Comments

simonw commented Jul 16, 2023

simonw commented Jul 16, 2023

simonw commented Jul 16, 2023

simonw commented Jul 16, 2023

simonw commented Jul 16, 2023

simonw commented Jul 16, 2023