Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A more memory-efficient HttpResponseProvider #95

Open
BurnzZ opened this issue Nov 21, 2022 · 3 comments
Open

A more memory-efficient HttpResponseProvider #95

BurnzZ opened this issue Nov 21, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@BurnzZ
Copy link
Contributor

BurnzZ commented Nov 21, 2022

https://github.com/scrapinghub/scrapy-poet/blob/master/scrapy_poet/page_input_providers.py#L165-L180

Currently, the HttpResponseProvider creates a new HttpResponse instance each time it's called:

class HttpResponseProvider(PageObjectInputProvider, CacheDataProviderMixin):
    """This class provides ``web_poet.page_inputs.HttpResponse`` instances."""

    provided_classes = {HttpResponse}
    name = "response_data"

    def __call__(self, to_provide: Set[Callable], response: Response):
        """Builds a ``HttpResponse`` instance using a Scrapy ``Response``"""
        return [
            HttpResponse(
                url=response.url,
                body=response.body,
                status=response.status,
                headers=HttpResponseHeaders.from_bytes_dict(response.headers),
            )
        ]

From another thread:

Suppose the average HTML size for a particular website is 256 KB. Let's also suppose that we have 12 POs that we need to support in our MultiLayoutPage subclass. This means that for every multi layout PO instance, it holds at least 256 KB * 12 = 3 MB in memory. Assuming that we're parsing at a rate of 10 pages per second, then we're holding at least 3 MB * 10 pages = 30 MB of memory per second.

It's not a crucial issue for now but it can certainly be made more efficient by having the provider return the same HttpResponse instance given a response identifier. HttpResponseProvider already inherits from CacheDataProviderMixin. Perhaps we can use an in-memory cache to determine if we can return the same instance instead of creating a new one.

@BurnzZ BurnzZ added the enhancement New feature or request label Nov 21, 2022
@kmike
Copy link
Member

kmike commented Nov 21, 2022

hey @BurnzZ! I'm not sure it's going to be an issue. The largest field in HttpResponse is response.body, and it's an immutable byte string, which shouldn't be copied multiple times even if you create several HttpResponse instances.

Though maybe you're right, as we're converting it to HttpResponseBody instances. I'm not sure what's Python behavior in this case.

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Nov 21, 2022

The largest field in HttpResponse is response.body, and it's an immutable byte string, which shouldn't be copied multiple times even if you create several HttpResponse instances.

I think the byte string is still copied over by value and not by reference:

from web_poet import HttpResponse

text = b"Hello World"

r1 = HttpResponse(url="example.com", body=text)
r2 = HttpResponse(url="example.com", body=text)

r1.body == r2.body  # True

r1.body is r2.body          # False
id(r1.body) == id(r2.body)  # False

I think it has something to do with HttpResponse being an attrs dataclass.

If we define it as something like the following, then the object reference is used:

class HttpResponse:
    def __init__(self, url, body):
        self.url = url
        self.body = body

text = b"Hello World"

r1 = HttpResponse(url="example.com", body=text)
r2 = HttpResponse(url="example.com", body=text)

r1.body == r2.body  # True

r1.body is r2.body          # True
id(r1.body) == id(r2.body)  # True

So I guess one approach is to modify how we define HttpResponse.

@kmike
Copy link
Member

kmike commented Nov 21, 2022

I think it has something to do with HttpResponse being an attrs dataclass.

It could be more about HttpResponseBody objects created from the response.body passed to HttpResponse.__init__

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants