-
Notifications
You must be signed in to change notification settings - Fork 12
feat: Handle request list user input #326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
cc94023
Draft example of helper function to create RequestList
Pijukatel ded055d
Add test for simple input
Pijukatel 57dd329
WIP
Pijukatel 0a465be
WIP
Pijukatel 4b16737
Use Pydantic to handle raw inputs
Pijukatel f2c2440
Properly pass request creation settings.
Pijukatel 5af9405
Add tests fro regexp.
Pijukatel cf4534a
Use regex instead of re.
Pijukatel ff3e047
Use re with \w
Pijukatel b4ad24f
Reduce some test code repetition.
Pijukatel 05d048a
Remove types-regex
Pijukatel 376ae8b
Make ruff happy
Pijukatel 629939e
Remove Input class
Pijukatel 910d11f
Review comments.
Pijukatel 6ff9e90
Remove unnecessary pyproject setting value.
Pijukatel 3f33145
Addresing review comments
Pijukatel 318c9c8
Addresing review comments 2
Pijukatel feff6ba
Update src/apify/storages/_request_list.py
Pijukatel f18133f
Merge remote-tracking branch 'origin/master' into handle-request-list…
Pijukatel b150e1d
Use docs_group decorator
Pijukatel 470041f
Update src/apify/storages/_request_list.py
janbuchar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
from crawlee.storages import Dataset, KeyValueStore, RequestQueue | ||
|
||
__all__ = ['Dataset', 'KeyValueStore', 'RequestQueue'] | ||
from ._request_list import RequestList | ||
|
||
__all__ = ['Dataset', 'KeyValueStore', 'RequestQueue', 'RequestList'] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
from __future__ import annotations | ||
|
||
import asyncio | ||
import re | ||
from asyncio import Task | ||
from functools import partial | ||
from typing import Annotated, Any, Union | ||
|
||
from pydantic import BaseModel, Field, TypeAdapter | ||
|
||
from crawlee import Request | ||
from crawlee._types import HttpMethod | ||
from crawlee.http_clients import BaseHttpClient, HttpxHttpClient | ||
from crawlee.storages import RequestList as CrawleeRequestList | ||
|
||
from apify._utils import docs_group | ||
|
||
URL_NO_COMMAS_REGEX = re.compile( | ||
r'https?:\/\/(www\.)?([^\W_]|[^\W_][-\w0-9@:%._+~#=]{0,254}[^\W_])\.[a-z]{2,63}(:\d{1,5})?(\/[-\w@:%+.~#?&/=()]*)?' | ||
) | ||
|
||
|
||
class _RequestDetails(BaseModel): | ||
method: HttpMethod = 'GET' | ||
payload: str = '' | ||
headers: Annotated[dict[str, str], Field(default_factory=dict)] = {} | ||
user_data: Annotated[dict[str, str], Field(default_factory=dict, alias='userData')] = {} | ||
|
||
|
||
class _RequestsFromUrlInput(_RequestDetails): | ||
requests_from_url: str = Field(alias='requestsFromUrl') | ||
|
||
|
||
class _SimpleUrlInput(_RequestDetails): | ||
url: str | ||
|
||
|
||
url_input_adapter = TypeAdapter(list[Union[_RequestsFromUrlInput, _SimpleUrlInput]]) | ||
|
||
|
||
@docs_group('Classes') | ||
class RequestList(CrawleeRequestList): | ||
"""Extends crawlee RequestList. | ||
|
||
Method open is used to create RequestList from actor's requestListSources input. | ||
""" | ||
|
||
@staticmethod | ||
async def open( | ||
name: str | None = None, | ||
request_list_sources_input: list[dict[str, Any]] | None = None, | ||
http_client: BaseHttpClient | None = None, | ||
) -> RequestList: | ||
"""Creates RequestList from Actor input requestListSources. | ||
|
||
Args: | ||
name: Name of the returned RequestList. | ||
request_list_sources_input: List of dicts with either url key or requestsFromUrl key. | ||
http_client: Client that will be used to send get request to urls defined by value of requestsFromUrl keys. | ||
|
||
Returns: | ||
RequestList created from request_list_sources_input. | ||
|
||
### Usage | ||
|
||
```python | ||
example_input = [ | ||
# Gather urls from response body. | ||
{'requestsFromUrl': 'https://crawlee.dev/file.txt', 'method': 'GET'}, | ||
# Directly include this url. | ||
{'url': 'https://crawlee.dev', 'method': 'GET'} | ||
] | ||
request_list = await RequestList.open(request_list_sources_input=example_input) | ||
``` | ||
""" | ||
request_list_sources_input = request_list_sources_input or [] | ||
return await RequestList._create_request_list(name, request_list_sources_input, http_client) | ||
|
||
@staticmethod | ||
async def _create_request_list( | ||
name: str | None, request_list_sources_input: list[dict[str, Any]], http_client: BaseHttpClient | None | ||
) -> RequestList: | ||
if not http_client: | ||
http_client = HttpxHttpClient() | ||
|
||
url_inputs = url_input_adapter.validate_python(request_list_sources_input) | ||
|
||
simple_url_inputs = [url_input for url_input in url_inputs if isinstance(url_input, _SimpleUrlInput)] | ||
remote_url_inputs = [url_input for url_input in url_inputs if isinstance(url_input, _RequestsFromUrlInput)] | ||
|
||
simple_url_requests = RequestList._create_requests_from_input(simple_url_inputs) | ||
remote_url_requests = await RequestList._fetch_requests_from_url(remote_url_inputs, http_client=http_client) | ||
|
||
return RequestList(name=name, requests=simple_url_requests + remote_url_requests) | ||
|
||
@staticmethod | ||
def _create_requests_from_input(simple_url_inputs: list[_SimpleUrlInput]) -> list[Request]: | ||
return [ | ||
Request.from_url( | ||
method=request_input.method, | ||
url=request_input.url, | ||
payload=request_input.payload.encode('utf-8'), | ||
headers=request_input.headers, | ||
user_data=request_input.user_data, | ||
) | ||
for request_input in simple_url_inputs | ||
] | ||
|
||
@staticmethod | ||
async def _fetch_requests_from_url( | ||
remote_url_requests_inputs: list[_RequestsFromUrlInput], http_client: BaseHttpClient | ||
) -> list[Request]: | ||
"""Crete list of requests from url. | ||
|
||
Send GET requests to urls defined in each requests_from_url of remote_url_requests_inputs. Run extracting | ||
callback on each response body and use URL_NO_COMMAS_REGEX regex to find all links. Create list of Requests from | ||
collected links and additional inputs stored in other attributes of each remote_url_requests_inputs. | ||
""" | ||
created_requests: list[Request] = [] | ||
|
||
def create_requests_from_response(request_input: _RequestsFromUrlInput, task: Task) -> None: | ||
"""Callback to scrape response body with regexp and create Requests from matches.""" | ||
matches = re.finditer(URL_NO_COMMAS_REGEX, task.result().read().decode('utf-8')) | ||
created_requests.extend( | ||
[ | ||
Request.from_url( | ||
match.group(0), | ||
method=request_input.method, | ||
payload=request_input.payload.encode('utf-8'), | ||
headers=request_input.headers, | ||
user_data=request_input.user_data, | ||
) | ||
for match in matches | ||
] | ||
) | ||
|
||
remote_url_requests = [] | ||
for remote_url_requests_input in remote_url_requests_inputs: | ||
get_response_task = asyncio.create_task( | ||
http_client.send_request( | ||
method='GET', | ||
url=remote_url_requests_input.requests_from_url, | ||
) | ||
) | ||
|
||
get_response_task.add_done_callback(partial(create_requests_from_response, remote_url_requests_input)) | ||
remote_url_requests.append(get_response_task) | ||
|
||
await asyncio.gather(*remote_url_requests) | ||
return created_requests |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though crawlee.configuration.Configuration inherits from pydantic_settings.BaseSettings and that one from pydantic.BaseModel, ruff has some issues in following inheritance hierarchy too far. So until that changes, some models will have to be explicitly mentioned even though they inherit from pydantic.BaseModel.
See closed issue where this is described as known limitation to certain degree:
astral-sh/ruff#8725