Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve timing of checks depending on changes since last check #163

Merged
merged 95 commits into from
Dec 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
739ef30
docs: update changelog
bolinocroustibat Sep 10, 2024
87d2c52
feat: select outdated checks with increasing time
bolinocroustibat Sep 11, 2024
ed5a584
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Sep 12, 2024
68f5556
fix: fix wrong config var format
bolinocroustibat Sep 13, 2024
d72ffac
tests: fix test
bolinocroustibat Sep 18, 2024
76ab578
docs: fix docstring
bolinocroustibat Sep 18, 2024
c72a100
docs: add comments
bolinocroustibat Sep 18, 2024
b8c8cac
clean: remove useless imports
bolinocroustibat Sep 18, 2024
7d07a2b
fix: update the count of outdated checks in get_crawler_status
bolinocroustibat Sep 18, 2024
825e793
docs: update docstrings
bolinocroustibat Sep 18, 2024
463f229
refactor: rename some vars
bolinocroustibat Sep 18, 2024
ab00ea7
docs: update changelog
bolinocroustibat Sep 18, 2024
0e47263
docs: update docstrings
bolinocroustibat Sep 18, 2024
1825df6
tests: add tests (wip)
bolinocroustibat Sep 18, 2024
941e894
tests: add tests for re-check before/after default delay
bolinocroustibat Sep 19, 2024
2e0fb48
tests: better names and comments
bolinocroustibat Sep 19, 2024
f07bcee
tests: refactor latest tests as parametrized (test_re_check_depending…
bolinocroustibat Sep 19, 2024
5f0c84c
feat: manage large resources exceptions differently (#148)
bolinocroustibat Sep 23, 2024
e33f090
WIP
bolinocroustibat Sep 23, 2024
6adec5c
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Sep 23, 2024
c25281c
fix: fix increasing check delay logic
bolinocroustibat Sep 23, 2024
35ecc23
tests: fix tests
bolinocroustibat Sep 23, 2024
3491421
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Sep 24, 2024
f5eef08
style: lint code
bolinocroustibat Sep 26, 2024
852663a
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Sep 26, 2024
6e6ac05
feat: use CHECK_DELAY_DEFAULT
bolinocroustibat Sep 26, 2024
431aa7f
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Sep 26, 2024
571a12d
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Sep 26, 2024
ff1fc70
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Sep 28, 2024
3daf61b
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Sep 30, 2024
524e57c
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Oct 2, 2024
70a2640
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Oct 2, 2024
3e2e6cd
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Oct 3, 2024
a8dc898
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Oct 22, 2024
9a10b7c
docs: update changelog
bolinocroustibat Oct 22, 2024
fe9e68c
docs: update comment
bolinocroustibat Oct 22, 2024
b9e4099
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Oct 24, 2024
2f59451
WIP
bolinocroustibat Oct 24, 2024
6d99e23
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Oct 24, 2024
461cf9a
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Oct 25, 2024
8c6372d
feat: add migratino to add a next_check column to checks
bolinocroustibat Oct 25, 2024
5aa528b
feat: new logic for select_batch
bolinocroustibat Oct 25, 2024
8f3e121
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Oct 29, 2024
96256a8
clean: remove useless imports
bolinocroustibat Oct 29, 2024
0d4275a
docs: add TODO
bolinocroustibat Oct 29, 2024
9661aa5
feat: add logic to calculate next_check datetime
bolinocroustibat Oct 29, 2024
79a0421
fix: better next_check behaviour
bolinocroustibat Oct 29, 2024
a7ae5dd
tests: revert test
bolinocroustibat Oct 29, 2024
04d324a
docs: add comment
bolinocroustibat Oct 29, 2024
9de1c48
fix: fix wrong var in code
bolinocroustibat Oct 30, 2024
23d5ec4
fix: fix type hint
bolinocroustibat Oct 30, 2024
80645f5
refactor: refactor crawler status logic
bolinocroustibat Oct 30, 2024
ae8ecaa
refactor: slight refactor of next_check logic
bolinocroustibat Oct 30, 2024
e809a52
clean: remove useless imports
bolinocroustibat Oct 30, 2024
3039850
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Oct 30, 2024
a69f5bc
docs: update changelog
bolinocroustibat Oct 30, 2024
4c4201e
fix: don't use CHECK_DELAY_DEFAULT
bolinocroustibat Oct 30, 2024
df0f967
refactor: change next_check to next_check_at
bolinocroustibat Oct 30, 2024
e68ab04
docs: comment fix
bolinocroustibat Oct 30, 2024
5a95774
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Oct 30, 2024
b32a703
fix: fix migration
bolinocroustibat Nov 4, 2024
e67f2a0
fix: fix crawler status bug
bolinocroustibat Nov 4, 2024
bf5a455
clean: add types
bolinocroustibat Nov 4, 2024
6285a54
refactor: calculate next_check_at date also depending on resource mod…
bolinocroustibat Nov 6, 2024
5df8670
docs: update docstring
bolinocroustibat Nov 6, 2024
3a14cd1
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Nov 6, 2024
9adbcff
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Nov 6, 2024
eef5f21
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Nov 6, 2024
7f0348f
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Nov 6, 2024
0f1e2e1
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Nov 7, 2024
913913d
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Nov 8, 2024
a60bff9
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Nov 15, 2024
85f0ef3
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Nov 15, 2024
2c07642
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Nov 15, 2024
4bcb89b
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Nov 15, 2024
927bb09
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Nov 15, 2024
5e651eb
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Nov 20, 2024
6a9a7ea
fix: fix calculate_next_check type bug issue
bolinocroustibat Nov 22, 2024
abc45b6
fix tests
bolinocroustibat Nov 22, 2024
8e7ec6f
fix: fix nasty bug in preprocess_check_data
bolinocroustibat Nov 22, 2024
0e89128
fix: fix select resource when last check has no next_check planned
bolinocroustibat Nov 22, 2024
cc69099
fix: fix select batch
bolinocroustibat Nov 22, 2024
d1590cb
tests: add tests
bolinocroustibat Nov 22, 2024
d58d4af
tests: improve tests
bolinocroustibat Nov 22, 2024
9e0f626
feat: check get_all and get_latest methods return next_check_at
bolinocroustibat Nov 22, 2024
e6c6eb0
Squashed commit of the following:
bolinocroustibat Nov 25, 2024
f88b458
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Nov 26, 2024
049e180
Squashed commit of the following:
bolinocroustibat Nov 27, 2024
5a914c6
fix: fix circular import issue
bolinocroustibat Nov 27, 2024
d56f2a5
fix: fix circular import issue
bolinocroustibat Dec 5, 2024
f9dc9fa
Revert "fix: fix circular import issue"
bolinocroustibat Dec 5, 2024
8bed2ae
Merge branch 'main' into unavailable-resources-management
bolinocroustibat Dec 5, 2024
a350e53
fix: fix tests
bolinocroustibat Dec 5, 2024
80f7f41
clean: remove SINCE from crawl monitor
bolinocroustibat Dec 6, 2024
2ebca63
docs: update README
bolinocroustibat Dec 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
- Use bump'X [#226](https://github.com/datagouv/hydra/pull/226)
- Get actual resource URL in case of 404 (change since last catalog load) [#225](https://github.com/datagouv/hydra/pull/225)
- Fix deadlocks errors when purging CSV tables by refactoring `purge_csv_tables` to use atomic transactions [#230](https://github.com/datagouv/hydra/pull/230)
- Improve timing of checks depending on changes since last check [#163](https://github.com/datagouv/hydra/pull/163)

## 2.0.5 (2024-11-08)

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ It will crawl (forever) the catalog according to the config set in `config.toml`

`BATCH_SIZE` URLs are queued at each loop run.

The crawler will start with URLs never checked and then proceed with URLs crawled before `SINCE` interval. It will then wait until something changes (catalog or time).
The crawler will start with URLs never checked and then proceed with URLs crawled before `CHECK_DELAYS` interval. It will then wait until something changes (catalog or time).

There's a by-domain backoff mecanism. The crawler will wait when, for a given domain in a given batch, `BACKOFF_NB_REQ` is exceeded in a period of `BACKOFF_PERIOD` seconds. It will retry until the backoff is lifted.

Expand Down
6 changes: 4 additions & 2 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ async def setup_catalog_with_resource_exception(setup_catalog):

@pytest.fixture
def produce_mock(mocker):
mocker.patch("udata_hydra.crawl.process_check_data.send", dummy())
mocker.patch("udata_hydra.crawl.preprocess_check_data.send", dummy())
mocker.patch("udata_hydra.analysis.resource.send", dummy())
mocker.patch("udata_hydra.analysis.csv.send", dummy())

Expand All @@ -168,7 +168,7 @@ def produce_mock(mocker):
def analysis_mock(mocker):
"""Disable analyse_resource while crawling"""
mocker.patch(
"udata_hydra.crawl.check_resources.analyse_resource",
"udata_hydra.analysis.resource.analyse_resource",
dummy({"error": None, "checksum": None, "filesize": None, "mime_type": None}),
)

Expand Down Expand Up @@ -218,6 +218,7 @@ async def _fake_check(
checksum=None,
resource_id=RESOURCE_ID,
detected_last_modified_at=None,
next_check_at=None,
parsing_table=False,
parquet_url=False,
domain="example.com",
Expand All @@ -234,6 +235,7 @@ async def _fake_check(
"error": error,
"checksum": checksum,
"detected_last_modified_at": detected_last_modified_at,
"next_check_at": next_check_at,
"parsing_table": hashlib.md5(url.encode("utf-8")).hexdigest()
if parsing_table
else None,
Expand Down
1 change: 1 addition & 0 deletions tests/test_api/test_api_checks.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ async def test_get_latest_check(setup_catalog, client, query, fake_check, fake_r
"url": url,
"headers": {"x-do": "you"},
"timeout": False,
"next_check_at": None,
"dataset_id": DATASET_ID,
"status": 200,
"parsing_error": None,
Expand Down
61 changes: 40 additions & 21 deletions tests/test_crawl/test_crawl.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from udata_hydra.analysis.resource import analyse_resource
from udata_hydra.crawl import start_checks
from udata_hydra.crawl.check_resources import check_resource
from udata_hydra.crawl.process_check_data import get_content_type_from_header
from udata_hydra.crawl.preprocess_check_data import get_content_type_from_header
from udata_hydra.db.check import Check
from udata_hydra.db.resource import Resource

Expand Down Expand Up @@ -127,29 +127,46 @@ async def test_excluded_clause(setup_catalog, mocker, event_loop, rmock, produce
assert ("GET", URL(rurl)) not in rmock.requests


async def test_outdated_check(setup_catalog, rmock, fake_check, event_loop, produce_mock):
await fake_check(created_at=datetime.now() - timedelta(weeks=52))
rurl = RESOURCE_URL
rmock.head(rurl, status=200)
event_loop.run_until_complete(start_checks(iterations=1))
# url has been called because check is outdated
assert ("HEAD", URL(rurl)) in rmock.requests


async def test_not_outdated_check(
setup_catalog, rmock, fake_check, event_loop, mocker, produce_mock
@pytest.mark.parametrize(
"last_check_params",
[
# last_check, next_check_at, new_check_expected
(False, None, True),
(True, None, True),
(True, datetime.now() - timedelta(hours=1), True),
(True, datetime.now() + timedelta(hours=1), False),
],
)
async def test_next_check(
setup_catalog, db, rmock, fake_check, event_loop, produce_mock, last_check_params
):
mocker.patch("udata_hydra.config.SLEEP_BETWEEN_BATCHES", 0)
await fake_check()
last_check, next_check_at, new_check_expected = last_check_params
if last_check:
await fake_check(
created_at=datetime.now() - timedelta(hours=24), next_check_at=next_check_at
)
rurl = RESOURCE_URL
rmock.get(rurl, status=200)
event_loop.run_until_complete(start_checks(iterations=1))
# url has not been called because check is fresh
assert ("GET", URL(rurl)) not in rmock.requests
checks: list[Record] = await db.fetch(
f"SELECT * FROM checks WHERE url = '{rurl}' ORDER BY created_at DESC"
)
if new_check_expected:
assert ("HEAD", URL(rurl)) in rmock.requests
assert len(checks) == [1, 2][last_check]
assert checks[0]["url"] == rurl
# assert the next check datetime is very close to what's expected, let's say by 10 seconds
assert (
checks[0]["next_check_at"]
- (datetime.now(timezone.utc) + timedelta(hours=config.CHECK_DELAYS[0]))
).total_seconds() < 10
else:
assert ("HEAD", URL(rurl)) not in rmock.requests
assert len(checks) == [0, 1][last_check]


async def test_deleted_check(setup_catalog, rmock, fake_check, event_loop, produce_mock):
check = await fake_check(created_at=datetime.now() - timedelta(weeks=52))
check = await fake_check(created_at=datetime.now() - timedelta(hours=24))
# associate check with a resource
await Resource.update(resource_id=RESOURCE_ID, data={"last_check": check["id"]})
# delete check
Expand Down Expand Up @@ -199,7 +216,7 @@ async def test_analyse_resource(setup_catalog, mocker, fake_check):
mocker.patch("udata_hydra.config.WEBHOOK_ENABLED", False)

check = await fake_check()
await analyse_resource(check["id"], False)
await analyse_resource(check_id=check["id"], last_check=None)
result: Record | None = await Check.get_by_id(check["id"])

assert result["error"] is None
Expand All @@ -213,7 +230,7 @@ async def test_analyse_resource_send_udata(setup_catalog, mocker, rmock, fake_ch
rmock.put(udata_url, status=200, repeat=True)

check = await fake_check()
await analyse_resource(check["id"], True)
await analyse_resource(check_id=check["id"], last_check=None)

req = rmock.requests[("PUT", URL(udata_url))]
assert len(req) == 1
Expand All @@ -229,9 +246,11 @@ async def test_analyse_resource_send_udata_no_change(
rmock.put(udata_url, status=200, repeat=True)

# previous check with same checksum
await fake_check(checksum=hashlib.sha1(SIMPLE_CSV_CONTENT.encode("utf-8")).hexdigest())
last_check = await fake_check(
checksum=hashlib.sha1(SIMPLE_CSV_CONTENT.encode("utf-8")).hexdigest()
)
check = await fake_check()
await analyse_resource(check["id"], False)
await analyse_resource(check_id=check["id"], last_check=last_check)

# udata has not been called
assert ("PUT", URL(udata_url)) not in rmock.requests
Expand Down
33 changes: 24 additions & 9 deletions udata_hydra/analysis/resource.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

from udata_hydra import config, context
from udata_hydra.analysis.csv import analyse_csv
from udata_hydra.crawl.calculate_next_check import calculate_next_check_date
from udata_hydra.db.check import Check
from udata_hydra.db.resource import Resource
from udata_hydra.db.resource_exception import ResourceException
Expand All @@ -32,7 +33,7 @@ class Change(Enum):


async def analyse_resource(
check_id: int, is_first_check: bool, force_analysis: bool = False
check_id: int, last_check: dict | None, force_analysis: bool = False
) -> None:
"""
Perform analysis on the resource designated by check_id:
Expand Down Expand Up @@ -105,11 +106,13 @@ async def analyse_resource(
)

if change_status == Change.HAS_CHANGED:
await store_last_modified_date(change_payload or {}, check_id)
await update_check_with_modification_and_next_dates(
change_payload or {}, check_id, last_check
)

analysis_results = {**dl_analysis, **(change_payload or {})}

if change_status == Change.HAS_CHANGED or is_first_check or force_analysis:
if change_status == Change.HAS_CHANGED or not last_check or force_analysis:
if is_tabular and tmp_file:
# Change status to TO_ANALYSE_CSV
await Resource.update(resource_id, data={"status": "TO_ANALYSE_CSV"})
Expand All @@ -132,14 +135,26 @@ async def analyse_resource(
await Resource.update(resource_id, data={"status": None})


async def store_last_modified_date(change_analysis: dict, check_id: int) -> None:
"""
Store last modified date in checks because it may be useful for later comparison
async def update_check_with_modification_and_next_dates(
change_analysis: dict, check_id: int, last_check: dict | None
) -> None:
"""Update check with last_modified date and next_check date if resource has changed

Args:
change_analysis: dict with optional "analysis:last-modified-at" key
check_id: the ID of the current check
last_check: the last check data, if any
"""
last_modified = change_analysis.get("analysis:last-modified-at")
last_modified: str | None = change_analysis.get("analysis:last-modified-at")
if last_modified:
last_modified = datetime.fromisoformat(last_modified)
await Check.update(check_id, {"detected_last_modified_at": last_modified})
last_modified_at: datetime = datetime.fromisoformat(last_modified)
next_check_at: datetime = calculate_next_check_date(
has_check_changed=True, last_check=last_check, last_modified_at=last_modified_at
)
await Check.update(
check_id,
{"detected_last_modified_at": last_modified_at, "next_check_at": next_check_at},
)


async def detect_resource_change_from_checksum(
Expand Down
10 changes: 6 additions & 4 deletions udata_hydra/config_default.toml
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,16 @@ BACKOFF_NB_REQ = 180
BACKOFF_PERIOD = 360 # in seconds
COOL_OFF_PERIOD = 86400 # 1 day to cool off when we've messed up


# crawl batch size, beware of open file limits
# check batch size, beware of open file limits
# ⚠️ do not exceed MAX_POOL_SIZE
BATCH_SIZE = 40
# crawl url if last check is older than
SINCE = "1w"

# check resource if last check is older than
CHECK_DELAYS = [12, 24, 168, 720] # in hours (1/2, 1, 7, 30 days)

# seconds to wait for between batches
SLEEP_BETWEEN_BATCHES = 60

# max download filesize in bytes (100 MB)
MAX_FILESIZE_ALLOWED.csv = 104857600
MAX_FILESIZE_ALLOWED.csvgz = 104857600
Expand Down
2 changes: 1 addition & 1 deletion udata_hydra/crawl/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ async def start_checks(iterations: int = -1) -> None:
"""
try:
context.monitor().init(
SINCE=config.SINCE,
CHECK_DELAYS=config.CHECK_DELAYS,
BATCH_SIZE=config.BATCH_SIZE,
BACKOFF_NB_REQ=config.BACKOFF_NB_REQ,
BACKOFF_PERIOD=config.BACKOFF_PERIOD,
Expand Down
44 changes: 44 additions & 0 deletions udata_hydra/crawl/calculate_next_check.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
from datetime import datetime, timedelta, timezone

from udata_hydra import config


def calculate_next_check_date(
has_check_changed: bool, last_check: dict | None, last_modified_at: datetime | None
) -> datetime:
"""Calculate the datetime of the next check, depending on the last check data.

Args:
has_check_changed: Whether the check has changed since the last check.
last_check: The last check data as a dict, or None if there is no last check.
last_modified_at: The last modification date of the ressource as analysed by the earliest change detection methods, or None if it could not be determined.

Returns:
The datetime of the next check.
"""

now: datetime = datetime.now(timezone.utc)

if not last_check or has_check_changed:
# No last check or check itself has changed, next check will happen in the earliest delay without looking at the ressource last modification date
next_check_at: datetime = now + timedelta(hours=config.CHECK_DELAYS[0])

else:
# Check has not changed since last check, we need to look at the ressource last modification date
if last_modified_at:
since_last_modif: timedelta = now - last_modified_at
else:
# If no last modification date, we use the last check date
since_last_modif: timedelta = now - last_check["created_at"]

if since_last_modif > timedelta(hours=config.CHECK_DELAYS[-1]):
# 1) Last check or last ressource modification happened change after the maximum delay, next check will be after the same maximum delay
next_check_at = now + timedelta(hours=config.CHECK_DELAYS[-1])
else:
# 2) Last check or last ressource modification happened before CHECK_DELAYS[i], next check will happen after CHECK_DELAYS[i]
for delay in config.CHECK_DELAYS:
if since_last_modif <= timedelta(hours=delay):
next_check_at = now + timedelta(hours=delay)
break

return next_check_at
20 changes: 11 additions & 9 deletions udata_hydra/crawl/check_resources.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,13 @@
from asyncpg import Record

from udata_hydra import config, context
from udata_hydra.analysis.resource import analyse_resource
from udata_hydra.crawl.helpers import (
convert_headers,
fix_surrogates,
has_nice_head,
is_domain_backoff,
)
from udata_hydra.crawl.process_check_data import process_check_data
from udata_hydra.crawl.preprocess_check_data import preprocess_check_data
from udata_hydra.db.resource import Resource
from udata_hydra.utils import queue

Expand Down Expand Up @@ -65,6 +64,9 @@ async def check_resource(
) -> str:
log.debug(f"check {url}, sleep {sleep}, method {method}")

# Import here to avoid circular import issues
from udata_hydra.analysis.resource import analyse_resource

# Update resource status to CRAWLING_URL
await Resource.update(resource_id, data={"status": "CRAWLING_URL"})

Expand All @@ -77,7 +79,7 @@ async def check_resource(
if not domain:
log.warning(f"[warning] not netloc in url, skipping {url}")
# Process the check data. If it has changed, it will be sent to udata
await process_check_data(
await preprocess_check_data(
{
"resource_id": resource_id,
"url": url,
Expand Down Expand Up @@ -111,8 +113,8 @@ async def check_resource(
)
resp.raise_for_status()

# Process the check data. If it has changed, it will be sent to udata
check, is_first_check = await process_check_data(
# Preprocess the check data. If it has changed, it will be sent to udata
new_check, last_check = await preprocess_check_data(
{
"resource_id": resource_id,
"url": url,
Expand All @@ -130,8 +132,8 @@ async def check_resource(
# Enqueue the resource for analysis
queue.enqueue(
analyse_resource,
check["id"],
is_first_check,
new_check["id"],
last_check,
force_analysis,
_priority=worker_priority,
)
Expand All @@ -140,7 +142,7 @@ async def check_resource(

except asyncio.exceptions.TimeoutError:
# Process the check data. If it has changed, it will be sent to udata
await process_check_data(
await preprocess_check_data(
{
"resource_id": resource_id,
"url": url,
Expand Down Expand Up @@ -179,7 +181,7 @@ async def check_resource(

error = getattr(e, "message", None) or str(e)
# Process the check data. If it has changed, it will be sent to udata
await process_check_data(
await preprocess_check_data(
{
"resource_id": resource_id,
"url": url,
Expand Down
2 changes: 1 addition & 1 deletion udata_hydra/crawl/helpers.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from datetime import datetime, timedelta, timezone
from typing import Any, Tuple
from typing import Any

from multidict import CIMultiDictProxy

Expand Down
Loading