Port ErrorMonitor class from ruby connectors #2671

artem-shelkovnikov · 2024-07-02T14:20:02Z

Porting the ErrorMonitor class from https://github.com/elastic/connectors-ruby/blob/main/lib/utility/error_monitor.rb

ErrorMonitor class is there to add transient error handling and ignore errors until certain threshold.

For example, default error monitor class can be used from any class implementing BaseDataSource with the following code:

async def fetch_some_info(self, id):
    with self.with_error_monitoring():
        yield self.client.fetch_some_not_critical_entity(id)

Alternative way of usage:

try:
    entity = self.client.query(id=id)
    self.error_monitor.track_success()
except Exception as ex:
    self.error_monitor.track_error(ex)

What it will do:

Log the error
Ignore it UNLESS too many errors happened.

What does "too many errors" mean?

Too many errors happened over the sync - default number is 1000
Too many errors happened consecutively within the blocks with self.with_error_monitoring() - default number is 10
Too many errors happened in the window of X blocks with self.with_error_monitoring() - by default it's 15%

This PR already makes use of ErrorMonitor for generic framework operations.

1 error monitor is created and passed to: Connector, Sink and Extractor. Sink and Extractor already track errors, connectors can track errors on per connector basis.

Sink uses error monitor to track errors while ingesting data - each bulk operation is checked and each document that was not ingested into Elasticsearch triggers an "error" tracking, each successfully ingested document produces "success mapping". In practice with default settings it means, for example, that if 100 documents are sent to Elasticsearch and 16 of them fail, sync will fail.

Same happens for downloads - if a download succeeds, we track is as "success" in Error Monitor, if it fails, it's a "failure" and Error Monitor will transiently skip it unless too many errors happened.

If sync is terminated due to this error, Kibana UI will show reason (too many errors in total, too many consequent errors, error rate is too high) along with the last error that happened.

So, as an example, if during a sync 200 documents fail to be downloaded and 801 documents fail to be ingested to Elasticsearch, sync will fail.

seanstory

Love it.
I'm fine to have the configuration be a follow-up. Just as long as it isn't used before it's configurable.

connectors/utils.py

seanstory

a few little nites, but largely! looks good!

config.yml.example

seanstory · 2024-07-08T17:41:11Z

connectors/es/sink.py

+ except ForceCanceledError:
+ raise


Should we also re-raise aio CanceledError?

I'm not 100% sure it's needed. I'll double check and add if it is

connectors/es/sink.py

seanstory · 2024-07-08T17:46:52Z

connectors/source.py

@@ -767,12 +780,14 @@ async def download_and_extract_file(
 doc = await self.handle_file_content_extraction(
 doc, source_filename, temp_filename
 )
+ self.error_monitor.track_success()


won't this end up double-counting? Since sink.py also uses the monitor for the lazy downloads?

True, will fix!

I don't really know what to do with that, we have several potential points of failure/success:

Extraction:

File was successfully downloaded

File was not successfully downloaded

Ingestion:

File was successfully ingested

File was not successfully ingested

It's possible to have all the permutations!

For example, file was not downloaded successfully, but we still choose to upload its metadata to Elasticsearch and everything worked out - is it a failure?

I'll open this up to Slack

seanstory · 2024-07-08T17:54:32Z

connectors/source.py

+ def with_error_monitoring(self):
+ try:
+ yield
+ self.error_monitor.track_success()
+ except Exception as ex:
+ self._logger.error(ex)
+ self.error_monitor.track_error(ex)


Something we had in ent-search was the concept of "fatal exceptions". See: https://github.com/elastic/ent-search/blob/a1ddc884ad1543f61a91da7b9511a01090adcac5/connectors/lib/connectors/content_sources/base/extractor.rb#L132-L139

it might be worth adding that here too, for things like aio CanceledError, OOME, etc, to make sure we fail-fast when necessary.

Yeah good call. OOME can't get caught, but stuff like CancelledError + having a custom fatal exception class should help

Okay, since we don't have code that uses this I'd address it later once we actually use this code somewhere, for now this doesn't make any difference.

I'd also like to separately add handlers for terminal exceptions - for example our download function and some other do catch Exception that can catch and "handle" cancelled errors in the weird way

seanstory · 2024-07-23T19:04:10Z

Looks like this may need a rebase - I think you've pulled in other commits you may not have meant to.

Co-authored-by: Sean Story <sean.j.story@gmail.com>

seanstory · 2024-07-25T16:01:25Z

Related: #1582 (maybe even "fixes"?)

artem-shelkovnikov requested a review from a team as a code owner July 2, 2024 14:20

github-actions bot added auto-backport v8.15.0.0 labels Jul 2, 2024

seanstory previously approved these changes Jul 2, 2024

View reviewed changes

connectors/utils.py Outdated Show resolved Hide resolved

artem-shelkovnikov dismissed seanstory’s stale review via 1dbeb4c July 4, 2024 15:29

artem-shelkovnikov requested review from a team and seanstory July 5, 2024 16:14

artem-shelkovnikov mentioned this pull request Jul 8, 2024

[Sharepoint Online] Invalid JSON will cause unactionable "generator didn't stop after athrow" errors #2309

Open

seanstory reviewed Jul 8, 2024

View reviewed changes

artem-shelkovnikov mentioned this pull request Jul 17, 2024

Cancel and fail downloads if exception happened #2710

Merged

8 tasks

artem-shelkovnikov and others added 11 commits July 24, 2024 12:04

Port ErrorMonitor class from ruby connectors

379ca25

Also log error

470a0f9

autoformat

9e8f9b9

Actually make use of Error Monitor and add configuration option for it

bc14f39

Make autoformat

7f3da79

Add some tests

e97a4ef

Remove error_queue_size from config + add config.yml.example options

8a7cfa3

Update config.yml.example

ddd9b6b

Co-authored-by: Sean Story <sean.j.story@gmail.com>

Update config.yml.example

5ea3e46

Co-authored-by: Sean Story <sean.j.story@gmail.com>

Actually make use of Error Monitor and add configuration option for it

631163b

Add enabled flag for error monitor

7dba43e

artem-shelkovnikov force-pushed the artem/add-error-monitor branch from 4a92c61 to 7dba43e Compare July 24, 2024 10:05

Fix test_sink.py

bbc4b91

artem-shelkovnikov added 5 commits July 26, 2024 15:42

Split error monitor usage to two

8e513fe

Fix problems + add tests

8c56cef

WIP document

90d10cb

Fix linter + add comments

23d0b16

Remove the content extraction part

df702f8

artem-shelkovnikov added 2 commits July 29, 2024 17:10

autoformat

05a3a8f

Fix ftest

3febb60

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port ErrorMonitor class from ruby connectors #2671

Port ErrorMonitor class from ruby connectors #2671

artem-shelkovnikov commented Jul 2, 2024 •

edited

Loading

seanstory left a comment

seanstory left a comment

seanstory Jul 8, 2024

artem-shelkovnikov Jul 9, 2024

seanstory Jul 8, 2024

artem-shelkovnikov Jul 9, 2024

artem-shelkovnikov Jul 24, 2024

seanstory Jul 8, 2024

artem-shelkovnikov Jul 9, 2024

artem-shelkovnikov Jul 24, 2024

seanstory commented Jul 23, 2024

seanstory commented Jul 25, 2024

Port ErrorMonitor class from ruby connectors #2671

Are you sure you want to change the base?

Port ErrorMonitor class from ruby connectors #2671

Conversation

artem-shelkovnikov commented Jul 2, 2024 • edited Loading

seanstory left a comment

Choose a reason for hiding this comment

seanstory left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seanstory commented Jul 23, 2024

seanstory commented Jul 25, 2024

artem-shelkovnikov commented Jul 2, 2024 •

edited

Loading