Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream returns AirbyteMessages #18572

Merged
merged 69 commits into from
Nov 8, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
5f88aa6
method yielding airbytemessage
girarda Oct 27, 2022
0de8395
move to Stream
girarda Oct 27, 2022
e5acdcf
update abstract source
girarda Oct 27, 2022
8a63af7
reset
girarda Oct 27, 2022
d0dcb52
missing file
girarda Oct 28, 2022
adf07fa
add test docker image
girarda Oct 31, 2022
c68e01d
script to run acceptance tests with local cdk
girarda Oct 31, 2022
258990d
Update conftest to use a different image
girarda Oct 31, 2022
b0f80a7
extract to method
girarda Oct 31, 2022
68528b2
dont use a different image tag
girarda Oct 31, 2022
b31f9bd
Always install local cdk
girarda Oct 31, 2022
893beab
break the cdk
girarda Oct 31, 2022
c3964c9
get path from current working directory
girarda Oct 31, 2022
c69774a
or
girarda Oct 31, 2022
152f5df
ignore unit test
girarda Oct 31, 2022
8ea5e85
debug log
girarda Oct 31, 2022
1684823
Revert "AMI change: ami-0f23be2f917510c26 -> ami-005924fb76f7477ce (#…
girarda Oct 31, 2022
b08a9c4
build from the top
girarda Oct 31, 2022
4fd9ed0
Update source-acceptance-test
girarda Oct 31, 2022
06b1c70
fix
girarda Oct 31, 2022
3e4b135
copy setup
girarda Oct 31, 2022
4702f4d
some work on the gradle plugin
girarda Oct 31, 2022
ae7b7fb
reset to master
girarda Nov 1, 2022
3d86bce
delete unused file
girarda Nov 1, 2022
9dc4486
delete unused file
girarda Nov 1, 2022
d3852f3
reset to master
girarda Nov 1, 2022
179d689
optional argument
girarda Nov 1, 2022
65c416c
delete dead code
girarda Nov 1, 2022
b13fe7a
use latest cdk with sendgrid
girarda Nov 2, 2022
0fff056
Merge branch 'master' into alex/read_with_logs
girarda Nov 2, 2022
01dcd2a
fix sendgrid dockerfile
girarda Nov 2, 2022
5d46bb0
Merge branch 'alex/always_install_local_cdk' into alex/read_with_logs
girarda Nov 2, 2022
600c195
break the cdk
girarda Nov 2, 2022
0b48fea
use local file
girarda Nov 2, 2022
613e876
Revert "break the cdk"
girarda Nov 2, 2022
ea2089d
dont raise an exception
girarda Nov 3, 2022
e2d11b3
reset to master
girarda Nov 3, 2022
d901e1c
unit tests
girarda Nov 3, 2022
0600e59
missing test
girarda Nov 3, 2022
82e46c5
more unit tests
girarda Nov 3, 2022
50bb3ed
Merge branch 'master' into alex/read_with_logs
girarda Nov 3, 2022
acfcec8
newline
girarda Nov 3, 2022
3fa6a00
reset to master
girarda Nov 3, 2022
048508d
Merge branch 'master' into alex/read_with_logs
girarda Nov 3, 2022
3e85dca
Merge branch 'master' into alex/read_with_logs
girarda Nov 3, 2022
6d9d0aa
remove files
girarda Nov 3, 2022
464b247
reset
girarda Nov 3, 2022
4acd199
Merge branch 'master' into alex/read_with_logs
girarda Nov 4, 2022
0c36dcb
Update abstract source
girarda Nov 4, 2022
46a807e
remove method from stream
girarda Nov 4, 2022
bad305b
convert to airbytemessage
girarda Nov 5, 2022
d78628f
unittests
girarda Nov 5, 2022
7fd5cc9
Update
girarda Nov 5, 2022
0cbd28e
unit test
girarda Nov 5, 2022
a1a139e
remove debug logs
girarda Nov 5, 2022
b1d62cd
Revert "remove debug logs"
girarda Nov 5, 2022
1da2ed1
Revert "Revert "remove debug logs""
girarda Nov 5, 2022
5dac7c2
Revert "reset to master"
girarda Nov 5, 2022
594ad06
Revert "Revert "reset to master""
girarda Nov 5, 2022
2d91e9a
reset to master
girarda Nov 5, 2022
cff30a1
reset to master
girarda Nov 5, 2022
2f91b80
test
girarda Nov 5, 2022
3fc697c
Revert "test"
girarda Nov 5, 2022
62d95eb
test
girarda Nov 5, 2022
003b860
Revert "test"
girarda Nov 5, 2022
27150ba
test
girarda Nov 5, 2022
26c3289
Revert "test"
girarda Nov 5, 2022
b36842a
format
girarda Nov 5, 2022
fe3b01a
Merge branch 'master' into alex/read_with_logs
girarda Nov 8, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 24 additions & 48 deletions airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,12 @@

import logging
from abc import ABC, abstractmethod
from datetime import datetime
from functools import lru_cache
from typing import Any, Dict, Iterator, List, Mapping, MutableMapping, Optional, Tuple, Union

from airbyte_cdk.models import (
AirbyteCatalog,
AirbyteConnectionStatus,
AirbyteMessage,
AirbyteRecordMessage,
AirbyteStateMessage,
ConfiguredAirbyteCatalog,
ConfiguredAirbyteStream,
Expand All @@ -25,7 +22,6 @@
from airbyte_cdk.sources.streams import Stream
from airbyte_cdk.sources.streams.http.http import HttpStream
from airbyte_cdk.sources.utils.schema_helpers import InternalConfig, split_config
from airbyte_cdk.sources.utils.transform import TypeTransformer
from airbyte_cdk.utils.event_timing import create_timer
from airbyte_cdk.utils.traced_exception import AirbyteTracedException

Expand Down Expand Up @@ -235,26 +231,28 @@ def _read_incremental(
for _slice in slices:
has_slices = True
logger.debug("Processing stream slice", extra={"slice": _slice})
records = stream_instance.read_records(
records = stream_instance.read_records_as_messages(
sync_mode=SyncMode.incremental,
stream_slice=_slice,
stream_state=stream_state,
cursor_field=configured_stream.cursor_field or None,
)
for record_counter, record_data in enumerate(records, start=1):
yield self._as_airbyte_record(stream_name, record_data)
stream_state = stream_instance.get_updated_state(stream_state, record_data)
checkpoint_interval = stream_instance.state_checkpoint_interval
if checkpoint_interval and record_counter % checkpoint_interval == 0:
yield self._checkpoint_state(stream_instance, stream_state, state_manager)

total_records_counter += 1
# This functionality should ideally live outside of this method
# but since state is managed inside this method, we keep track
# of it here.
if self._limit_reached(internal_config, total_records_counter):
# Break from slice loop to save state and exit from _read_incremental function.
break
for record_counter, message in enumerate(records, start=1):
yield message
if message.type == MessageType.RECORD:
record = message.record
stream_state = stream_instance.get_updated_state(stream_state, record.data)
Copy link
Contributor Author

@girarda girarda Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same logic moved inside the if block because we only want to update the state and record counter on records

checkpoint_interval = stream_instance.state_checkpoint_interval
if checkpoint_interval and record_counter % checkpoint_interval == 0:
yield self._checkpoint_state(stream_instance, stream_state, state_manager)

total_records_counter += 1
# This functionality should ideally live outside of this method
# but since state is managed inside this method, we keep track
# of it here.
if self._limit_reached(internal_config, total_records_counter):
# Break from slice loop to save state and exit from _read_incremental function.
break

yield self._checkpoint_state(stream_instance, stream_state, state_manager)
if self._limit_reached(internal_config, total_records_counter):
Expand All @@ -277,16 +275,17 @@ def _read_full_refresh(
total_records_counter = 0
for _slice in slices:
logger.debug("Processing stream slice", extra={"slice": _slice})
records = stream_instance.read_records(
messages = stream_instance.read_records_as_messages(
stream_slice=_slice,
sync_mode=SyncMode.full_refresh,
cursor_field=configured_stream.cursor_field,
)
for record in records:
yield self._as_airbyte_record(configured_stream.stream.name, record)
total_records_counter += 1
if self._limit_reached(internal_config, total_records_counter):
return
for message in messages:
yield message
if message.type == MessageType.RECORD:
total_records_counter += 1
if self._limit_reached(internal_config, total_records_counter):
return

def _checkpoint_state(self, stream: Stream, stream_state, state_manager: ConnectorStateManager):
# First attempt to retrieve the current state using the stream's state property. We receive an AttributeError if the state
Expand All @@ -298,29 +297,6 @@ def _checkpoint_state(self, stream: Stream, stream_state, state_manager: Connect
state_manager.update_state_for_stream(stream.name, stream.namespace, stream_state)
return state_manager.create_state_message(stream.name, stream.namespace, send_per_stream_state=self.per_stream_state_enabled)

@lru_cache(maxsize=None)
def _get_stream_transformer_and_schema(self, stream_name: str) -> Tuple[TypeTransformer, Mapping[str, Any]]:
"""
Lookup stream's transform object and jsonschema based on stream name.
This function would be called a lot so using caching to save on costly
get_json_schema operation.
:param stream_name name of stream from catalog.
:return tuple with stream transformer object and discover json schema.
"""
stream_instance = self._stream_to_instance_map[stream_name]
return stream_instance.transformer, stream_instance.get_json_schema()

def _as_airbyte_record(self, stream_name: str, data: Mapping[str, Any]):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracted to a function so it can be reused by the SimpleRetriever

now_millis = int(datetime.now().timestamp() * 1000)
transformer, schema = self._get_stream_transformer_and_schema(stream_name)
# Transform object fields according to config. Most likely you will
# need it to normalize values against json schema. By default no action
# taken unless configured. See
# docs/connector-development/cdk-python/schemas.md for details.
transformer.transform(data, schema) # type: ignore
message = AirbyteRecordMessage(stream=stream_name, data=data, emitted_at=now_millis)
return AirbyteMessage(type=MessageType.RECORD, record=message)

@staticmethod
def _apply_log_level_to_stream_logger(logger: logging.Logger, stream_instance: Stream):
"""
Expand Down
20 changes: 19 additions & 1 deletion airbyte-cdk/python/airbyte_cdk/sources/streams/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,14 @@
import inspect
import logging
from abc import ABC, abstractmethod
from functools import lru_cache
from typing import Any, Iterable, List, Mapping, MutableMapping, Optional, Union

import airbyte_cdk.sources.utils.casing as casing
from airbyte_cdk.models import AirbyteStream, SyncMode
from airbyte_cdk.models import AirbyteMessage, AirbyteStream, SyncMode

# list of all possible HTTP methods which can be used for sending of request bodies
from airbyte_cdk.sources.utils.record_helper import data_to_airbyte_record
from airbyte_cdk.sources.utils.schema_helpers import ResourceSchemaLoader
from airbyte_cdk.sources.utils.transform import TransformConfig, TypeTransformer
from deprecated.classic import deprecated
Expand Down Expand Up @@ -87,6 +91,19 @@ def get_error_display_message(self, exception: BaseException) -> Optional[str]:
"""
return None

def read_records_as_messages(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that gives me slight pause is that this new function is defined to make it seem like read_records_as_messages could be overwritten as part of the public Stream interface. But I can't see a scenario where someone should want to do that since this record remapping is more of an internal implementation detail.

If we were to pull it back into abstract_source.py then we can at least mark it private as _read_records_as_messages(), so developers know not to override it. Or alternatively we can mark it private here, but that's also awkward because we invoke it from outside abstract_source. So we're marking it private but using it publicly which smells bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit smelly that I can override read_records and read_records_as_messages, but only the latter ever gets called

Are we just saying that a developer can return either Mapping or AirbyteMessage from read_records? Should we just change the type of that method to be a union, and move this logic into AbstractSource so that it wraps any Mappings into AirbyteMessages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brianjlai the Stream interface has to return the message because we'll want to return more types of AirbyteMessages (like logs and tracemessages)

@sherifnada
In Java-world, I would change the scope of read_records to protected so it can be overridden by subclasses, but it shouldn't be part of the interface (the source only calls read_records_as_messages now)

Are we just saying that a developer can return either Mapping or AirbyteMessage from read_records?
No, read_records always returns a Mapping. The mapping gets converted to an AirbyteRecord on line 105

self,
sync_mode: SyncMode,
cursor_field: List[str] = None,
stream_slice: Mapping[str, Any] = None,
stream_state: Mapping[str, Any] = None,
) -> Iterable[AirbyteMessage]:
""" """
for record_mapping in self.read_records(
sync_mode=sync_mode, cursor_field=cursor_field, stream_slice=stream_slice, stream_state=stream_state
):
yield data_to_airbyte_record(self.name, record_mapping, self.transformer, self.get_json_schema())

@abstractmethod
def read_records(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the method name is a little misleading now because it doesn't only return records anymore

self,
Expand All @@ -99,6 +116,7 @@ def read_records(
This method should be overridden by subclasses to read records based on the inputs
"""

@lru_cache(maxsize=None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def get_json_schema(self) -> Mapping[str, Any]:
"""
:return: A dict of the JSON schema representing this stream.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@
from requests.auth import AuthBase
from requests_cache.session import CachedSession

# list of all possible HTTP methods which can be used for sending of request bodies
from .auth.core import HttpAuthenticator, NoAuth
from .exceptions import DefaultBackoffException, RequestBodyException, UserDefinedBackoffException
from .rate_limiting import default_backoff_handler, user_defined_backoff_handler

# list of all possible HTTP methods which can be used for sending of request bodies
BODY_REQUEST_METHODS = ("GET", "POST", "PUT", "PATCH")


Expand Down
4 changes: 3 additions & 1 deletion airbyte-cdk/python/airbyte_cdk/sources/utils/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#
# Copyright (c) 2021 Airbyte, Inc., all rights reserved.
# Copyright (c) 2022 Airbyte, Inc., all rights reserved.
#

# Initialize Utils Package

__all__ = ["record_helper"]
34 changes: 33 additions & 1 deletion airbyte-cdk/python/unit_tests/sources/streams/http/test_http.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,28 @@
#


import datetime
import json
import unittest
from http import HTTPStatus
from typing import Any, Iterable, Mapping, Optional
from unittest.mock import ANY, MagicMock, patch

import pytest
import requests
from airbyte_cdk.models import SyncMode
from airbyte_cdk.models import AirbyteMessage, AirbyteRecordMessage, SyncMode, Type
from airbyte_cdk.sources.streams.http import HttpStream, HttpSubStream
from airbyte_cdk.sources.streams.http.auth import NoAuth
from airbyte_cdk.sources.streams.http.auth import TokenAuthenticator as HttpTokenAuthenticator
from airbyte_cdk.sources.streams.http.exceptions import DefaultBackoffException, RequestBodyException, UserDefinedBackoffException
from airbyte_cdk.sources.streams.http.requests_native_auth import TokenAuthenticator

datetime_format = "%Y-%m-%dT%H:%M:%S.%f%z"
FAKE_NOW = datetime.datetime(2022, 1, 1, tzinfo=datetime.timezone.utc)

config = {"start_date": "2021-01-01T00:00:00.000000+0000", "start_date_ymd": "2021-01-01"}
timezone = datetime.timezone.utc


class StubBasicReadHttpStream(HttpStream):
url_base = "https://test_base_url.com"
Expand All @@ -37,6 +45,16 @@ def parse_response(self, response: requests.Response, **kwargs) -> Iterable[Mapp
self.resp_counter += 1
yield stubResp

def get_json_schema(self) -> Mapping[str, Any]:
return {}


@pytest.fixture()
def mock_datetime_now(monkeypatch):
datetime_mock = unittest.mock.MagicMock(wraps=datetime.datetime)
datetime_mock.now.return_value = FAKE_NOW
monkeypatch.setattr(datetime, "datetime", datetime_mock)


def test_default_authenticator():
stream = StubBasicReadHttpStream()
Expand Down Expand Up @@ -79,6 +97,20 @@ def test_stub_basic_read_http_stream_read_records(mocker):
assert [{"data": 1}] == records


def test_stub_basic_read_http_stream_read_records_as_messages(mocker, mock_datetime_now):
stream = StubBasicReadHttpStream()
blank_response = {} # Send a blank response is fine as we ignore the response in `parse_response anyway.
mocker.patch.object(StubBasicReadHttpStream, "_send_request", return_value=blank_response)
Copy link
Contributor

@sherifnada sherifnada Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like mocking private methods is a smell. I know it didn't originate in this PR, but it feels like it's telling us we need a dependency injection of the requester component.

(mostly meaning to start a discussion in this comment, not asking that it's fixed in this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


records = list(stream.read_records_as_messages(SyncMode.full_refresh))

assert [
AirbyteMessage(
type=Type.RECORD, record=AirbyteRecordMessage(stream="stub_basic_read_http_stream", data={"data": 1}, emitted_at=1640995200000)
)
] == records


class StubNextPageTokenHttpStream(StubBasicReadHttpStream):
current_page = 0

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ def __init__(
def name(self):
return self._name

def read_records(self, **kwargs) -> Iterable[Mapping[str, Any]]: # type: ignore
def read_records(self, *awgs, **kwargs) -> Iterable[Mapping[str, Any]]: # type: ignore
# Remove None values
kwargs = {k: v for k, v in kwargs.items() if v is not None}
if self._inputs_and_mocked_outputs:
Expand Down
4 changes: 2 additions & 2 deletions airbyte-cdk/python/unit_tests/sources/test_source.py
Original file line number Diff line number Diff line change
Expand Up @@ -425,8 +425,8 @@ def test_source_config_no_transform(abstract_source, catalog):
records = [r for r in abstract_source.read(logger=logger_mock, config={}, catalog=catalog, state={})]
assert len(records) == 2 * 5
assert [r.record.data for r in records] == [{"value": 23}] * 2 * 5
assert http_stream.get_json_schema.call_count == 1
assert non_http_stream.get_json_schema.call_count == 1
assert http_stream.get_json_schema.call_count == 5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assert verified that we used the cached transformer and schema instead of calling get_json_schema.
We now call get_json_schema more often, but the value is cached so the behavior is effectively the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense! but side question, could we add a test into core.py that calls get_json_schema() and asserts that some part of:

ResourceSchemaLoader(package_name_from_class(self.__class__)).get_schema(self.name)

only gets invoked once. like given that it's cached, we'd only expect self.__class__ or the underlying get_schema() to only get called once

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 0cbd28e

assert non_http_stream.get_json_schema.call_count == 5


def test_source_config_transform(abstract_source, catalog):
Expand Down