Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connector builder: support for test read with message grouping per slices #23925

Merged
merged 81 commits into from
Mar 16, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
f829ac5
New connector_builder module for handling requests from the Connector…
clnoll Mar 8, 2023
13a9a14
Automated Commit - Formatting Changes
clnoll Mar 8, 2023
1c85330
Rename ConnectorBuilderSource to ConnectorBuilderHandler
clnoll Mar 9, 2023
0dccb4e
Update source_declarative_manifest README
clnoll Mar 9, 2023
f7a475a
Reorganize
clnoll Mar 9, 2023
bd71e91
read records
girarda Mar 9, 2023
c6ac119
paste unit tests from connector builder server
girarda Mar 9, 2023
f949521
compiles but tests fail
girarda Mar 9, 2023
45741cd
first test passes
girarda Mar 9, 2023
0e7d6f4
Second test passes
girarda Mar 9, 2023
81ff6e9
3rd test passes
girarda Mar 10, 2023
14aa8ca
one more test
girarda Mar 10, 2023
b3764ba
another test
girarda Mar 10, 2023
c6040cc
one more test
girarda Mar 10, 2023
5f0ead1
test
girarda Mar 10, 2023
1614bad
return StreamRead
girarda Mar 10, 2023
b323578
test
girarda Mar 10, 2023
9902dd5
test
girarda Mar 10, 2023
3b3255e
rename
girarda Mar 10, 2023
6242e3a
test
girarda Mar 10, 2023
c8631f1
test
girarda Mar 10, 2023
fe1da29
test
girarda Mar 10, 2023
5b0750c
main seems to work
girarda Mar 10, 2023
7dd5ada
Update
girarda Mar 10, 2023
2f048e9
Update
girarda Mar 10, 2023
dd778ed
Update
girarda Mar 10, 2023
e79adf8
Update
girarda Mar 10, 2023
ab009a6
update
girarda Mar 10, 2023
71f94c1
error message
girarda Mar 10, 2023
c4d8b84
rename
girarda Mar 10, 2023
4c26009
update
girarda Mar 10, 2023
31425f0
Update
girarda Mar 10, 2023
a7911f5
CR improvements
clnoll Mar 9, 2023
11f35cb
merge
girarda Mar 10, 2023
59fea95
fix test_source_declarative_manifest
girarda Mar 10, 2023
b190176
fix tests
girarda Mar 10, 2023
8c51bb1
Update
girarda Mar 10, 2023
ad3a9c4
Update
girarda Mar 10, 2023
e1e2598
Update
girarda Mar 10, 2023
639accd
Update
girarda Mar 11, 2023
1fbbc8f
rename
girarda Mar 11, 2023
bf0175d
rename
girarda Mar 11, 2023
1e6b3d2
rename
girarda Mar 11, 2023
032c44c
format
girarda Mar 13, 2023
aea625e
Give connector_builder its own main.py
clnoll Mar 13, 2023
70a9052
merge
girarda Mar 13, 2023
0b15013
Update
girarda Mar 13, 2023
5ad0fba
reset
girarda Mar 13, 2023
f782328
delete dead code
girarda Mar 13, 2023
1e9c159
remove debug print
girarda Mar 13, 2023
691e957
update test
girarda Mar 13, 2023
2280924
Update
girarda Mar 13, 2023
754c61c
set right stream
girarda Mar 13, 2023
d38a760
Add --catalog argument
clnoll Mar 13, 2023
d64cf5d
Remove unneeded preparse
clnoll Mar 14, 2023
8732639
Update README
clnoll Mar 14, 2023
a32898a
merge
girarda Mar 14, 2023
0616250
handle error
girarda Mar 14, 2023
fa491f7
tests pass
girarda Mar 14, 2023
1e04904
more explicit test
girarda Mar 14, 2023
aa2839c
reset
girarda Mar 14, 2023
5d3163f
format
girarda Mar 14, 2023
fc9c28c
merge master
girarda Mar 14, 2023
9cd9105
fix merge
girarda Mar 14, 2023
e37851e
raise exception
girarda Mar 14, 2023
0b809f9
fix
girarda Mar 14, 2023
33022c9
black format
girarda Mar 14, 2023
cccb968
raise with config
girarda Mar 14, 2023
a79b43b
update
girarda Mar 14, 2023
8ca4d64
Merge branch 'master' into alex/test_read
girarda Mar 14, 2023
4ace71b
fix flake
girarda Mar 14, 2023
e74c040
Merge branch 'alex/test_read' of github.com:airbytehq/airbyte into al…
girarda Mar 14, 2023
fa198be
Merge branch 'master' into alex/test_read
girarda Mar 14, 2023
455e65a
__test_read_config is optional
girarda Mar 15, 2023
4059751
Merge branch 'master' into alex/test_read
girarda Mar 15, 2023
f78637c
merge
girarda Mar 15, 2023
36152a1
fix
girarda Mar 15, 2023
b308f04
Automated Commit - Formatting Changes
girarda Mar 15, 2023
40550b3
fix
girarda Mar 15, 2023
ae66445
Merge branch 'alex/test_read' of github.com:airbytehq/airbyte into al…
girarda Mar 15, 2023
b1beeb3
exclude_unset
girarda Mar 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions airbyte-cdk/python/airbyte_cdk/sources/source.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,5 +85,6 @@ def _emit_legacy_state_format(self, state_obj) -> Union[List[AirbyteStateMessage
return []

# can be overridden to change an input catalog
def read_catalog(self, catalog_path: str) -> ConfiguredAirbyteCatalog:
return ConfiguredAirbyteCatalog.parse_obj(self._read_json_file(catalog_path))
@classmethod
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a classmethod so it can be used before creating the source

def read_catalog(cls, catalog_path: str) -> ConfiguredAirbyteCatalog:
return ConfiguredAirbyteCatalog.parse_obj(cls._read_json_file(catalog_path))
32 changes: 29 additions & 3 deletions airbyte-cdk/python/connector_builder/connector_builder_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,45 @@
# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
#

import dataclasses
from datetime import datetime
from typing import Any, Mapping

from airbyte_cdk.models import AirbyteMessage, AirbyteRecordMessage, Type
from airbyte_cdk.models import AirbyteMessage, AirbyteRecordMessage, ConfiguredAirbyteCatalog
from airbyte_cdk.models import Type
from airbyte_cdk.models import Type as MessageType
from airbyte_cdk.sources.declarative.declarative_source import DeclarativeSource
from airbyte_cdk.sources.declarative.manifest_declarative_source import ManifestDeclarativeSource
from airbyte_cdk.utils.traced_exception import AirbyteTracedException
from connector_builder.message_grouper import MessageGrouper


def list_streams() -> AirbyteMessage:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should return an AirbyteMessage containing an AirbyteRecord

raise NotImplementedError


def stream_read() -> AirbyteMessage:
raise NotImplementedError
DEFAULT_MAXIMUM_NUMBER_OF_PAGES_PER_SLICE = 5
DEFAULT_MAXIMUM_NUMBER_OF_SLICES = 5
DEFAULT_MAX_RECORDS = 100


def read_stream(source: DeclarativeSource, config: Mapping[str, Any], configured_catalog: ConfiguredAirbyteCatalog) -> AirbyteMessage:
try:
command_config = config.get("__test_read_config", {})
max_pages_per_slice = command_config.get("max_pages_per_slice", DEFAULT_MAXIMUM_NUMBER_OF_PAGES_PER_SLICE)
max_slices = command_config.get("max_slices", DEFAULT_MAXIMUM_NUMBER_OF_SLICES)
max_records = command_config.get("max_records", DEFAULT_MAX_RECORDS)
handler = MessageGrouper(max_pages_per_slice, max_slices)
stream_name = configured_catalog.streams[0].stream.name # The connector builder only supports a single stream
stream_read = handler.get_message_groups(source, config, configured_catalog, max_records)
return AirbyteMessage(type=MessageType.RECORD, record=AirbyteRecordMessage(
data=dataclasses.asdict(stream_read),
stream=stream_name,
emitted_at=_emitted_at()
))
except Exception as exc:
error = AirbyteTracedException.from_exception(exc, message=f"Error reading stream with config={config} and catalog={configured_catalog}")
return error.as_airbyte_message()


def resolve_manifest(source: ManifestDeclarativeSource) -> AirbyteMessage:
Expand Down
30 changes: 18 additions & 12 deletions airbyte-cdk/python/connector_builder/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,47 +4,53 @@


import sys
from typing import Any, List, Mapping
from typing import Any, List, Mapping, Tuple

from airbyte_cdk.connector import BaseConnector
from airbyte_cdk.entrypoint import AirbyteEntrypoint
from airbyte_cdk.models import ConfiguredAirbyteCatalog
from airbyte_cdk.sources.declarative.manifest_declarative_source import ManifestDeclarativeSource
from airbyte_cdk.utils.traced_exception import AirbyteTracedException
from connector_builder.connector_builder_handler import resolve_manifest
from connector_builder.connector_builder_handler import read_stream, resolve_manifest


def create_source(config: Mapping[str, Any]) -> ManifestDeclarativeSource:
manifest = config.get("__injected_declarative_manifest")
return ManifestDeclarativeSource(manifest)
return ManifestDeclarativeSource(manifest, True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set debug to True so the source returns raw requests and responses



def get_config_from_args(args: List[str]) -> Mapping[str, Any]:
def get_config_and_catalog_from_args(args: List[str]) -> Tuple[Mapping[str, Any], ConfiguredAirbyteCatalog]:
parsed_args = AirbyteEntrypoint.parse_args(args)
config_path, catalog_path = parsed_args.config, parsed_args.catalog
if parsed_args.command != "read":
raise ValueError("Only read commands are allowed for Connector Builder requests.")

config = BaseConnector.read_config(parsed_args.config)
config = BaseConnector.read_config(config_path)
catalog = ConfiguredAirbyteCatalog.parse_obj(BaseConnector.read_config(catalog_path))

if "__injected_declarative_manifest" not in config:
raise ValueError(
f"Invalid config: `__injected_declarative_manifest` should be provided at the root of the config but config only has keys {list(config.keys())}"
)

return config
return config, catalog


def handle_connector_builder_request(source: ManifestDeclarativeSource, config: Mapping[str, Any]):
def handle_connector_builder_request(source: ManifestDeclarativeSource, config: Mapping[str, Any], catalog: ConfiguredAirbyteCatalog):
command = config.get("__command")
if command == "resolve_manifest":
return resolve_manifest(source)
raise ValueError(f"Unrecognized command {command}.")
elif command == "test_read":
return read_stream(source, config, catalog)
else:
raise ValueError(f"Unrecognized command {command}.")


def handle_request(args: List[str]):
config = get_config_from_args(args)
source = create_source(config)
config, catalog = get_config_and_catalog_from_args(args)
if "__command" in config:
return handle_connector_builder_request(source, config).json()
source = create_source(config)
return handle_connector_builder_request(source, config, catalog).json(exclude_unset=True)
else:
raise ValueError("Missing __command argument in config file.")

Expand All @@ -55,4 +61,4 @@ def handle_request(args: List[str]):
except Exception as exc:
error = AirbyteTracedException.from_exception(exc, message="Error handling request.")
m = error.as_airbyte_message()
print(error.as_airbyte_message().json())
print(error.as_airbyte_message().json(exclude_unset=True))
190 changes: 190 additions & 0 deletions airbyte-cdk/python/connector_builder/message_grouper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
#
# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
#

import json
import logging
from copy import deepcopy
from json import JSONDecodeError
from typing import Any, Iterable, Iterator, Mapping, Optional, Union
from urllib.parse import parse_qs, urlparse

from airbyte_cdk.models import AirbyteLogMessage, AirbyteMessage, Type
from airbyte_cdk.sources.declarative.declarative_source import DeclarativeSource
from airbyte_cdk.utils.schema_inferrer import SchemaInferrer
from airbyte_protocol.models.airbyte_protocol import ConfiguredAirbyteCatalog
from connector_builder.models import HttpRequest, HttpResponse, StreamRead, StreamReadPages, StreamReadSlices


class MessageGrouper:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger = logging.getLogger("airbyte.connector-builder")

def __init__(self, max_pages_per_slice: int, max_slices: int, max_record_limit: int = 1000):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (also I know this may just be copy/pasted): Would it be preferable to enforce the maximums in this class, instead of allowing them to be fully configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main reason why I didn't enforce maximums here is that we'll need to update the value in two places if we decide to increase it. I think the risk of not enforcing a limit here is negligible because we control the caller

self._max_pages_per_slice = max_pages_per_slice
self._max_slices = max_slices
self._max_record_limit = max_record_limit

def get_message_groups(self,
source: DeclarativeSource,
config: Mapping[str, Any],
configured_catalog: ConfiguredAirbyteCatalog,
record_limit: Optional[int] = None,
) -> StreamRead:
if record_limit is not None and not (1 <= record_limit <= 1000):
raise ValueError(f"Record limit must be between 1 and 1000. Got {record_limit}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved from a validator because dataclasses don't have builtin validators

schema_inferrer = SchemaInferrer()

if record_limit is None:
record_limit = self._max_record_limit
else:
record_limit = min(record_limit, self._max_record_limit)

slices = []
log_messages = []
state = {} # No support for incremental sync
for message_group in self._get_message_groups(
source.read(self.logger, config, configured_catalog, state),
schema_inferrer,
record_limit,
):
if isinstance(message_group, AirbyteLogMessage):
log_messages.append({"message": message_group.message})
else:
slices.append(message_group)

return StreamRead(
logs=log_messages,
slices=slices,
test_read_limit_reached=self._has_reached_limit(slices),
inferred_schema=schema_inferrer.get_stream_schema(configured_catalog.streams[0].stream.name) # The connector builder currently only supports reading from a single stream at a time
)

def _get_message_groups(
self, messages: Iterator[AirbyteMessage], schema_inferrer: SchemaInferrer, limit: int
) -> Iterable[Union[StreamReadPages, AirbyteLogMessage]]:
"""
Message groups are partitioned according to when request log messages are received. Subsequent response log messages
and record messages belong to the prior request log message and when we encounter another request, append the latest
message group, until <limit> records have been read.

Messages received from the CDK read operation will always arrive in the following order:
{type: LOG, log: {message: "request: ..."}}
{type: LOG, log: {message: "response: ..."}}
... 0 or more record messages
{type: RECORD, record: {data: ...}}
{type: RECORD, record: {data: ...}}
Repeats for each request/response made

Note: The exception is that normal log messages can be received at any time which are not incorporated into grouping
"""
records_count = 0
at_least_one_page_in_group = False
current_page_records = []
current_slice_pages = []
current_page_request: Optional[HttpRequest] = None
current_page_response: Optional[HttpResponse] = None

while records_count < limit and (message := next(messages, None)):
if self._need_to_close_page(at_least_one_page_in_group, message):
self._close_page(current_page_request, current_page_response, current_slice_pages, current_page_records)
current_page_request = None
current_page_response = None

if at_least_one_page_in_group and message.type == Type.LOG and message.log.message.startswith("slice:"):
yield StreamReadSlices(pages=current_slice_pages)
current_slice_pages = []
at_least_one_page_in_group = False
elif message.type == Type.LOG and message.log.message.startswith("request:"):
if not at_least_one_page_in_group:
at_least_one_page_in_group = True
current_page_request = self._create_request_from_log_message(message.log)
elif message.type == Type.LOG and message.log.message.startswith("response:"):
current_page_response = self._create_response_from_log_message(message.log)
elif message.type == Type.LOG:
yield message.log
elif message.type == Type.RECORD:
current_page_records.append(message.record.data)
records_count += 1
schema_inferrer.accumulate(message.record)
else:
self._close_page(current_page_request, current_page_response, current_slice_pages, current_page_records)
yield StreamReadSlices(pages=current_slice_pages)

@staticmethod
def _need_to_close_page(at_least_one_page_in_group, message):
return (
at_least_one_page_in_group
and message.type == Type.LOG
and (message.log.message.startswith("request:") or message.log.message.startswith("slice:"))
)

@staticmethod
def _close_page(current_page_request, current_page_response, current_slice_pages, current_page_records):
if not current_page_request or not current_page_response:
raise ValueError("Every message grouping should have at least one request and response")

current_slice_pages.append(
StreamReadPages(request=current_page_request, response=current_page_response, records=deepcopy(current_page_records))
)
current_page_records.clear()

def _create_request_from_log_message(self, log_message: AirbyteLogMessage) -> Optional[HttpRequest]:
# TODO: As a temporary stopgap, the CDK emits request data as a log message string. Ideally this should come in the
# form of a custom message object defined in the Airbyte protocol, but this unblocks us in the immediate while the
# protocol change is worked on.
raw_request = log_message.message.partition("request:")[2]
try:
request = json.loads(raw_request)
url = urlparse(request.get("url", ""))
full_path = f"{url.scheme}://{url.hostname}{url.path}" if url else ""
parameters = parse_qs(url.query) or None
return HttpRequest(
url=full_path,
http_method=request.get("http_method", ""),
headers=request.get("headers"),
parameters=parameters,
body=request.get("body"),
)
except JSONDecodeError as error:
self.logger.warning(f"Failed to parse log message into request object with error: {error}")
return None

def _create_response_from_log_message(self, log_message: AirbyteLogMessage) -> Optional[HttpResponse]:
# TODO: As a temporary stopgap, the CDK emits response data as a log message string. Ideally this should come in the
# form of a custom message object defined in the Airbyte protocol, but this unblocks us in the immediate while the
# protocol change is worked on.
raw_response = log_message.message.partition("response:")[2]
try:
response = json.loads(raw_response)
body = response.get("body", "{}")
return HttpResponse(status=response.get("status_code"), body=body, headers=response.get("headers"))
except JSONDecodeError as error:
self.logger.warning(f"Failed to parse log message into response object with error: {error}")
return None

def _has_reached_limit(self, slices):
if len(slices) >= self._max_slices:
return True

for slice in slices:
if len(slice.pages) >= self._max_pages_per_slice:
return True
return False

@classmethod
def _create_configure_catalog(cls, stream_name: str) -> ConfiguredAirbyteCatalog:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating the catalog on the Java side would allow us to keep the same read interface as connectors

return ConfiguredAirbyteCatalog.parse_obj(
{
"streams": [
{
"stream": {
"name": stream_name,
"json_schema": {},
"supported_sync_modes": ["full_refresh", "incremental"],
},
"sync_mode": "full_refresh",
"destination_sync_mode": "overwrite",
}
]
}
)
Loading