Skip to content

Commit

Permalink
✨ Source GCS: Add Gzip and Bzip compression support (#36373)
Browse files Browse the repository at this point in the history
Co-authored-by: Artem Inzhyyants <artem.inzhyyants@gmail.com>
  • Loading branch information
tolik0 and artem1205 authored Mar 26, 2024
1 parent 441bc77 commit 7382c87
Show file tree
Hide file tree
Showing 13 changed files with 2,449 additions and 188 deletions.
90 changes: 41 additions & 49 deletions airbyte-integrations/connectors/source-gcs/README.md
Original file line number Diff line number Diff line change
@@ -1,68 +1,55 @@
# Gcs Source
# Gcs source connector


This is the repository for the Gcs source connector, written in Python.
For information about how to use this connector within Airbyte, see [the documentation](https://docs.airbyte.com/integrations/sources/gcs).

## Local development

### Prerequisites
**To iterate on this connector, make sure to complete this prerequisites section.**

#### Minimum Python version required `= 3.9.0`
* Python (~=3.9)
* Poetry (~=1.7) - installation instructions [here](https://python-poetry.org/docs/#installation)

#### Build & Activate Virtual Environment and install dependencies
From this connector directory, create a virtual environment:
```
python -m venv .venv
```

This will generate a virtualenv for this module in `.venv/`. Make sure this venv is active in your
development environment of choice. To activate it from the terminal, run:
```
source .venv/bin/activate
pip install -r requirements.txt
### Installing the connector
From this connector directory, run:
```bash
poetry install --with dev
```
If you are in an IDE, follow your IDE's instructions to activate the virtualenv.

Note that while we are installing dependencies from `requirements.txt`, you should only edit `setup.py` for your dependencies. `requirements.txt` is
used for editable installs (`pip install -e`) to pull in Python dependencies from the monorepo and will call `setup.py`.
If this is mumbo jumbo to you, don't worry about it, just put your deps in `setup.py` but install using `pip install -r requirements.txt` and everything
should work as you expect.

#### Create credentials
### Create credentials
**If you are a community contributor**, follow the instructions in the [documentation](https://docs.airbyte.com/integrations/sources/gcs)
to generate the necessary credentials. Then create a file `secrets/config.json` conforming to the `source_gcs/spec.yaml` file.
Note that the `secrets` directory is gitignored by default, so there is no danger of accidentally checking in sensitive information.
See `integration_tests/sample_config.json` for a sample config file.
Note that any directory named `secrets` is gitignored across the entire Airbyte repo, so there is no danger of accidentally checking in sensitive information.
See `sample_files/sample_config.json` for a sample config file.

**If you are an Airbyte core member**, copy the credentials in Lastpass under the secret name `source gcs test creds`
and place them into `secrets/config.json`.

### Locally running the connector
```
python main.py spec
python main.py check --config secrets/config.json
python main.py discover --config secrets/config.json
python main.py read --config secrets/config.json --catalog integration_tests/configured_catalog.json
poetry run source-gcs spec
poetry run source-gcs check --config secrets/config.json
poetry run source-gcs discover --config secrets/config.json
poetry run source-gcs read --config secrets/config.json --catalog sample_files/configured_catalog.json
```

### Locally running the connector docker image

### Running unit tests
To run unit tests locally, from the connector directory run:
```
poetry run pytest unit_tests
```

#### Build
**Via [`airbyte-ci`](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/pipelines/README.md) (recommended):**
### Building the docker image
1. Install [`airbyte-ci`](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/pipelines/README.md)
2. Run the following command to build the docker image:
```bash
airbyte-ci connectors --name=source-gcs build
```

An image will be built with the tag `airbyte/source-gcs:dev`.
An image will be available on your host with the tag `airbyte/source-gcs:dev`.

**Via `docker build`:**
```bash
docker build -t airbyte/source-gcs:dev .
```

#### Run
### Running as a docker container
Then run any of the connector commands as follows:
```
docker run --rm airbyte/source-gcs:dev spec
Expand All @@ -71,29 +58,34 @@ docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-gcs:dev discover --con
docker run --rm -v $(pwd)/secrets:/secrets -v $(pwd)/integration_tests:/integration_tests airbyte/source-gcs:dev read --config /secrets/config.json --catalog /integration_tests/configured_catalog.json
```

## Testing
### Running our CI test suite
You can run our full test suite locally using [`airbyte-ci`](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/pipelines/README.md):
```bash
airbyte-ci connectors --name=source-gcs test
```

### Customizing acceptance Tests
Customize `acceptance-test-config.yml` file to configure tests. See [Connector Acceptance Tests](https://docs.airbyte.com/connector-development/testing-connectors/connector-acceptance-tests-reference) for more information.
Customize `acceptance-test-config.yml` file to configure acceptance tests. See [Connector Acceptance Tests](https://docs.airbyte.com/connector-development/testing-connectors/connector-acceptance-tests-reference) for more information.
If your connector requires to create or destroy resources for use during acceptance tests create fixtures for it and place them inside integration_tests/acceptance.py.

## Dependency Management
All of your dependencies should go in `setup.py`, NOT `requirements.txt`. The requirements file is only used to connect internal Airbyte dependencies in the monorepo for local development.
We split dependencies between two groups, dependencies that are:
* required for your connector to work need to go to `MAIN_REQUIREMENTS` list.
* required for the testing need to go to `TEST_REQUIREMENTS` list
### Dependency Management
All of your dependencies should be managed via Poetry.
To add a new dependency, run:
```bash
poetry add <package-name>
```

Please commit the changes to `pyproject.toml` and `poetry.lock` files.

### Publishing a new version of the connector
## Publishing a new version of the connector
You've checked out the repo, implemented a million dollar feature, and you're ready to share your changes with the world. Now what?
1. Make sure your changes are passing our test suite: `airbyte-ci connectors --name=source-gcs test`
2. Bump the connector version in `metadata.yaml`: increment the `dockerImageTag` value. Please follow [semantic versioning for connectors](https://docs.airbyte.com/contributing-to-airbyte/resources/pull-requests-handbook/#semantic-versioning-for-connectors).
2. Bump the connector version (please follow [semantic versioning for connectors](https://docs.airbyte.com/contributing-to-airbyte/resources/pull-requests-handbook/#semantic-versioning-for-connectors)):
- bump the `dockerImageTag` value in in `metadata.yaml`
- bump the `version` value in `pyproject.toml`
3. Make sure the `metadata.yaml` content is up to date.
4. Make the connector documentation and its changelog is up to date (`docs/integrations/sources/gcs.md`).
4. Make sure the connector documentation and its changelog is up to date (`docs/integrations/sources/gcs.md`).
5. Create a Pull Request: use [our PR naming conventions](https://docs.airbyte.com/contributing-to-airbyte/resources/pull-requests-handbook/#pull-request-title-convention).
6. Pat yourself on the back for being an awesome contributor.
7. Someone from Airbyte will take a look at your PR and iterate with you to merge it into master.

8. Once your PR is merged, the new version of the connector will be automatically published to Docker Hub and our connector registry.
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,6 @@ acceptance_tests:
tests:
- config_path: "secrets/config.json"
status: succeed
- config_path: "secrets/old_config.json"
status: succeed
- config_path: "integration_tests/invalid_config.json"
status: exception
discovery:
Expand All @@ -21,11 +19,7 @@ acceptance_tests:
timeout_seconds: 2400
basic_read:
tests:
- config_path: "secrets/old_config.json"
configured_catalog_path: "integration_tests/configured_catalog.json"
expect_trace_message_on_failure: false
- config_path: "secrets/config.json"
configured_catalog_path: "integration_tests/configured_catalog.json"
expect_trace_message_on_failure: false
incremental:
tests:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,19 @@
"name": "example_2"
}
}
},
{
"type": "STREAM",
"stream": {
"stream_state": {
"_ab_source_file_last_modified": "2024-03-21T16:13:20.571000Z\"_https://storage.googleapis.com/airbyte-integration-test-source-gcs/test_folder/simple_test.csv.gz",
"history": {
"https://storage.googleapis.com/airbyte-integration-test-source-gcs/test_folder/simple_test.csv.gz": "2024-03-21T16:13:20.571000Z"
}
},
"stream_descriptor": {
"name": "example_gzip"
}
}
}
]
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,15 @@
},
"sync_mode": "incremental",
"destination_sync_mode": "overwrite"
},
{
"stream": {
"name": "example_gzip",
"json_schema": {},
"supported_sync_modes": ["full_refresh", "incremental"]
},
"sync_mode": "incremental",
"destination_sync_mode": "overwrite"
}
]
}
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,7 @@
"description": "The pattern used to specify which files should be selected from the file system. For more information on glob pattern matching look <a href=\"https://en.wikipedia.org/wiki/Glob_(programming)\">here</a>.",
"order": 1,
"type": "array",
"items": {
"type": "string"
}
"items": { "type": "string" }
},
"legacy_prefix": {
"title": "Legacy Prefix",
Expand Down Expand Up @@ -118,9 +116,7 @@
"description": "A set of case-sensitive strings that should be interpreted as null values. For example, if the value 'NA' should be interpreted as null, enter 'NA' in this field.",
"default": [],
"type": "array",
"items": {
"type": "string"
},
"items": { "type": "string" },
"uniqueItems": true
},
"strings_can_be_null": {
Expand All @@ -144,9 +140,7 @@
"header_definition": {
"title": "CSV Header Definition",
"description": "How headers will be defined. `User Provided` assumes the CSV does not have a header row and uses the headers provided and `Autogenerated` assumes the CSV does not have a header row and the CDK will generate headers using for `f{i}` where `i` is the index starting from 0. Else, the default behavior is to use the header from the CSV file. If a user wants to autogenerate or provide column names for a CSV having headers, they can skip rows.",
"default": {
"header_definition_type": "From CSV"
},
"default": { "header_definition_type": "From CSV" },
"oneOf": [
{
"title": "From CSV",
Expand Down Expand Up @@ -188,9 +182,7 @@
"title": "Column Names",
"description": "The column names that will be used while emitting the CSV records",
"type": "array",
"items": {
"type": "string"
}
"items": { "type": "string" }
}
},
"required": ["column_names", "header_definition_type"]
Expand All @@ -203,19 +195,15 @@
"description": "A set of case-sensitive strings that should be interpreted as true values.",
"default": ["y", "yes", "t", "true", "on", "1"],
"type": "array",
"items": {
"type": "string"
},
"items": { "type": "string" },
"uniqueItems": true
},
"false_values": {
"title": "False Values",
"description": "A set of case-sensitive strings that should be interpreted as false values.",
"default": ["n", "no", "f", "false", "off", "0"],
"type": "array",
"items": {
"type": "string"
},
"items": { "type": "string" },
"uniqueItems": true
},
"inference_type": {
Expand All @@ -224,6 +212,12 @@
"default": "None",
"airbyte_hidden": true,
"enum": ["None", "Primitive Types Only"]
},
"ignore_errors_on_fields_mismatch": {
"title": "Ignore errors on field mismatch",
"description": "Whether to ignore errors that occur when the number of fields in the CSV does not match the number of columns in the schema.",
"default": false,
"type": "boolean"
}
},
"required": ["filetype"]
Expand Down
2 changes: 1 addition & 1 deletion airbyte-integrations/connectors/source-gcs/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ data:
connectorSubtype: file
connectorType: source
definitionId: 2a8c41ae-8c23-4be0-a73f-2ab10ca1a820
dockerImageTag: 0.3.7
dockerImageTag: 0.4.0
dockerRepository: airbyte/source-gcs
documentationUrl: https://docs.airbyte.com/integrations/sources/gcs
githubIssueLabel: source-gcs
Expand Down
Loading

0 comments on commit 7382c87

Please sign in to comment.