Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Cannot create datasets in projects with model names containing multibyte characters #1197

Closed
1 task done
t0momi219 opened this issue Sep 7, 2024 · 1 comment · Fixed by #1198
Closed
1 task done
Labels
area:datasets Related to the Airflow datasets feature/module bug Something isn't working execution:virtualenv Related to Virtualenv execution environment profile:snowflake Related to Snowflake ProfileConfig triage-needed Items need to be reviewed / assigned to milestone

Comments

@t0momi219
Copy link
Contributor

t0momi219 commented Sep 7, 2024

Astronomer Cosmos Version

main (development)

If "Other Astronomer Cosmos version" selected, which one?

No response

dbt-core version

1.8

Versions of dbt adapters

No response

LoadMode

AUTOMATIC

ExecutionMode

VIRTUALENV

InvocationMode

None

airflow version

2.10.0

Operating System

Debian GNU/Linux 12 (official airflow docker image)

If a you think it's an UI issue, what browsers are you seeing the problem on?

No response

Deployment

Docker-Compose

Deployment details

No response

What happened?

In projects containing models with names like the following, dataset creation fails, and an error occurs during execution.

└── dbt
    └── my_project
        ├── dbt_packages
        ├── dbt_project.yml
        ├── models
            ├── 日本語名モデル.sql
            └── 日本語名モデル.yml

The error message is as follows:

[2024-09-07T04:10:58.435+0000] {local.py:455} DEBUG - URIs to be converted to Dataset: ['snowflake://***.ap-northeast-1.aws/TEST_DB.DBT.日本語名モデル']
[2024-09-07T04:10:58.437+0000] {taskinstance.py:3301} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/cosmos/operators/local.py", line 459, in get_datasets
    datasets = [Dataset(uri) for uri in uris]
                ^^^^^^^^^^^^
  File "<attrs generated init airflow.datasets.Dataset>", line 3, in __init__
    _setattr('uri', __attr_converter_uri(uri))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/datasets/__init__.py", line 78, in _sanitize_uri
    raise ValueError("Dataset URI must only consist of ASCII characters")
ValueError: Dataset URI must only consist of ASCII characters

Relevant log output

No response

How to reproduce

Add model has multibyte character name.

ex: 日本語名モデル.sql

Anything else :)?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Contact Details

No response

@t0momi219 t0momi219 added bug Something isn't working triage-needed Items need to be reviewed / assigned to milestone labels Sep 7, 2024
Copy link

dosubot bot commented Sep 7, 2024

@t0momi219 こんにちは! I'm here to assist you with any issues you encounter. I can help you solve bugs, answer questions, and even guide you on how to become a contributor. Let me know how I can assist you with the bug you've encountered.

The issue you're encountering is due to the Dataset class in airflow.datasets enforcing that URIs must only consist of ASCII characters. This is causing the error when trying to create datasets with model names containing multibyte characters.

The relevant part of the code in cosmos.operators.local that leads to the creation of the Dataset instances is:

def get_datasets(self, source: Literal["inputs", "outputs"]) -> list[Dataset]:
    uris = []
    for completed in self.openlineage_events_completes:
        for output in getattr(completed, source):
            dataset_uri = output.namespace + "/" + output.name
            uris.append(dataset_uri)
    self.log.debug("URIs to be converted to Dataset: %s", uris)

    datasets = []
    try:
        datasets = [Dataset(uri) for uri in uris]
    except ValueError:
        raise AirflowCompatibilityError(
            """
            Apache Airflow 2.9.0 & 2.9.1 introduced a breaking change in Dataset URIs, to be fixed in newer versions:
            https://github.com/apache/airflow/issues/39486

            If you want to use Cosmos with one of these Airflow versions, you will have to disable emission of Datasets:
            By setting ``emit_datasets=False`` in ``RenderConfig``. For more information, see https://astronomer.github.io/astronomer-cosmos/configuration/render-config.html.
            """
        )
    return datasets

To address this issue, you have a few options:

  1. Disable Emission of Datasets: As suggested in the error message, you can disable the emission of datasets by setting emit_datasets=False in RenderConfig. This will bypass the creation of Dataset instances and avoid the ASCII character restriction.

  2. Modify URI Generation: You could modify the URI generation process to ensure that only ASCII characters are used. This might involve transliterating or encoding the multibyte characters into a valid ASCII representation.

  3. Patch Airflow: If feasible, you could patch the Dataset class in Airflow to support multibyte characters, though this would be a more complex and less maintainable solution.

For now, the quickest solution would be to disable the emission of datasets if that fits your use case. If you need further assistance or have any questions about implementing these changes, feel free to ask!

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added area:datasets Related to the Airflow datasets feature/module execution:virtualenv Related to Virtualenv execution environment profile:snowflake Related to Snowflake ProfileConfig labels Sep 7, 2024
pankajkoti pushed a commit that referenced this issue Sep 26, 2024
In projects containing models with names like the following, dataset
creation fails, and an error occurs during execution.
```txt
└── dbt
    └── my_project
        └── models
            ├── 日本語名モデル.sql
            └── 日本語名モデル.yml

```

```
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/datasets/__init__.py", line 78, in _sanitize_uri
    raise ValueError("Dataset URI must only consist of ASCII characters")
ValueError: Dataset URI must only consist of ASCII characters
```

To support model names with multibyte characters, it might be good to
URL encode the names.

closes: #1197

Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com>
slords pushed a commit to slords/astronomer-cosmos that referenced this issue Sep 26, 2024
…1198)

In projects containing models with names like the following, dataset
creation fails, and an error occurs during execution.
```txt
└── dbt
    └── my_project
        └── models
            ├── 日本語名モデル.sql
            └── 日本語名モデル.yml

```

```
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/datasets/__init__.py", line 78, in _sanitize_uri
    raise ValueError("Dataset URI must only consist of ASCII characters")
ValueError: Dataset URI must only consist of ASCII characters
```

To support model names with multibyte characters, it might be good to
URL encode the names.

closes: astronomer#1197

Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:datasets Related to the Airflow datasets feature/module bug Something isn't working execution:virtualenv Related to Virtualenv execution environment profile:snowflake Related to Snowflake ProfileConfig triage-needed Items need to be reviewed / assigned to milestone
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant