Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting remote_target_path and remote_target_path_conn_id, but no files land in S3 after running project dag #1259

Closed
johnmcochran opened this issue Oct 17, 2024 · 12 comments
Assignees
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration dbt:test Primarily related to dbt test command or functionality
Milestone

Comments

@johnmcochran
Copy link
Contributor

I'm trying to push data test failures to S3. To do this, I'm using the documentation here (https://astronomer.github.io/astronomer-cosmos/configuration/cosmos-conf.html#remote-target-path) to set up the target directory pointing to my S3 bucket.

There is limited info in the Cosmos documentation examples for how to get this up and running other than just creating the environment variables. What am I supposed to do after adding the environment variables? When I run my dbt project dag, no files are dropped into S3.

I successfully used the S3 connection for deploying docs, so I know that part is good, and I am using cosmos 1.7.0. I verified that I have a failing data test in my project.

bug?

Thank you,
John

Copy link

dosubot bot commented Oct 17, 2024

Hey there, @johnmcochran! I'm here to help you with any bugs, questions, or contributions. Let's tackle this issue together and get it resolved!

To troubleshoot the issue of files not being transferred to S3 after setting remote_target_path and remote_target_path_conn_id, here are a few steps you can take:

  1. Verify Environment Variables: Ensure that the environment variable AIRFLOW__COSMOS__REMOTE_TARGET_PATH is correctly set to your S3 bucket path (e.g., s3://your_bucket/target_dir/). Also, make sure that remote_target_path_conn_id is set to the correct Airflow connection ID for your S3 connection.

  2. Check DAG Configuration: Confirm that your DAG is configured to use these settings. The remote_target_path should be correctly referenced in your DAG code to ensure that dbt outputs are directed to the specified S3 path.

  3. Review Logs: Check the Airflow logs for any errors or warnings during the DAG run. This can provide insights into whether the dbt tasks are executing successfully and if there are any issues with the file transfer process.

  4. Permissions: Double-check that the S3 bucket permissions allow writing from your Airflow instance. Ensure that the IAM role or user associated with your S3 connection has the necessary permissions to write to the specified bucket and path.

  5. Test S3 Connection: Since you mentioned that the S3 connection works for deploying docs, it might be worth testing the connection specifically for this use case to ensure there are no configuration issues.

If these steps do not resolve the issue, it might be helpful to look into any specific error messages in the logs that could indicate what might be going wrong during the file transfer process.

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added area:config Related to configuration, like YAML files, environment variables, or executer configuration dbt:test Primarily related to dbt test command or functionality labels Oct 17, 2024
@johnmcochran
Copy link
Contributor Author

Hey @dosu,

Do I need to add the remote_target_path to my DbtTaskGroup function?

I was under the impression that simply adding the environment variables would let cosmos detect the desire for remote path and automatically send the target files to S3. Maybe you could help me understand why creating the environment variable for remote_target_path is necessary if remote_target_path needs to be specified in the DAG as well?

I am curious about whether the S3 remote_target_path gets overwritten each time the DAG is run, or if it automatically adds subdirectories?

Thank you,
John

Copy link

dosubot bot commented Oct 18, 2024

The remote_target_path configuration in Cosmos is indeed designed to be set using the AIRFLOW__COSMOS__REMOTE_TARGET_PATH environment variable, which should allow Cosmos to detect and use this path for storing dbt outputs in S3 [1]. However, if you're not seeing the expected behavior, it might be worth explicitly specifying the remote_target_path in your DbtTaskGroup function to ensure that the configuration is correctly applied.

Regarding the behavior of the remote_target_path when a DAG is run multiple times, the documentation does not specify whether the path gets overwritten or if subdirectories are automatically created. This behavior might depend on how the dbt tasks and the S3 storage are configured. It's a good idea to check the Airflow logs for any clues or test the behavior by running the DAG multiple times to observe how the files are managed in S3.

To continue talking to Dosu, mention @dosu.

@tatiana tatiana added the bug Something isn't working label Oct 18, 2024
@johnmcochran
Copy link
Contributor Author

Updated code:

transform_data = DbtTaskGroup(
group_id="transform_data",
project_config=ProjectConfig(DBT_PROJECT_PATH),
profile_config=profile_config,
execution_config=execution_config,
default_args={"retries": 2},
remote_target_path=os.environ['AIRFLOW__COSMOS__REMOTE_TARGET_PATH']
)

I get this error when trying to add remote_target_path to my DbtTaskGroup:

Broken DAG: [/usr/local/airflow/dags/sagitta_dags/sagitta_staff_dbt_dag.py]
Traceback (most recent call last):
File "/usr/local/airflow/dags/sagitta_dags/sagitta_staff_dbt_dag.py", line 49, in sagitta_staff_dag
transform_data = DbtTaskGroup(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/cosmos/airflow/task_group.py", line 26, in init
TaskGroup.init(self, *args, **airflow_kwargs(**kwargs))
TypeError: TaskGroup.init() got an unexpected keyword argument 'remote_target_path'

I tried digging through the DbtTaskGroup function and couldn't find any references to remote_target_path either.

@johnmcochran
Copy link
Contributor Author

johnmcochran commented Oct 23, 2024

Hello @tatiana,

Does the Bug tag that was assigned to this mean that it is a confirmed bug and there isn't something I need to fix on my end? Is there an ETA on how long issues normally take to be fixed?

I'm moving to Astronomer soon and would like to know whether I can rely on the cosmos stored test failures functionalities for some upcoming data quality tests I am implementing at my company.

Thank you,
John

@tatiana
Copy link
Collaborator

tatiana commented Oct 28, 2024

Hi @johnmcochran ! I had temporarily marked it as a bug until we could reproduce/troubleshoot it further.

The variables remote_target_path and remote_target_path_conn_id were introduced as part of #1147 (Cosmos 1.6), with the intent to being an alternative implementation for the dbt ls cache in Airflow variables, which was implemented in #1014 (Cosmos 1.5). The goal of #1147 was to meet the needs of users uncomfortable with caching using the Airflow metadata database.

This is an example of how I had set them up locally by using environment variables (this is a deployment global setting) when using GCS:

export AIRFLOW__COSMOS__REMOTE_TARGET_PATH="gs://cosmos_dev/target_compiled"
export AIRFLOW__COSMOS__REMOTE_TARGET_PATH_CONN_ID="google_cloud_default" 

When doing this, Cosmos creates specific paths for each DbtTaskGroup or DbtDag within the user-defined remote_target_path. An example of the files created in GCS when I tested this feature:
Image

In this case, you can see three Cosmos Dbt "entities" were cached:

  • the DbtDag async_bq_profile_example
  • the DbtDag simple_dag_async
  • the DAG simple_dag_async_task_group that had one DbtTaskGroup task_group

For this to work, please, confirm the folowing conditions are met:

Please, could you confirm that this works for you?

BTW: would you be up for helping us improving our documentation so, in future, users don't face the same issues that you faced when trying out this feature?

@tatiana tatiana removed the bug Something isn't working label Oct 28, 2024
@johnmcochran
Copy link
Contributor Author

Hey @tatiana,

TLDR: the cache config doesn't seem to be automatically generating AIRFLOW__COSMOS__ENABLE_CACHE=True and
AIRFLOW__COSMOS__ENABLE_CACHE_DBT_LS=True like you suggested, despite using 1.7.0. Manually adding the two config variables did not resolve. Manually setting the RenderConfig in the DbtTaskGroup did not resolve

  • cosmos version

I'm using cosmos 1.7.0, set in my requirements.txt file

  • cache config

Neither of these were listed in the Admin > Configurations page in airflow UI. I added these to my Dockerfile, and still not seeing test failures stored in the S3 directory defined by my remote_target_path. I tried just setting these before looking into load_method, since I saw that dbt_ls should be the default parsing method. If dbt_ls isn't the default, I think the documentation isn't clear on that end (I see that automatic looks for a manifest file, and then defaults to dbt_ls after that).

I see for dbt_ls: "this requires the dbt executable to be installed on your machine". My assumption is that the below code would satisfy this requirement, but I'm not sure.

execution_config = ExecutionConfig(
dbt_executable_path=DBT_EXECUTABLE_PATH,
)

This execution config is passed into my DbtTaskGroup:

transform_data = DbtTaskGroup(
group_id="transform_data",
project_config=ProjectConfig(DBT_PROJECT_PATH),
profile_config=profile_config,
execution_config=execution_config,
default_args={"retries": 2},
)

  • load_method=LoadMode.DBT_LS

The example included in the documentation mentions setting the load_method in RenderConfig. I didn't not have this. I updated my DbtTaskGroup:

transform_data = DbtTaskGroup(
group_id="transform_data",
project_config=ProjectConfig(DBT_PROJECT_PATH),
profile_config=profile_config,
execution_config=execution_config,
render_config=RenderConfig(
load_method=LoadMode.DBT_LS,
),
default_args={"retries": 2},
)

Unfortunately, none of these changes yielded test failures being stored in my designated S3 directory. I still have the failing test within my dbt project.

@tatiana
Copy link
Collaborator

tatiana commented Oct 30, 2024

@johnmcochran thanks for your prompt reply

TLDR: the cache config doesn't seem to be automatically generating AIRFLOW__COSMOS__ENABLE_CACHE=True and
AIRFLOW__COSMOS__ENABLE_CACHE_DBT_LS=True like you suggested, despite using 1.7.0. Manually adding the two config variables did not resolve. Manually setting the RenderConfig in the DbtTaskGroup did not resolve

Cosmos does not materialize a configuration on behalf of the user (e.g. environment variables). Still, the default behaviour is to consider those true, as shown in the code below, if the user does not specify them:

enable_cache = conf.getboolean("cosmos", "enable_cache", fallback=True)

Next steps

  1. Airflow version
    Could you confirm that you're using Airflow 2.8.0 or higher?

  2. Since things don't seem to be working, please could you check what the scheduler logs say?

It would be great to see if your scheduler logs include:

@johnmcochran
Copy link
Contributor Author

Hey @tatiana ,

1.>

  1. Airflow version

I see in the airflow UI the following: Astronomer Runtime 12.1.1 based on Airflow 2.10.2+astro.1

2. Since things don't seem to be working, please could you check what the scheduler logs say?

Tried checking the logs in Airflow UI for the specific failed task and didn't see any of these phrases ('Trying to parse' or 'cache miss'), no hits. I checked the scheduler docker container and inspected the log files to see if there's different info there and found the following.

Hopefully there's an easier way to check scheduler logs than what I just did, ha. Could the warning about conversion function be the issue? This chunk of logs repeated itself in a similar fashion many times within the scheduler logs.

[2024-10-30T00:06:32.820+0000] {logging_mixin.py:190} INFO - [2024-10-30T00:06:32.819+0000] {dagbag.py:588} INFO - Filling up the DagBag from /usr/local/airflow/dags/sagitta_dags/sagitta_staff_dbt_dag.py
[2024-10-30T00:06:33.021+0000] {logging_mixin.py:190} INFO - [2024-10-30T00:06:33.020+0000] {graph.py:470} INFO - Trying to parse the dbt project using dbt ls cache cosmos_cache__sagitta_staff_dag__transform_data...
[2024-10-30T00:06:33.240+0000] {logging_mixin.py:190} INFO - [2024-10-30T00:06:33.240+0000] {cache.py:320} INFO - Cosmos performance: time to calculate cache identifier cosmos_cache__sagitta_staff_dag__transform_data for current version: 0.19846626208163798
[2024-10-30T00:06:33.240+0000] {logging_mixin.py:190} INFO - [2024-10-30T00:06:33.240+0000] {graph.py:487} INFO - Cosmos performance [aac60146e4b1|35569]: The cache size for cosmos_cache__sagitta_staff_dag__transform_data is 10340
[2024-10-30T00:06:33.241+0000] {logging_mixin.py:190} INFO - [2024-10-30T00:06:33.241+0000] {graph.py:495} INFO - Cosmos performance: Cache hit for cosmos_cache__sagitta_staff_dag__transform_data - eea696cc7e441696d559d15d80775827,d41d8cd98f00b204e9800998ecf8427e
[2024-10-30T00:06:33.241+0000] {logging_mixin.py:190} INFO - [2024-10-30T00:06:33.241+0000] {graph.py:412} INFO - Total nodes: 14
[2024-10-30T00:06:33.241+0000] {logging_mixin.py:190} INFO - [2024-10-30T00:06:33.241+0000] {graph.py:413} INFO - Total filtered nodes: 14
[2024-10-30T00:06:33.241+0000] {logging_mixin.py:190} INFO - [2024-10-30T00:06:33.241+0000] {converter.py:264} INFO - Cosmos performance (sagitta_staff_dag__transform_data) - [aac60146e4b1|35569]: It took 0.221s to parse the dbt project for DAG using LoadMode.DBT_LS_CACHE
[2024-10-30T00:06:33.241+0000] {logging_mixin.py:190} INFO - [2024-10-30T00:06:33.241+0000] {graph.py:206} WARNING - Unavailable conversion function for <DbtResourceType.EXPOSURE> (node <exposure.mjdw_dbt_project.wes-aperture-1>). Define a converter function using render_config.node_converters.
[2024-10-30T00:06:33.245+0000] {logging_mixin.py:190} INFO - [2024-10-30T00:06:33.245+0000] {converter.py:308} INFO - Cosmos performance (sagitta_staff_dag__transform_data) - [aac60146e4b1|35569]: It took 0.00423s to build the Airflow DAG.
[2024-10-30T00:06:33.247+0000] {processor.py:925} INFO - DAG(s) 'sagitta_staff_dag' retrieved from /usr/local/airflow/dags/sagitta_dags/sagitta_staff_dbt_dag.py
[2024-10-30T00:06:33.299+0000] {logging_mixin.py:190} INFO - [2024-10-30T00:06:33.299+0000] {dag.py:3239} INFO - Sync 1 DAGs
[2024-10-30T00:06:33.309+0000] {logging_mixin.py:190} INFO - [2024-10-30T00:06:33.309+0000] {dag.py:4180} INFO - Setting next_dagrun for sagitta_staff_dag to None, run_after=None
[2024-10-30T00:06:33.623+0000] {processor.py:208} INFO - Processing /usr/local/airflow/dags/sagitta_dags/sagitta_staff_dbt_dag.py took 0.809 seconds

@johnmcochran
Copy link
Contributor Author

Hi @tatiana,

Have you had a chance to take a look at the logs I posted above? I bolded the portions that may be relevant, but unfortunately didn't see any hits for the things you wanted me to look for.

Sincerely,
John

@johnmcochran
Copy link
Contributor Author

Hi @tatiana,

Checking in again to see if you could take a look at the response I posted to your questions :)

Sincerely,
John

@phanikumv phanikumv added this to the Cosmos 1.8.0 milestone Dec 3, 2024
@tatiana tatiana changed the title Setting remote_target_path and remote_target_path_conn_id, but no files land in S3 after running project dag Setting remote_target_path and remote_target_path_conn_id, but no files land in S3 after running project dag Dec 12, 2024
@tatiana
Copy link
Collaborator

tatiana commented Dec 12, 2024

Hi @johnmcochran, I'm sorry for the delay. I've been sidetracked with other priorities at Astronomer. I'm back to this issue.

At the beginning of the issue, you stated:

I'm trying to push data test failures to S3. To do this, I'm using the documentation here (https://astronomer.github.io/astronomer-cosmos/configuration/cosmos-conf.html#remote-target-path) to set up the target directory pointing to my S3 bucket.

There was an issue in our documentation, which was fixed in #1305. By setting remote_target_path and remote_target_path_conn_id, you will not be uploading test failures to S3. As mentioned in the docs:

Cosmos currently only supports copying files from the compiled directory within the target folder

And also:

only when the execution mode is set to ExecutionMode.AIRFLOW_ASYNC

As part of Cosmos 1.8, we're aiming to improve Cosmos callback support and implement the feature you're asking for.

This feature request seems to be duplicated of this ticket: #801, that is in our current sprint. It is part of a bigger goal: #1349.

Therefore, I suggest you track one of those tickets, and I'll close this one as a duplicate.

Sorry again for the delay.

In regards to the scheduler logs, you mentioned:

Tried checking the logs in Airflow UI for the specific failed task and didn't see any of these phrases ('Trying to parse' or 'cache miss'), no hits. I checked the scheduler docker container, inspected the log files to see if there's different info, and found the following.

These logs will be in the Airflow scheduler, not in the Airflow UI, which currently only displays Task logs. Usually, the easiest way to see scheduler logs in Astro CLI is:

astro dev logs -s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration dbt:test Primarily related to dbt test command or functionality
Projects
None yet
Development

No branches or pull requests

3 participants