Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database Connection Failure on AML clusters using kedro ThreadRunner #3951

Open
gitgud5000 opened this issue Jun 13, 2024 · 4 comments
Open

Comments

@gitgud5000
Copy link

Description

I have an issue running Kedro with ThreadRunner to execute the following pipeline:
Pasted image 20240613004553

The primary layer shown in the Kedro Viz above is a series of 21 SQLScriptDataset objects (a pandas.sql_dataset.SQLQueryDataset subclass which formats input queries in a special way using parameters in the catalog and then calls super().__init__).

This Kedro pipeline is triggered as part of a CommandJob in Azure Machine Learning (AML), using a command_job.py which runs a Kedro session with something like this:

if __name__ == "__main__":
...
with KedroSession.create() as session:
	session.run(...,runner = runner)

Problem/Error

After most or all of the datasets in the primary layer are loaded, SQLAlchemy produces the following error:

sqlalchemy.pool.impl.QueuePool Error on connect(): ORA-28547: connection to server failed, probable Oracle Net admin error
...
DatabaseError: (cx_Oracle.DatabaseError) ORA-28547: connection to server failed, probable Oracle Net admin error
(Background on this error at: https://sqlalche.me/e/20/4xp6)

Context

In AML, these jobs can be run on two types of compute: a Compute Instance, which is an Ubuntu VM used for development, and Clusters, which are managed infrastructures that allow for the creation of single/multi-node computes for deployment.

When executing the CommandJob, essentially running kedro run with ThreadRunner on a Cluster, the job fails. However, this issue does not occur when running the same job on a Compute Instance, or when run locally from source using kedro run.

These command jobs run with the same environment image in both cases.

Steps to Reproduce

  • Set up an AML Cluster with the specified environment.
  • Execute the CommandJob to run the Kedro pipeline with ThreadRunner.
  • Observe the error.

Expected Result

I would expect the job to run successfully on the cluster, as it does on other compute instances with the same configuration.

Actual Result

The job fails with the following error:

sqlalchemy.pool.impl.QueuePool Error on connect(): ORA-28547: connection to server failed, probable Oracle Net admin error
...
DatabaseError: (cx_Oracle.DatabaseError) ORA-28547: connection to server failed, probable Oracle Net admin error
(Background on this error at: https://sqlalche.me/e/20/4xp6)

Attempts to Resolve

  • Using a small max_workers in the runner configuration.
  • Used connection parameters for the engine
  • Tried different engine parameters, including:
{
    "pool_size": 5,
    "max_overflow": 10,
    "pool_timeout": 120,
    "pool_recycle": 1800,
    "pool_reset_on_return": null,
    "isolation_level": "AUTOCOMMIT",
    "pool_pre_ping": true,
    "echo_pool": "debug"
}
  • Different Oracle (ugh, I know) drivers
  • Different versions of oracledb and cx-Oracle no luck.

Logs

Here is a log file of a run with 'echo_pool': 'debug' and a similar setup, with 5 SQLScriptDataset as input.
Running in AzureML.log
Pasted image 20240613014731

Your Environment

  • Python version: 3.11.9
  • kedro version: 0.19.6
  • kedro-datasets version: 3.0.0
  • cx-Oracle version: 8.3.0
  • OS: mcr.microsoft.com/azureml/inference-base-2204:20240531.v1
  • Oracle client: oracle/instantclient_21_8
  • Azure ML compute type: Standard_D16_v3
  • Oracle DB version: 19
@gitgud5000 gitgud5000 changed the title Database Connection Failure on AML clusters using kedro ThreadRunner Database Connection Failure on AML clusters using kedro ThreadRunner Jun 13, 2024
@astrojuanlu
Copy link
Member

Hi @gitgud5000, thanks for opening this issue and sorry you had a bumpy experience. We will look into this shortly.

@ArmandoRl1
Copy link

I'm having a similar problem where the issue appears when running on a cluster, but not running locally or on a compute instance.

@astrojuanlu
Copy link
Member

Hi @ArmandoRl1, could you give more details on your setup? @gitgud5000 already gave a good writeup but the more information we have about this the better.

@ricardorodl
Copy link

I replicated @gitgud5000 environment and I am getting the same error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants