Database Connection Failure on AML clusters using kedro `ThreadRunner` #3951

gitgud5000 · 2024-06-13T01:06:31Z

Description

I have an issue running Kedro with ThreadRunner to execute the following pipeline:

The primary layer shown in the Kedro Viz above is a series of 21 SQLScriptDataset objects (a pandas.sql_dataset.SQLQueryDataset subclass which formats input queries in a special way using parameters in the catalog and then calls super().__init__).

This Kedro pipeline is triggered as part of a CommandJob in Azure Machine Learning (AML), using a command_job.py which runs a Kedro session with something like this:

if __name__ == "__main__":
...
with KedroSession.create() as session:
	session.run(...,runner = runner)

Problem/Error

After most or all of the datasets in the primary layer are loaded, SQLAlchemy produces the following error:

sqlalchemy.pool.impl.QueuePool Error on connect(): ORA-28547: connection to server failed, probable Oracle Net admin error
...
DatabaseError: (cx_Oracle.DatabaseError) ORA-28547: connection to server failed, probable Oracle Net admin error
(Background on this error at: https://sqlalche.me/e/20/4xp6)

Context

In AML, these jobs can be run on two types of compute: a Compute Instance, which is an Ubuntu VM used for development, and Clusters, which are managed infrastructures that allow for the creation of single/multi-node computes for deployment.

When executing the CommandJob, essentially running kedro run with ThreadRunner on a Cluster, the job fails. However, this issue does not occur when running the same job on a Compute Instance, or when run locally from source using kedro run.

These command jobs run with the same environment image in both cases.

Steps to Reproduce

Set up an AML Cluster with the specified environment.
Execute the CommandJob to run the Kedro pipeline with ThreadRunner.
Observe the error.

Expected Result

I would expect the job to run successfully on the cluster, as it does on other compute instances with the same configuration.

Actual Result

The job fails with the following error:

sqlalchemy.pool.impl.QueuePool Error on connect(): ORA-28547: connection to server failed, probable Oracle Net admin error
...
DatabaseError: (cx_Oracle.DatabaseError) ORA-28547: connection to server failed, probable Oracle Net admin error
(Background on this error at: https://sqlalche.me/e/20/4xp6)

Attempts to Resolve

Using a small max_workers in the runner configuration.
Used connection parameters for the engine
Tried different engine parameters, including:

{
    "pool_size": 5,
    "max_overflow": 10,
    "pool_timeout": 120,
    "pool_recycle": 1800,
    "pool_reset_on_return": null,
    "isolation_level": "AUTOCOMMIT",
    "pool_pre_ping": true,
    "echo_pool": "debug"
}

Different Oracle (ugh, I know) drivers
Different versions of oracledb and cx-Oracle no luck.

Logs

Here is a log file of a run with 'echo_pool': 'debug' and a similar setup, with 5 SQLScriptDataset as input.
Running in AzureML.log

Your Environment

Python version: 3.11.9
kedro version: 0.19.6
kedro-datasets version: 3.0.0
cx-Oracle version: 8.3.0
OS: mcr.microsoft.com/azureml/inference-base-2204:20240531.v1
Oracle client: oracle/instantclient_21_8
Azure ML compute type: Standard_D16_v3
Oracle DB version: 19

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-06-15T14:44:14Z

Hi @gitgud5000, thanks for opening this issue and sorry you had a bumpy experience. We will look into this shortly.

ArmandoRl1 · 2024-06-26T18:29:37Z

I'm having a similar problem where the issue appears when running on a cluster, but not running locally or on a compute instance.

astrojuanlu · 2024-06-27T07:30:07Z

Hi @ArmandoRl1, could you give more details on your setup? @gitgud5000 already gave a good writeup but the more information we have about this the better.

ricardorodl · 2024-07-11T18:45:00Z

I replicated @gitgud5000 environment and I am getting the same error.

gitgud5000 changed the title ~~Database Connection Failure on AML clusters using kedro ThreadRunner~~ Database Connection Failure on AML clusters using kedro ThreadRunner Jun 13, 2024

github-actions bot mentioned this issue Jul 1, 2024

Monthly issue metrics report #3975

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database Connection Failure on AML clusters using kedro `ThreadRunner` #3951

Database Connection Failure on AML clusters using kedro `ThreadRunner` #3951

gitgud5000 commented Jun 13, 2024

astrojuanlu commented Jun 15, 2024

ArmandoRl1 commented Jun 26, 2024

astrojuanlu commented Jun 27, 2024

ricardorodl commented Jul 11, 2024

Database Connection Failure on AML clusters using kedro ThreadRunner #3951

Database Connection Failure on AML clusters using kedro ThreadRunner #3951

Comments

gitgud5000 commented Jun 13, 2024

Description

Problem/Error

Context

Steps to Reproduce

Expected Result

Actual Result

Attempts to Resolve

Logs

Your Environment

astrojuanlu commented Jun 15, 2024

ArmandoRl1 commented Jun 26, 2024

astrojuanlu commented Jun 27, 2024

ricardorodl commented Jul 11, 2024

Database Connection Failure on AML clusters using kedro `ThreadRunner` #3951

Database Connection Failure on AML clusters using kedro `ThreadRunner` #3951