SSHHook get_conn() does not re-use client #10874

baolsen · 2020-09-11T05:31:27Z

Apache Airflow version: 1.10.8

Environment:

Cloud provider or hardware configuration: 4 VCPU 8GB RAM VM
OS (e.g. from /etc/os-release): RHEL 7.7
Kernel (e.g. uname -a): Linux 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others:

What happened:

Sub-classing the SSHOperator and calling its execute repeatedly will create a new SSH connection each time, to run the command.

Not sure if this is a bug or an enhancement / feature. I can re-log as a feature request if needed.

What you expected to happen:

SSH client / connection should be re-used if it was already established.

How to reproduce it:

Sub-class the SSHOperator.
In your sub class execute method, call super().execute() a few times.
Observe in the logs how an SSH Connection is created each time.

Anything else we need to know:

The SSHHook.get_conn() method creates a new Paramiko SSH client each time. Despite storing the client on self.client before returning, the hook get_conn() method does not actually use the self.client next time. A new connection is therefore created.
I think this is because the SSH Operator uses a context manager to operate on the Paramiko client, so the Hook needs to create a new client if a previous context manager had closed the last one.

Fixing this would mean changing the SSH Operator execute() to not use the ssh_hook.get_conn() as a context manager since this will open and close the session each time. Perhaps the conn can be closed with the operator's post_execute method rather than in the execute.

Example logs

[2020-09-11 07:04:37,960] {ssh_operator.py:89} INFO - ssh_conn_id is ignored when ssh_hook is provided.
[2020-09-11 07:04:37,960] {logging_mixin.py:112} INFO - [2020-09-11 07:04:37,960] {ssh_hook.py:166} WARNING - Remote Identification Change is not verified. This wont protect against Man-In-The-Middle attacks
[2020-09-11 07:04:37,961] {logging_mixin.py:112} INFO - [2020-09-11 07:04:37,961] {ssh_hook.py:170} WARNING - No Host Key Verification. This wont protect against Man-In-The-Middle attacks
[2020-09-11 07:04:37,976] {logging_mixin.py:112} INFO - [2020-09-11 07:04:37,975] {transport.py:1819} INFO - Connected (version 2.0, client OpenSSH_7.4)
[2020-09-11 07:04:38,161] {logging_mixin.py:112} INFO - [2020-09-11 07:04:38,161] {transport.py:1819} INFO - Auth banner: b'Authorized uses only. All activity may be monitored and reported.\n'
[2020-09-11 07:04:38,161] {logging_mixin.py:112} INFO - [2020-09-11 07:04:38,161] {transport.py:1819} INFO - Authentication (publickey) successful!
[2020-09-11 07:04:38,161] {ssh_operator.py:109} INFO - Running command: [REDACTED COMMAND 1]
...
[2020-09-11 07:04:38,383] {ssh_operator.py:89} INFO - ssh_conn_id is ignored when ssh_hook is provided.
[2020-09-11 07:04:38,383] {logging_mixin.py:112} INFO - [2020-09-11 07:04:38,383] {ssh_hook.py:166} WARNING - Remote Identification Change is not verified. This wont protect against Man-In-The-Middle attacks
[2020-09-11 07:04:38,383] {logging_mixin.py:112} INFO - [2020-09-11 07:04:38,383] {ssh_hook.py:170} WARNING - No Host Key Verification. This wont protect against Man-In-The-Middle attacks
[2020-09-11 07:04:38,399] {logging_mixin.py:112} INFO - [2020-09-11 07:04:38,399] {transport.py:1819} INFO - Connected (version 2.0, client OpenSSH_7.4)
[2020-09-11 07:04:38,545] {logging_mixin.py:112} INFO - [2020-09-11 07:04:38,545] {transport.py:1819} INFO - Auth banner: b'Authorized uses only. All activity may be monitored and reported.\n'
[2020-09-11 07:04:38,546] {logging_mixin.py:112} INFO - [2020-09-11 07:04:38,546] {transport.py:1819} INFO - Authentication (publickey) successful!
[2020-09-11 07:04:38,546] {ssh_operator.py:109} INFO - Running command: [REDACTED COMMAND 2]
....
[2020-09-11 07:04:38,722] {ssh_operator.py:89} INFO - ssh_conn_id is ignored when ssh_hook is provided.
[2020-09-11 07:04:38,722] {logging_mixin.py:112} INFO - [2020-09-11 07:04:38,722] {ssh_hook.py:166} WARNING - Remote Identification Change is not verified. This wont protect against Man-In-The-Middle attacks
[2020-09-11 07:04:38,723] {logging_mixin.py:112} INFO - [2020-09-11 07:04:38,723] {ssh_hook.py:170} WARNING - No Host Key Verification. This wont protect against Man-In-The-Middle attacks
[2020-09-11 07:04:38,734] {logging_mixin.py:112} INFO - [2020-09-11 07:04:38,734] {transport.py:1819} INFO - Connected (version 2.0, client OpenSSH_7.4)
[2020-09-11 07:04:38,867] {logging_mixin.py:112} INFO - [2020-09-11 07:04:38,867] {transport.py:1819} INFO - Auth banner: b'Authorized uses only. All activity may be monitored and reported.\n'
[2020-09-11 07:04:38,868] {logging_mixin.py:112} INFO - [2020-09-11 07:04:38,867] {transport.py:1819} INFO - Authentication (publickey) successful!
[2020-09-11 07:04:38,868] {ssh_operator.py:109} INFO - Running command: [REDACTED COMMAND 3]

The text was updated successfully, but these errors were encountered:

rootcss · 2020-09-17T19:06:59Z

@kaxil Please assign this to me.

eladkal · 2021-07-16T17:31:25Z

@rootcss do you wish to finish the PR you started ?

potiuk · 2021-07-18T16:45:28Z

I read the issue and - unless I misunderstood it - I think we should close both the issue and PR.

@baolsen Why would you like to run execute() method multiple times? The whole idea of execute() method is that it is run ONCE per task execution, in a separate process and the process is killed. This is pretty much how it always works, so the fact that it opens a new connection every time it is called is pretty much expected. Caching is hard, and if we do not have to do it, we should not do it.

The right way of reusing a connection (even if you want to compose or subclass something) is to reuse HOOK not operator. You can easily create single Hook and pass it around and use it many times while keeping the connection open, but you should not try to do it with the operator - pretty much always when you want to write some behaviour where you want to reuse connection for multiple API calls etc, the hook reuse is right not the operator. You should always in this case write a custom operator. You can even subclass existing operator, if this operator has some reusable methods, but then you should never call super execute method multiple times. Execute is not designed to be called more than once.

I am tempted to close it straight away, but this a very old issue @kaxil and @eladkal reacted to it - without pointing that out - maybe I do understand it wrongly ? Can someone shed some light on it :)

uranusjr · 2021-07-19T03:24:17Z

The right way of reusing a connection (even if you want to compose or subclass something) is to reuse HOOK not operator.

The problem is SSHHook also does not reuse the underlying Paramiko connection, a new paramiko.SSHClient is created every time SSHHook.get_conn() is called, and it’s very non-obvious how that connection should be reused.

hook = SSHHook(...)

client = hook.get_conn()

# Does not reuse! A brand new connection is established.
# Plus the previous connection is now dangling.
client = hook.get_conn()

But I don’t think there’s a good quick solution to the problem; SSHHook is fundamentally not designed to be reused. So this probably should be removed from good-first-issue and we need to properly re-design SSHHook’s interface to make it work.

baolsen · 2021-07-19T06:47:06Z

Thank you all for the feedback.

My use case is to split and zip some files on a remote server if those files are above a specific size.

To do this, I do the following over SSH:

list the input files and check their size
zip and split the largest ones using a Python script I send over SSH
Check the output by iterating the output files

I put them in a single task using CustomSSHOperator which inherits SSHOperator. I want my task to be:

Able to resume if the process fails on file N
Simplify the support process so we only re-run 1 task if it fails
Avoid performance issues with 1.10.x Airflow scheduler if we had many tasks

In my environment connections are expensive (for local and remote), complex and error prone.
This is mainly because of the corporate authentication implementation at my site.

But SSHOperator does not allow multiple commands to be run over 1 connection even when subclassed.
I had to find a way to run multiple commands over the same connection.

I could have subclassed SSHOperator and copied all of the "boiler plate" client code.
That code is complex and doesn't directly add value to my use case. I don't want to maintain and update it myself.

So I ended up for each command, changing the self.command and calling super.execute() as it was simplest.
I agree it is not the right way to solve this just because it was convenient for me ;)

One option could be to refactor SSHOperator to isolate the "run an SSH command" part which can be reused by subclasses.

But we seem not decided yet whether the hook is a better place to do this. If so I think the changes could be more extensive and impactful.

uranusjr · 2021-07-19T07:52:55Z

One option could be to refactor SSHOperator to isolate the "run an SSH command" part which can be reused by subclasses.

This makes total sense to me, but I believe we still need to rewrite the hook a bit to make this happen. The operator is backend by the hook, so it can’t do anything the hook does not do; in order to be able to run multiple commands with the same connection, the hook needs to be able to return that connection for reuse in the first place, and currently the hook does not do that. And that change is not trivial, is what I was trying to say.

potiuk · 2021-07-19T12:18:24Z

@baolsen:

| One option could be to refactor SSHOperator to isolate the "run an SSH command" part which can be reused by subclasses.

Oh yeah Makes total sense. And (see below) it can be easily fixed to provide client caching in Hook

@uranusjr :

# Does not reuse! A brand new connection is established.
# Plus the previous connection is now dangling.
client = hook.get_conn()

Yeah. Because SSHHook pretty much does not have its own methods so it was never a problem, I guess. It delegates everything to SSHClient. It is a Hook that on it's own is not even a thin wrapper around the SSHClient but delegates all the work to it. I think quick fixing it and caching the client should be two lines code (it happens in other operators):

if self.client:
    return client

uranusjr · 2021-07-20T04:44:34Z

That would also mean SSHHook cannot be re-used though, because afaict paramiko.SSHClient does not support reopening a client once it’s closed. So if we do

if self.client:
    return client

This would no longer work:

hook = SSHHook(...)

with hook.get_conn() as client:
    ...

with hook.get_conn() as client:
    # Will fail! The client was already closed, and calling __enter__ again does not reopen it.

So it’s breaking one usage to accomodate another, which feels wrong to me.

potiuk · 2021-07-25T20:19:59Z

Proposals?

baolsen · 2021-07-30T12:19:41Z

I can put together a PR, to refactor SSHOperator to isolate the "run an SSH command" part

baolsen · 2021-08-06T05:23:27Z

PR is up :) Observations and suggestions welcome.
Happy to make other improvements based on the feedback.
#17378

…7378)

baolsen added the kind:bug This is a clearly a bug label Sep 11, 2020

kaxil assigned rootcss Sep 17, 2020

rootcss mentioned this issue Sep 17, 2020

Re-use existing paramiko connection in SSHHook instead of recreating #11002

Closed

rootcss mentioned this issue Oct 6, 2020

Re-use paramiko connection in SSHHook instead of recreating #11316

Closed

eladkal added the area:providers label Jan 13, 2021

eladkal added the good first issue label Jul 16, 2021

baolsen mentioned this issue Aug 2, 2021

Refactor SSHOperator so a subclass can run many commands (#10874) #17378

Merged

potiuk added the hacktoberfest label Oct 6, 2021

potiuk closed this as completed in #17378 Oct 13, 2021

potiuk pushed a commit that referenced this issue Oct 13, 2021

Refactor SSHOperator so a subclass can run many commands (#10874) (#1…

73fcbb0

…7378)

fredthomsen mentioned this issue Nov 1, 2021

Status of testing Providers that were prepared on October 30, 2021 #19328

Closed

54 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSHHook get_conn() does not re-use client #10874

SSHHook get_conn() does not re-use client #10874

baolsen commented Sep 11, 2020 •

edited by uranusjr

Loading

rootcss commented Sep 17, 2020

eladkal commented Jul 16, 2021

potiuk commented Jul 18, 2021 •

edited

Loading

uranusjr commented Jul 19, 2021

baolsen commented Jul 19, 2021

uranusjr commented Jul 19, 2021

potiuk commented Jul 19, 2021

uranusjr commented Jul 20, 2021

potiuk commented Jul 25, 2021

baolsen commented Jul 30, 2021

baolsen commented Aug 6, 2021

SSHHook get_conn() does not re-use client #10874

SSHHook get_conn() does not re-use client #10874

Comments

baolsen commented Sep 11, 2020 • edited by uranusjr Loading

rootcss commented Sep 17, 2020

eladkal commented Jul 16, 2021

potiuk commented Jul 18, 2021 • edited Loading

uranusjr commented Jul 19, 2021

baolsen commented Jul 19, 2021

uranusjr commented Jul 19, 2021

potiuk commented Jul 19, 2021

uranusjr commented Jul 20, 2021

potiuk commented Jul 25, 2021

baolsen commented Jul 30, 2021

baolsen commented Aug 6, 2021

baolsen commented Sep 11, 2020 •

edited by uranusjr

Loading

potiuk commented Jul 18, 2021 •

edited

Loading