-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Improve SFTP hook's directory transfer to use a single connection for multiple files #46582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SFTP hook's directory transfer to use a single connection for multiple files #46582
Conversation
|
I also added the Hi @dabla , I think this might be related to your recent PRs. Please feel free to take a look. |
|
@Dawnpool really good catch, as indeed this will speed up the store_directory and retrieve_directory functions as only one connection will be created, but personally, and that's a personal opinion I don't like the wrapper that much even though it solves the performance issue. Wouldn't it be possible to refactor the get_conn method so that it caches the connection so when invoked a second time it just returns the cached instance and once it's finished removed the cached connection, then the code using the context manager stays as it was. This is how it could be implemented, I've tested it locally and it works: The hook would have 3 new fields, namely self._ssh_conn, self._sftp_conn and self._conn_count. The for example in retrieve_directory: Of course if we could replace the "dummy" usage of the self.get_conn() context manager as a decorator, then we would have best of both worlds and only annotate methods where we want that behaviour without changing the methods signature. To be clear if it's not possible, then I would stick with your proposed solution. WDYT? |
|
Hi @dabla , |
|
Hi, @dabla |
dabla
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, looks good to me!
|
Nice, but can you please add some unit tests testing the behaviour (identity of the client). |
Good point @potiuk. You could indeed add some assertions within the existing tests to make sure connection gets opened/closed only once during the whole operation. |
| file_path = os.path.join(root, file_name) | ||
| new_remote_path = os.path.join(remote_full_path, os.path.relpath(file_path, local_full_path)) | ||
| self.store_file(new_remote_path, file_path, confirm) | ||
| with self.get_conn(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh no, I didn't realize it. I'll create a fix PR right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries I forgot to warn you as we had to fix backward compatibility yesterday, my bad
… multiple files (apache#46582) * Improve SFTP directory transfer to use a single connection in multiple files * Add with_conn wrapper * Fix delete_directory * Delete wrapper and update get_conn * Add test code
… multiple files (apache#46582) * Improve SFTP directory transfer to use a single connection in multiple files * Add with_conn wrapper * Fix delete_directory * Delete wrapper and update get_conn * Add test code
This PR improves SFTP hook's
store_directoryandretrieve_directoryfunctions to use a single connection when transferring a directory with multiple files.Previously, these functions relied on
store_fileandretrieve_filefunctions. And with this PR, thestore_fileandretrieve_filefunctions were modified to open and close sftp connection each time.This leads to the
store_directoryandretrieve_directoryfunctions to open and close too many connections repeatedly when there are many files in a directory, which causes significant overhead.To address this, I modified them to open a connection, transfer all files in a directory, and then close the connection afterward.
I also did a performance test in my local environment by transferring a directory containing 1,000 small files. This reduced the transfer time by approximately 8-9 seconds. The results are shown below.