Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale SSH connections created when working with saga.filesystem.File objects #519

Open
jcohen02 opened this issue Mar 16, 2016 · 5 comments

Comments

@jcohen02
Copy link
Contributor

I'm experiencing the creation of stale SSH connections when attempting file transfers via the saga-python API which ultimately result in my application either blocking when trying to make an SSH connection or throwing a "RuntimeError: LeaseObject is already leased:" exception. I'm unsure whether this is a bug or something I'm doing wrong in the way I'm using the API.

My application is using gevent and I originally thought the LeaseObject exception may be the result of some attempt at concurrent access to a shared session object, however I've now got a simple single-threaded standalone test case based on file copying that demonstrates the problem.

I wonder if someone can see if this issue can be reproduced or whether this is something specific to the platforms that I'm using. I'm using a Mac OS X client running Python 2.7.10 and using saga-python 0.40.1. I can reproduce the problem copying files from localhost or from a remote Linux server running Ubuntu 14.04.

The test case is as follows:

  1. Create a group of files containing random data:

mkdir /tmp/test_files

mkdir /tmp/test_files_copy

for i in seq 1 50; do dd if=/dev/urandom of=/tmp/test_files/test_file_${i}.dat bs=512k count=1; done

  1. First test copying of data with valid filenames:
from saga import Session, Context, DoesNotExist
from saga.filesystem import File

ctx = Context('ssh')
ctx.user_id = 'myuser'
ctx.user_key = '/full/path/to/key/file'
s = Session(default=False)
s.add_context(ctx)

# Create a list of the source files:
source_files = []

for i in range(1,51):
    source_files.append('test_file_%s.dat' % (i))

target_dir = 'file://localhost/tmp/test_files_copy/'
source_dir = 'sftp://localhost/tmp/test_files/'

# Now try and copy some files...
for filename in source_files:
    source_file_url = source_dir + filename
    print ('About to copy <%s> to <%s>' 
           % (source_file_url, target_dir))

    try:
        file_obj = File(source_file_url, session=s)
        file_obj.copy(target_dir)
        print ('Copied %s to %s (size: %s bytes)\n' 
               % (source_file_url, target_dir, file_obj.get_size())) 
        file_obj.close()
    except DoesNotExist as e:
        print('File <%s> does not exist!' % (source_file_url))

print "File copy complete..."

This copy works correctly and I observe a consistent group of 3 ssh and 1 sftp processes created by the script.

  1. Now retry the copy using invalid filenames - in my use case, it is sometimes the case that a file copy will be attempted using a filename that does not exist:

Replace the for loop in lines 13-14 above with:

for i in range(51,101):
    source_files.append('test_file_%s.dat' % (i))

Now run the script again with the revised loop. When the line file_obj = File(source_file_url, session=s) is run, a new SSH process is created but an exception is generated so execution jumps to the line except DoesNotExist as e:. At this point, file_obj does not exist so close() cannot be called but the SSH connection remains - I think this is this now a stale SSH connection?

After attempting around 10 copies of files further SSH connections cannot be made and the code hangs.

  1. Now trying a scenario similar to my own code, where some copy tasks succeed and some fail:

Replace the for loop in lines 13-14 with:

for i in range(51,101):
    source_files.append('test_file_%s.dat' % (i))
    source_files.append('test_file_%s.dat' % (int(i)+50))
    source_files.append('test_file_%s.dat' % (int(i)-50))

Again, I see a rapidly growing number of SSH connections when running the script but with this example, the script always fails after some number of file copy tasks with:

raise RuntimeError ("LeaseObject is already leased: %s" % self)
RuntimeError: LeaseObject is already leased: 
<radical.utils.lease_manager._LeaseObject object at 0x104a38d90>

I'm using saga-python in a service environment and caching session/service objects so over time, the stale connections build up and I am eventually experiencing one of the above errors.

Any help or suggestions you can give in resolving or working around these issues would be much appreciated.

Thanks, Jeremy

@jcohen02
Copy link
Contributor Author

Just to update the above issue, we have a workaround for this problem by first creating a saga.filesystem.Directory object and then calling is_file('filename') on this object to see if the target file that we wish to copy exists on the remote platform.

However, I also note that if the file exists and I then try to call open('filename') on the Directory object, this fails. Where other functions such as is_file, is_dir and make_diron the Directory object seem to accept a relative path, open appears not to be operating within the context of the Directory object, e.g. given /tmp on a remote node my.remote.host containing a file test.txt:

# The following two statements succeed
remote_dir = Directory('sftp://my.remote.host/tmp/', session=s)
remote_dir.is_file('test.txt') # returns True

# This fails
remote_dir.open('test.txt')

Looking at the internal state where saga-python is requesting a lease for the SSH connection when remote_dir.open('test.txt') is called, it's passing the URL file://localhost/test.txt.

If the full remote URL of the file is passed to the open call then the file can be opened successfully but it would be good if this could be done using a relative path.

@andre-merzky
Copy link
Member

Hey Jeremy,

thanks for the ticket! It looks like we don't release shells correctly on the failing ops, or something.

The second problem you reported (rel path interpreted as absolute) deserves a second ticket, as i seems quite unrelated. I'll open one if you don't mind.

@vivek-bala
Copy link
Contributor

Hey Andre, there doesn't seem to be a commit or PR associated with this ticket. Is it fair to assume the bug with incorrect shell release still exists?

@andre-merzky
Copy link
Member

Yes, I also assume that this not fixed. The lease manager has seen some updates wrt. garbage collection, but I doubt that this case is covered. We may want to confirm though.

@vivek-bala
Copy link
Contributor

I'll leave this open in that case. I think jeremy has already provided info to reproduce the issue and a workaround.

@andre-merzky andre-merzky removed their assignment Apr 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants