Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Responses from FTP client are not properly read #3

Open
martin-ueding opened this issue Aug 10, 2018 · 5 comments
Open

Responses from FTP client are not properly read #3

martin-ueding opened this issue Aug 10, 2018 · 5 comments

Comments

@martin-ueding
Copy link
Contributor

@pittlerf has notified me that the files are not transmitted correctly using the put command. I now have reproduced it myself.

I have a file sWC_A2p1_Mpi270_L32T96-forward-logs-1161-1-2626.tar.gz which is 291M of size (304938506). This got transferred to the archive with the this command:

hpss-wrapper/start-hpss-agent put sWC_A2p1_Mpi270_L32T96-forward-logs-1161-1-2626.tar.gz

Looking at the dir output within an interactive FTP session shows a size of 33554432 where we would want 304938506. That only 11% of the original file. A test with a smaller file (size 7850) shows that the numbers from ls -l on the normal file system and dir in the FTP session are of the same units, namely bytes.

The transfer of the large file was done almost instantly. I have checked a few minutes later and the file size has not been updated in the archive. Retrieving the file with get also gives a file which is much smaller than the original file. Therefore I conclude that the transfers get aborted at some point.

The exact details of the breaking are very likely out of our control and have to be treated by the HLRS support. For our sanity it is important that the file sizes are verified after storing things into the archive. Similarly we should verify the file size after retrieving a file from the archive.

I will work on the wrapper script to make those additional checks.

@martin-ueding
Copy link
Contributor Author

It turns out that I had implemented this feature a while ago and my checkout on Hazel Hen was not up to date. The problem is that this check fails because the output of the ls -l is not properly captured.

While I was at it, I introduced the Python logging module and a -v flag that can be given twice for extra debug output. Instead of printing the output of the program directly, I now instead print the string representation. This will also show any characters that are deleted with command characters like \r.

This is the output that I have when trying to copy a large file:

$ hpss-wrapper/start-hpss-agent -vv put sWC_A2p1_Mpi270_L32T96-forward-logs-2067-1-2652.tar.gz
DEBUG:__main__:Before: 'Using /opt/hlrs/hpss/config/HPSS.conf\r\nParallel block size set to 1048576.\r\nConnected to 193.196.155.141.\r\n220-#    HLRS HPSS 7.5.1 Parallel FTP Daemon on hpsscore.hww.de\r\n#\t\tcoming from eslogin008.hww.hlrs.de\r\n#\r\n\r\n220 hpsscore.hww.de FTP server (HPSS 7.5.1 Parallel FTP Daemon) ready.\r\nParallel stripe width set to (4).\r\n\r\x1b[K'
DEBUG:__main__:After: 'Name (hpsscore:xskmuedi): '
DEBUG:__main__:Before: '\r\x1b[KName (hpsscore:xskmuedi): x\r\x1b[KName (hpsscore:xskmuedi): xs\r\x1b[KName (hpsscore:xskmuedi): xsk\r\x1b[KName (hpsscore:xskmuedi): xskm\r\x1b[KName (hpsscore:xskmuedi): xskmu\r\x1b[KName (hpsscore:xskmuedi): xskmue\r\x1b[KName (hpsscore:xskmuedi): xskmued\r\x1b[KName (hpsscore:xskmuedi): xskmuedi\r\n331 Password required for xskmuedi.\r\n\r\x1b[K'
DEBUG:__main__:After: 'Password:'
DEBUG:__main__:Before: '\r\x1b[KPassword:*\r\x1b[KPassword:**\r\x1b[KPassword:***\r\x1b[KPassword:****\r\x1b[KPassword:*****\r\x1b[KPassword:******\r\x1b[KPassword:*******\r\x1b[KPassword:********\r\x1b[KPassword:*********\r\x1b[KPassword:**********\r\x1b[KPassword:***********\r\x1b[KPassword:************\r\x1b[KPassword:*************\r\x1b[KPassword:**************\r\x1b[KPassword:***************\r\x1b[KPassword:****************\r\x1b[KPassword:*****************\r\x1b[KPassword:******************\r\x1b[KPassword:*******************\r\x1b[KPassword:********************\r\n230 User xskmuedi logged in as xskmuedi@HPSSCORE.HWW.DE\r\nRemote system type is UNIX.\r\n215 Parallel protocol is PDATA_AND_MOVER\r\n215 SAN3P support disabled\r\n\r\x1b[K'
DEBUG:__main__:After: 'ftp> '
Putting sWC_A2p1_Mpi270_L32T96-forward-logs-2067-1-2652.tar.gz
INFO:__main__:Using extra timeout 3600
DEBUG:__main__:Before: '\r\x1b[K'
DEBUG:__main__:After: 'ftp> '
DEBUG:__main__:Before: 'd\r\x1b[K'
DEBUG:__main__:After: 'ftp> '
DEBUG:__main__:Assertion of correct transfer. Lines read are the following:
DEBUG:__main__:  'di\r\x1b[K'
Traceback (most recent call last):
  File "hpss-wrapper/hpss-agent", line 226, in <module>
    _parse_args()
  File "hpss-wrapper/hpss-agent", line 222, in _parse_args
    options.func()
  File "hpss-wrapper/hpss-agent", line 62, in main_put
    assert_correct_transfer(options.source, lsl)
  File "hpss-wrapper/hpss-agent", line 70, in assert_correct_transfer
    assert lines[0] == 'ls -l'
AssertionError

The assertion fails because there is no output of ls -l. When one is in an interactive session, the command produces the desired output:

ftp> ls -l
200 PORT command successful.
150 Opening ASCII mode data connection for file list.
-rw-r-----   1 xskmuedi hsk44113          1018240 Nov 20 2017  Thesis.pdf
-rw-r-----   1 xskmuedi hsk44113                2 Nov 20 2017  d-1.txt
-rw-r-----   1 xskmuedi hsk44113                2 Nov 20 2017  dd-1.txt
-rw-r-----   1 xskmuedi hsk44113             4292 Aug 10 14:30 hpss-agent
-rw-r-----   1 xskmuedi hsk44113               90 Nov 21 2017  openmp.cpp
-rw-r-----   1 xskmuedi hsk44113             7850 Aug 10 14:38 random.bin
-rw-r-----   1 xskmuedi hsk44113         33554432 Aug 10 15:42 sWC_A2p1_Mpi270_L32T96-forward-logs-1161-1-2626.tar.gz
-rw-r-----   1 xskmuedi hsk44113         67108864 Aug 10 16:33 sWC_A2p1_Mpi270_L32T96-forward-logs-2067-1-2652.tar.gz
-rw-r-----   1 xskmuedi hsk44113              437 Dec  4 2017  tags
drwxr-x---   2 xskmuedi hsk44113              512 Nov 20 2017  test-dir
226 Transfer complete.
815 bytes received in 0.0154 seconds (0.050 MBytes/sec)

Since this broke without us changing something, I presume that somehow the FTP server or clients now does something slightly differently. I will have to figure out how that is. Perhaps it is because the connection broke? In the interactive session I can perform put and it will happily transfer the complete file, taking a few seconds. A ls -l also just works.

It seems that the wrapping of the FTP client does not properly work any more.

@martin-ueding martin-ueding changed the title File sizes need to be verified Responses from FTP client are not properly read Aug 10, 2018
@martin-ueding
Copy link
Contributor Author

The HLRS wiki for HPSS states that one can either use the pftp_client or use the regular FTP client and set the class of service for it. Initially I decided to wrap the pftp_client as it promises most performance. With this regression here I have revised that choice and have written the hpss-wrapper using the Python 3 ftplib which nicely wraps an FTP connection. By setting the class of service I would hope to achieve the same performance as before. But performance without stability and reliability is not worth anything to start with.

For the users, nothing has changed except that the interactive mode is no longer offered. @pittlerf, could you please deploy the latest version of this to your Hazel Hen account and try whether it works as expected? It might have some minor mistakes in the path handling, so please be a bit careful.

File sizes are gathered by API methods now, so the parsing of ls -l has been exchanged for a directory listing function.

These changes also remove any guards for filenames with spaces that were in there before, filenames with spaces will now break the thing. But since we are used to brittle software, we haven't used those in the first place.

I hope that this issue is then resolved and the performance is still acceptable if not unchanged. @pittlerf, please feel free to close this issue if it is resolved now.

@pittlerf
Copy link

pittlerf commented Aug 10, 2018

Hi Martin.

Thanks for the update. I have actually a small problem. When I try to setup I got the following error message:

Traceback (most recent call last):
  File "./hpss-agent", line 159, in <module>
    _parse_args()
  File "./hpss-agent", line 155, in _parse_args
    options.func()
  File "./hpss-agent", line 47, in main_setup
    username = raw_input('Your HPSS username: ')
NameError: name 'raw_input' is not defined

I actually tried to circumvent this by setting the old location of the pass file in the new code. Then by the copying I get the following error:

xskfpitt@eslogin007:/lustre/cray/ws8/ws/xskfpitt-ferenc_piN2piN/cA2.09.64.b/propagators> /zhome/academic/HLRS/hsk/xskfpitt/hpss-wrapper-master/start-hpss-agent --timeout 1000000 put piN2piNcA20964b_268_aff.tar.gz .
Traceback (most recent call last):
  File "/zhome/academic/HLRS/hsk/xskfpitt/hpss-wrapper-master/hpss-agent", line 99, in main_put
    ftp.storbinary('STOR {}'.format(source_base), f)
  File "/opt/python/3.6.1.1/lib/python3.6/ftplib.py", line 507, in storbinary
    conn.sendall(buf)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/zhome/academic/HLRS/hsk/xskfpitt/hpss-wrapper-master/hpss-agent", line 159, in <module>
    _parse_args()
  File "/zhome/academic/HLRS/hsk/xskfpitt/hpss-wrapper-master/hpss-agent", line 155, in _parse_args
    options.func()
  File "/zhome/academic/HLRS/hsk/xskfpitt/hpss-wrapper-master/hpss-agent", line 101, in main_put
    assert_correct_transfer(ftp, source_base)
  File "/opt/python/3.6.1.1/lib/python3.6/ftplib.py", line 128, in __exit__
    self.quit()
  File "/opt/python/3.6.1.1/lib/python3.6/ftplib.py", line 663, in quit
    resp = self.voidcmd('QUIT')
  File "/opt/python/3.6.1.1/lib/python3.6/ftplib.py", line 276, in voidcmd
    return self.voidresp()
  File "/opt/python/3.6.1.1/lib/python3.6/ftplib.py", line 249, in voidresp
    resp = self.getresp()
  File "/opt/python/3.6.1.1/lib/python3.6/ftplib.py", line 244, in getresp
    raise error_perm(resp)
ftplib.error_perm: 529 Bad Data Transfer.(error = -27, moved = 8606711808).

Could you help me to fix this?
Feri

@martin-ueding
Copy link
Contributor Author

The first issue is fixed now. In Python 3 there is input and not raw_input any more.

The second is harder. In the beginning of the traceback it says ConnectionResetError: [Errno 104] Connection reset by peer. If then tries to close the connection and fails to do so because the connection has already broken off.

The file in question, piN2piNcA20964b_268_aff.tar.gz has a size of 660 GB. That should be just fine for the archive system as at least the Jülich Supercomputing Center recommends file sizes of 1 TB in the archive. I was able to reproduce the crash using the hpss-agent.

This program here seems to be robust enough to catch such errors. I have started the same transfer in an interactive ftp session and it seems to work.

I noticed that I did not set the “class of service”, and now my script does this. With the correct class of service set, it is currently transferring the file and has not crashed. From the interactive session I see that it will take another 50 minutes for it to be transferred.

The changes are already pushed, so you can pull down the latest version and try it for yourself. We will have to see how the performance is. Using the interactive ftp with correct class of service, it seems to have around 200 M/s of speed, I presume it is MB/s and not MBit/s (= Mb/s)

@martin-ueding
Copy link
Contributor Author

The 660 GB file has been successfully transferred to the archive, both with an interactive ftp session as well as with the latest version of hpss-agent. I believe that this problem is now resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants