Incomplete downloads #424

mrwonko · 2021-03-29T08:31:08Z

Since upgrading to 1.13 the library will often report EOF before a file opened with Client.Open() is downloaded completely. For example a 634618 byte file has stopped downloading after just 65536 bytes, and a 2367599 byte file after 2031616.

As a workaround I've added a call to File.Stat().Size(), which returns the correct file size, and compare that with the downloaded size to check for success.

The text was updated successfully, but these errors were encountered:

puellanivis · 2021-03-29T08:43:33Z

Queries: Is it returning an io.EOF error, or a nil error? How are you transferring the file? Client.Open() then a io.Copy?

mrwonko · 2021-03-29T09:27:49Z

Yes, Client.Open() followed by io.Copy(), which returns a nil error, implying that the reader itself returned io.EOF.

puellanivis · 2021-03-29T15:42:58Z

Hm… this is proving pretty hard to pin down what might be going wrong with WriteTo. I see one path that could result in an err == nil case without an io.EOF but no one closes any of those channels anyways, so it seems unlikely to be that, and I don’t really see how we could be getting an io.EOF from anything other than an SSH_FXP_STATUS…

Clearly some number of following bytes are not being committed, this can be clearly seen, because the two mistaken file size examples are at 32768 byte boundaries…

mrwonko · 2021-03-29T15:53:15Z

the two mistaken file size examples are at 32768 byte boundaries

Sounds right, other examples like 1343488, 1638400 and 786432 also fulfil that.

puellanivis · 2021-03-29T22:42:49Z

I’m thinking we might want to try a cherry pick that covers the unexpected channel close case. I’m entirely unsure how that could ever trigger, but it might be the problem, as that is the only other way we can get a nil error without an EOF, and like, it very clearly shouldn’t be a nil error… it was just like that as a hold-over from an earlier draft.

This could also explain the observed behavior, some number of 32 KiB writes work, then we receive on a closed channel, and stop all transfers.

drakkan · 2021-03-30T06:35:14Z

@puellanivis thank you. So if I want try to replicate this issue it should happen only for unexpected transfer errors, right?

puellanivis · 2021-03-30T07:01:45Z

@drakkan That’s the thing, this really shouldn’t happen for unexpected transfer errors either. Those should all percolate some non-EOF error up the line as well? 🤔

drakkan · 2021-04-02T11:46:35Z

I tryed to replicate this issue with no luck using something like this:

  for i := 0; i < 1000; i++ {
		start := time.Now()
		f, err := sftpClient.Open("file.zip")
		if err != nil {
			panic(err)
		}
		dst, err := os.Create("/tmp/file.zip")
		if err != nil {
			panic(err)
		}
		n, err := io.Copy(dst, f)
		fmt.Printf("it %v written %v, err %v elapsed %v\n", i, n, err, time.Since(start))
		if n != 61449068 {
			fmt.Printf("errrrrrrrrrrrrrrrrrrrrrr\n\n\n\n")
		}
		f.Close()
		dst.Close()
	}

@mrwonko, can you please confirm that #425 fixes this issue in your case?

@puellanivis I tryed this over a VPN connection, to trigger the error I tryed to disconnect the VPN, in this case more than 15 minutes are needed before the connection lost error is triggered

it 0 written 2654208, err connection lost elapsed 17m10.208717111s

what do you think about? Should we add a context or such? thank you

mrwonko · 2021-04-07T11:33:55Z

I will give #425 a try next week and let you know if it helps.

mrwonko · 2021-04-12T15:40:04Z

I've deployed #425 on our preproduction system a couple of hours ago and there was already the first partial download, 98304 bytes instead of 240430. I'll keep it there until tomorrow to see if it happens again, but then I'll probably roll back, it doesn't seem to work.

mrwonko · 2021-04-13T07:16:30Z

There have been about a dozen failed downloads at this point, #425 doesn't seem to fix our issue.

drakkan · 2021-04-13T07:24:31Z

@mrwonko can you please provide a reproducer? I tryed the above code with no luck.

I'm not very familiar with client side code but if I have an easy way to reproduce the issue I'll try to fix it, thank you

puellanivis · 2021-04-13T09:41:26Z

The only think I can think of now is the possibility that there are writes and reads happening to the remote file at the same time. So, when the read tries to read a value at offset XYZ, it’s met with an io.EOF and so it assumes the read is complete. Since the client can really only receive either data, or an io.EOF that means that setup will only ever write a 32k. 🤔 This still seems so weird though.

puellanivis · 2021-04-13T09:54:33Z

Going through with a fine-toothed comb again, I’m seeing numerous small issues here and there. I’ll be making a PR to address them, and ensure that sendPacket does not ever return an io.EOF error, and the readChunkAt refuses to recognize an err == io.EOF from sendPacket as well.

Since proper EOFs in the SFTP library should only ever come via an SSH_FX_EOF status packet, and never raw from the sendPacket.

drakkan · 2021-04-13T13:31:22Z

@puellanivis thank for your patch

@mrwonko can you please a give a try to #429?

mrwonko · 2021-04-13T14:14:31Z

I'll give that new patch a try.

I've not yet been able to reproduce this locally due to unrelated issues, so instead I'm testing using our normal downloaders on our preproduction cluster, which reliably reproduces it. That said, the snippet you posted should theoretically be sufficient for a reproduction.

mrwonko · 2021-04-14T09:19:22Z

I still get partial downloads with #429.

puellanivis · 2021-04-14T09:23:05Z

I still get partial downloads with #429.

🙃 wth?

puellanivis · 2021-04-22T21:23:47Z

What are the architectures and programs running on both sides of this? Like, clearly the client in this situation is from our library, but what kind of server is it connecting to? Is that server running off of our package as well?

puellanivis · 2021-04-23T08:58:47Z

I know this might be a long shot, but do you still get incomplete downloads with #435 ?

mrwonko · 2021-04-23T11:23:16Z

Our client runs in an x86_64 docker container. All I know about the server is that it runs on Windows, and I'm pretty sure it doesn't use this library.

Since we don't run in 32 bit, I see no point in trying #435?

puellanivis · 2021-04-23T12:12:46Z

So, while the problem was manifest in 32-bit, but there were still other possible edge conditions that could have manifest the same behavior. I’m honestly not sure and I’m grasping at straws here.

carsten-luckmann · 2021-04-23T13:05:49Z

As far as I understand, the server is running JSCAPE and accesses the files it serves via some network file system.

drakkan · 2021-04-24T07:36:46Z

@carsten-luckmann @mrwonko we tried to reproduce the issue with no luck, you have two choices:

use v1.12 forever and/or hope that a refactoring of the library will also fix this regression
provide more information, a small reproducer would be ideal but also adding some debug logs in pkg/sftp, printing, for example, the concurrency value etc. for the non-working case could help us

Thank you!

j0hnsmith · 2021-04-26T10:34:12Z

I might be seeing this problem too. Using v1.13.0 we're doing

fp, err := sftpClient.Open("somepath")
...
data, err := ioutil.ReadAll(fp)
...

There are no errors (so I assume io.EOF was seen) but the data is short, in our case for 74kb and 69kb files the exact size written is 65024 bytes. Unfortunately the server deletes the files as soon as they're read to completion (the server did delete the files which suggests data was read but not returned fully) so we can't try again with different branches/PRs. Given the size is different to those reported above and io.Copy() wasn't used I'm not sure if it's the same problem or not.

puellanivis · 2021-04-26T10:58:47Z

😱 I think I might see what’s going on with your situation @j0hnsmith . While using concurrent reads, if we request the end-of-file read, the end server might trigger the “download is complete” functionality, and delete the file from its side and maybe even Truncate it to 0 length.

This would mean any later reading of the file at any point in the file would end up with an EOF since it is now beyond the length of the file. Causing a a premature and silent EOF to happen as some odd 32k boundary less than the length of the whole file.

I had considered needing to have the writes well-ordered, but I had not anticipated a need to ensure monotonously increasing read offsets in WriteTo.

puellanivis · 2021-04-26T11:04:35Z

🤔 I’m thinking about if I should work on a patch to get well-ordered monotonously increasing reads into WriteTo first before the other PR with the big rewrite to use the internal filexfer package.

That other PR doesn’t really touch any of the WriteTo or ReadFrom code, so I think it should be pretty safe, even if it means the other PR will need some adjustment after the fix is merged. I’ll try to get something out soon. I think there is good cause to prioritize this over the refactor right now, as it is causing serious issues.

puellanivis · 2021-04-26T17:51:38Z

#436 can people try this PR? And see if it fixes the issue?

drakkan · 2021-06-12T08:42:07Z

hopefully, fixed in v1.13.1

Coffeeri · 2024-01-10T15:50:29Z

#436 can people try this PR? And see if it fixes the issue?

This seems to be still the case in v1.13.6.
Some files are unexpected and silent cutoff when using io.ReadAll.
However as earlier noted, using io.Copy solves this.

fp, err := sftpClient.Open("somepath")
...
bs, err := io.ReadAll(fp)
// vs
var buf bytes.Buffer
_, err = io.Copy(&buf, fp)

puellanivis · 2024-01-10T17:49:25Z

Hm… trippy. So, io.ReadAll only uses Read from the io.Reader. And looking at the code, it seems like we’ve made the assumption that a short read means err = io.EOF, but this might coincide with a short packet that failed to read the whole way.

We’ll probably want to do some sort of check that if n != l { err = io.UnexpectedEOF } else { err = io.EOF } 🤔

puellanivis · 2024-01-11T16:30:00Z

Actually reopening this issue.

Coffeeri · 2024-01-17T13:02:09Z

And looking at the code, it seems like we’ve made the assumption that a short read means err = io.EOF, but this might coincide with a short packet that failed to read the whole way.

I tried to break on err = io.EOF, however it is never reaching that point.

The EOF "error" is generated by err = normaliseError(unmarshalStatus(packet.id, s.data)) . I reach this point four times, does this even make sense? Should not only one request reach the EOF?

puellanivis · 2024-01-18T18:29:25Z

If the io.EOF error is coming from line 1155, then that is a proper EOF, according to the SFTP server.

DrHayt · 2024-06-12T14:58:33Z

Ok, we just got hit by this issue when transitioning from a stock openssh sftp server to one from GoAnywhere.

We have no issues with the stock, but we do have issues that manifest exactly as described here when using the GoAnywhere server.

We are going to try to setup something to trace the sftp communication so that we can compare and contrast and maybe help solve the problem.

varunbpatil · 2024-09-13T06:25:53Z

I also have the same issue that @Coffeeri was describing. Specifically, io.Copy has no issues, but io.Read* reads incomplete data.

My use case includes reading the file twice:

The first read is to calculate md5sum of the file contents.

srcFile, err := sftpClient.Open("somepath")
hash := md5.New()
_, err = io.Copy(hash, srcFile)
md5sum := hash.Sum(nil)
srcFile.Close()

This works correctly everytime. It returns the correct md5sum.

The second read is to actually upload the file to S3.
```
srcFile, err := sftpClient.Open("somepath")
_, err = uploader.Upload(ctx, &s3.PutObjectInput{
	Bucket: aws.String(s3Bucket),
	Key:    aws.String(s3Path),
	Body:   srcFile,
	ContentMD5: aws.String(<md5sum calculated in the previous step>),
})
```
This step fails many times before succeeding. Specifically, AWS returns BadDigest which means the ContentMD5 we specified does not match the data we uploaded. Even more interestingly, the md5sum that AWS returns is equal to the md5sum of some random 32K byte boundary (sometimes even 0 byte) of the actual file.

The reason I cannot use io.Copy here is that the files I'm uploading are large and do not fit into memory all at once.

puellanivis · 2024-09-13T16:22:53Z

io.Copy does not read the whole file into memory, but instead makes 32 KiB reads into a temporary internal buffer, which it then writes out to. But I guess you mean, copying the whole file into a bytes.Buffer before uploading.

What version are you using? Your use case should work otherwise. Also, you can use srcFile.Seek(0, 0) to reset the offset to the start of the file.

varunbpatil · 2024-09-13T18:10:35Z

Hi @puellanivis ,

But I guess you mean, copying the whole file into a bytes.Buffer before uploading.

Yes, sorry, I meant to say that I was doing two separate reads because the file doesn't fit into memory and that the first read (the md5sum calculation) works perfectly everytime while the second read (the s3 upload) returns different contents each time until it eventually succeeds after several retries.

I guess srcFile.Seek(0, 0) would also work instead of a separate open file for the second read, but I was more intrigued by io.Copy() being very consistent (in returning the full contents of the file) compared to Read() which is what I assume AWS s3 upload is using because I've also tried wrapping sftp.File in a custom struct and only exposing the Read() interface.

I'm currently on version 1.3.1. Haven't tried the latest version. Do you see something that might explain it for the older version? Thanks.

puellanivis · 2024-09-13T20:36:47Z

Hi varunbpatil,

We’ve fixed a few thing that could possibly be fixed in one of the later versions. I would recommend trying the newest version (there isn’t any reason to not update, it’s safe), and see if the issue persists.

almk-dev · 2024-11-07T00:08:15Z

Adding in here that we ran into this, bumped to version v1.13.7, but has the same issue. It silently truncates the file at the 1638400 byte boundary. Ultimately, we had to change out ReadAll() calls to Copy() and that solved the issue.

drakkan mentioned this issue Apr 20, 2021

Encoding SSH filexfer #430

Merged

drakkan closed this as completed Jun 12, 2021

uncleDecart mentioned this issue Dec 11, 2023

Connection lost when using library with limited bandwidth #566

Closed

puellanivis reopened this Jan 11, 2024

Incomplete downloads #424

Incomplete downloads #424

Comments

mrwonko commented Mar 29, 2021

puellanivis commented Mar 29, 2021

mrwonko commented Mar 29, 2021

puellanivis commented Mar 29, 2021

mrwonko commented Mar 29, 2021

puellanivis commented Mar 29, 2021

drakkan commented Mar 30, 2021

puellanivis commented Mar 30, 2021

drakkan commented Apr 2, 2021 • edited Loading

mrwonko commented Apr 7, 2021

mrwonko commented Apr 12, 2021

mrwonko commented Apr 13, 2021

drakkan commented Apr 13, 2021

puellanivis commented Apr 13, 2021

puellanivis commented Apr 13, 2021

drakkan commented Apr 13, 2021

mrwonko commented Apr 13, 2021 • edited Loading

mrwonko commented Apr 14, 2021

puellanivis commented Apr 14, 2021

puellanivis commented Apr 22, 2021

puellanivis commented Apr 23, 2021

mrwonko commented Apr 23, 2021

puellanivis commented Apr 23, 2021

carsten-luckmann commented Apr 23, 2021

drakkan commented Apr 24, 2021

j0hnsmith commented Apr 26, 2021

puellanivis commented Apr 26, 2021

puellanivis commented Apr 26, 2021

puellanivis commented Apr 26, 2021

drakkan commented Jun 12, 2021

Coffeeri commented Jan 10, 2024 • edited Loading

puellanivis commented Jan 10, 2024

puellanivis commented Jan 11, 2024

Coffeeri commented Jan 17, 2024 • edited Loading

puellanivis commented Jan 18, 2024

DrHayt commented Jun 12, 2024

varunbpatil commented Sep 13, 2024

puellanivis commented Sep 13, 2024

varunbpatil commented Sep 13, 2024

puellanivis commented Sep 13, 2024

almk-dev commented Nov 7, 2024 • edited Loading

drakkan commented Apr 2, 2021 •

edited

Loading

mrwonko commented Apr 13, 2021 •

edited

Loading

Coffeeri commented Jan 10, 2024 •

edited

Loading

Coffeeri commented Jan 17, 2024 •

edited

Loading

almk-dev commented Nov 7, 2024 •

edited

Loading