Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete downloads #424

Open
mrwonko opened this issue Mar 29, 2021 · 40 comments
Open

Incomplete downloads #424

mrwonko opened this issue Mar 29, 2021 · 40 comments

Comments

@mrwonko
Copy link

mrwonko commented Mar 29, 2021

Since upgrading to 1.13 the library will often report EOF before a file opened with Client.Open() is downloaded completely. For example a 634618 byte file has stopped downloading after just 65536 bytes, and a 2367599 byte file after 2031616.

As a workaround I've added a call to File.Stat().Size(), which returns the correct file size, and compare that with the downloaded size to check for success.

@puellanivis
Copy link
Collaborator

Queries: Is it returning an io.EOF error, or a nil error? How are you transferring the file? Client.Open() then a io.Copy?

@mrwonko
Copy link
Author

mrwonko commented Mar 29, 2021

Yes, Client.Open() followed by io.Copy(), which returns a nil error, implying that the reader itself returned io.EOF.

@puellanivis
Copy link
Collaborator

Hm… this is proving pretty hard to pin down what might be going wrong with WriteTo. I see one path that could result in an err == nil case without an io.EOF but no one closes any of those channels anyways, so it seems unlikely to be that, and I don’t really see how we could be getting an io.EOF from anything other than an SSH_FXP_STATUS

Clearly some number of following bytes are not being committed, this can be clearly seen, because the two mistaken file size examples are at 32768 byte boundaries…

@mrwonko
Copy link
Author

mrwonko commented Mar 29, 2021

the two mistaken file size examples are at 32768 byte boundaries

Sounds right, other examples like 1343488, 1638400 and 786432 also fulfil that.

@puellanivis
Copy link
Collaborator

I’m thinking we might want to try a cherry pick that covers the unexpected channel close case. I’m entirely unsure how that could ever trigger, but it might be the problem, as that is the only other way we can get a nil error without an EOF, and like, it very clearly shouldn’t be a nil error… it was just like that as a hold-over from an earlier draft.

This could also explain the observed behavior, some number of 32 KiB writes work, then we receive on a closed channel, and stop all transfers.

@drakkan
Copy link
Collaborator

drakkan commented Mar 30, 2021

@puellanivis thank you. So if I want try to replicate this issue it should happen only for unexpected transfer errors, right?

@puellanivis
Copy link
Collaborator

@drakkan That’s the thing, this really shouldn’t happen for unexpected transfer errors either. Those should all percolate some non-EOF error up the line as well? 🤔

@drakkan
Copy link
Collaborator

drakkan commented Apr 2, 2021

I tryed to replicate this issue with no luck using something like this:

  for i := 0; i < 1000; i++ {
		start := time.Now()
		f, err := sftpClient.Open("file.zip")
		if err != nil {
			panic(err)
		}
		dst, err := os.Create("/tmp/file.zip")
		if err != nil {
			panic(err)
		}
		n, err := io.Copy(dst, f)
		fmt.Printf("it %v written %v, err %v elapsed %v\n", i, n, err, time.Since(start))
		if n != 61449068 {
			fmt.Printf("errrrrrrrrrrrrrrrrrrrrrr\n\n\n\n")
		}
		f.Close()
		dst.Close()
	}

@mrwonko, can you please confirm that #425 fixes this issue in your case?

@puellanivis I tryed this over a VPN connection, to trigger the error I tryed to disconnect the VPN, in this case more than 15 minutes are needed before the connection lost error is triggered

it 0 written 2654208, err connection lost elapsed 17m10.208717111s

what do you think about? Should we add a context or such? thank you

@mrwonko
Copy link
Author

mrwonko commented Apr 7, 2021

I will give #425 a try next week and let you know if it helps.

@mrwonko
Copy link
Author

mrwonko commented Apr 12, 2021

I've deployed #425 on our preproduction system a couple of hours ago and there was already the first partial download, 98304 bytes instead of 240430. I'll keep it there until tomorrow to see if it happens again, but then I'll probably roll back, it doesn't seem to work.

@mrwonko
Copy link
Author

mrwonko commented Apr 13, 2021

There have been about a dozen failed downloads at this point, #425 doesn't seem to fix our issue.

@drakkan
Copy link
Collaborator

drakkan commented Apr 13, 2021

@mrwonko can you please provide a reproducer? I tryed the above code with no luck.

I'm not very familiar with client side code but if I have an easy way to reproduce the issue I'll try to fix it, thank you

@puellanivis
Copy link
Collaborator

The only think I can think of now is the possibility that there are writes and reads happening to the remote file at the same time. So, when the read tries to read a value at offset XYZ, it’s met with an io.EOF and so it assumes the read is complete. Since the client can really only receive either data, or an io.EOF that means that setup will only ever write a 32k. 🤔 This still seems so weird though.

@puellanivis
Copy link
Collaborator

Going through with a fine-toothed comb again, I’m seeing numerous small issues here and there. I’ll be making a PR to address them, and ensure that sendPacket does not ever return an io.EOF error, and the readChunkAt refuses to recognize an err == io.EOF from sendPacket as well.

Since proper EOFs in the SFTP library should only ever come via an SSH_FX_EOF status packet, and never raw from the sendPacket.

@drakkan
Copy link
Collaborator

drakkan commented Apr 13, 2021

@puellanivis thank for your patch

@mrwonko can you please a give a try to #429?

@mrwonko
Copy link
Author

mrwonko commented Apr 13, 2021

I'll give that new patch a try.

I've not yet been able to reproduce this locally due to unrelated issues, so instead I'm testing using our normal downloaders on our preproduction cluster, which reliably reproduces it. That said, the snippet you posted should theoretically be sufficient for a reproduction.

@mrwonko
Copy link
Author

mrwonko commented Apr 14, 2021

I still get partial downloads with #429.

@puellanivis
Copy link
Collaborator

I still get partial downloads with #429.

🙃 wth?

@puellanivis
Copy link
Collaborator

What are the architectures and programs running on both sides of this? Like, clearly the client in this situation is from our library, but what kind of server is it connecting to? Is that server running off of our package as well?

@puellanivis
Copy link
Collaborator

I know this might be a long shot, but do you still get incomplete downloads with #435 ?

@mrwonko
Copy link
Author

mrwonko commented Apr 23, 2021

Our client runs in an x86_64 docker container. All I know about the server is that it runs on Windows, and I'm pretty sure it doesn't use this library.

Since we don't run in 32 bit, I see no point in trying #435?

@puellanivis
Copy link
Collaborator

So, while the problem was manifest in 32-bit, but there were still other possible edge conditions that could have manifest the same behavior. I’m honestly not sure and I’m grasping at straws here.

@carsten-luckmann
Copy link

As far as I understand, the server is running JSCAPE and accesses the files it serves via some network file system.

@drakkan
Copy link
Collaborator

drakkan commented Apr 24, 2021

@carsten-luckmann @mrwonko we tried to reproduce the issue with no luck, you have two choices:

  1. use v1.12 forever and/or hope that a refactoring of the library will also fix this regression
  2. provide more information, a small reproducer would be ideal but also adding some debug logs in pkg/sftp, printing, for example, the concurrency value etc. for the non-working case could help us

Thank you!

@j0hnsmith
Copy link

I might be seeing this problem too. Using v1.13.0 we're doing

fp, err := sftpClient.Open("somepath")
...
data, err := ioutil.ReadAll(fp)
...

There are no errors (so I assume io.EOF was seen) but the data is short, in our case for 74kb and 69kb files the exact size written is 65024 bytes. Unfortunately the server deletes the files as soon as they're read to completion (the server did delete the files which suggests data was read but not returned fully) so we can't try again with different branches/PRs. Given the size is different to those reported above and io.Copy() wasn't used I'm not sure if it's the same problem or not.

@puellanivis
Copy link
Collaborator

😱 I think I might see what’s going on with your situation @j0hnsmith . While using concurrent reads, if we request the end-of-file read, the end server might trigger the “download is complete” functionality, and delete the file from its side and maybe even Truncate it to 0 length.

This would mean any later reading of the file at any point in the file would end up with an EOF since it is now beyond the length of the file. Causing a a premature and silent EOF to happen as some odd 32k boundary less than the length of the whole file.

I had considered needing to have the writes well-ordered, but I had not anticipated a need to ensure monotonously increasing read offsets in WriteTo.

@puellanivis
Copy link
Collaborator

🤔 I’m thinking about if I should work on a patch to get well-ordered monotonously increasing reads into WriteTo first before the other PR with the big rewrite to use the internal filexfer package.

That other PR doesn’t really touch any of the WriteTo or ReadFrom code, so I think it should be pretty safe, even if it means the other PR will need some adjustment after the fix is merged. I’ll try to get something out soon. I think there is good cause to prioritize this over the refactor right now, as it is causing serious issues.

@puellanivis
Copy link
Collaborator

#436 can people try this PR? And see if it fixes the issue?

@drakkan
Copy link
Collaborator

drakkan commented Jun 12, 2021

hopefully, fixed in v1.13.1

@Coffeeri
Copy link

Coffeeri commented Jan 10, 2024

#436 can people try this PR? And see if it fixes the issue?

This seems to be still the case in v1.13.6.
Some files are unexpected and silent cutoff when using io.ReadAll.
However as earlier noted, using io.Copy solves this.

fp, err := sftpClient.Open("somepath")
...
bs, err := io.ReadAll(fp)
// vs
var buf bytes.Buffer
_, err = io.Copy(&buf, fp)

@puellanivis
Copy link
Collaborator

Hm… trippy. So, io.ReadAll only uses Read from the io.Reader. And looking at the code, it seems like we’ve made the assumption that a short read means err = io.EOF, but this might coincide with a short packet that failed to read the whole way.

We’ll probably want to do some sort of check that if n != l { err = io.UnexpectedEOF } else { err = io.EOF } 🤔

@puellanivis
Copy link
Collaborator

Actually reopening this issue.

@puellanivis puellanivis reopened this Jan 11, 2024
@Coffeeri
Copy link

Coffeeri commented Jan 17, 2024

And looking at the code, it seems like we’ve made the assumption that a short read means err = io.EOF, but this might coincide with a short packet that failed to read the whole way.

I tried to break on err = io.EOF, however it is never reaching that point.

The EOF "error" is generated by err = normaliseError(unmarshalStatus(packet.id, s.data)) . I reach this point four times, does this even make sense? Should not only one request reach the EOF?

@puellanivis
Copy link
Collaborator

If the io.EOF error is coming from line 1155, then that is a proper EOF, according to the SFTP server.

@DrHayt
Copy link

DrHayt commented Jun 12, 2024

Ok, we just got hit by this issue when transitioning from a stock openssh sftp server to one from GoAnywhere.

We have no issues with the stock, but we do have issues that manifest exactly as described here when using the GoAnywhere server.

We are going to try to setup something to trace the sftp communication so that we can compare and contrast and maybe help solve the problem.

@varunbpatil
Copy link

I also have the same issue that @Coffeeri was describing. Specifically, io.Copy has no issues, but io.Read* reads incomplete data.

My use case includes reading the file twice:

  1. The first read is to calculate md5sum of the file contents.

    srcFile, err := sftpClient.Open("somepath")
    hash := md5.New()
    _, err = io.Copy(hash, srcFile)
    md5sum := hash.Sum(nil)
    srcFile.Close()

    This works correctly everytime. It returns the correct md5sum.

  2. The second read is to actually upload the file to S3.

    srcFile, err := sftpClient.Open("somepath")
    _, err = uploader.Upload(ctx, &s3.PutObjectInput{
    	Bucket: aws.String(s3Bucket),
    	Key:    aws.String(s3Path),
    	Body:   srcFile,
    	ContentMD5: aws.String(<md5sum calculated in the previous step>),
    })

    This step fails many times before succeeding. Specifically, AWS returns BadDigest which means the ContentMD5 we specified does not match the data we uploaded. Even more interestingly, the md5sum that AWS returns is equal to the md5sum of some random 32K byte boundary (sometimes even 0 byte) of the actual file.

    The reason I cannot use io.Copy here is that the files I'm uploading are large and do not fit into memory all at once.

@puellanivis
Copy link
Collaborator

io.Copy does not read the whole file into memory, but instead makes 32 KiB reads into a temporary internal buffer, which it then writes out to. But I guess you mean, copying the whole file into a bytes.Buffer before uploading.

What version are you using? Your use case should work otherwise. Also, you can use srcFile.Seek(0, 0) to reset the offset to the start of the file.

@varunbpatil
Copy link

Hi @puellanivis ,

But I guess you mean, copying the whole file into a bytes.Buffer before uploading.

Yes, sorry, I meant to say that I was doing two separate reads because the file doesn't fit into memory and that the first read (the md5sum calculation) works perfectly everytime while the second read (the s3 upload) returns different contents each time until it eventually succeeds after several retries.

I guess srcFile.Seek(0, 0) would also work instead of a separate open file for the second read, but I was more intrigued by io.Copy() being very consistent (in returning the full contents of the file) compared to Read() which is what I assume AWS s3 upload is using because I've also tried wrapping sftp.File in a custom struct and only exposing the Read() interface.

I'm currently on version 1.3.1. Haven't tried the latest version. Do you see something that might explain it for the older version? Thanks.

@puellanivis
Copy link
Collaborator

Hi varunbpatil,

We’ve fixed a few thing that could possibly be fixed in one of the later versions. I would recommend trying the newest version (there isn’t any reason to not update, it’s safe), and see if the issue persists.

@almk-dev
Copy link

almk-dev commented Nov 7, 2024

Adding in here that we ran into this, bumped to version v1.13.7, but has the same issue. It silently truncates the file at the 1638400 byte boundary. Ultimately, we had to change out ReadAll() calls to Copy() and that solved the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants