Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need better error handling when there is a failure to fetch a file #78

Open
suhastech opened this issue May 27, 2021 · 24 comments
Open

Comments

@suhastech
Copy link

We stumbled upon a tricky bug where no matter what, the ZIP produced by zipline was corrupt. Tracking it down was quite difficult.

It seems that the download of one of the S3 files failed here:

https://github.com/fringd/zipline/blob/master/lib/zipline/zip_generator.rb#L96

file.download { |chunk| writer_for_file << chunk }
  S3 Storage (46.4ms) Downloaded file from key: 123
Traceback (most recent call last):]
Aws::S3::Errors::NotFound ()

Possible solutions:

  • Aborting the download and raise the error properly with debug information. I'm not sure if this is possible as it's a stream.
  • Write an empty file or with a note of the error. At least the other files would be properly delivered this way. I do agree this isn't ideal as it fails silently. An ignore errors flag from the user could make them aware.
@fringd
Copy link
Owner

fringd commented Oct 19, 2022

yeah we should handle this better.

I'm not sure how, but if we could abort the whole download that's probably the best option...

@fringd
Copy link
Owner

fringd commented Feb 6, 2023

i googled a bit for some possible solutions here... https://stackoverflow.com/questions/15305203/what-to-do-with-errors-when-streaming-the-body-of-an-http-request

these two options seem like the best leads:

  1. Write a malformed chunk (and close the connection) which will trigger client error
  2. Add a http trailer telling your client that something went wrong.

the trailer seems pretty robust, but i'm not sure how to do that.
https://www.rubydoc.info/gems/rack/Rack/Chunked is the only mention of trailers in rails i could find.

a malformed chunk? not sure how to make rails do that either.

if anybody else has any clues, let me know

@Urist-McUristurister
Copy link

Hey, just an idea, it may be stupid, but may be not, so...

If zipline fails to fetch a file, or any other error occurs:

  • log this
  • if it was a failure to fetch a file, add the file url to some kind of internal @skipped list
  • if possible, do NOT add the failed file to archive. If not, better to have a working archive with a few broken files, than a whole broken archive. Especially if it's multiple GBs in size.
  • move on to the next item on the list
  • In the end, add a text file named, say, "failures.log", in which list the @skipped files with corresponding http errors

I don't know the internal mechanism of zipline, so, if skipping is not possible, then write the closing sequence for the broken file (if needed/possible), write the "failures.log", and finish up with proper zip header (well, footer, technically) - to allow client to unarchive at least what have been fetched so far.

Optionally, add a configuration/options/settings/initialiser to customise the name and content (using proc) of the failures log file (or disable it completely).

WDYT?...

@fringd
Copy link
Owner

fringd commented Jun 22, 2023 via email

@fringd
Copy link
Owner

fringd commented Jun 22, 2023

that said, an errors.log would be better than nothing. i'll keep this in mind if we try the other options and can't get them working.

@Urist-McUristurister
Copy link

if it happened in the middle of the file that data has already been streamed out to the user and it's not possible to walk it back as far as i know

But there's no need to walk back.

Before switching to zipline, for many years we've used a "classic" approach of downloading files to the worker machine, zipping them, and then sending to the client. And from that experience, vast majority of errors related to remote files happen when trying to fetch a file - for example, a 404 or 5XX, or any other type of error. For something to happen when the remote file is in the middle of transfer - I can't say that I have ever encountered this. So if it's possible to just log it and skip it - that would be best.

But even in the case where we've fetched, say, 20mb out of 25mb file, and in the middle of transfer the connection to remote host went down or timed out — is it possible to stream to the client "ok, that's the end of this file, here's the next one" and just jump to the next file in the list? Because it is still much more preferable to have a log + corrupted 20mb file + 1999 full files, rather than a broken 15Gb archive that you can't even examine nor salvage.

@fringd
Copy link
Owner

fringd commented Jun 23, 2023

yeah actually that makes some sense to me... this gem is mostly for those giant downloads, so maybe just getting partial success is generally better. my worry is the user will not know that the 1 file in there is actually corrupt.

@Urist-McUristurister
Copy link

Urist-McUristurister commented Jun 24, 2023

my worry is the user will not know that the 1 file in there is actually corrupt.

Hence the error.log ;) Better 1 corrupt file, than 15gb corrupt archive...

@julik
Copy link
Contributor

julik commented Feb 5, 2024

The problem is that a ZIP file which has been "chopped off" at the end is likely not to unarchive well, because the list of files inside the ZIP comes at the end of file. You can read the ZIP "straight-ahead" but in presence of data descriptors this is likely to fail on some apps, and it is very inefficient. It can also be faulty - for example, if entries are themselves ZIP files. What you need to make this work is a resumable download, similar to what we've done at WeTransfer back in the day. You might be able to put something together using raw zip_tricks and interval_response so that when you address the download again you will be able to resume your download. It is not trivial to do by any means but doable.

error.log would be a nice feature, albeit that when your download interrupts you already can't add anything to the response.

But even in the case where we've fetched, say, 20mb out of 25mb file, and in the middle of transfer the connection to remote host went down or timed out — is it possible to stream to the client "ok, that's the end of this file, here's the next one" and just jump to the next file in the list?

If your concern is the "upstream" (where you are pulling the file from) timing out on you or similar, you can pull data from it in chunks and resume where it interrupts. HTTP Range headers can be used for that.

@Urist-McUristurister
Copy link

Urist-McUristurister commented Feb 5, 2024

@julik I'm not talking about "chopping off" zip files, I'm talking about chopped off files that are being requested. Or, which is actually more frequent, non existing, 404 files, that zipline is fetching from the url. If I specify a list of 500+ urls (S3, for example), and one of them in the middle of the list does not exist, then the whole 50gb archive is botched. And the frustration about downloading 30Gb file which is suddenly broken, and then you try to download it again, just to find out that it's still broken, is immense.

I'm not talking about chopping the zip file. I'm talking about at least skipping the failed archivable file. If the file being fetched is throwing an error, any error, then either skip writing it altogether or write what have been fetched so far (say, 50mb of 150), if it's a partially fetched file, and move on to the next file. And in the very end, in the list of the zipped files, say "ok, this file is 50mb in size, next".

That way when user downloads the archive, it will have only 1 file in the whole archive "broken" (zero-sized, partial, or not present at all), not the whole archive itself.

@julik
Copy link
Contributor

julik commented Feb 5, 2024

That should actually work if you surround your call to write_deflated_file with a block with a rescue. If you do that, zip_tricks will move on to the next file and not add the file you have written to the central directory. The likelihood is high that the unarchiving software opening the file will be able to unarchive correctly based on the central directory, which gets output in the end. Try adding a rescue surrounding ZipGenerator#write_file and try with some large-ish outputs interrupting them halfway?

Update - I was wrong, because even though ZipTricks will not finalize the file it will still add it to the central directory - albeit in a "broken" form. You would need to muck about with the internals a bit:

module SaferZipWrite
  # Removes a file that does not finish writing from the
  # @files array in the Streamer, so that the file will not
  # be present in the ZIP central directory. Some unarchivers
  # might then still be able to unarchive files which do complete writing.
  class StreamerWrapper < SimpleDelegator
    def write_deflated_file(...)
      files = __getobj__.instance_variable_get("@files")
      n_files_before_writing = files.size
      super
    rescue => e
      # do something with the exception
      files.pop if files.size > n_files_before_writing
    end

    def write_stored_file(...)
      files = __getobj__.instance_variable_get("@files")
      n_files_before_writing = files.size
      super
    rescue => e
      # do something with the exception
      files.pop if files.size > n_files_before_writing
    end
  end

  def write_file(streamer, file, name, options)
    super(StreamerWrapper.new(streamer), file, name, options)
  end
end

Zipline::ZipGenerator.prepend(SaferZipWrite)

@Urist-McUristurister
Copy link

Well... I actually did something similar: I've patched the write_file method to first attempt the download of the target file, with up to 3 retries, in case of failure after 3 retries write a text file with a same name and error message, and then proceed with that file to streamer.write_stored_file (stored because we're serving pictures mostly, no BMPs, so it doesn't really make much sense to compress them even further. A "compression" option for zipline method, maybe?). Sure, it's slower, since it has to download the file first, but more reliable. Still, not something that would be fit for everyone.

I'll try your suggestion, though, thanks! 👍 Maybe I don't have to download the whole file after all...

Still, would be nice to have this feature built-in ;)

@Urist-McUristurister
Copy link

even though ZipTricks will not finalize the file it will still add it to the central directory - albeit in a "broken" form

That's still much more preferable than a broken archive! 👍 😉

@julik
Copy link
Contributor

julik commented Feb 5, 2024

"compression" option for zipline method, maybe?

that is for @fringd to ponder 😆 The feature would be nice, indeed - I'd wager it would have to be added in zip_tricks though and I haven't yet sorted the maintenance there. There could be a heuristic in zipline where it picks stored or deflated depending on the filename extension for instance. Note that storing a stored Excel or Word file can lead to interesting bugs because if a program reads a ZIP "straight ahead" it will sometimes think that the parts of those Office files are themselves entries of the "outer" ZIP 🤡 I remember the macOS ArchiveUtility used to do this, but don't know at which version they have fixed it.

@fringd
Copy link
Owner

fringd commented Feb 5, 2024 via email

@julik
Copy link
Contributor

julik commented Feb 11, 2024

For those pondering "heuristic" choice between using stored or deflated modes, here is something that could work nicely. It will monitor how well the input compresses (you could also pick a ratio or something), and buffer roughly that amount of data. Then it will pick the more efficient option and open a file for you in the Streamer.

module HeuristicWrite
  class DeflateHeuristic
    BYTES_WRITTEN_THRESHOLD = 65*1024

    def initialize(filename, streamer)
      @filename = filename
      @streamer = streamer

      @buf = StringIO.new.binmode
      @deflater = ::Zlib::Deflate.new(Zlib::DEFAULT_COMPRESSION, -::Zlib::MAX_WBITS)
      @bytes_deflated = 0

      @winner = nil
    end

    def <<(bytes)
      if @winner
        @winner << bytes
      else
        @buf << bytes
        @deflater.deflate(bytes) { |chunk| @bytes_deflated += chunk.bytesize }
        decide if @buf.size > BYTES_WRITTEN_THRESHOLD
      end
      self
    end

    def close
      decide unless @winner
      @winner.close
    end

    private def decide
      # Finish and then close the deflater - it has likely buffered some data
      @bytes_deflated += @deflater.finish.bytesize until @deflater.finished?

      # If the deflated version is smaller than the stored one
      # - use deflate, otherwise stored
      ratio = @bytes_deflated / @buf.size.to_f
      if ratio < 0.75
        warn "using flate"
        @winner = @streamer.write_deflated_file(@filename)
      else
        warn "using stored"
        @winner = @streamer.write_stored_file(@filename)
      end
      @buf.rewind
      IO.copy_stream(@buf, @winner)
    end
  end

  def write_file(name)
    writable = DeflateHeuristic.new(name, self)
    yield(writable)
    writable.close
  end
end
ZipTricks::Streamer.include(HeuristicWrite)

io = StringIO.new
ZipTricks::Streamer.open(io) do |zip|
  zip.write_file("this will be stored.bin") do |io|
    90_000.times do
      io << Random.bytes(1024)
    end
  end

  zip.write_file("this will be deflated.bin") do |io|
    90_000.times do
      io << "many many delicious, compressible words"
    end
  end
end
warn io.size

julik added a commit to julik/zip_kit that referenced this issue Feb 12, 2024
This allows us to rescue during writes if, for example, the upstream server has a timeout.
There is an interesting discussion here fringd/zipline#78
julik added a commit to julik/zip_kit that referenced this issue Feb 29, 2024
This allows us to rescue during writes if, for example, the upstream server has a timeout.
There is an interesting discussion here fringd/zipline#78
@julik
Copy link
Contributor

julik commented Mar 17, 2024

@Urist-McUristurister @suhastech @fringd Now that the zip_tricks situation has been resolved - maybe we should think about an API for this? I've added a feature to zip_kit which excludes the failed file from the ZIP central directory and removes it from the path set - so you can add the same file multiple times if it fails halfway.

The issue is that you need to specify a retry policy - or have some code for it in some way. For example, you could have a retry policy that if one of files does not get included correctly or fails downloading from remote, zipline continues to the next. Or that it does 5 attempts and then aborts. Or that it does 3 attempts with exponential backoff and then moves on to the next file if output still fails.

With "naked" zip_kit you could do that with something like https://github.com/kamui/retriable by doing

Retriable.retriable(on: TimeoutError) do
  zip.write_file("my_stuff.bin") do |sink|
    HTTP.get(the_url) { |response_body| IO.copy_stream(response_body, sink) }
  end
end

The configuration for Retriable could be passed into zipline as a keyword argument, but then retriable would become a dependency for zipline. It also does not allow specifying a policy such as "continue to the next file on that error", and the errors that can be raised by cloud services when fetching from remote are many.

@fringd
Copy link
Owner

fringd commented Mar 18, 2024

if fetching a file over http fails after x/n bytes written it would be nice if we could try to resume using range requests to start the download at x bytes if supported

@fringd
Copy link
Owner

fringd commented Mar 18, 2024

i'm guessing most CDNs support range requests

@julik
Copy link
Contributor

julik commented Mar 18, 2024

Storage backends like S3 do, but it is not guaranteed - also S3 will not respond to a HEAD, only to a GET - you will need to do a "start request", and so on. I mean - all of these are doable, what I am wondering about is how to make it sufficiently generic so that it would be "good enough for most folks" - or how to provide an extension point for this, alternatively. If you can resume an upstream download you can rescue inside of the write_file block - but even ActiveStorage will not give you an API for this. Not now at least, unless you pre-chop ranges to fetch. There is a lot of ranged fetching work here https://github.com/WeTransfer/format_parser/blob/master/lib/remote_io.rb and https://github.com/WeTransfer/format_parser/blob/master/lib/care.rb and https://github.com/WeTransfer/format_parser/blob/master/lib/active_storage/blob_io.rb for instance, and it has (for one) to be HTTP-client-specific 🤔

@fringd
Copy link
Owner

fringd commented Mar 18, 2024

Okay, attempting to resume gracefully would be a nice to have feature request i think. we can attempt it later, just handling errors gracefully would be a good start for now

@julik
Copy link
Contributor

julik commented Mar 20, 2024

Ok, I will need to a bit of refactoring then. What would be the default - "give up" if a file fails and print to the Rails log? This would need a refactor for all the file types I think (since you need to match on storage-specific exceptions)

@fringd
Copy link
Owner

fringd commented Mar 20, 2024

i think the default would be to remove the file from the list of files to be written at the header at the end and move on to the next file, and yeah log. that's at least a base-line we can improve from

@julik
Copy link
Contributor

julik commented Mar 20, 2024

i think the default would be to remove the file from the list of files to be written at the header at the end and move on to the next file, and yeah log. that's at least a base-line we can improve from

Aight

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants