-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize retrieval from Filecoin #404
Conversation
Optimize retrieval so that when requested retrieval ranges do not align with singularity file ranges, only the minimal number of retrieval requests are made. This is accomplished by creating a separate reader for each singularity file range. For reads that are larger than a range, multiple ranges are read until the read request is satisfied or until all data is read. For reads smaller than the amount of data remaining in the range, the range reader is maintained so that it can continue to be read from by subsequent reads. This approach associates a reader with each Singularity file range, and not the ranges requested via the API (in HTTP range header). This avoids needing to parse the range header in order to set of a reader that reads some number of Singularity ranges. Rather, as arbitrary requested ranges are read, an existing reader for a range is reused if the requested range falls on a singularity range from a previous read. This also means that there is only a single retrieval for each singularity range, whereas if readers were associated with requested ranges then multiple readers could overlap the same singularity range and require multiple retrievals of the same range. Fixes #366 Additional changes: - The filecoinReader implementation supports the WriteTo interface to allow direct copying to an io.Wwriter. - local and non-local readers can handle forward seek within current range without additional retrieval. - The FilecoinRetriever interface supports the RetrieveReader function that returns an io.ReadCloser to read data from.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #404 +/- ##
==========================================
- Coverage 73.94% 73.83% -0.12%
==========================================
Files 148 149 +1
Lines 9669 9813 +144
==========================================
+ Hits 7150 7245 +95
- Misses 1776 1815 +39
- Partials 743 753 +10
☔ View full report in Codecov by Sentry. |
I think this is OK and solves the bulk of the issues at the singularity level, but unless I'm not looking at it right, you're going to end up having to do basically the same style of thing in motion. Over there, the singularity blob reader still only implements a basic So what to do about that? I think you're going to have to do something persistent in motion that hangs on to a But then after all that I guess you have the complexity question—is it better to solve this through the abstractions that let us continue to use |
Co-authored-by: Rod Vagg <rod@vagg.org>
Also add test to check the retrieve goroutines have exited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good PR for improving the retrieval performance except for a few things still need to be addressed.
- I don't see io.WriterTo being used so I assume you're still serving io.ReadSeekCloser to http.ServeContent so that writeToN is called repeatedly for each io.Read()
- If so, a major concern I'm seeing is to too much logic inside io.Read() -> writeToN().
My understanding of io.Reader is to have each Read() call to be very lightweight and have all heavyweight SQL query inside io.Seek() or the object constructor.
Below are some facts that I'm aware of that I would leverage
- http.ServeContent seeks to the beginning and end to figure out the file size. After which, it will seek to the range start and only do Read() (see the code @rvagg refers to)
- Singularity Files are splitted into FileRanges of 1GB large
- Read() is usually called repeatedly with a fixed buffer size of 32K (see io.Copy)
My recommendation of how the ReadSeekCloser can be implemented is
- still leverage http.ServeContent which does all the parsing of range header
- The constructor of filecoinReader should lookup all file ranges and their corresponding storage providers. So that if there is no SPs serving a certain range of that file, we can early return the error status rather than aborting the HTTP response when it reaches that range. This also eliminates all database calls inside Read()
- Seeking to the rangeStart will close all underlying io.Closer, e.g. lassie pipe. This looks inefficient, but given http.ServeContent will never seek again once Read() starts, this path will never be reached
- Read() will open the underlying lassie pipe for that offset if it's not nil. We have a little overhead for the first Read() but after that it will be all sequential read. We don't need to optimize the forward seeking once read happens as http.ServeContent never seeks again after starting to read.
- Avoid loop inside Read(), if a single Read() crosses the end boundary of a file range, simply set the underlying lassie pipe to nil, increment the file range index so the next Read() will create a new lassie pipe with the new file range.
I agree, as that seems like the more correct/expected behavior. I have removed that handling of a seek within a read, as well as the corresponding handling here. Now the code does not continue reading from the same stream (pipe reader) following a Seek. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline in a meeting. Looks good now. Waiting for some great benchmark result.
Worth clarifying this point because I think it's kind of important to what's being achieved here:
This is unfortunately not true because as long as we accept HTTP spec semantics of I hate this about the spec and would love to know what utility it has for general use, and specifically whether it's even going to be useful to Motion users. Perhaps we can't know this ahead of time and we just have to accept it. My inclination would be to ditch the Go But, we can also just go along with it, and I think the solution @gammazero has here gets at this problem. |
Optimize retrieval so that when requested retrieval ranges do not align with singularity file ranges, only the minimal number of retrieval requests are made.
This is accomplished by creating a separate reader for each singularity file range. For reads that are larger than a range, multiple ranges are read until the read request is satisfied or until all data is read. For reads smaller than the amount of data remaining in the range, the range reader is maintained so that it can continue to be read from by subsequent reads.
This approach associates a reader with each Singularity file range, and not the ranges requested via the API (in HTTP range header). This avoids needing to parse the range header in order to create readers where each reads some number of Singularity ranges. Rather, as arbitrary requested ranges are read, an existing reader for the corresponding singularity range(s) is reused if the requested range falls on a singularity range from a previous read. This also means that there is only a single retrieval for each singularity range, whereas if readers were associated with requested ranges then multiple readers could overlap the same singularity range and require multiple retrievals of the same range.
Fixes #366
Fixes filecoin-project/motion#143
As an optimization, only one singularity range reader is maintained at a time. This works because once a new singularity range is selected by the requested range read, then it is highly unlikely that a subsequent read request will fall on a
a singularity range that was already read from, previous to the new one.
Additional changes:
filecoinReader
implementation supports theio.WriteTo
interface to allow direct copying to anio.Writer
.FilecoinRetriever
interface supports theRetrieveReader
function that returns anio.ReadCloser
to read data from.Benchmark
Added a benchmark of filecoin retrievals to compare before and after optimization.
Benchmark uses a file composed of 4 sections, each 16Mib in size. The entire file is retrieved by requesting 1Mib chunks. The 1Mib reads are done through io.CopyN, which copies the data through a 32k buffer.
The non-optimized version does a retrieval for each buffer copy to copy the file data. The optimized version only does as many retrievals as there are independently retrievable sections of the file.
Before optimization:
After optimization: (PR #404)