-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Cloud Optimized GeoTIFF Support #211
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Josh,
Thanks for the contribution of your work.
I started a first quick review of your commits to provide you a first set of feedbacks.
We might be back with some other comments after other passes.
[] Please, remove all .DS_Store committed files (I see you have put a rule in .gitignore but some files are still there)
[] It would be better if all the commits are squashed into a single one (messages can be preserved). Not a big issue, we might want to click on squash-and-merge during merge.
I have to re-check it a second time to understand how it works when using JAI ImageRead deferred read.
} | ||
|
||
public void addTileRange(int tileIndex, long offset, long byteLength) { | ||
//if ((offset < firstTileOffset && offset > 0) || tileIndex == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove commented-out lines
firstTileByteLength = byteLength; | ||
} | ||
if (offset < headerSize && offset > 0) { | ||
//headerSize = (int)offset - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove commented-out lines.
package it.geosolutions.imageioimpl.plugins.cog; | ||
|
||
/** | ||
* This utility class examines will look for either a system property or environment variable. When calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will probably remove the word "examines"
if (ranges[i][0] < headerLength - 1) { | ||
// this range starts inside of what we already read for the header | ||
modified = true; | ||
if (ranges[i][1] < headerLength - 1 || ranges[i][1] == headerLength -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we could simplify this check with a "<=".
|
||
completed.add(key); | ||
} catch (Exception e) { | ||
e.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Printing stack trace isn't a good practice. Better using the LOGGER, instead.
Question: What we can do when an exception occurs? Will the data be corrupted? Should we fail the whole read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this fails, 1 or more tiles may become corrupt. This could lead to an error decoding that tile(s). Because TIFFImageReader reads 1 tile at a time, I wasn't 100% certain if it would fail on a single failed tile, or if only that tile would not render. Being optimistic, I didn't fail hard -- I wanted to give the other ranges/tiles a chance to be read and cached in the event they still might be rendered.
try { | ||
buffer.put(bytes); | ||
} catch (Exception e) { | ||
LOGGER.severe("Error writing bytes to ByteBuffer for source " + uri); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we need to also report the causing exception in the logged error
writeValue((int)key, future.get().asByteArray()); | ||
completed.add(key); | ||
} catch (Exception e) { | ||
e.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments above on the "XYZRangeReader" about printing StackTrace
try { | ||
streamPos = delegate.getStreamPosition(); | ||
} catch (IOException e) { | ||
e.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above about printing stacktrace :)
try { | ||
delegate.seek(streamPos); | ||
} catch (IOException e) { | ||
e.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too :)
Arrays.copyOfRange(b, (int) tileRange.getStart(), (int) (tileRange.getEnd() + 1)); | ||
CacheManagement.DEFAULT.cacheTile(key, tileBytes); | ||
} catch (Exception e) { | ||
e.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above on printing stack trace
I believe I have addressed all of the issues pointed out. Please let me know if there are more findings and I will address those as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a look, like the design, found a bunch of minor issues, but also a very serious one (memory usage).
One thing that's left out is control over the asynch requests, I guess they are parallelized, how does one control how many concurrent requests the server (HTTP, S3, Azure) is going to be hit with, worst case?
@@ -125,6 +125,10 @@ public TIFFImageReaderSpi() { | |||
); | |||
} | |||
|
|||
public static void setReaderClassName(String readerClassName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not look right, I see you changing the value in CogImageReaderSpi, but the method should be non static and protected (or even better, turned into a protected constructor taking the reader names), and the two fields also non static, otherwise both Spis might end up exposing the same reader and writer names (maybe not in practice, but still seems problematic too me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to make this non-static.
*/ | ||
public interface CogImageInputStream { | ||
|
||
void readRanges(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some javadocs on the methods would not hurt (or else, would help review :-D).
Same goes for the other interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the interfaces in the commons module are now thoroughly commented at the method level.
this.headerSize = headerSize; | ||
} | ||
|
||
public void addTileRange(int tileIndex, long offset, long byteLength) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So based on the offset one has two different side effects? Maybe best explained in the javadoc of the method, if dedicated methods for each side effect are not desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide a bit more detail about your concern here? Is it just that the two separate if statements are confusing? I will definitely take another pass at documenting methods better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's one method call that does two entirely different things based on input parameters, at the very least they should be explained. Or just have two separate methods with clear names instead (javadocs are still a plus in this case too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed tracking the start tile index, added documentation about the remaining logic to check header size.
<version>${azure.version}</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>com.squareup.okhttp3</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, haven't heard of this library before. In the GeoServer ecosystem we use commons httpclient (mostly version 3, a bit of 4). Seems a moderately big new dependency for the ecosystem (700k). Would moving to commons-httpclient impair functionality significantly, or cause performance regressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OkHttp seems to be a pretty popular client, but I'm not tied to it. This was actually a rewrite as I originally used the Java 11 HttpClient, but Torben pointed out that this needs to be compatible with Java 8, so I moved to OkHttp. I have no issue moving to commons-httpclient as it seems to support async connections. My question now is, would you be against introducing v5 of the client? It supports http/2 multiplexing, which eliminates the concerns of how many connections get created. This seems like the best modern approach to take, but still introduces a new dependency. Or potentially I could provide an http1 and http2 client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not aligned with the GeoTools current usage it would be one more dependency nevertheless, at that point, probably better to keep it as is and concentrate on the other issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left it alone for now. I'm fine with rewriting this in the future or refactoring this to be the http2 module and providing another http module based on commons-httpclient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
return headerLength; | ||
} | ||
|
||
public abstract void readAsync(long[]... ranges); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found the name confusing... the method is actually synchronous, it's just that internally it does calls that are eventually parallellized and running asynch, but it waits for all of them to be completed before returning. I'd just call it "read"? How it works internally it's a implementation detail of the subclasses... or not?
Also, since it's returning nothing, it would be nice to explain what needs to be done next (I'm guessing, calling getBytes()).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All correct. Looks like I'll be refactoring all of this to avoid the usage of ByteBuffer. Had async requests stuck in my brain when I created that method, and originally I also had another read method to make a single request that was purely synchronous. I'll just rename the method as it is in fact, non-synchronous and will return the result after I refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the name and modified the method to return the data.
* @author joshfix | ||
* Created on 2019-08-23 | ||
*/ | ||
public class DefaultCogImageInputStream implements ImageInputStream, CogImageInputStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this class works in two steps, initially it has no content and won't work, then one inits the header, and it can provide the header only, and after calling readranges, it can offer the tiles too (only the requested ones). So it's not a generic image input stream, but a special purpose one that can only be used in concert with the COG reader. This should be explicit imho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also all correct. The central concept of my design is to "warm" the streams with the data of (only) the requested tiles up front, before we actually start to decode tiles. The only way that can be accomplished is by reading the header to determine the tile offsets and lengths. Basically the input streams are completely non-functional without being initialized with the header data.
I guess I assumed from the name and the interface it would be apparent that it was purpose-built for COG, but I will add more documentation stating and explaining its purpose and limitations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImageInputStream are an extension point that can be looked up and used in whatever other format, making one that's specific is off its intended usage. I understand the reason behind the current implementation, and hopefully it's not going to cause side effects for other concurrent sources in the same server (e.g., other geotiffs, ecws and the like), but it's best to be quite clear when going off design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added explicit documentation regarding purpose.
} | ||
|
||
public void setFilesize(int filesize) { | ||
buffer = ByteBuffer.allocate(filesize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, danger zone! The byte buffer here is actually allocated in memory, so if one has to deal with a bigtiff that's say 100GB in the cloud, the code will actually allocate in the local memory 100GB, even if the current read only aims at reading one tile?
See: https://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html#allocate(int)
If I understand correctly, the current implementation works for (very) small images and light load, but it's deadly easy to make it go OOM. It would need some sort of sparse byte buffer, or, instead of delegating to a MemoryCacheImageInputStream, build its own implementation that only keeps in memory the tiles needed for the current read... should be doable by keeping an "index" of offsets pointing at the actual bytes, and doing some offset math whenever a read is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. In fact I did not take into account very large files or consider the ramifications of blindly allocating the full filesize in the byte buffer. I'm going to spend a bit of time contemplating the best way to handle this. My initial design was to keep the RangeReader
implementations responsibilities as narrow as possible -- read the final ranges required for the requested tiles, but know nothing about tiles, then fill in the byte buffer at the appropriate places. I'm thinking maybe this should be modified so that:
- The
CogTileInfo
object is passed to the range reader so that it has full knowledge of each requested tile index, offset, length, etc - The range reader uses the
RangeBuilder
to build the contiguous ranges to be read - The range reader then decomposes the results back into per-tile byte arrays (possibly storing them in the corresponding
TileRange
objects inCogTileInfo
- Update the
DefaultCogImageInputStream
accordingly with offset math, etc
What do you think about this solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the RangeReader classes can be kept compact and focused, if instead of returning a byte[]
, end up returning a Map<long[], byte[]>
with just the byte ranges read. The interesting and bit complicated bit will be updating the CogImageInputStream
implementations so that they keep track of the current reading position and do the offset math dance. I don't see TileInfo
needing to be updated, but it's more of an implementation preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to return Map<Long, byte[]>
, where the Long value is the start of the range. The end of the range, if needed, can be calculated using the byte[] length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I suppose it's important to mention that I completely removed a) any reference or attempt to understand the actual file size and b) usage of the class-level ByteBuffer to store all range data :)
String length = contentRange.split("/")[1]; | ||
try { | ||
filesize = Integer.parseInt(length); | ||
buffer = ByteBuffer.allocate(filesize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOM issue, another problematic point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ByteBuffer has been removed.
try { | ||
HeadObjectResponse headResponse = client.headObject(headObjectRequest).get(); | ||
filesize = headResponse.contentLength().intValue(); | ||
buffer = ByteBuffer.allocate(filesize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOM issue, another problematic point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ByteBuffer has been removed.
connector = new AzureConnector(AzureUrlParser.getAccountName(uri)); | ||
client = connector.getAzureClient(containerName, filename); | ||
filesize = (int) client.getProperties().block().blobSize(); | ||
buffer = ByteBuffer.allocate(filesize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOM issue, another problematic point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ByteBuffer has been removed.
Committed changes that should address all concerns @aaime. In addition, I made a few other noteworthy changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for all the changes! Added a few comments, hopefully quick to fix.
Now... I'd like to test this in GeoServer. Is it just a matter of putting it in the classpath and using a http URI or is anything else needed?
|
||
/** | ||
* ImageReader implementation extending from TIFFImageReader. If this class encounters an ImageInputStream that does | ||
* not implement `CogImageInputStream`, it will simply pas the request on to TIFFImageReader. Otherwise, it will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pas -> pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
protected int headerLength = DEFAULT_HEADER_LENGTH; | ||
protected Map<Integer, TileRange> tileRanges = new TreeMap<>(); | ||
public static final int HEADER_TILE_INDEX = -100; | ||
public static final int DEFAULT_HEADER_LENGTH = 16384; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be that the header is longer than 16k (e.g., bigtiff with very large tile directory?) Would that badly affect functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally have not seen this and decided on 16K because it's the default for GDAL COG reads as well. I would assume if a TIFF had enough overviews and enough tiles, you could eventually break past 16K, but it seems unlikely to me? I just realized that I made an effort to make the initial headerLength configurable, but provided no real mechanism to allow an external user to set it. I'm thinking I could just add it as a field to CogImageReadParam
.... thoughts?
@@ -0,0 +1,54 @@ | |||
package it.geosolutions.imageioimpl.plugins.cog; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used to have copyright header, now gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-added
@Override | ||
public int read() throws IOException { | ||
streamPos++; | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit harsh, I'm guessing the TIFF reader never does a single byte read call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are true, it is a bit harsh and it never gets called :) Either way, I updated it to read a single byte from the other read method.
|
||
// this should never happen -- we should have read all bytes from all tiles in the request envelope | ||
if (contiguousRange == null || rangeStart == -1L) { | ||
streamPos += len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it can really never happen, maybe best to throw an exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to throw an exception
|
||
int relativeStreamPos = (int)(streamPos - rangeStart) + off; | ||
// copy the bytes from the fetched tile into the destination byte array | ||
for (int i = 0; i < len; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eeek, pretty inefficient! Can you use System.arrayCopy instead? It accepts offsets between the two arrays
:-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated this and the caching implementation to use System.arraycopy.
<version>${azure.version}</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>com.squareup.okhttp3</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Re: how to test in GeoServer... There is a blocker :) I think we would really just want the ability to use the GeoTools GeoTiffReader. The problem is that the GeoTiffFormat class doesn't allow you to modify the image reader SPI: To circumvent the issue, I created my own format class that is nearly an exact duplicate of GeoTiffFormat, but the (note this is utilizing an old version of the COG project) I have tested this in the past using a custom mosaic store that uses a STAC catalog as a feature source and uses the aforementioned format class as the suggested format: (forgive the ugly code, i just pushed the dev branch which was full of in-progress tests) It worked fine for me and I was able to dynamically render landsat 8 mosaics. Not sure what the best/recommended approach would be here, but I think we're going to need to update some GeoTools code to get this working in GeoServer. |
Doh sorry I missed your earlier comment. How about modifying the GeoTIFF format to use a list of potential image readers instead? And skip the subclass? |
No worries! I think you may have missed my other comment/question as well about setting the header length. I realized that even though most of my classes accept header length as a variable, there is no way outside of the CogImageReader to set it. Previously, the default value came from RE: modifying GeoTiffFormat... Given that it's a GeoTools class, does that mean that this imageio-ext code is good to go? Should we move this conversation now to GeoTools? I am definitely open to suggestion here on how to handle the GeoTools code. It seems we would need to deal with this in GeoTiffFormat:
GeoTiffReader has the same private static final field. In addition, there may be some situations we need to handle. If a Sorry for the long-winded comment. Let me know how you'd like to proceed from here! |
Hi @joshfix so I wanted to go ahead and merge, and then move attention to GT/GS indeed.
If there are online tests that need specific setups, or even just internet access, it's best if they are enabled only with a profile, or a config file that the dev has to provide. |
This is very interesting. I can't reproduce locally and the test runs against a publically available landsat 8 image on s3, so no aws key is required for to read the image. |
Hum... a test that requires an internet connection should still be activated only by profiles.
Rings any bell? Could it be that you're not seeing the error as you have the "right" credentials or at least a set of valid ones? |
I renamed my aws config files and was able to reproduce. It was a simple fix -- when I'm building the s3 client, if no credentials were provided I build an anonymous provider. I also added an online profile for all tests that require internet access. |
Merged finally! Going to have a look at the geotools/geoserver side |
Ok had a look, you're right, having to initialize the streams with the range reader is source of troubles indeed... at this point I don't see much alternative to making a custom subclass of the GeoTiff reader. There you can possibly dodge the SPI mechanism and manually setup the stream, and init it using the desired area to read and target resolution (which are available in the read method of the reader). If no area is required, then the whole tiff at max resolution is read, to just read the metadata, make it read only the header I believe? I would support making changes to the base gt-geotiff classes, enough to make it possible to setup the streams correctly. Does this sound like a plan? |
Thanks for getting this merged! Very excited to take the next step and get this working in GeoTools. I'm a bit buried at the moment so it may be a bit before I can circle back to this. Totally open to any suggestions, however, and if you see a clear path forward I'll be happy to implement it. |
This PR introduces a new plugin that seeks to take advantage of the COG structure by asynchronously reading byte ranges for contiguous TIFF tiles. It includes caching and non-caching ImageInputStream implementations, as well as clients (RangeReaders) capable of reading remote COGs using HTTP, Azure, and AWS S3 client libraries.