-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-19131. Assist reflection IO with WrappedOperations class #6686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-19131. Assist reflection IO with WrappedOperations class #6686
Conversation
|
prepared parquet for this by renaming vectorio package to |
12c95ff to
668c1ce
Compare
c1e52f5 to
827b41c
Compare
0dad2aa to
e6241ab
Compare
128ba0c to
128e2d7
Compare
This is in sync with apache/hadoop#6686 which has renamed one of the method names to load. The new DynamicWrappedIO class is based on one being written as part of that PR, as both are based on the Parquet DynMethods class a copy-and-paste is straightforward.
steveloughran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- think I might cut the new read forms (parquet, orc) from the read policy, though parquet/1 and parquet/3 may be good
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Options.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Options.java
Outdated
Show resolved
Hide resolved
...ls/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/contract/ITestAbfsWrappedIO.java
Show resolved
Hide resolved
...s-project/hadoop-hdfs/src/test/java/org/apache/hadoop/fs/contract/hdfs/TestDFSWrappedIO.java
Show resolved
Hide resolved
|
@mukund-thakur this pr renames My iceberg PR apache/iceberg#10233 looks for the new name; it is now dynamic and should build link up if we can think of a way to test it (proposed: make it an option to use if present), default is true. |
b44df10 to
a60f769
Compare
|
💔 -1 overall
This message was automatically generated. |
|
javadocs checkstyles are all about use of _ in method names, except for one |
|
💔 -1 overall
This message was automatically generated. |
|
legitimate failure |
Class WrappedIO extended with more filesystem operations
- openFile()
- PathCapabilities
- StreamCapabilities
- ByteBufferPositionedReadable
* test on supported filesystems (hdfs)
* Plus tests with validation of degradation when IO methods are
not found.
Explicitly add read policies for columnar, parquet and orc
Add IOStatistics context accessors and reset()
* columnar
* orc
* parquet
* avro
This to make it clearer to the filesystem implementations that they
should optimize for whatever their data traces recommend.
Class DynamicWrappedIO to access the WrappedIO Methods
through Parquet's DynMethods API.
This class becomes easy to copy and paste into
Parquet and Iceberg and then be immediately used.
Class WrappedStatistics to provide an equivalent to access
IOStatistics interfaces, objects and operations.
Ability to
* Get a serializable IOStatisticsSnapshot from an IOStatisticsSource or
IOStatistics instance
* Save an IOStatisticsSnapshot to file
* Convert an IOStatisticsSnapshot to JSON
* Given an object which may be an IOStatisticsSource, return an object
whose toString() value is a dynamically generated, human readable summary.
This is for logging.
* Separate getters to the different sections of IOStatistics.
* mean values are returned as a Map.Pair<Long, Long> of (samples, sum)
from which means may be calculated.
Tuned AbstractContractBulkDeleteTest
* make setUp() an override of the existing setup();
this makes initialization more deterministic.
* inline some variables in setup()
Important: this change renames bulkDelete_PageSize to be bulkDelete_pageSize
so it is consistent with all the new methods being added.
This is sync with initial implementation of PARQUET-2493;
tuning code to suit actual use.
In particular
-WrappedIO methods raise UncheckedIOEs
-DynamicWrappedIO methods unwrap these
-static method to switch between openFile() and open() based on
method availability.
Change-Id: Ib4f177d5409156217f4c3d14f1c99adfe82b96d2
+move the DynMethods and related classes under oah.utils.dynamic, marked as private. Change-Id: I9ff52ab02d51bf2175862a3020b41e969088fb65
+ boolean to enable/disable footer caching. These are all hints with no wiring up. Google GCS does footer cache, and abfs has it as a WiP; those clients can adopt as desired. The reason for the footer cache flag is that some query engines do their own caching -having the input stream try and "be helpful" is at best needless and at worst counterproductive. Change-Id: Ibf5914d9fa327438790b946b29b9369d098ae14c
Indicates that locations are generated client side and don't refer to real hosts. If found, list calls which return LocatedFileStatus are low cost Added for: file, s3a, abfs, oss Change-Id: Id94be4cbf1a41ac84818c7b2e061423b9b24d149
Got signature wrong. logging loading at debug to diagnose this. Change-Id: I9c96ffe61d123b9461636380ef77f55d8ddbe3a4
move the unchecking as default methods in CallableRaisingIOE, FunctionRaisingIOE etc makes for a clean and flexible design. some test enhancements. Change-Id: If25b6d0377bc9e4e8d4a6e689692ddfa96b1c756
javadoc, checkstyle and unit test for the new method Change-Id: Id16d01c193814c46215c81e8040ffa7a25720f1c
b59846b to
3fe9cdb
Compare
declare that hbase is an hbase table; s3a maps to random IO. abfs recommends disabling prefetch for these files...it should do it automatically when support for read policies is wired up. Change-Id: I0823cd307a059bf0f3499e7555d9ccc87fb4ae70
3fe9cdb to
76b0afc
Compare
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
Change-Id: Ibd158b3a14bacc95059f0e4e86179e78bebdb53c
|
🎊 +1 overall
This message was automatically generated. |
|
All checkstyles are from underscores; I tried to set up a style rule to disable this but it didn't work right as there's no checkstyle overrides in hadoop-common right now |
mukund-thakur
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a big patch. Have been reviewing a few weeks ago and checked again today. Overall looks great to me +1
Just I don't understand why we added fs.capability.virtual.block.locations in this patch?
it's to say "this fs makes up block locations". it means that cost of looking up block locations is a lot less (no remote calls) and you don't really need to schedule work elsewhere. now that hasPathCapability() is being exported to legacy code, I just felt this would be useful. Currently things look for the default (host == localhost) and go from there -but they only get to do that after the looup |
…he#6686) 1. The class WrappedIO has been extended with more filesystem operations - openFile() - PathCapabilities - StreamCapabilities - ByteBufferPositionedReadable All these static methods raise UncheckedIOExceptions rather than checked ones. 2. The adjacent class org.apache.hadoop.io.wrappedio.WrappedStatistics provides similar access to IOStatistics/IOStatisticsContext classes and operations. Allows callers to: * Get a serializable IOStatisticsSnapshot from an IOStatisticsSource or IOStatistics instance * Save an IOStatisticsSnapshot to file * Convert an IOStatisticsSnapshot to JSON * Given an object which may be an IOStatisticsSource, return an object whose toString() value is a dynamically generated, human readable summary. This is for logging. * Separate getters to the different sections of IOStatistics. * Mean values are returned as a Map.Pair<Long, Long> of (samples, sum) from which means may be calculated. There are examples of the dynamic bindings to these classes in: org.apache.hadoop.io.wrappedio.impl.DynamicWrappedIO org.apache.hadoop.io.wrappedio.impl.DynamicWrappedStatistics These use DynMethods and other classes in the package org.apache.hadoop.util.dynamic which are based on the Apache Parquet equivalents. This makes re-implementing these in that library and others which their own fork of the classes (example: Apache Iceberg) 3. The openFile() option "fs.option.openfile.read.policy" has added specific file format policies for the core filetypes * avro * columnar * csv * hbase * json * orc * parquet S3A chooses the appropriate sequential/random policy as a A policy `parquet, columnar, vector, random, adaptive` will use the parquet policy for any filesystem aware of it, falling back to the first entry in the list which the specific version of the filesystem recognizes 4. New Path capability fs.capability.virtual.block.locations Indicates that locations are generated client side and don't refer to real hosts. Contributed by Steve Loughran
1. The class WrappedIO has been extended with more filesystem operations - openFile() - PathCapabilities - StreamCapabilities - ByteBufferPositionedReadable All these static methods raise UncheckedIOExceptions rather than checked ones. 2. The adjacent class org.apache.hadoop.io.wrappedio.WrappedStatistics provides similar access to IOStatistics/IOStatisticsContext classes and operations. Allows callers to: * Get a serializable IOStatisticsSnapshot from an IOStatisticsSource or IOStatistics instance * Save an IOStatisticsSnapshot to file * Convert an IOStatisticsSnapshot to JSON * Given an object which may be an IOStatisticsSource, return an object whose toString() value is a dynamically generated, human readable summary. This is for logging. * Separate getters to the different sections of IOStatistics. * Mean values are returned as a Map.Pair<Long, Long> of (samples, sum) from which means may be calculated. There are examples of the dynamic bindings to these classes in: org.apache.hadoop.io.wrappedio.impl.DynamicWrappedIO org.apache.hadoop.io.wrappedio.impl.DynamicWrappedStatistics These use DynMethods and other classes in the package org.apache.hadoop.util.dynamic which are based on the Apache Parquet equivalents. This makes re-implementing these in that library and others which their own fork of the classes (example: Apache Iceberg) 3. The openFile() option "fs.option.openfile.read.policy" has added specific file format policies for the core filetypes * avro * columnar * csv * hbase * json * orc * parquet S3A chooses the appropriate sequential/random policy as a A policy `parquet, columnar, vector, random, adaptive` will use the parquet policy for any filesystem aware of it, falling back to the first entry in the list which the specific version of the filesystem recognizes 4. New Path capability fs.capability.virtual.block.locations Indicates that locations are generated client side and don't refer to real hosts. Contributed by Steve Loughran
…he#6686) 1. The class WrappedIO has been extended with more filesystem operations - openFile() - PathCapabilities - StreamCapabilities - ByteBufferPositionedReadable All these static methods raise UncheckedIOExceptions rather than checked ones. 2. The adjacent class org.apache.hadoop.io.wrappedio.WrappedStatistics provides similar access to IOStatistics/IOStatisticsContext classes and operations. Allows callers to: * Get a serializable IOStatisticsSnapshot from an IOStatisticsSource or IOStatistics instance * Save an IOStatisticsSnapshot to file * Convert an IOStatisticsSnapshot to JSON * Given an object which may be an IOStatisticsSource, return an object whose toString() value is a dynamically generated, human readable summary. This is for logging. * Separate getters to the different sections of IOStatistics. * Mean values are returned as a Map.Pair<Long, Long> of (samples, sum) from which means may be calculated. There are examples of the dynamic bindings to these classes in: org.apache.hadoop.io.wrappedio.impl.DynamicWrappedIO org.apache.hadoop.io.wrappedio.impl.DynamicWrappedStatistics These use DynMethods and other classes in the package org.apache.hadoop.util.dynamic which are based on the Apache Parquet equivalents. This makes re-implementing these in that library and others which their own fork of the classes (example: Apache Iceberg) 3. The openFile() option "fs.option.openfile.read.policy" has added specific file format policies for the core filetypes * avro * columnar * csv * hbase * json * orc * parquet S3A chooses the appropriate sequential/random policy as a A policy `parquet, columnar, vector, random, adaptive` will use the parquet policy for any filesystem aware of it, falling back to the first entry in the list which the specific version of the filesystem recognizes 4. New Path capability fs.capability.virtual.block.locations Indicates that locations are generated client side and don't refer to real hosts. Contributed by Steve Loughran
…he#6686) 1. The class WrappedIO has been extended with more filesystem operations - openFile() - PathCapabilities - StreamCapabilities - ByteBufferPositionedReadable All these static methods raise UncheckedIOExceptions rather than checked ones. 2. The adjacent class org.apache.hadoop.io.wrappedio.WrappedStatistics provides similar access to IOStatistics/IOStatisticsContext classes and operations. Allows callers to: * Get a serializable IOStatisticsSnapshot from an IOStatisticsSource or IOStatistics instance * Save an IOStatisticsSnapshot to file * Convert an IOStatisticsSnapshot to JSON * Given an object which may be an IOStatisticsSource, return an object whose toString() value is a dynamically generated, human readable summary. This is for logging. * Separate getters to the different sections of IOStatistics. * Mean values are returned as a Map.Pair<Long, Long> of (samples, sum) from which means may be calculated. There are examples of the dynamic bindings to these classes in: org.apache.hadoop.io.wrappedio.impl.DynamicWrappedIO org.apache.hadoop.io.wrappedio.impl.DynamicWrappedStatistics These use DynMethods and other classes in the package org.apache.hadoop.util.dynamic which are based on the Apache Parquet equivalents. This makes re-implementing these in that library and others which their own fork of the classes (example: Apache Iceberg) 3. The openFile() option "fs.option.openfile.read.policy" has added specific file format policies for the core filetypes * avro * columnar * csv * hbase * json * orc * parquet S3A chooses the appropriate sequential/random policy as a A policy `parquet, columnar, vector, random, adaptive` will use the parquet policy for any filesystem aware of it, falling back to the first entry in the list which the specific version of the filesystem recognizes 4. New Path capability fs.capability.virtual.block.locations Indicates that locations are generated client side and don't refer to real hosts. Contributed by Steve Loughran
…he#6686) 1. The class WrappedIO has been extended with more filesystem operations - openFile() - PathCapabilities - StreamCapabilities - ByteBufferPositionedReadable All these static methods raise UncheckedIOExceptions rather than checked ones. 2. The adjacent class org.apache.hadoop.io.wrappedio.WrappedStatistics provides similar access to IOStatistics/IOStatisticsContext classes and operations. Allows callers to: * Get a serializable IOStatisticsSnapshot from an IOStatisticsSource or IOStatistics instance * Save an IOStatisticsSnapshot to file * Convert an IOStatisticsSnapshot to JSON * Given an object which may be an IOStatisticsSource, return an object whose toString() value is a dynamically generated, human readable summary. This is for logging. * Separate getters to the different sections of IOStatistics. * Mean values are returned as a Map.Pair<Long, Long> of (samples, sum) from which means may be calculated. There are examples of the dynamic bindings to these classes in: org.apache.hadoop.io.wrappedio.impl.DynamicWrappedIO org.apache.hadoop.io.wrappedio.impl.DynamicWrappedStatistics These use DynMethods and other classes in the package org.apache.hadoop.util.dynamic which are based on the Apache Parquet equivalents. This makes re-implementing these in that library and others which their own fork of the classes (example: Apache Iceberg) 3. The openFile() option "fs.option.openfile.read.policy" has added specific file format policies for the core filetypes * avro * columnar * csv * hbase * json * orc * parquet S3A chooses the appropriate sequential/random policy as a A policy `parquet, columnar, vector, random, adaptive` will use the parquet policy for any filesystem aware of it, falling back to the first entry in the list which the specific version of the filesystem recognizes 4. New Path capability fs.capability.virtual.block.locations Indicates that locations are generated client side and don't refer to real hosts. Contributed by Steve Loughran
|
Hi @steveloughran Sorry I should have added that comment her itself. |
HADOOP-19131
Assist reflection IO with WrappedOperations class
How was this patch tested?
Needs new tests going through reflection, maybe some in openfile contract to guarantee full use.
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?