-
Notifications
You must be signed in to change notification settings - Fork 16
ODPi Runtime v3 Draft
ODPi Runtime Specification: 3.0
Status: Draft
Specifications covering Platforms based upon Apache Hadoop, Apache Hive, Apache Spark and Apache Ranger. Compatibility guidelines for Applications running on Platforms.
This specification covers:
- Apache Hadoop® 3.0, including all maintenance releases.
- Apache Hadoop® 3.0, compatible filesystems (HCFS).
- Apache Hive™ 3.0
- Apache Spark™ 2.2
- Apache Ranger™ 1.0
- Apache Atlas™ 1.0.0
The goals of the ODPi Runtime Specification are:
-
For End-Users: Ability to run any Applications on any Platform and have it work.
-
For Software Vendors : Compatibility guidelines that enable them to ensure thier Applications are interoperable across any Platform.
-
For Platform Vendors: Compliance guidelines that enable Applications to run successfully on their Platform. But the guidelines must allow Platform Vendors to patch their End-Users in an expeditious manner, to deal with emergencies.
The methodology used in ODPi Runtime Specification is defining the interface(s) between Services on an Platform (such as HDFS) and Applications that achieves the above goals. These interface(s) in turn can be used by Software Vendors to properly build their software, and will be used as the basis of a compliance test suite that can be used by Platform Vendors to test compliance.
At this time the ODPi Runtime Specification is strongly based on the exact behaviour of the underlying Apache projects. Part of compliance is specified as shipping a Platform built from a specific line of Hadoop, namely 3.0. It is expected that the Hadoop version the spec is based on will evolve as both Hadoop and this specification evolve.
The Hadoop implementation leaves many degrees of freedom in how Hadoop is deployed and configured--and also how it is used (e.g., nothing stops Applications from calling private interfaces). These degrees of freedom can interfere with the objectives of the ODPi Runtime Specification. The goal of this spec is to close enough of those freedoms to achieve those objectives.
The source code approach is not followed in the same way for Hive. Instead a set of interfaces that are deemed to be important for applications and users are specified to be fully compatible with Hive 3.0.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Each entry will have a designation (free text in square brackets) in order to pinpoint which parts of the specification are in violation during certification.
The designation [TEST_ENVIRONMENT] is reserved for entries that are defining constraints on the environment in which Platforms are expected to run. The output of the reference implementation validation testsuite is expected to capture enough information to identify whether the test execution environment conforms to the specification.
A list of important upgrade considerations are listed below in this section.
All Hadoop JARs are now compiled targeting a runtime version of Java 8. Users still using Java 7 or below must upgrade to Java 8.
The YARN timeline service v.2 should only be used in a test capacity. It addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation. It is provided for users and developers to test and provide feedback.
The Hadoop shell scripts have been rewritten to fix bugs. Some of the changes made may break existing installations. The incompatible changes are documented in the release notes and one can find more details here.
Erasure Coding has been created to reduce the overhead in storage space as well as other hardware resources. It provides the same level of fault-tolerance but takes up much less storage. The storage overhead is no more than 50% and has depreciated the replication factor since it is always 1.
In order to achieve a higher degree of fault-tolerance support has been introduced to have more than 2 NameNodes. The documentation has been updated on how to configure more than two. Three NameNodes is recommended and it is suggested to not exceed five due to communication overhead.
Opportunistic containers were created to improve cluster resource utilization to increase task throughput. With YARN containers they would only get scheduled if there were unallocated resources but when opportunistic containers they can be dispatched even if their execution cannot start immediately. This is best for workloads that include relatively short tasks in the order of seconds.
MapReduce has introduced a performance enhancement which is a native implementation of the map output collector catered toward shuffle-intensize jobs that can provide a decrease in runtime speeds up to 30%.
From an ISV perspective that have dependent build code, when upgrading to HDP v3 one may face some challenges in order to get a successful build.
- Dependent jar version
to Hive 3.0 with ACID tables there are API changes, deprecated code, and package name changes so in order to build dependent code changes would need to be done in order to get a successful build.
This work is licensed under a Creative Commons Attribution 4.0 International License