Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-24106 Update getting started documentation after HBASE-24086 #1422

Merged
merged 1 commit into from
Apr 6, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 52 additions & 75 deletions src/main/asciidoc/_chapters/getting_started.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -55,85 +55,67 @@ See <<java,Java>> for information about supported JDK versions.
. Choose a download site from this list of link:https://www.apache.org/dyn/closer.lua/hbase/[Apache Download Mirrors].
Click on the suggested top link.
This will take you to a mirror of _HBase Releases_.
Click on the folder named _stable_ and then download the binary file that ends in _.tar.gz_ to your local filesystem.
Do not download the file ending in _src.tar.gz_ for now.
Click on the folder named _stable_ and then download the binary file that looks like
_hbase-<version>-bin.tar.gz_.
. Extract the downloaded file, and change to the newly-created directory.
. Extract the downloaded file and change to the newly-created directory.
+
[source,subs="attributes"]
----
$ tar xzvf hbase-{Version}-bin.tar.gz
$ cd hbase-{Version}/
$ tar xzvf hbase-<version>-bin.tar.gz
$ cd hbase-<version>/
----
. You must set the `JAVA_HOME` environment variable before starting HBase.
To make this easier, HBase lets you set it within the _conf/hbase-env.sh_ file. You must locate where Java is
installed on your machine, and one way to find this is by using the _whereis java_ command. Once you have the location,
edit the _conf/hbase-env.sh_ file and uncomment the line starting with _#export JAVA_HOME=_, and then set it to your Java installation path.
. Set the `JAVA_HOME` environment variable in _conf/hbase-env.sh_.
First, locate the installation of `java` on your machine. On Unix systems, you can use the
_whereis java_ command. Once you have the location, edit _conf/hbase-env.sh_ file, found inside
the extracted _hbase-<version>_ directory, uncomment the line starting with `#export JAVA_HOME=`,
and then set it to your Java installation path.
+
.Example extract from _hbase-env.sh_ where _JAVA_HOME_ is set
.Example extract from _conf/hbase-env.sh_ where `JAVA_HOME` is set
# Set environment variables here.
# The java implementation to use.
export JAVA_HOME=/usr/jdk64/jdk1.8.0_112
+
. Edit _conf/hbase-site.xml_, which is the main HBase configuration file.
At this time, you need to specify the directory on the local filesystem where HBase and ZooKeeper write data and acknowledge some risks.
By default, a new directory is created under /tmp.
Many servers are configured to delete the contents of _/tmp_ upon reboot, so you should store the data elsewhere.
The following configuration will store HBase's data in the _hbase_ directory, in the home directory of the user called `testuser`.
Paste the `<property>` tags beneath the `<configuration>` tags, which should be empty in a new HBase install.
. Optionally set the <<hbase.tmp.dir,`hbase.tmp.dir`>> property in _conf/hbase-site.xml_.
At this time, you may consider changing the location on the local filesystem where HBase writes
its application data and the data written by its embedded ZooKeeper instance. By default, HBase
uses paths under <<hbase.tmp.dir,`hbase.tmp.dir`>> for these directories.
+
NOTE: On most systems, this is a path created under _/tmp_. Many system periodically delete the
contents of _/tmp_. If you start working with HBase in this way, and then return after the
cleanup operation takes place, you're likely to find strange errors. The following
configuration will place HBase's runtime data in a _tmp_ directory found inside the extracted
_hbase-<version>_ directory, where it will be safe from this periodic cleanup.
+
Open _conf/hbase-site.xml_ and paste the `<property>` tags between the empty `<configuration>`
tags.
+
.Example _hbase-site.xml_ for Standalone HBase
====
[source,xml]
----
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///home/testuser/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/testuser/zookeeper</value>
</property>
<property>
<name>hbase.unsafe.stream.capability.enforce</name>
<value>false</value>
<description>
Controls whether HBase will check for stream capabilities (hflush/hsync).
Disable this if you intend to run on LocalFileSystem, denoted by a rootdir
with the 'file://' scheme, but be mindful of the NOTE below.
WARNING: Setting this to false blinds you to potential data loss and
inconsistent system state in the event of process and/or node failures. If
HBase is complaining of an inability to use hsync or hflush it's most
likely not a false positive.
</description>
<name>hbase.tmp.dir</name>
<value>tmp</value>
</property>
</configuration>
----
====
+
You do not need to create the HBase data directory.
HBase will do this for you. If you create the directory,
HBase will attempt to do a migration, which is not what you want.
You do not need to create the HBase _tmp_ directory; HBase will do this for you.
+
NOTE: The _hbase.rootdir_ in the above example points to a directory
in the _local filesystem_. The 'file://' prefix is how we denote local
filesystem. You should take the WARNING present in the configuration example
to heart. In standalone mode HBase makes use of the local filesystem abstraction
from the Apache Hadoop project. That abstraction doesn't provide the durability
promises that HBase needs to operate safely. This is fine for local development
and testing use cases where the cost of cluster failure is well contained. It is
not appropriate for production deployments; eventually you will lose data.
To home HBase on an existing instance of HDFS, set the _hbase.rootdir_ to point at a
directory up on your instance: e.g. _hdfs://namenode.example.org:8020/hbase_.
For more on this variant, see the section below on Standalone HBase over HDFS.
NOTE: When unconfigured, HBase uses <<hbase.tmp.dir,`hbase.tmp.dir`>> as a starting point for many
important configurations. Notable among them are <<hbase.rootdir,`hbase.rootdir`>>, the path under
which HBase stores its data. You can specify values for this configuration directly, as you'll see
in the subsequent sections.
+
NOTE: In this example, HBase is running on Hadoop's `LocalFileSystem`. That abstraction doesn't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this change is only going into specific versions, should we add a note for users on older versions? Something like...

if you are on v < 2.3, disable stream enforcement by setting this flag.. later versions Hbase takes care of it automatically.....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, documentation changes are merged to master and cherry-picked to appropriate branches. In this case the commit should be backported to branch-2 and branch-2.3 but not to branch-2.2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True..but I think most people refer to the ref-guide here https://hbase.apache.org/book.html rather than version specific one because it shows up in search results.. hence my comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've never really solved challenges of the branch-specific documentation.

I didn't try older versions of 2.x... let me see where they stand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The book says HBase 2.1 and 2.2 are "untested" with Hadoop 2.10. branch-2.2 builds with -Dhadoop-two.version=2.10.0 and produces a similar error at runtime:

java.lang.IllegalStateException: The procedure WAL relies on the ability to hsync for proper operation during component failures, but the underlying filesystem does not support doing so. Please check th
e config value of 'hbase.procedure.store.wal.use.hsync' to set the desired level of robustness and ensure the config value of 'hbase.wal.dir' points to a FileSystem mount that can provide it.
        at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.rollWriter(WALProcedureStore.java:1092)
        at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:424)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:586)
        at org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1522)
        at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:579)
        at java.lang.Thread.run(Thread.java:748)

branch-2.1 does not build with hadoop-2.10.0,

[INFO] --- maven-enforcer-plugin:3.0.0-M2:enforce (banned-jsr305) @ hbase-client ---
[WARNING] Rule 0: org.apache.maven.plugins.enforcer.BannedDependencies failed with message:
We don't allow the JSR305 jar from the Findbugs project, see HBASE-16321.
Found Banned Dependency: com.google.code.findbugs:jsr305:jar:1.3.9

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this change is only going into specific versions, should we add a note for users on older versions? Something like...

if you are on v < 2.3, disable stream enforcement by setting this flag.. later versions Hbase takes care of it automatically.....

This is tricky. My understanding is that HBase has never had these durability guarantees when running on LocalFileSystem because no version of Hadoop has ever provided an implementation that provides hflush or hsync on that class. Thus this warning is applicable everywhere, to everyone.

Now as far as HBase's behavior in the presence of a LocalFileSystem, that's a little different. On Hadoop-2.8.x, we had no way to ask Hadoop if the OutputStream supported these characteristics, so we simply move forward with a warning. It's not clear to me when we first exposed hbase.unsafe.stream.capability.enforce (on first glance, all the places that config is referenced appear to have been changed since its inception) or what our behavior was before then. Let me look into this further and see if I can make a recommendation.

If we back port the parent issue to branch-2.2, then I think the behavior will be the same on all branch-2 derivatives.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HBASE-19289 has some nice history here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that HBase has never had these durability guarantees when running on LocalFileSystem because no version of Hadoop has ever provided an implementation that provides hflush or hsync on that class

Yeah, agree on this point. Just driving by to also mention that there is the RawLocalFileSystem implementation which works with the local filesystem and does implement hflush/hsync. There's just some more trickery to get it set up for file:// instead of LocalFileSystem. I don't think we have this configured for HBase at all (I remember Accumulo used to do a bunch with it for UT's).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree on this point. Just driving by to also mention that there is the RawLocalFileSystem implementation which works with the local filesystem and does implement hflush/hsync. There's just some more trickery to get it set up for file:// instead of LocalFileSystem. I don't think we have this configured for HBase at all (I remember Accumulo used to do a bunch with it for UT's).

Thanks for the pointer @joshelser . I'm not up to speed on the differences between the two. I looked long enough to see that they follow different inheritance hierarchies.

If we could converge the majority of our small and medium test suite onto some equivalent to LocalFileSystem, it would make a huge difference in the runtime and resource usage of tests...

provide the durability promises that HBase needs to operate safely. This is most likely acceptable
for local development and testing use cases. It is not appropriate for production deployments;
eventually you will lose data. Instead, ensure your production deployment sets
<<hbase.rootdir,`hbase.rootdir`>> to a durable `FileSystem` implementation.
. The _bin/start-hbase.sh_ script is provided as a convenient way to start HBase.
Issue the command, and if all goes well, a message is logged to standard output showing that HBase started successfully.
Expand Down Expand Up @@ -308,26 +290,21 @@ In the next sections we give a quick overview of other modes of hbase deploy.
[[quickstart_pseudo]]
=== Pseudo-Distributed Local Install
After working your way through <<quickstart,quickstart>> standalone mode,
you can re-configure HBase to run in pseudo-distributed mode.
Pseudo-distributed mode means that HBase still runs completely on a single host,
but each HBase daemon (HMaster, HRegionServer, and ZooKeeper) runs as a separate process:
in standalone mode all daemons ran in one jvm process/instance.
By default, unless you configure the `hbase.rootdir` property as described in
<<quickstart,quickstart>>, your data is still stored in _/tmp/_.
In this walk-through, we store your data in HDFS instead, assuming you have HDFS available.
You can skip the HDFS configuration to continue storing your data in the local filesystem.
After working your way through the <<quickstart,quickstart>> using standalone mode, you can
re-configure HBase to run in pseudo-distributed mode. Pseudo-distributed mode means that HBase
still runs completely on a single host, but each HBase daemon (HMaster, HRegionServer, and
ZooKeeper) runs as a separate process. Previously in <<quickstart,standalone mode>>, all these
daemons ran in a single jvm process, and your data was stored under
<<hbase.tmp.dir,`hbase.tmp.dir`>>. In this walk-through, your data will be stored in in HDFS
instead, assuming you have HDFS available. This is optional; you can skip the HDFS configuration
to continue storing your data in the local filesystem.
.Hadoop Configuration
[NOTE]
====
This procedure assumes that you have configured Hadoop and HDFS on your local system and/or a remote
system, and that they are running and available. It also assumes you are using Hadoop 2.
NOTE: This procedure assumes that you have configured Hadoop and HDFS on your local system and/or a
remote system, and that they are running and available. It also assumes you are using Hadoop 2.
The guide on
link:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html[Setting up a Single Node Cluster]
in the Hadoop documentation is a good starting point.
====
. Stop HBase if it is running.
+
Expand All @@ -348,8 +325,8 @@ First, add the following property which directs HBase to run in distributed mode
</property>
----
+
Next, change the `hbase.rootdir` from the local filesystem to the address of your HDFS instance, using the `hdfs:////` URI syntax.
In this example, HDFS is running on the localhost at port 8020. Be sure to either remove the entry for `hbase.unsafe.stream.capability.enforce` or set it to true.
Next, add a configuration for `hbase.rootdir` so that it points to the address of your HDFS instance, using the `hdfs:////` URI syntax.
In this example, HDFS is running on the localhost at port 8020.
+
[source,xml]
----
Expand All @@ -360,10 +337,10 @@ In this example, HDFS is running on the localhost at port 8020. Be sure to eithe
</property>
----
+
You do not need to create the directory in HDFS.
HBase will do this for you.
You do not need to create the directory in HDFS; HBase will do this for you.
If you create the directory, HBase will attempt to do a migration, which is not what you want.
+
Finally, remove the configuration for `hbase.tmp.dir`.
. Start HBase.
+
Use the _bin/start-hbase.sh_ command to start HBase.
Expand Down