Support Accumulo installs on Microsoft Azure #270

srajtiwari · 2019-08-08T20:00:05Z

Add new cluster type 'azure' which leverages VM Scale Sets
Add HA (high-availability) capabilities for the Hadoop Name Node,
Accumulo master, and Zookeeper roles within Muchos. Note: HA is on by
default and should not be disabled
Enable central collection of metrics and logs using Azure Monitor
Increase some CentOS defaults to improve cluster stability
Fix latent bugs which prevent Spark from being set up correctly
Add checksums for specific Spark and Hadoop versions, as well as for
Accumulo 2.0.0

keith-turner

Thanks for the contribution @srajtiwari. This is a large PR so its going to take some time for me to work through it. I noticed some binary files like OMS-collectd.pp, what is that?

ansible/roles/accumulo/templates/hadoop-metrics2-accumulo.properties

lib/muchos/azure.py

ansible/roles/common/tasks/os.yml

ctubbsii · 2019-08-08T21:11:19Z

Hi @srajtiwari . This is quite a large pull request, and I think it's going to take a bit of time to review. In future, it would probably help to contribute smaller, more narrowly scoped contributions. You have 6 different bullet points that this pull request accomplishes. I think that those probably could have probably been 6 different pull requests.

Also, I noticed that there's some binary files in here. Can you explain what those are? We probably aren't going to be able to accept the binary files in the pull request, since binaries are not "open source" by any standard definition.

arvindshmicrosoft · 2019-08-08T22:16:39Z

@ctubbsii and @keith-turner firstly thank you for your comments. We acknowledge the feedback about the size of the PR, and in the future we will definitely scope those down to much smaller chunks.

About the binary files, these are SELinux policy modules obtained using audit2allow which allow SELinux to permit the statsd plugin (for collectd) to bind to port 8125, as well as for collectd to talk to the Azure Log Analytics agent. We will figure out a way to not check-in those binary files and instead have Ansible tasks which will generate and copy these files per deployment, again with a conditional for cluster_type == azure.

ctubbsii · 2019-08-08T22:22:45Z

About the binary files, these are SELinux policy modules

Oh, that makes sense. Could probably just check in the .te file then, and instructions or a script to compile it.

keith-turner

I am still looking at this, but here are the comments I have so far.

README.md

conf/muchos.props.example.azure

conf/muchos.props.example

conf/muchos.props.example.azure

ansible/roles/azure/tasks/create_vmss.yml

ansible/roles/common/tasks/os.yml

ansible/roles/hadoop/tasks/start-dn.yml

ansible/hadoop.yml

ansible/roles/common/tasks/main.yml

keith-turner · 2019-08-14T21:50:08Z

@arvindshmicrosoft, @srajtiwari, or @karthick-rn so far I have only been looking at the code changes for this. I would like to try running these changes in the next day or two, but I think you may still be make changes. Let me know if you think I should wait before giving this a run.

arvindshmicrosoft · 2019-08-15T00:06:07Z

Hi @keith-turner we have addressed all of the comments (to best of our belief) except for the HA configuration one, which (as discussed) we are tracking via #271 and hopefully we will push that as a separate PR in the next week or so. Please let us know about other immediate issues / comments you may find and we will quickly triage and decide if we can address in this PR or create issue(s) to track their subsequent resolution through later PR(s).

Thank you very much again for your patience and advice working through this.

keith-turner · 2019-08-15T21:57:09Z

I tried running this branch against EC2 to setup a 12 node cluster and I ran into a few issues.

Accumulo 2.0 does not download, I am still trying to figure out why. Looking at the changes made to the Accumulo ansible files I don't see any problems. I am going to look into this some more tomorrow.
I am seeing some errors with the zookeeper setup because I chose a leader node type that only had a single ephemeral drive.
I saw some errors with the jps commands, but these did not seem to cause a problem.

Below are the zookeeper setup errors I saw. The leader nodes only have /media/ephemeral0

ASK [zookeeper : Create zookeeper log dir] ****************************************************************************************************************************************************************
Thursday 15 August 2019  21:40:11 +0000 (0:00:01.844)       0:03:27.249 ******* 
fatal: [leader2]: FAILED! => {"changed": false, "msg": "There was an issue creating /media/ephemeral1 as requested: [Errno 13] Permission denied: '/media/ephemeral1'", "path": "/media/ephemeral1/logs/zookeeper"}
fatal: [leader3]: FAILED! => {"changed": false, "msg": "There was an issue creating /media/ephemeral1 as requested: [Errno 13] Permission denied: '/media/ephemeral1'", "path": "/media/ephemeral1/logs/zookeeper"}
changed: [worker1]
fatal: [leader1]: FAILED! => {"changed": false, "msg": "There was an issue creating /media/ephemeral1 as requested: [Errno 13] Permission denied: '/media/ephemeral1'", "path": "/media/ephemeral1/logs/zookeeper"}
changed: [worker6]
changed: [worker7]
changed: [worker8]
changed: [worker9]
changed: [worker4]
changed: [worker5]
changed: [worker3]
changed: [worker2]

Below are the jps errors and the failure to download Accumulo.

PLAY [journalnode] *****************************************************************************************************************************************************************************************

PLAY [namenode[0]] *****************************************************************************************************************************************************************************************

PLAY [namenode[0]] *****************************************************************************************************************************************************************************************

PLAY [namenode] ********************************************************************************************************************************************************************************************

PLAY [namenode[0]] *****************************************************************************************************************************************************************************************

PLAY [namenode[1]] *****************************************************************************************************************************************************************************************

PLAY [workers] *********************************************************************************************************************************************************************************************

TASK [Check if DataNode is running] ************************************************************************************************************************************************************************
Thursday 15 August 2019  21:41:31 +0000 (0:00:00.576)       0:04:47.205 ******* 
fatal: [worker1]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.139149", "end": "2019-08-15 21:41:32.007976", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:31.868827", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker2]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.143752", "end": "2019-08-15 21:41:32.085573", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:31.941821", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker4]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.143829", "end": "2019-08-15 21:41:32.101101", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:31.957272", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker3]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.147803", "end": "2019-08-15 21:41:32.102506", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:31.954703", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker5]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.131862", "end": "2019-08-15 21:41:32.120846", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:31.988984", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker6]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.146547", "end": "2019-08-15 21:41:32.143614", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:31.997067", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker8]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.130241", "end": "2019-08-15 21:41:32.165315", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:32.035074", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker7]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.142849", "end": "2019-08-15 21:41:32.183773", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:32.040924", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker9]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.131396", "end": "2019-08-15 21:41:32.211191", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:32.079795", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring

TASK [start datanodes] *************************************************************************************************************************************************************************************
Thursday 15 August 2019  21:41:32 +0000 (0:00:00.663)       0:04:47.868 ******* 
changed: [worker1]
changed: [worker2]
changed: [worker3]
changed: [worker4]
changed: [worker5]
changed: [worker6]
changed: [worker7]
changed: [worker8]
changed: [worker9]

PLAY [resourcemanager] *************************************************************************************************************************************************************************************

PLAY [metrics] *********************************************************************************************************************************************************************************************

PLAY [proxy] ***********************************************************************************************************************************************************************************************

PLAY [all:!_] *************************************************************************************************************************************************************************************************

TASK [accumulo : install accumulo from tarball] ************************************************************************************************************************************************************
Thursday 15 August 2019  21:41:34 +0000 (0:00:02.700)       0:04:50.568 ******* 
fatal: [worker1]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker7]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker6]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker8]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker9]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker4]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker5]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker3]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker2]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}

PLAY [accumulomaster[0]] ***********************************************************************************************************************************************************************************

PLAY [accumulo] ********************************************************************************************************************************************************************************************

PLAY [workers] *********************************************************************************************************************************************************************************************

PLAY [accumulomaster] **************************************************************************************************************************************************************************************

arvindshmicrosoft · 2019-08-15T23:44:39Z

@keith-turner, thanks for testing! Could you let me know the EC2 instance type that you were using for tests? I'll take a look at the logic for assigning the worker data-dirs ASAP. Now, given that leader1 failed (due to the drive folder issue) and presuming that leader1 is also the proxy host, it looks like the downstream download Accumulo tarball task did not run on the proxy (that is Ansible's behavior) and hence the install failed as well. So IMHO the second observation about Accumulo install failing is "by design".

Will reply ASAP on the data directory issue.

keith-turner · 2019-08-16T17:58:12Z

Could you let me know the EC2 instance type that you were using for tests?

@arvindshmicrosoft I am trying to use the following settings.

default_instance_type = m5d.xlarge
worker_instance_type = d2.xlarge

given that leader1 failed (due to the drive folder issue) and presuming that leader1 is also the proxy host, it looks like the downstream download Accumulo tarball task did not run on the proxy (that is Ansible's behavior)

@arvindshmicrosoft thanks for that explanation, it was immensely helpful! I do not know Ansible very well, but I'm quickly learning. While debugging I ran the following commands on the cluster, the first would not download and the second would.

ansible-playbook ansible/site.yml --extra-vars "azure_proxy_host=_"

ansible-playbook ansible/accumulo.yml --extra-vars "azure_proxy_host=_"

After reading your message I now understand the difference in behavior.

arvindshmicrosoft · 2019-08-16T18:31:10Z

Thanks, @keith-turner for the details. m5d.xlarge has 1 ephemeral disk (as you said) and d2.xlarge has 3. Based on this, default_data_dirs has just the '/media/ephemeral0' location, while worker_data_dirs has all 3 ephemeral disks.

Now, unfortunately it looks like the previous Zookeeper and Hadoop roles were loosely using worker_data_dirs, and our changes continued using worker_data_dirs. From our understanding of the implementation, all references on the leader nodes should only be using default_data_dirs. This appears to be a latent bug which has been uncovered indirectly now.

To unblock, if you use a homogenous cluster with a common instance type across leaders and workers it would workaround this issue, but @karthick-rn and I discussed and we feel we should address the latent bug. Only question is when: would you prefer we create a new issue and we will address that in the next week or so with a fresh PR, or would you prefer we bundle the fix for this new (latent) bug as part of this existing PR itself?

keith-turner · 2019-08-16T19:31:17Z

Only question is when: would you prefer we create a new issue and we will address that in the next week or so with a fresh PR, or would you prefer we bundle the fix for this new (latent) bug as part of this existing PR itself?

I can't really say because I don't know enough about the bug and how it manifest in this branch vs master. I do know the config using m5d.xlarge and d2.xlarge works in the master branch. One of my goals before merging this was to ensure that EC2 still works, but I have not been able to do that yet.

arvindshmicrosoft · 2019-08-16T21:11:55Z

I think you have implicitly answered our question, @keith-turner . Given you are seeing regression from current master, we will work to make sure our current PR is stable with EC2.

SlickNik · 2019-08-28T18:49:52Z

@keith-turner @ctubbsii I've updated the PR to encompass the minimum set of changes that are needed to support Azure-based installs. We will follow up with separate PRs for the other (monitoring, optional HA) pieces. Thanks for your help with this!

ctubbsii

This is a much smaller, and easier to parse change than the previous. Thank you for trimming this down to the minimal for Azure support, and deferring the further changes.

README.md

ansible/roles/azure/vars/.gitignore

ansible/roles/common/tasks/hosts.yml

conf/checksums

lib/muchos/config.py

lib/muchos/util.py

* Add new cluster type 'azure' which leverages VM Scale Sets * Increase some CentOS defaults to improve cluster stability * Add checksums for specific Spark and Hadoop versions, as well as for Accumulo 2.0.0

ctubbsii

This looks good to me.
@keith-turner what do you think?
@SlickNik In future, it's probably best to avoid force pushes, especially when responding to review feedback, since it makes it slightly harder to review. We typically squash merge at the end anyway. When you went back to the "minimal" approach, it probably made sense, but this last change could probably have been a regular commit. No big deal, though, I manually did a diff since my last review 😸

SlickNik · 2019-09-05T00:47:37Z

@ctubbsii 👍 Sounds good - will keep that in mind; and makes total sense since we're doing a squash and merge. It's just a remnant from working with other projects that merge without squashing. Having 'checkpoint' commits littering the git log can make history distracting! 😇

keith-turner

@SlickNik this was much easier to review, thanks for simplifying it. I finished reviewing this and now I am going to test running it to ensure it still works on EC2.

README.md

ansible/roles/common/tasks/azure.yml

README.md

conf/muchos.props.example

README.md

* Addition of comments to [azure] section in muchos.props.example * Changes to README.md * Minor spelling error corrections * Update config.py to be stylistically consistent Signed-off-by: Nikhil Manchanda <SlickNik@gmail.com>

keith-turner

I was able to run this branch on EC2 w/o issue.

keith-turner · 2019-09-09T16:23:43Z

Thanks for the contribution everyone! If any of you would like to be listed as contributor, please edit Fluo's people page.

The Apache Fluo project tweets about first contributions. If you any of you would like a tweet to be made about this PR, just let me know what twitter handles to include in the tweet.

ctubbsii · 2019-09-09T16:44:47Z

In addition to being listed as contributors, if anybody is interested in continuing to contribute to Fluo, please consider subscribing to the developer mailing list and perhaps introducing yourself on that list.

keith-turner self-assigned this Aug 8, 2019

keith-turner reviewed Aug 8, 2019

View reviewed changes

ansible/roles/accumulo/templates/hadoop-metrics2-accumulo.properties Outdated Show resolved Hide resolved

lib/muchos/azure.py Show resolved Hide resolved

ansible/roles/common/tasks/os.yml Show resolved Hide resolved

keith-turner reviewed Aug 12, 2019

View reviewed changes

srajtiwari force-pushed the master branch from d901c1a to be5ae7a Compare August 14, 2019 23:53

GreenCee mentioned this pull request Aug 21, 2019

Add additional cloud support #261

Closed

SlickNik force-pushed the master branch 2 times, most recently from 0ffd6ac to 6503458 Compare August 28, 2019 18:46

ctubbsii reviewed Aug 28, 2019

View reviewed changes

Support Accumulo installs on Microsoft Azure

e02e3d7

* Add new cluster type 'azure' which leverages VM Scale Sets * Increase some CentOS defaults to improve cluster stability * Add checksums for specific Spark and Hadoop versions, as well as for Accumulo 2.0.0

SlickNik force-pushed the master branch from 6503458 to e02e3d7 Compare August 28, 2019 22:07

ctubbsii approved these changes Sep 4, 2019

View reviewed changes

ctubbsii requested a review from keith-turner September 4, 2019 23:04

keith-turner reviewed Sep 5, 2019

View reviewed changes

PR Updates based on reviews

98da5b3

* Addition of comments to [azure] section in muchos.props.example * Changes to README.md * Minor spelling error corrections * Update config.py to be stylistically consistent Signed-off-by: Nikhil Manchanda <SlickNik@gmail.com>

SlickNik force-pushed the master branch from 97c94c2 to 98da5b3 Compare September 9, 2019 07:06

keith-turner approved these changes Sep 9, 2019

View reviewed changes

keith-turner merged commit 5861e87 into apache:master Sep 9, 2019

arvindshmicrosoft mentioned this pull request Sep 16, 2019

Refactor the Azure-specific Ansible tasks into separate files #272

Closed

This was referenced Oct 8, 2019

Update people.md apache/fluo-website#178

Merged

Update people.md apache/fluo-website#179

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Accumulo installs on Microsoft Azure #270

Support Accumulo installs on Microsoft Azure #270

srajtiwari commented Aug 8, 2019

keith-turner left a comment

ctubbsii commented Aug 8, 2019

arvindshmicrosoft commented Aug 8, 2019

ctubbsii commented Aug 8, 2019

keith-turner left a comment

keith-turner commented Aug 14, 2019

arvindshmicrosoft commented Aug 15, 2019

keith-turner commented Aug 15, 2019

arvindshmicrosoft commented Aug 15, 2019 •

edited

Loading

keith-turner commented Aug 16, 2019

arvindshmicrosoft commented Aug 16, 2019

keith-turner commented Aug 16, 2019

arvindshmicrosoft commented Aug 16, 2019

SlickNik commented Aug 28, 2019

ctubbsii left a comment

ctubbsii left a comment

SlickNik commented Sep 5, 2019

keith-turner left a comment

keith-turner left a comment

keith-turner commented Sep 9, 2019

ctubbsii commented Sep 9, 2019

Support Accumulo installs on Microsoft Azure #270

Support Accumulo installs on Microsoft Azure #270

Conversation

srajtiwari commented Aug 8, 2019

keith-turner left a comment

Choose a reason for hiding this comment

ctubbsii commented Aug 8, 2019

arvindshmicrosoft commented Aug 8, 2019

ctubbsii commented Aug 8, 2019

keith-turner left a comment

Choose a reason for hiding this comment

keith-turner commented Aug 14, 2019

arvindshmicrosoft commented Aug 15, 2019

keith-turner commented Aug 15, 2019

arvindshmicrosoft commented Aug 15, 2019 • edited Loading

keith-turner commented Aug 16, 2019

arvindshmicrosoft commented Aug 16, 2019

keith-turner commented Aug 16, 2019

arvindshmicrosoft commented Aug 16, 2019

SlickNik commented Aug 28, 2019

ctubbsii left a comment

Choose a reason for hiding this comment

ctubbsii left a comment

Choose a reason for hiding this comment

SlickNik commented Sep 5, 2019

keith-turner left a comment

Choose a reason for hiding this comment

keith-turner left a comment

Choose a reason for hiding this comment

keith-turner commented Sep 9, 2019

ctubbsii commented Sep 9, 2019

arvindshmicrosoft commented Aug 15, 2019 •

edited

Loading