Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Accumulo installs on Microsoft Azure #270

Merged
merged 2 commits into from
Sep 9, 2019

Conversation

srajtiwari
Copy link
Contributor

  • Add new cluster type 'azure' which leverages VM Scale Sets
  • Add HA (high-availability) capabilities for the Hadoop Name Node,
    Accumulo master, and Zookeeper roles within Muchos. Note: HA is on by
    default and should not be disabled
  • Enable central collection of metrics and logs using Azure Monitor
  • Increase some CentOS defaults to improve cluster stability
  • Fix latent bugs which prevent Spark from being set up correctly
  • Add checksums for specific Spark and Hadoop versions, as well as for
    Accumulo 2.0.0

@keith-turner keith-turner self-assigned this Aug 8, 2019
Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @srajtiwari. This is a large PR so its going to take some time for me to work through it. I noticed some binary files like OMS-collectd.pp, what is that?

lib/muchos/azure.py Show resolved Hide resolved
ansible/roles/common/tasks/os.yml Show resolved Hide resolved
@ctubbsii
Copy link
Member

ctubbsii commented Aug 8, 2019

Hi @srajtiwari . This is quite a large pull request, and I think it's going to take a bit of time to review. In future, it would probably help to contribute smaller, more narrowly scoped contributions. You have 6 different bullet points that this pull request accomplishes. I think that those probably could have probably been 6 different pull requests.

Also, I noticed that there's some binary files in here. Can you explain what those are? We probably aren't going to be able to accept the binary files in the pull request, since binaries are not "open source" by any standard definition.

@arvindshmicrosoft
Copy link
Member

@ctubbsii and @keith-turner firstly thank you for your comments. We acknowledge the feedback about the size of the PR, and in the future we will definitely scope those down to much smaller chunks.

About the binary files, these are SELinux policy modules obtained using audit2allow which allow SELinux to permit the statsd plugin (for collectd) to bind to port 8125, as well as for collectd to talk to the Azure Log Analytics agent. We will figure out a way to not check-in those binary files and instead have Ansible tasks which will generate and copy these files per deployment, again with a conditional for cluster_type == azure.

@ctubbsii
Copy link
Member

ctubbsii commented Aug 8, 2019

About the binary files, these are SELinux policy modules

Oh, that makes sense. Could probably just check in the .te file then, and instructions or a script to compile it.

Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still looking at this, but here are the comments I have so far.

README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
conf/muchos.props.example.azure Outdated Show resolved Hide resolved
conf/muchos.props.example Outdated Show resolved Hide resolved
conf/muchos.props.example.azure Outdated Show resolved Hide resolved
ansible/roles/azure/tasks/create_vmss.yml Show resolved Hide resolved
ansible/roles/common/tasks/os.yml Show resolved Hide resolved
ansible/roles/hadoop/tasks/start-dn.yml Outdated Show resolved Hide resolved
ansible/hadoop.yml Outdated Show resolved Hide resolved
ansible/roles/common/tasks/main.yml Outdated Show resolved Hide resolved
@keith-turner
Copy link
Contributor

@arvindshmicrosoft, @srajtiwari, or @karthick-rn so far I have only been looking at the code changes for this. I would like to try running these changes in the next day or two, but I think you may still be make changes. Let me know if you think I should wait before giving this a run.

@arvindshmicrosoft
Copy link
Member

Hi @keith-turner we have addressed all of the comments (to best of our belief) except for the HA configuration one, which (as discussed) we are tracking via #271 and hopefully we will push that as a separate PR in the next week or so. Please let us know about other immediate issues / comments you may find and we will quickly triage and decide if we can address in this PR or create issue(s) to track their subsequent resolution through later PR(s).

Thank you very much again for your patience and advice working through this.

@keith-turner
Copy link
Contributor

I tried running this branch against EC2 to setup a 12 node cluster and I ran into a few issues.

  • Accumulo 2.0 does not download, I am still trying to figure out why. Looking at the changes made to the Accumulo ansible files I don't see any problems. I am going to look into this some more tomorrow.
  • I am seeing some errors with the zookeeper setup because I chose a leader node type that only had a single ephemeral drive.
  • I saw some errors with the jps commands, but these did not seem to cause a problem.

Below are the zookeeper setup errors I saw. The leader nodes only have /media/ephemeral0

ASK [zookeeper : Create zookeeper log dir] ****************************************************************************************************************************************************************
Thursday 15 August 2019  21:40:11 +0000 (0:00:01.844)       0:03:27.249 ******* 
fatal: [leader2]: FAILED! => {"changed": false, "msg": "There was an issue creating /media/ephemeral1 as requested: [Errno 13] Permission denied: '/media/ephemeral1'", "path": "/media/ephemeral1/logs/zookeeper"}
fatal: [leader3]: FAILED! => {"changed": false, "msg": "There was an issue creating /media/ephemeral1 as requested: [Errno 13] Permission denied: '/media/ephemeral1'", "path": "/media/ephemeral1/logs/zookeeper"}
changed: [worker1]
fatal: [leader1]: FAILED! => {"changed": false, "msg": "There was an issue creating /media/ephemeral1 as requested: [Errno 13] Permission denied: '/media/ephemeral1'", "path": "/media/ephemeral1/logs/zookeeper"}
changed: [worker6]
changed: [worker7]
changed: [worker8]
changed: [worker9]
changed: [worker4]
changed: [worker5]
changed: [worker3]
changed: [worker2]

Below are the jps errors and the failure to download Accumulo.

PLAY [journalnode] *****************************************************************************************************************************************************************************************

PLAY [namenode[0]] *****************************************************************************************************************************************************************************************

PLAY [namenode[0]] *****************************************************************************************************************************************************************************************

PLAY [namenode] ********************************************************************************************************************************************************************************************

PLAY [namenode[0]] *****************************************************************************************************************************************************************************************

PLAY [namenode[1]] *****************************************************************************************************************************************************************************************

PLAY [workers] *********************************************************************************************************************************************************************************************

TASK [Check if DataNode is running] ************************************************************************************************************************************************************************
Thursday 15 August 2019  21:41:31 +0000 (0:00:00.576)       0:04:47.205 ******* 
fatal: [worker1]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.139149", "end": "2019-08-15 21:41:32.007976", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:31.868827", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker2]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.143752", "end": "2019-08-15 21:41:32.085573", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:31.941821", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker4]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.143829", "end": "2019-08-15 21:41:32.101101", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:31.957272", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker3]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.147803", "end": "2019-08-15 21:41:32.102506", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:31.954703", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker5]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.131862", "end": "2019-08-15 21:41:32.120846", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:31.988984", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker6]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.146547", "end": "2019-08-15 21:41:32.143614", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:31.997067", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker8]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.130241", "end": "2019-08-15 21:41:32.165315", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:32.035074", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker7]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.142849", "end": "2019-08-15 21:41:32.183773", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:32.040924", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [worker9]: FAILED! => {"changed": false, "cmd": "jps | grep \" DataNode\" | grep -v grep", "delta": "0:00:00.131396", "end": "2019-08-15 21:41:32.211191", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 21:41:32.079795", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring

TASK [start datanodes] *************************************************************************************************************************************************************************************
Thursday 15 August 2019  21:41:32 +0000 (0:00:00.663)       0:04:47.868 ******* 
changed: [worker1]
changed: [worker2]
changed: [worker3]
changed: [worker4]
changed: [worker5]
changed: [worker6]
changed: [worker7]
changed: [worker8]
changed: [worker9]

PLAY [resourcemanager] *************************************************************************************************************************************************************************************

PLAY [metrics] *********************************************************************************************************************************************************************************************

PLAY [proxy] ***********************************************************************************************************************************************************************************************

PLAY [all:!_] *************************************************************************************************************************************************************************************************

TASK [accumulo : install accumulo from tarball] ************************************************************************************************************************************************************
Thursday 15 August 2019  21:41:34 +0000 (0:00:02.700)       0:04:50.568 ******* 
fatal: [worker1]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker7]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker6]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker8]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker9]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker4]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker5]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker3]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
fatal: [worker2]: FAILED! => {"changed": false, "msg": "Could not find or access '/home/centos/tarballs/accumulo-2.0.0-bin.tar.gz' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}

PLAY [accumulomaster[0]] ***********************************************************************************************************************************************************************************

PLAY [accumulo] ********************************************************************************************************************************************************************************************

PLAY [workers] *********************************************************************************************************************************************************************************************

PLAY [accumulomaster] **************************************************************************************************************************************************************************************

@arvindshmicrosoft
Copy link
Member

arvindshmicrosoft commented Aug 15, 2019

@keith-turner, thanks for testing! Could you let me know the EC2 instance type that you were using for tests? I'll take a look at the logic for assigning the worker data-dirs ASAP. Now, given that leader1 failed (due to the drive folder issue) and presuming that leader1 is also the proxy host, it looks like the downstream download Accumulo tarball task did not run on the proxy (that is Ansible's behavior) and hence the install failed as well. So IMHO the second observation about Accumulo install failing is "by design".

Will reply ASAP on the data directory issue.

@keith-turner
Copy link
Contributor

Could you let me know the EC2 instance type that you were using for tests?

@arvindshmicrosoft I am trying to use the following settings.

default_instance_type = m5d.xlarge
worker_instance_type = d2.xlarge

given that leader1 failed (due to the drive folder issue) and presuming that leader1 is also the proxy host, it looks like the downstream download Accumulo tarball task did not run on the proxy (that is Ansible's behavior)

@arvindshmicrosoft thanks for that explanation, it was immensely helpful! I do not know Ansible very well, but I'm quickly learning. While debugging I ran the following commands on the cluster, the first would not download and the second would.

ansible-playbook ansible/site.yml --extra-vars "azure_proxy_host=_"
ansible-playbook ansible/accumulo.yml --extra-vars "azure_proxy_host=_"

After reading your message I now understand the difference in behavior.

@arvindshmicrosoft
Copy link
Member

Thanks, @keith-turner for the details. m5d.xlarge has 1 ephemeral disk (as you said) and d2.xlarge has 3. Based on this, default_data_dirs has just the '/media/ephemeral0' location, while worker_data_dirs has all 3 ephemeral disks.

Now, unfortunately it looks like the previous Zookeeper and Hadoop roles were loosely using worker_data_dirs, and our changes continued using worker_data_dirs. From our understanding of the implementation, all references on the leader nodes should only be using default_data_dirs. This appears to be a latent bug which has been uncovered indirectly now.

To unblock, if you use a homogenous cluster with a common instance type across leaders and workers it would workaround this issue, but @karthick-rn and I discussed and we feel we should address the latent bug. Only question is when: would you prefer we create a new issue and we will address that in the next week or so with a fresh PR, or would you prefer we bundle the fix for this new (latent) bug as part of this existing PR itself?

@keith-turner
Copy link
Contributor

Only question is when: would you prefer we create a new issue and we will address that in the next week or so with a fresh PR, or would you prefer we bundle the fix for this new (latent) bug as part of this existing PR itself?

I can't really say because I don't know enough about the bug and how it manifest in this branch vs master. I do know the config using m5d.xlarge and d2.xlarge works in the master branch. One of my goals before merging this was to ensure that EC2 still works, but I have not been able to do that yet.

@arvindshmicrosoft
Copy link
Member

I think you have implicitly answered our question, @keith-turner . Given you are seeing regression from current master, we will work to make sure our current PR is stable with EC2.

@SlickNik
Copy link
Contributor

@keith-turner @ctubbsii I've updated the PR to encompass the minimum set of changes that are needed to support Azure-based installs. We will follow up with separate PRs for the other (monitoring, optional HA) pieces. Thanks for your help with this!

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a much smaller, and easier to parse change than the previous. Thank you for trimming this down to the minimal for Azure support, and deferring the further changes.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
ansible/roles/azure/vars/.gitignore Outdated Show resolved Hide resolved
ansible/roles/common/tasks/hosts.yml Show resolved Hide resolved
conf/checksums Outdated Show resolved Hide resolved
lib/muchos/config.py Show resolved Hide resolved
lib/muchos/util.py Outdated Show resolved Hide resolved
* Add new cluster type 'azure' which leverages VM Scale Sets
* Increase some CentOS defaults to improve cluster stability
* Add checksums for specific Spark and Hadoop versions, as well as for
  Accumulo 2.0.0
Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.
@keith-turner what do you think?
@SlickNik In future, it's probably best to avoid force pushes, especially when responding to review feedback, since it makes it slightly harder to review. We typically squash merge at the end anyway. When you went back to the "minimal" approach, it probably made sense, but this last change could probably have been a regular commit. No big deal, though, I manually did a diff since my last review 😸

@SlickNik
Copy link
Contributor

SlickNik commented Sep 5, 2019

@ctubbsii 👍 Sounds good - will keep that in mind; and makes total sense since we're doing a squash and merge. It's just a remnant from working with other projects that merge without squashing. Having 'checkpoint' commits littering the git log can make history distracting! 😇

Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SlickNik this was much easier to review, thanks for simplifying it. I finished reviewing this and now I am going to test running it to ensure it still works on EC2.

README.md Outdated Show resolved Hide resolved
ansible/roles/common/tasks/azure.yml Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
conf/muchos.props.example Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
* Addition of comments to [azure] section in muchos.props.example
* Changes to README.md
* Minor spelling error corrections
* Update config.py to be stylistically consistent

Signed-off-by: Nikhil Manchanda <SlickNik@gmail.com>
Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to run this branch on EC2 w/o issue.

@keith-turner keith-turner merged commit 5861e87 into apache:master Sep 9, 2019
@keith-turner
Copy link
Contributor

Thanks for the contribution everyone! If any of you would like to be listed as contributor, please edit Fluo's people page.

The Apache Fluo project tweets about first contributions. If you any of you would like a tweet to be made about this PR, just let me know what twitter handles to include in the tweet.

@ctubbsii
Copy link
Member

ctubbsii commented Sep 9, 2019

In addition to being listed as contributors, if anybody is interested in continuing to contribute to Fluo, please consider subscribing to the developer mailing list and perhaps introducing yourself on that list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants