Skip to content
This repository has been archived by the owner on Feb 9, 2022. It is now read-only.

Workaround for crashing Elasticsearch 6.x under EKS. #430

Merged
merged 1 commit into from
Mar 14, 2019
Merged

Workaround for crashing Elasticsearch 6.x under EKS. #430

merged 1 commit into from
Mar 14, 2019

Conversation

falfaro
Copy link
Contributor

@falfaro falfaro commented Mar 12, 2019

No description provided.

@falfaro falfaro added the bug Something isn't working label Mar 12, 2019
@falfaro falfaro self-assigned this Mar 12, 2019
@falfaro falfaro requested review from wojciechka and arapulido March 12, 2019 15:25
docs/quickstart-eks.md Outdated Show resolved Hide resolved
Copy link

@wojciechka wojciechka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of also doing something like --node-ami ${AMI_ID} to avoid someone copy-pasting wrong AMI ID, but I am not insisting on it.

Copy link
Contributor

@arapulido arapulido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@falfaro
Copy link
Contributor Author

falfaro commented Mar 12, 2019

bors r+

@bors
Copy link
Contributor

bors bot commented Mar 12, 2019

👎 Rejected by PR status

@sameersbn
Copy link
Contributor

sameersbn commented Mar 13, 2019

IMO we should instead ask users to SSH into their EKS cluster nodes and remove the ulimits configuration of the docker daemon. This can be achieved with (tested):

sudo sed -i '/"nofile": {/,/}/d' /etc/docker/daemon.json
sudo systemctl restart docker

This workaround can go in the troubleshooting doc instead of the quickstart guide.

edit:

The above sed command turns

{
  "bridge": "none",
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "10"
  },
  "live-restore": true,
  "max-concurrent-downloads": 10,
  "default-ulimits": {
    "nofile": {
      "Name": "nofile",
      "Soft": 2048,
      "Hard": 8192
    }
  }
}

to

{
  "bridge": "none",
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "10"
  },
  "live-restore": true,
  "max-concurrent-downloads": 10,
  "default-ulimits": {
  }
}

, which is essentially the change in awslabs/amazon-eks-ami#206

@falfaro
Copy link
Contributor Author

falfaro commented Mar 13, 2019

@sameersbn in my opinion, we should make things as easy as possible for users, and adding --node-ami to tke eksctl command-line is way easier than SSH-ing into N machines, changing the Docker configuration, etc.

Also, if the change is going to be reverted soon, what is the point of adding this documentation to the troubleshooting section? Instead, we can keep it in the Quickstart and once Amazon fixes the issue, we can remove this block of documentation.

@arapulido what do you think?

@sameersbn
Copy link
Contributor

This workaround only works for newly created clusters. User with an existing EKS cluster will not be able to take the path of this workaround.

@wojciechka
Copy link

This workaround only works for newly created clusters. User with an existing EKS cluster will not be able to take the path of this workaround.

That's true. What I managed to do is create a new nodegroup with proper AMI and then kubectl drain all my old nodes, then deleting nodegroup.

I think eksctl even helps with that - i.e. eksctl-io/eksctl#592

It was a painful process and mongodb that I had installed for my kubeapps could not be moved for some reason - but I just uninstalled and reinstalled kubeapps.

@arapulido
Copy link
Contributor

I think we should have both. Pointing to a previous AMI in the quickstart (and revert this once it is fixed), and a troubleshooting section to manually change the ulimits that can stay (for people who already have a EKS cluster)

@falfaro
Copy link
Contributor Author

falfaro commented Mar 14, 2019

@sameersbn please take another look, as I've added also a section for this issue in the troubleshooting section.


## Troubleshooting Elasticsearch

### Elasticsearch crashloop under EKS
Copy link
Contributor

@sameersbn sameersbn Mar 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be following the pattern used in the rest of the document where we first state the issue that's observed (i.e. Elasticsearch enters a crashloop) and then include a Troubleshooting section which would walk the user to confirm that the issue is due to the ulimits (i.e. inspect the logs) and then suggest using the sed command to resolve the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about now?

docs/troubleshooting.md Outdated Show resolved Hide resolved
docs/troubleshooting.md Outdated Show resolved Hide resolved
@sameersbn
Copy link
Contributor

lgtm! minor typos to be resolved

@falfaro
Copy link
Contributor Author

falfaro commented Mar 14, 2019

bors r+

bors bot added a commit that referenced this pull request Mar 14, 2019
430: Workaround for crashing Elasticsearch 6.x under EKS. r=falfaro a=falfaro



Co-authored-by: Felipe Alfaro Solana <felipe.alfaro@gmail.com>
@falfaro
Copy link
Contributor Author

falfaro commented Mar 14, 2019

bors r+

bors bot added a commit that referenced this pull request Mar 14, 2019
430: Workaround for crashing Elasticsearch 6.x under EKS. r=falfaro a=falfaro



Co-authored-by: Felipe Alfaro Solana <felipe.alfaro@gmail.com>
@bors
Copy link
Contributor

bors bot commented Mar 14, 2019

Build succeeded

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants