-
Notifications
You must be signed in to change notification settings - Fork 713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API call controller #3598
Comments
Hi is there any progress regarding this configuration, i need to configure the ILM policy with the operator but not able to find anything helpful. Only Solution left me for is having a wrapper command over kubeconfig, that will first deploy the operator, and then make a put call to use that policy. |
I would love to see this going upstream. In my org we have multiple Elasticsearch clusters with similar configuration. Having the ability to configure ILM and SLM policies as well as RBAC via configuration files would be of great help! Hopefully ECK will handle this shortly. Thanks for this |
@uric , @bleskes - @anyasabo and I discussed this today, and to reiterate the concerns around this (to make sure I heard this right)
The proposed solution POC: #3775 resolved #1 by having a generic implementation that can be extended to any API (as long as the PUTs are idempotent). I am not sure if #2 is a concern and I wanted to check on this and if there are other concerns. From our conversations with the Elasticsearch team, we decided to implement this in ECK and not the stack because it seemed significantly lower in terms of effort. |
I love the creativity in coming up with the "Sequence of API calls" direction. I have a ton of questions, out of curiosity, but I trust the team will figure it out and I'm looking forward to hearing the solutions. There is also an interesting relationship to features to the package manager being developed (which may conceptually include things like SLM/ILM policy), but I can see how this API based approach can serve as a foundation for many creative use cases. I do have one ask - can we maybe think about another name for it that better describes the scope/foundational character of the feature? There's a big gap between Declarative Config and a series of API calls you can issue against a cluster. I actually had to see the examples to understand what it is about. I think naming it right will help people understand what they can do with it and how to think about it, which will actually result in wider adoption. Just to give an idea of what I mean - "API sequences", "API fixtures".. etc. |
A few things I'm wondering while looking at the draft PR:
|
For the first three, that's definitely valid and I tried to address them in the initial issue. My main hesitation with adding it to the ES CRD is the amount of uncertainty around how we want to model this feature. Modifying the ES CRD is hard at this point as it is v1. We can add things to it, but if we need to remove it or modify it later it's a Big Deal. That was my main motivation for making it a separate, alpha resource. Once we've proofed out the design in practice we can still add it to the ES CRD later if we think it makes sense. If it wasn't for this concern I'd have been in favor of including it in the ES CRD. A configmap works (since this is essentially a list of k/v pairs) but has at least two downsides:
🤷♀️ I contacted the ES team to take a look but haven't heard back yet. I'll ask again on Monday.
That's valid. It feels a bit like an anti pattern to just store all of the state indefinitely in the CR where most controllers reconstruct their state via observation whenever possible, and only store it in etcd when this is not possible. I think I'd be open to it if we find there are more severe problems with this approach and we need to look for alternatives, but as is I prefer retrieving the state from ES. |
{
"policy": {
"phases": {
"warm": {
"min_age": "10d",
"actions": {
"forcemerge": {
"max_num_segments": 1
}
}
},
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
} On create many APIs take an "anonymous" object. The name is communicated via the the path in On {
"my_policy": {
"version": 1,
"modified_date": 82392349,
"policy": {
"phases": {
"warm": {
"min_age": "10d",
"actions": {
"forcemerge": {
"max_num_segments": 1
}
}
},
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}
} If I am understanding the suggested approach correctly this means the diff would always be non-empty and we would I wonder how many Elasticsearch APIs are truly idempotent? With SLM and ILM and thus two major APIs already excluded, maybe we should consider the hashing approach @sebgl suggested after all to make this feature more useful? PS An alternative approach to hashing, which has admittedly the drawback of accumulating all the state in annotations, would be to make this feature not fully generic. Instead we would encode knowledge about the structure of major APIs we would support. Based off of this ILM example, we could have a map of map[string][]string{
"_ilm/policy/\1": []string{"\1", "policy"}
} Note that the path contains a dynamic component in form of the name of the policy which I suggest to match from the last path segment of the API. If the user specs an unknown API we would fall back to the current behaviour from the draft PR and assume |
I was wondering how we should handle deletion. What if the user wants to delete something ? (creating objects at scale is great, being able to fix a mistake at scale is great too). Is it out of the scope of this feature ? Also I think I agree with @bleskes about the naming. Modern, declarative language based, tools like Terraform, have an advanced management of the objects lifecycle. It might be hard to offer a similar experience if the user has to provide an URL and not an desired state. |
My initial thinking was that yes it is out of scope (at least for now) because it is hard. I think it may be possible though to extend the
I think we need to be explicit about deletes -- just removing an item from the list should not trigger deleting the resource IMO. |
One other extension point I thought of is that adding another field containing query parameters to use only on PUTs might make sense. For instance component templates have a flag allowing you to update them. You can't put them in the url because then that will break the GETs. I suppose alternately we could strip any query parameters when doing GETs as well, but that feels less obvious to the end user. It doesn't seem like something that needs to be added on launch, but could make sense later. |
In an out of band discussion, we decided to put this approach on pause for the time being. It is feasible, but we would need to add support for specific APIs one by one as we cannot depend on consistency between different Elasticsearch APIs. This could have been caught sooner as an issue with the approach and that's on me. We decided to investigate an alternative option of having an example k8s job that can simply send PUTs until the a success response code is returned from that API. This is more of the "bootstrap" use case rather than a configuration mgmt use case, but it should meet a lot of needs for not that much effort. I'm not sure there is an approach that can satisfy the "reconciliation loop" model that does not require explicitly supporting various APIs. I left my branch https://github.com/anyasabo/cloud-on-k8s/tree/esdeclare up in case we decide to revisit this approach though. We would just need to weigh the benefits vs the large maintenance costs of supporting such a feature. It may also be worth further investigation with the ES clients team to see how they have resolved similar issues. My initial look was that they mostly did not, but it was very preliminary and could use further investigation. The current state of the branch has unit tests for some specific API responses (pulled from an actual ES server) and their transformations, and the framework for e2e tests but without any e2e tests yet. |
We decided not to call it "Declarative ES Config" and a more intelligent name is TBD. |
If we consider the differences between having an operator vs. using Helm/Terraform/other tool to orchestrate an Elasticsearch cluster, the advantage of the operator is that it is able to constantly ensure that the cluster is in the desired state and prevent configuration drift (immediate feedback vs. delayed feedback). Ensuring that all changes to a cluster are documented (e.g. as Git commits) and having the confidence that re-creating the cluster via the last manifest brings up the same cluster are quite desirable traits from an ops perspective. As Anya's work illustrates, building such a feature externally is a brittle proposition because it is impossible to do without intimate knowledge of the Elasticsearch API between different versions and playing catch-up constantly to keep abreast of developments in new versions. Furthermore, the need to employ polling techniques to ensure that the cluster is in the desired state is inefficient and could even end up overloading busy clusters. Therefore, I think the best place to implement this feature is in Elasticsearch itself. I experimented a bit with the "fire-and-forget" alternative.
This work is available in the https://github.com/charith-elastic/cloud-on-k8s/tree/feature/api-conf branch. ---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: hulk
spec:
version: 7.10.0
nodeSets:
- name: default
count: 3
config:
node.store.allow_mmap: false
podTemplate:
spec:
readinessGates:
- conditionType: eck.k8s.elastic.co/post-provision
---
apiVersion: batch/v1
kind: Job
metadata:
name: hulk-post-provision
spec:
ttlSecondsAfterFinished: 300
template:
spec:
restartPolicy: OnFailure
containers:
- name: postprovision
args:
- postprovision
- --jobdef=/jobs/job.yaml
- --log-verbosity=1
image: docker.elastic.co/eck-dev/eck-operator-cell:1.4.0-SNAPSHOT-5851b163
volumeMounts:
- mountPath: /jobs
name: jobs
readOnly: true
securityContext:
runAsNonRoot: true
serviceAccountName: hulk-post-provision
volumes:
- name: jobs
configMap:
name: hulk-post-provision
---
apiVersion: v1
kind: ConfigMap
metadata:
name: hulk-post-provision
data:
job.yaml: |-
target:
kind: Elasticsearch
namespace: default
name: hulk
clientConf:
requestTimeout: 30s
retryAttempts: 3
retryBackoff: 10s
retryMaxDuration: 150s
apiCalls:
- method: PUT
path: /_ilm/policy/my_policy
payload: {
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "25GB"
}
}
},
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}
successCodes: [200]
retry: true
- method: PUT
path: /_cluster/settings
payload: {
"transient": {
"indices.lifecycle.poll_interval": "3m"
}
}
successCodes: [200]
retry: true
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: hulk-post-provision
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: hulk-post-provision
rules:
- apiGroups:
- ""
resources:
- pods
- secrets
verbs:
- get
- list
- apiGroups:
- elasticsearch.k8s.elastic.co
resources:
- elasticsearches
verbs:
- get
- list
- update
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: hulk-post-provision
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: hulk-post-provision
subjects:
- kind: ServiceAccount
name: hulk-post-provision The issues with this approach are:
We could potentially introduce a new CRD or a new field to the existing CRD to alleviate some of the above problems. The question is whether we are comfortable with that level of deep integration for a brittle feature like this. |
I'd like to expand a bit on the hashing approach I proposed some time ago. CRDFirst, what the CRD would look like (nothing new here) - I think this would be a section in the existing Elasticsearch CRD: api_requests:
- path: /_security/role_mapping/sso-kibana-123
method: PUT
body: |
{
"rules": {
"all": [
{
"field": {
"realm.name": "cloud-saml-kibana-abcdeaad866da82f61ed1"
}
},
{
"field": {
"groups": "superuser"
}
}
]
},
"enabled": true,
"roles": ["superuser"],
"metadata": {}
}
- path: /_snapshot/cloud-snapshots
method: PUT
body: |
{
"cloud-snapshots": {
"type": "gcs",
"settings": {
"bucket": "abcdfd3f246639bfbaa5b01fb0",
"client": "client-a9fc50",
"base_path": "snapshots/abcdfd3f246639bfbaa5b01fb0",
"region": "us-central1",
"email": "abcd@cloud-email.com",
"aws_account": "abcd-us-central1"
}
}
}
- path: /_slm/policy/cloud-snapshot-policy
verb: PUT
body: |
{
"name": "<cloud-snapshot-{now/d}>",
"schedule": "0 */30 * * * ?",
"repository": "cloud-snapshots",
"config": {
"partial": true
},
"retention": {
"expire_after": "259200s",
"min_count": 10,
"max_count": 100
}
}
- path: /_cluster/settings
verb: PUT
body: |
{
"persistent": {
"action": {
"auto_create_index": "true"
},
"cluster": {
"routing": {
"allocation": {
"disk": {
"threshold_enabled": "true"
}
}
},
"indices": {
"close": {
"enable": "false"
}
}
}
}
- path: /_cluster/settings
verb: PUT
body: |
{
“persistent.snapshot.max_concurrent_operations”: null
}
- path: /_template/.cloud-hot-warm-allocation
verb: PUT
body: |
{
".cloud-hot-warm-allocation-0": {
"index_patterns": [
"*"
],
"mappings": {},
"aliases": {},
"order": 0,
"settings": {
"index": {
"routing": {
"allocation": {
"require": {
"data": "hot"
}
}
}
}
}
}
- path: /_template/.another-deprecated-template
verb: DELETE What the feature is aboutI think the general philosophy would be to:
Persisting hashes of successful http callsAlthough many Elasticsearch requests materialize to idempotent PUTs, I think it would be a shame to execute the same request over and over again. In most cases users just want to configure things once. For example, the sha1 hash of metadata:
annotations:
elasticsearch.k8s.elastic.co/successful_http_requests: "[\"b3b07b553ff32cd0ad9767eebc3bb70b6a4206ca\"]" We only need to persist hashes matching API requests specified in the specification. If the user removes the snapshot repository request from the specification, we can also remove its hash from the annotation. At least once http calls, not declarative configuration managementIf a user patches the snapshot repository through Elasticsearch API directly, bypassing the yaml manifest, then it's considered out of control of the operator. The operator won't try to execute again the same http request, since already successful. This feature is "just" about executing http calls at least once. It does not allow a user to declaratively manage all objects of a given type. For example, let's imagine a user sets up 2 index templates in the specification: removing an index template from the spec does not lead the controller to also notice it should be removed in Elasticsearch. That removal can only be explicit through a - path: /_template/.cloud-hot-warm-allocation
verb: PUT
body: |
{
".cloud-hot-warm-allocation-0": {
"index_patterns": [
"*"
],
"mappings": {},
"aliases": {},
"order": 0,
"settings": {
"index": {
"routing": {
"allocation": {
"require": {
"data": "hot"
}
}
}
}
}
}
- path: /_template/.another-deprecated-template
verb: DELETE Another example: remote clusters. They can be configured through the PUT /_cluster/settings
{
"persistent": {
"cluster": {
"remote": {
"my-remote-cluster-1": {
"mode":"proxy",
"proxy_address": "a542184a7a7d45b88b83f95392f450ab.192.168.44.10.ip.es.io:9400",
"server_name": "a542184a7a7d45b88b83f95392f450ab.192.168.44.10.ip.es.io"
}
}
}
}
} In order to remove PUT /_cluster/settings
{
"persistent": {
"cluster": {
"remote": {
"my-remote-cluster-1": {
"mode": null,
"proxy_address": null,
"server_name": null
}
}
}
}
} Those cases may be better handled by an explicit Status reportingIn addition to the internal annotation to persist successful request hashes, I think it would be useful to provide a way for users to know whether all requests have been successfully executed. Which naturally falls into the status:
health: green
phase: Ready
version: 7.15.0
api_requests: 12/12 # 12 requests in the spec, 12 matching hashes in the annotation I think it makes sense to keep this status very lightweight, and not go into a direction where the result of each call would be serialized. Rather, if a user notices Success criteriasAll Elasticsearch API calls I know about can be considered successful under the following conditions:
Sequential ordering of the requestsThere is an order into which requests may need to be executed. For example, a snapshot repository must be created before an SLM policy that refers to it. Therefore all http calls defined in the spec must be executed in the sequential order they are defined with. If request Additional settingsWe could provide more options in the specification for users to tweak how requests get executed. However I think in practice the following works just fine (so I'd rather not add those options for now):
Caching issues with the successful hashes annotationIf we execute all requests, then update the annotation, chances are that annotation may not be visible yet at the next reconciliation attempt. That's because the controller works with a cached version of the Elasticsearch resource, and the cache has not been populated yet with the updated annotation. We could consider that's not a problem: run the same idempotent requests again. Or (my preference) rely on the existing expectations mechanism: check the updated resourceVersion, and only reconcile again once the caches has a higher resourceVersion. Pseudo-code implementation
|
To solve this minimally, we have implemented a small utility that can be used as a side-car / job to configure things against ES/Kibana api |
(sorry for the duplicate) I think it is interesting to see that Elastic made official terraform provider so user can natively use terraform. I hope we should get soon the same UX experience using k8s native way (CRDs). |
We would also be very grateful if this could be implemented. Our current hack is to inject a sidecar: apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
spec:
nodeSets:
- name: default
podTemplate:
spec:
containers:
- name: elasticsearch-s3-configuration
image: curlimages/curl
command:
- sh
- -c
- |
echo "machine localhost login $USERNAME password $PASSWORD" > $HOME/elastic-credentials
set -x
while [ "$(curl -k -s -o /dev/null -w '%{http_code}\n' --netrc-file $HOME/elastic-credentials https://localhost:9200/_snapshot/s3_snapshots)" != "200" ]; do
curl -k -X PUT --netrc-file $HOME/elastic-credentials https://localhost:9200/_snapshot/s3_snapshots -H 'Content-Type: application/json' -d'
{
"type": "s3",
"settings": {
"bucket": "es-backup",
"endpoint": "minio.example.org"
}
}'
sleep 10
done
sleep infinity
env:
- name: USERNAME
value: elastic
- name: PASSWORD
valueFrom:
secretKeyRef:
name: logserver-es-elastic-user
key: elastic
|
About to implement a hack for this as well. Mergey MacMergeface! MVP! 😄 |
I believe a direct Resource controller would be nice, a better solution. |
closing this issue for now, as it depicts a solution to a now solved problem - StackConfigPolicy CRD in ECK allows you to configure what people here depict the "API controller" should do. |
Hello, Is there a standalone tool that supports declarative ES API calls? While this thread, I believe mainly is about k8s integration, we are a small company that uses Managed ES. We also have a similar situation where we would like to have all the operations done on ES via configuration so that they can be checked into source control so that it provides consistency across environments and auditability to the changes done. What are our options if we dont use k8s? |
@aravindh-odyssey, as mentioned in #3598 (comment), Elastic supports an official Elastic Stack Terraform provider that allows you to manage and configure the Elastic Stack as code using Terraform. |
@thbkrkr Thank you for pointing me towards that. This helps. |
This issue will be updated with current status as it progresses.
Goals
We would like to be able to declaratively configure (create/update) common configurations operations in Elasticsearch that are currently API-only, such as ILM, index settings, and index templates. This is helpful for organizations that have more than one Elasticsearch cluster, but want them to be configured identically. For instance, having separate dev, staging, and production environments, or having an environment per geography. It is helpful to be able to store these settings in configuration management to ensure all environments are identical (see “infrastructure as code” philosophy). Similar functionality exists for users and roles today.
Today to have a "fully functional" cluster, users need to create their ES cluster, wait for it to become available, then interact with the API to create the extra settings. If this could be handled by ECK it would make the process much simpler.
Initial thoughts
We may want to have generic support for essentially any endpoint that supports a PUT. Users could provide a URL and the body, and we could get/set the URL as necessary. Even idempotent PUTs cause extra load on Elasticsearch, so it would be better for us to do a comparably cheaper GET first rather than blindly sending PUTs.
This resource could be modeled as a configmap, a CRD, or as part of the Elasticsearch manifest. My gut leans towards a CRD to start. Adding fields to a CRD that is already v1 is challenging because you cannot revise them easily. A configmap works well for a key/value list, which is all we really need. But it cannot be versioned easily, where at least CRDs we have the option to use webhook conversion (though we do not use the feature today and there are challenges with deploying it in all environments).
Caveat: some PUTs are not idempotent (e.g. indexing documents revs the version). SLM policies include execution status information. This may be something where we do not do a strict comparison before setting -- for instance, if a user specifies a doc to be indexed, as long as all the fields the user specified match, then we do not consider it as needing an update. This has similar problems that we have experienced in the past in that it makes it difficult to remove fields previously set by the user.
Another option would be to simply leave this as a potential footgun, and to document that users shouldn't use non-idempotent API operations.
Caveat cont'd: some operations have dependencies. For instance, a snapshot repository must be created before an SLM policy referencing it can be created. This can likely be supported by executing operations sequentially within a given resource.
Proposed CRD example
Current status
I'll do a POC of a CRD that looks like that example.
Related
ILM CRD issue
Kibana dashboard issue (ECK repo)
Kibana dashboard issue (Kibana repo)
Elasticsearch has rest api tests that function similarly here
The text was updated successfully, but these errors were encountered: