-
Notifications
You must be signed in to change notification settings - Fork 42
chore: add skeleton for the stack-monitoring tests #113
chore: add skeleton for the stack-monitoring tests #113
Conversation
@@ -66,6 +66,12 @@ var supportedProducts = map[string]*contextMetadata{ | |||
}, | |||
modules: []string{"metricbeat"}, | |||
}, | |||
"parity-tests": &contextMetadata{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For newcomers, long story about this map here: #104
It needs an elasticsearch as monitoring instance, and we will create them using the life cycle hooks (Befores and Afters)
…er-compose to a method in the test runner
This will allow us to run multiple elasticsearch instances in the same test, as we do for monitoring
This is a requirement
We don't need them for stack monitoring's parity tests purpose
These keys have been found in this Docker testing
Thanks for scoping down this PR to just Elasticsearch parity tests, @mdelapenya. As stated in #94, the goal of this PR was:
Now that this PR is only dealing with Elasticsearch parity tests, I was able to do an apples-to-apples comparison by running the parity tests as implemented by this PR and comparing the results with the existing parity tests. In both cases I started out with completely clean environments (no pre-built Docker images to aid this PR and similarly no VM images or running VMs to aid the existing parity tests). This PR
Existing parity tests
As you can see the existing parity tests took 4 minutes and 32 seconds while the ones in this PR took 274s == 4 minutes and 33 seconds — almost exactly the same amount of time! Given this, we can conclude that there doesn't seem to be any major performance benefit in porting over the existing parity tests to the E2E framework. However, performance was only one of the goals stated in #94. The other one was maintainability. Given that the @elastic/stack-monitoring-ui team is currently maintaining these parity tests, I will defer to them on whether they think porting them over to the E2E framework (as illustrated in this PR) would be more maintainable for them in the long run over the current implementation. |
About performance, I'd run benchmarks for both implementations to verify it in a controlled environment to avoid getting simplistic results. As an example, in my local machine (Macbook 13'', 16GB RAM, 2,7 GHz Intel Core i7, SSD) it takes:
which is faster than the above numbers (curious about the CPU usage: 4% vs 26%). As a side note, the tests for the 4 products (elasticsearch, kibana, logstash and filebeat) complete fairly fast:
Talking about maintainability, I'm going to list the things I (and the software industry via ISO's) considered important when working in both projects, creating this framework and reading the current parity tests. Hope I don't get anybody bored here 😄 Testability
Analyzability
Functionality
Usability
Installability
|
Isn't that what I did? Can you explain why your environment should be considered more controlled than mine? Also, your results prove that the new parity tests run faster in your environment than mine. But the question is about the relative performance between the existing and new parity tests, right? In my comment I ran both implementations (existing and new) and posted both their results. However, so far you have only posted the results of the new parity tests in your comment. For an apples to apples comparison, shouldn't you also run the existing parity tests in your environment and post those results as well? As for maintainability, thanks for listing out the various factors that go into it. We should indeed consider these in deciding which implementation is more maintainable long term. I have no horse in this race as I'm no longer the primary maintainer of these tests. So I'll reiterate what I said in my previous comment: it's up to the @elastic/stack-monitoring-ui team to weigh in on this aspect as they are primarily maintaining the parity tests. I think it's worth remembering that we already have stack monitoring parity tests implemented and running in CI for well over a year at this point. They're stable and have been catching bugs for us already. We made an investment in them to get them to be stable and useful, which took a few months, and since then that investment has been paying off with very little upkeep. The framework they are built on is being actively maintained and is well supported. So there is no inherent reason to abandon this investment now. Essentially, we've paid off the costs and are reaping the benefits. Whereas with a new implementation we're going to have to put in some work first before we start seeing the same level of stability and usefulness. For example, right now my involvement in maintaining the existing parity tests is pretty much zero. By comparison, I've had to be quite involved in helping debug the new implementation and I suspect I would be for at least some more time. Basically, in my mind, there has to be enough of a benefit to move away from the existing implementation that's stable, useful, and well-supported; otherwise I'm not sure it's worth it. |
I wasn't really aware of this effort, but I'm having a hard time understanding why this is happening. I don't think we (the stack monitoring team) has any issues with these tests as is. Is there any background I can read about this effort? |
Sorry if I expressed that I ran the tests in a controlled environment, but my intention was to clarify that none of mine or yours is controlled 😊 I only posted the results as a pure demonstration of non being controlled. I can picture a bare-metal, dedicate machine running on CI for different runs to accomplish this performance task. Besides that, I tried to run the existing ones without success, that's why I added the Please do not get me wrong, I'm not saying there is no value in the parity tests so we have to replace them. I totally value what they provide to detect changes between both JSON schemas. The one-month work in this PR is trying to address the initial request (#94) to check if this new dockerised, BDD-based framework was able to mimic what the parity tests do, and that's what we did: bring them to the framework, where we would be able to run it with the frequency that we need (with every PR or merge, daily, etc.) TBH I was simply surprised that measuring performance in a few runs would determine the performance mark for both use cases, that's why I brought in the maintainability aspects to consider when taking a final decision. |
I've not mentioned it, but any initiative from our team (Eng. Productivity) is trying to assist where needed: if we see no improvements in this new approach, then it's totally OK 😊 so closing this PR would be fine too |
Ah, thanks for clarifying. I agree with you that a controlled environment is ideal for benchmarking. Given that we're more interested in the relative comparison between two implementations I think, as long as they run in the same environment, it's good enough — not perfect, but good enough IMO. BTW, it is for this exact reason that I avoided comparing the performance results of the existing and new implementations in a CI environment — I would consider a CI environment to be even less controlled than our local machines. So I'm right there with you on how the environment being controlled vs. not can influence the numbers.
👍 Please do. We have one comparative performance analysis done on my system. It would be great to have another one done on yours as well.
Indeed, and I'm grateful for the work you've done in bringing this PR to where it is. Without it we couldn't be having the discussion we're having right now about performance and maintainability, which is what #94 was about. Quoting from it:
This is a fair criticism. I will re-run both implementations locally multiple times (resetting state by removing all docker images and VMs+images and making sure the CPU usage drops to a low stable rate each time). I'll post the results of each run here. If you do the same in your environment and post the results, hopefully we can get a fairer picture of relative performance. |
@chrisronline This effort is an exploration of whether there might be benefits (performance and maintainability, specifically) to migrating the existing parity tests over to the E2E testing framework being developed by the Observability Engineering Productivity team. I requested this exploration in #94. My thinking was/is that if there are significant benefits in either or both areas (performance and/or maintainability), it would be worth moving the tests over. Now, thanks to @mdelapenya's efforts, we have this PR so we can do a concrete comparison between the existing and new implementations, rather than talking in abstract terms. As the primary maintainer of the existing tests along with @igoristic, WDYT? |
@ycombinator I have found some errors while preparing the local environment on Mac, so I created this issue in the repo: https://github.com/elastic/elastic-stack-testing/issues/585 |
It sounds like the ask here is for myself and @igoristic to pull this down and run the tests and get a sense of how they work and how we can debug them. We can do that and report back |
@chrisronline Yes, that would be an ideal way to get a sense of maintainability. Thanks! |
Hey @ycombinator, I managed to run the parity test locally. These are the time results for the parity tests: 1st run:
2nd run:
3rd run:
|
Thanks @mdelapenya. Those are about the same numbers I saw for my single run for the existing parity tests. I still need to run them again multiple times to get more results; just haven't gotten to it. I'm curious why the new tests (this PR) ran so much faster in your environment than mine when the existing tests took about the same time. Did you clear out all your docker images before you ran the new tests? I did that with |
I'm just running these tests removing elasticsearch and metricbeat images from the Docker host. To speed things up on CI, those images can be cached in the CI workers, as we already do for other projects |
New results! 🍞 Removing docker images, so both elasticsearch and metricbeat are pulled during the process: 1st run
2nd run
One of the benefits of Docker is this reusability of the Docker cache, so I would like to keep these results as valid, because it's how Docker works. If the existing framework is not caching resources between runs (it downloads a TAR file from the network) then maybe that could be a good candidate for improvement. |
++ agreed! I don't know too much about our CI environments so I might need some education here. Given that we use ephemeral workers for CI, does that impact cacheability? Specifically, it is possible to cache Docker images (or other artifacts) on each worker and have them preserved between two uses of the same worker? |
Yes! We can bake packer images with some docker images already present in the host, under We also generate most of the Beats integrations Docker images daily, and push them to our own registry, to avoid building them on each CI build (See https://apm-ci.elastic.co/blue/organizations/jenkins/beats%2Fbeats-docker-images-pipeline/detail/beats-docker-images-pipeline/200/pipeline/57). But this is not our case, as we consume already existing Docker images. Nevertheless, I could imaging creating images for a specific PR on metricbeat, pushing it to our registry, and testing running these tests with it. Everything automated, of course |
Gotcha, thanks. In that case I agree that there will be a performance gain from using the E2E testing framework for the parity tests (this PR), which use Docker images over the existing implementation. Even if we were to try and cache tar files for the existing implementation, that counts towards extra effort in that framework because it's not something we get "for free" already. Whereas it sounds like the analogous effort for caching Docker images on CI workers has already been made so we will get it "for free". Concretely, we're looking at 195.666s with the E2E testing framework (taken from this comment) vs. 291.333s (taken as the average of 3 values from this comment). That's a performance gain of 95s or 32%, which is not insignificant IMO. So purely from a performance perspective, I'm now 👍 to migrate the tests over to the E2E framework. I do think we should wait to hear @chrisronline and @igoristic's thoughts on the maintainability perspective since they are the ones maintaining the existing tests and would have to maintain the new ones if we migrate. |
I'm still curious about the CPU usage 🤔 but I guess it's a Docker Vs Vagrant thing. I did not expect such a difference! Docker: 3% |
Nice performance analysis. I am planning to remove the use of Vagrant in CI. Ansible can be used directly, with docker or a cloud provider. |
We use Ansible as Cloud provisioner replacing Terraform, and it works pretty well! |
@chrisronline @igoristic @ycombinator Please let us know if you have any questions about this PR 🙏 we are glad to help! |
Hi 👋 Would you mind if I close this one until a final decision is taken? You could reopen it at any time if/when needed. Thanks in advance! |
Sounds good to me. We are a bit swamped and won't be able to look at this in the near future |
Never mind! We have the commits here ready to come to them whenever there is more bandwidth. Thank you! |
What is this PR doing?
It ports the existing Python parity tests for stack monitoring to Go
Why is it important?
We want to add support for Stack Monitoring in this test framework.
How to test these changes?
You could run:
or using VSCode, supporting debug capabilities:
Related issues
Logs
Expand to view the logs