Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MiqServer.seed should not call Zone.seed #18173

Merged
merged 1 commit into from
Nov 8, 2018

Conversation

bdunne
Copy link
Member

@bdunne bdunne commented Nov 7, 2018

Mistakes were made, this seed shouldn't be here.

@bdunne bdunne mentioned this pull request Nov 7, 2018
2 tasks
@carbonin carbonin self-assigned this Nov 8, 2018
@carbonin carbonin merged commit 879bc8f into ManageIQ:master Nov 8, 2018
@bdunne bdunne deleted the remove_incorrect_zone_seed branch November 8, 2018 20:01
Copy link
Member

@jrafanie jrafanie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Hyperkid123
Copy link

Hyperkid123 commented Nov 12, 2018

@bdunne it seems that this has broken ui-classic test. We are getting same errors across multiple tests:

  1) ApplicationHelper::ToolbarBuilder update_url_parms when the request query string has a few specific params to be excluded excludes specific parameters and adds the new one
     Failure/Error: MiqServer.seed
     
     NoMethodError:
       undefined method `name' for nil:NilClass
     # /home/mmarosi/manageiq/app/models/miq_server.rb:158:in `seed'
     # ./spec/helpers/application_helper/toolbar_builder_spec.rb:335:in `block (3 levels) in <top (required)>'

It is related to Zone in manageiq/app/models/miq_server.rb

157:        _log.info("Creating Default MiqServer with guid=[#{my_guid}], zone=[#{Zone.default_zone.name}]")

default_zone is nill. When i remove it, tests are passing. But i don;t think that removing this line is the right solution. Can you guys fix it please?

@martinpovolny
Copy link
Member

martinpovolny commented Nov 12, 2018

I would like to remind that there is still a very strong coupling between the UI code and the backend code. And this aspect is not improving recently for reasons that we all know.

Also the UI specs fully depend on the backend models, factories and seeding among other things.

@bdunne
Copy link
Member Author

bdunne commented Nov 12, 2018

This PR was not created in an attempt to intentionally break the ui-classic or any other repo. Someone referenced this code from another PR as an example and since it is incorrect, I removed it. As the app grows (and more repos are created), I expect that this will happen more frequently.

@martinpovolny
Copy link
Member

martinpovolny commented Nov 12, 2018

@bdunne : Sure, no one thinks that this is intentional and we all understand that it happens. It's just that since last week we had several issues that broke the master. This PR was just one more of those. It is consuming a significant amount of UI people's energy and it is a a bit flustrating.

I am not keeping a score but I hear people complaining (or not complaining but fixing UI's travis due to changes in other repos) all the time. Maybe @himdel can remember what is the frequency recently.

If your really think that this is going to happen more frequently than I suggest that we revisit the idea of core running the UI tests as part of the Github PR automating.

In the end it's the UI that integrates everything so from that perspective it's most likely to catch the most issues and therefor everyone should be using it for testing stuff. Does it make sense?

@himdel
Copy link
Contributor

himdel commented Nov 14, 2018

I am not keeping a score but I hear people complaining (or not complaining but fixing UI's travis due to changes in other repos) all the time. Maybe @himdel can remember what is the frequency recently.

Not exact, but looks like we had backend-related test issues on:

10/24
11/05
11/07
11/08 through 11/12 (no green in between)
11/13

So, yes, we shouldn't single out this PR, but in general, it is a persistent and annoying problem :).

@jrafanie
Copy link
Member

This is the difficulty with riding the edge of branches for different repos. We've had many core test failures due to other gems changing on their master branch. We figure them out, fix it and move on. Adding additional unrelated tests to this or other repos just in case feels like the wrong solution.

We could version branches with tags and only go against tags but then will discover incompatibilities much later in the development process, and probably multiple issues at once.

If you encounter a core or other repo bug breaking ui-classic, investigate it and get us involved to fix it. If anyone doesn't know how to reset their linked manageiq master branch locally to before the breakage, we need to teach them. Being blocked by another repo's change should be a temporary problem. Even if you don't know when manageiq was good, you could just reset the head of master locally to 1 or 2 days prior and try again.

We try to do the same thing when other repos break manageiq tests.

I don't think versioning branches will help and neither will running another repo's tests against every pull request.

If it's a communication / ownership problem for the underlying change, we should work together to fix that. In some cases, core changes need to be made because assumptions elsewhere are just plain wrong. If that's the case, we should be proactive with trying to involve other teams to see if the change may break others but in some cases, we'll not being able to predict what changes will break others.

@himdel
Copy link
Contributor

himdel commented Nov 14, 2018

As long as any external code is depending on some core conventions, it seams reasonable that those conventions should be tested in the core. That's hardly unrelated.

If this were a once a month issue, sure.

But as it is, this is adding days worth of delay per PR. I think we need a better solution than "meh, deal with it" ;).

@slemrmartin
Copy link
Contributor

slemrmartin commented Nov 14, 2018

@bdunne sorry, but I told it many times - this approach with "it had to be seeded before in production" is bad. I told it it is mistake in #17452, we couldn't rely on these assumptions. It leads to unexpected behaviour.

@martinpovolny
Copy link
Member

Just to explain how serious a problem this is: My estimate is that right now this issue costs around 1/2 of a senior engineer. The discussions, the delays, the "down time" when stuff is not being merged. The fixing. More discussions.

Adding additional unrelated tests to this or other repos just in case feels like the wrong solution.

Agree. Except our tests are very related. We have 2 repositories that are very tightly coupled. We reference the classes, methods, attributes, structures, constants everything. And we do not have any kind of a formal contract. No header files as in C or typings as in Typescript.

So these UI tests are not unrelated. There tests actually test the code in the core repo.

I heard before that the reason not to run UI tests is time. Over the time I concluded that this argument is a weak one. Usually a PR stays open for days or weeks. The UI tests run for hours at worst. We may just run the Ruby test suite and on a single version of Ruby. That would be enough to catch a significant part of the problems.

If you encounter a core or other repo bug breaking ui-classic, investigate it and get us involved to fix it.

This does not work because it is slow. Stuff (frequently) gets broken in your afternoon, our night. Many of us start working in your late night. Waiting to get someone from US involved basically means extending the "down time" by one day.

@jrafanie
Copy link
Member

Just to explain how serious a problem this is: My estimate is that right now this issue costs around 1/2 of a senior engineer. The discussions, the delays, the "down time" when stuff is not being merged. The fixing. More discussions.

👍 If this downtime includes many team members not able to work because they don't know how to reset their manageiq head locally to yesterday, we need to address that because that's a solvable problem.

@Fryguy
Copy link
Member

Fryguy commented Nov 14, 2018

My opinion on this matter is that we need to be more scientific about it. A lot of us are bringing emotion into it without working out the numbers, which IMO should be the thing that gets us to a real solution. In a perfect world, of course we'd run every single test on every single PR, for all providers, separate repos, plugins, ui, api, etc, but that's just unrealistic for multiple constraint dimensions. So, we're trying to find the best balance here. I agree that the current balance might be non-optimal and I'm willing debate ways to improve that.

I am not keeping a score but I hear people complaining (or not complaining but fixing UI's travis due to changes in other repos) all the time.
...
If this were a once a month issue, sure.

@himdel @martinpovolny I think we need some accounting here. If possible, can we take the next month or two and literally count how many times a core PR broke the UI repo AND how long it takes to fix? Then we can average that over the number of PRs in each repo to get to some good numbers. I know you have estimates, but hard numbers might make this a lot simpler discussion.

Hypothetically, if this is happening once a month, and on average takes a couple hours to fix, then I am not willing to slow down every push to every core PR by a factor of 215% by introducing UI tests (core tests take 17 minutes, ui tests take 37 minutes = 54 minutes = ~215% time increase).

However if this is happening, say, daily, and takes multiple days to fix, then the overhead is clearly worth it and I would be willing to take that kind of time hit in the core repo.


Given those numbers we discover, if we decide we do need to invest some time in to fixing this, I did some brainstorming on a possible technical solution. I was speaking earlier with @carbonin, @bdunne, @jrafanie, and @yrudman and one thing we always have wanted was some sort of "on-demand" request that a PR author could do. Usually we have a good idea of a change we make that could break another repo, and it would be really nice that at the end of a PR's dev cycle that we could just ask @miq-bot to run some other test suite for us. However, Travis doesn't do this natively and setting up and managing a full Jenkins instance, publicly, is just something that none of us want to maintain.

So, the alternate idea is to have a new GitHub repo (and thus Travis repo) whose sole purpose is to run tests from a different repo. I can imagine that a PR author can comment with @miq-bot run-tests manageiq-ui-classic, the bot can create a dummy PR on the new GitHub repo, and wait on the tests to complete on that dummy PR. When the tests complete, it can update the source PR with a comment, or even better with the Status API. I could see the PR author requesting tests for any particular repo, so this could also help provider authors, api authors, etc. I also think this approach would take much less developer overhead in maintaining.

I feel like this is a decent compromise and would catch a lot of the issues. Would it catch all of them? No, probably not. But... "we're trying to find the best balance here."

@jrafanie
Copy link
Member

I agree we need an integration suite to test core / ui / providers together but it shouldn't be the core repo. We need to develop a solution that integrates these repos and mergers to the other repos monitor so as not to destroy everyone else's productivity.

@himdel
Copy link
Contributor

himdel commented Nov 14, 2018

@himdel @martinpovolny I think we need some accounting here. If possible, can we take the next month or two and literally count how many times a core PR broke the UI repo AND how long it takes to fix? Then we can average that over the number of PRs in each repo to get to some good numbers. I know you have estimates, but hard numbers might make this a lot simpler discussion.

Agreed, created ManageIQ/manageiq-ui-classic#4921.

Everybody please mention that issue whenever something from other manageiq repos breaks ui-classic specs.

@himdel
Copy link
Contributor

himdel commented Nov 14, 2018

As for the spec-only repo idea, it's definitely better than nothing :) And yes, maybe having such a feature handy whenever somebody thinks their PR could break the UI should help.

I'm not quite sure it will actually help, not without running automatically & a lot of automation around finding the cause of the failure and notifying the relevant people, but it could be a decent first step..

@Fryguy
Copy link
Member

Fryguy commented Nov 14, 2018

a lot of automation around finding the cause of the failure and notifying the relevant people

I would love to understand this step the most, as I'm curious if this is where the majority of time is lost. I've added a comment to ManageIQ/manageiq-ui-classic#4921 to see if we can account for that as well as some other timings.

@bdunne
Copy link
Member Author

bdunne commented Nov 14, 2018

@himdel Should we move the issue to ManageIQ/manageiq instead? Then we can also track the same stats for the API and other repos that this happens to.

@martinpovolny
Copy link
Member

martinpovolny commented Nov 15, 2018

A few notes.

@jrafanie : I wrote "down-time" -- stuff is not being merged.

If this downtime includes many team members not able to work because they don't know how to reset their manageiq head locally to yesterday, we need to address that because that's a solvable problem.

I never wrote that. Of course people can for example carry on working on their topic branches.

On top of that "resetting manageiq to yesterday..." would not help in many cases. There's a number of other issues that happen in these situation. Most significant would be schema changes. Most UI people need to work with a populated database. So db restore from backup might be involved in the process. Also manageiq-schema reverting and possibly provider repos too.

But this is actually not very relevant. I was speaking about merging being blocked.

As it happens the people who do most reviews and merges are the same people who are most likely to be fixing the travis breakage. These people get spread thin. So the "down time" that I wrote about eventually influences parallel teams and partners. Their PRs are red, need travis restart (again the same people can do that). Someone needs to go and check if these PRs are red due to an issue with the PR or due to the master breakage. The amount of work sums up. And it's work that is not planned.

@Fryguy wrote:

Hypothetically, if this is happening once a month, and on average takes a couple hours to fix, then I am not willing to slow down every push to every core PR by a factor of 215% by introducing UI tests (core tests take 17 minutes, ui tests take 37 minutes = 54 minutes = ~215% time increase).

No point in starting with a hypothesis that we know is wrong. It's happening much more frequently that that. Once a week would be a conservative estimate. And yes, we will collect the data as you want. 👍

Speaking about a factor of 215% looks bad. Until you realize, that it's 30-40 minutes per PR. Do you know what is the average lifetime of a PR (before it is merged)? That put into percent may give us a completely different picture.

@jrafanie :

We need to develop a solution that integrates these repos and mergers to the other repos monitor so as not to destroy everyone else's productivity.

Well I am not going to reproduce what people where saying on this side of the ocean after reading this ;-) I will just try to tell you what impression that statement makes.

From our perspective this sounds like you are saying: "We are not willing to give a (small) amount of time (40 minutes) of PR delay because it would cause our loss of productivity (perceived as small from our side) in exchange for fixing a huge (from our side this seems huge) problem that you have."

I understand very well that this is not what @jrafanie wrote or meant to say. I know that it's a problem of different perspectives and incomplete understanding of the situation from both sides.

Overall I am really thankful that you acknowledge that there might be a problem and are willing to carry on the discussion on how to size it and solve it. 👍

@cben
Copy link
Contributor

cben commented Nov 15, 2018

... then I am not willing to slow down every push to every core PR by a factor of 215% by introducing UI tests (core tests take 17 minutes, ui tests take 37 minutes)

So we have a tension between feedback latency while iterating on a PR vs the unplanned productivity damage once a breaking PR is merged.

  • Could we apply different standard of testing to iterating on a PR vs finally merging it?
  • Why do we care so much about Travis latency? If the extra tests are separate travis jobs, it's still quite easy for developers to check the fast tests. I suspect we care mostly because we wait for green before we can merge?

So perhaps some merge-on-green automation could make everyone happier?
As soon as reviewer is happy, they'd just write some magic /merge comment (maybe miq-bot, maybe something else), it'd run tests — including some dependant repos — and merge IFF all passed.

(Many such automations also implement a strict merge queue — test & merge 1 PR at a time. We can't afford that, and we also don't need it — actual race conditions missed by test runs against older master are very rare.)

@himdel
Copy link
Contributor

himdel commented Nov 15, 2018

@himdel Should we move the issue to ManageIQ/manageiq instead? Then we can also track the same stats for the API and other repos that this happens to.

@bdunne Depends on what we want to track. Naturally, I'm more interested in "other repos breaking ui-classic", because my feeling is this is about 50:50 core vs some provider, and there was at least one API PR that also broke our travis.

If you're more interested in "manageiq core breaking other repos" it makes sense to have the issue elsewhere :).

Perhaps we should track both.

@jrafanie
Copy link
Member

jrafanie commented Nov 15, 2018

On top of that "resetting manageiq to yesterday..." would not help in many cases. There's a number of other issues that happen in these situation. Most significant would be schema changes. Most UI people need to work with a populated database. So db restore from backup might be involved in the process. Also manageiq-schema reverting and possibly provider repos too.

But this is actually not very relevant. I was speaking about merging being blocked.

👍 @martinpovolny

Speaking about a factor of 215% looks bad. Until you realize, that it's 30-40 minutes per PR. Do you know what is the average lifetime of a PR (before it is merged)? That put into percent may give us a completely different picture.

While it's very true it's only 30-40 minutes per PR, I'm more concerned with the larger test suite producing more sporadic failures due to test contamination in general. I know we have a few sporadic test failures on core that appear very infrequently and seem to be exposed more when new tests are added, perhaps because of the way the rspec spec randomizer works. Additionally, while we're talking about bringing UI-classic tests here, I can see someone in a very active provider repo wanting the same thing for areas where they are tightly coupled to manageiq. This doesn't scale very well.

From our perspective this sounds like you are saying: "We are not willing to give a (small) amount of time (40 minutes) of PR delay because it would cause our loss of productivity (perceived as small from our side) in exchange for fixing a huge (from our side this seems huge) problem that you have."

I understand very well that this is not what @jrafanie wrote or meant to say. I know that it's a problem of different perspectives and incomplete understanding of the situation from both sides.

Thanks for providing that. This was clearly not my intent. What I meant is additional somewhat related tests from other repos being added to repositories without the expertise in those related tests bring a lot of potential for test failures as I mentioned above. If other repos also want this behavior, it makes it much worse. Suddenly, provider/core/ui developers will have to wait a longer time on travis to find out a unrelated test is failing without knowledge of the area. Because it's important to track down this failure, time will be spent to find out if it's a valid failure.

Overall I am really thankful that you acknowledge that there might be a problem and are willing to carry on the discussion on how to size it and solve it. 👍

Oh, I know there's a definite problem. It's hard to measure how bad a problem it is. In the past when other gems or repos have broken travis for the core repo (or even when we were all in one repo), we've spent unaccounted amount of time tracking down the failures. It wasn't ever measured or really tracked.

I'm glad you're bringing this to our attention, because if we can measure how much time we spend on this, and how the breakage generally happens, it will help drive how we try to solve it.

My gut feeling is a single repo that pulls in all our repos/gems and runs all/most of the tests on demand at specific git sha1s (via a bot command) and/or every 6 hours where a subset of the repo contributors watch (although anyone who cares can also watch) and we can discuss only this in the repo's gitter room.

@jrafanie jrafanie added this to the Sprint 99 Ending Nov 19, 2018 milestone Nov 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants