Skip to content
This repository has been archived by the owner on Jul 23, 2020. It is now read-only.

Launcher should not allow users to continue if environment(s) have reached resource quotas #3346

Open
andrewazores opened this issue Apr 27, 2018 · 29 comments

Comments

@andrewazores
Copy link
Collaborator

Parent: #3344

Task

As seen in design link in parent issue, the Launcher should display an overlay and block the user from continuing to create a new application if the user has insufficient available resources for the application to be deployed.

@andrewazores
Copy link
Collaborator Author

@mceledonia to clarify the intent here, should the Launcher block creation if either Stage or Run environments have reached resource quota?

@mceledonia
Copy link
Collaborator

mceledonia commented Apr 27, 2018

@andrewazores I was always under the assumption that this would be for both Stage and Run being full, but I think for this stage of implementation and for the sake of simplicity we should block for either. If in the future we do have the granular control to allow deployment into one stage or the other in this wizard I think we can change that but for now this seems to be the way to go.

@qodfathr
Copy link
Collaborator

Please see my comments here: #3344 (comment)

I do not feel this should be a p0.

@bmicklea bmicklea removed the priority/P1 Critical label Apr 30, 2018
@bmicklea
Copy link
Collaborator

+1 to @qodfathr assessment that this is a SEV2, but not a P0. Removing label.

@sunilk747
Copy link
Collaborator

sunilk747 commented May 4, 2018

@invincibleJai @arunkumars08 I guess this issue can be solved from client side. We just need to add a check for the available resources and used resources. We can use this API end point to get the metadata
https://api.starter-us-east-2.openshift.com/api/v1/namespaces/ssamal/resourcequotas. It will return something like this
screenshot from 2018-05-04 14-16-21
Let me know what you guys think.
cc: @animuk

@invincibleJai
Copy link
Collaborator

@sunilk747 agree it should be handled the client end, it needs some development effort and even the flow needs to be decided i.e when to show quota screen https://redhat.invisionapp.com/share/CTGJ6F7V4K9#/screens , it's more like a task to me not exactly an issue. Which we need to have in sprint plan.

@andrewazores
Copy link
Collaborator Author

@sunilk747 you can use this service: https://github.com/fabric8-ui/fabric8-ui/blob/3683fb70017dd16a96951d259a08cf30698e09ea/src/app/space/create/deployments/services/deployment-api.service.ts

This provides access to the backend WIT endpoints developed for the Deployments page. This way you don't need to reimplement proxy handling etc. since it is done by the backend.

@andrewazores
Copy link
Collaborator Author

In particular you can get CPU and Memory usage vs quota by calling getEnvironments, which returns an Observable<EnvironmentStat[]>. You can narrow this down from EnvironmentStat -> EnvironmentAttributes -> Quota -> {MemoryStat or CpuStat}. Both MemoryStat and CpuStat extend an interface Stat, which looks like this:

export interface Stat {
  readonly used: number;
  readonly quota: number;
  readonly timestamp?: number;
}

used and quota are the exact info you need here.

@jyasveer jyasveer self-assigned this May 7, 2018
@jyasveer jyasveer added this to the Sprint 149 milestone May 7, 2018
@jyasveer
Copy link
Collaborator

jyasveer commented May 7, 2018

@catrobson @bdellasc
As per my understanding we have to show this overlay while a user is trying to create a space. The steps can be something like this

  1. User clicks create space link/button.
  2. As soon as the user clicks on the button/link, we will check for the resource limits and display the overlay with the resource warning message.

Please correct me if i am wrong ?
cc @animuk @invincibleJai @arunkumars08 @sunilk747

@bdellasc
Copy link
Collaborator

bdellasc commented May 7, 2018

@jyasveer I believe this would work. For this use case, instead of seeing the actual "Create Space" overlay, showing this makes sense to me.

CC @catrobson @mceledonia

@qodfathr
Copy link
Collaborator

qodfathr commented May 7, 2018

I want to make sure we all understand and agree to the implications of this.

In the free tier, very quickly, all users will essentially be at quota. So we agree that a good experience is that a very disruptive overall is presented to them every single time they try to create a new app. Even though there is nothing wrong with creating a new app -- the user will not be stuck in any way if they create a new app. The only thing that will "go wrong" is the deployment may fail (but may not -- this is highly situationally dependent). They will be able to edit the app in Che, create work items, etc. etc. Moreover, it is entirely possible that if the user is not at quota, but is close to quota, the app deployment will fail anyway, and we are not warning them in this case. Lastly, there are several ways a deployment may fail -- CPU, mem, disk, routes, services, etc. Are we committing to (1) ensure we are checking all of those, (2) keeping pace with new artefacts with limits in the future (e.g. stateful sets), and (3) keeping pace with changes to those limits (e.g. combing Run and Stage into one pool of resources rather than two independent pools).

I remain greatly concerned that we are trying to solve the wrong problem, and do so in the wrong way. If we pursue this change as described, then I would like to see those in support of the change to actually use the product, with that change feature flag turned on, to do real work with the service for a few weeks and see if they still feel the same way about the proposal. We have to balance what is logical/analytical against the emotional response to the feel of the service. I believe this proposal is far too analytical and not considering enough the desired emotional experience.

@catrobson
Copy link
Collaborator

I agree @qodfathr - I think the blocking of creation is being put into place because we do not have the right information, actions, and capabilities built into the system today to gracefully handle how the system acts, and what the user can do about it, in the case that they go over quota. This is something that I think will be really important for us to focus on improving as soon as possible.

I'm not convinced that as of today, not blocking users will result in a better experience than blocking them, only because recovery from an over quota experience seems very difficult to handle - the user can get into too much trouble without clear recovery paths. I'd like to talk about this as a team as soon as possible so we can consider what the least amount of changes required would be in order for the information, user recovery, and resource management to be good enough that we no longer need to block users in an attempt to protect.

@bmicklea
Copy link
Collaborator

bmicklea commented May 7, 2018

Is this a case where if the deployment fails we could warn them and suggest they try again later? If this was paired with some kind of alert on the system delivery side could we then clear the issue for them from the back?

@jyasveer
Copy link
Collaborator

@catrobson can you please help me know if UX team is still working on the UXD for this ?

@bdellasc
Copy link
Collaborator

bdellasc commented May 11, 2018

My impression: reading last comments from @catrobson and @qodfathr , it sounds like we (UXD, Development and PM) need to look at the information, user recovery and resource management topics. Based on that conversation, we would try to figure out how to add or improve those so we don’t have to “block” them when maxed out on current resource usage. “Blocking” them feels like an interim solution, when we really need to improve those other supporting pieces.

Since Summit was this week, I don’t think this is something UXD has taken up yet, as a number of key stakeholders have been focused on Summit. If this is the path we agree needs to be taken, we’ll need to lay some groundwork to create stories/issues in order to begin fielding the wider topics mentioned above.

We might need to stick with a simple “block” solution (+ some basic mechanisms to get unblocked) for now because the issues above might take some time to work through. If we run with the interim “block” solution now, keep in mind that we need to address those other areas as a more permanent solution. I’m not sure if @catrobson and @qodfathr would be alright with this approach and I don't want to speak for them.

So, bottom line is - we do intend on fielding the wider issues with the rest of the team, but haven’t yet because of Summit this week.

@ldimaggi
Copy link
Collaborator

I think that as a general rule, if we block a user from performing a task, we have to also provide them with an easy path to get un-blocked. Blocking a user from performing a task without providing a path forward/around the block is a dead-end situation. Thx!

@jyasveer
Copy link
Collaborator

@ldimaggi i agree with you.
@qodfathr @bdellasc @catrobson
My question here is do we want to have temporary solution of blocking a user to not create a space for now OR do we want to have a permanent solution and implement the same later ?

@bdellasc
Copy link
Collaborator

bdellasc commented May 11, 2018

If this is critical to have something implemented ASAP, it might make sense to arrive at a very basic solution where we block the user, but give them that path to get unblocked, so we don't end up the with dead end that @ldimaggi mentioned. It may not be an ideal flow, but would serve the purpose for now. Then, we can take a wider look at a better, more permanent solution for the longer term?

I think these are our options going forward:

  • spend more time NOW for a more integrated, longer term solution (means the current implementation is a dead end for the user)
  • get something quick/basic implemented NOW that prevents the dead end, then look at a better, longer term solution once the basic one is in place...

I suspect the second option is the least painful for our users...something that is potentially clunky, but doesn't dead-end them right now.

@catrobson @qodfathr

@mceledonia
Copy link
Collaborator

mceledonia commented May 11, 2018

My understanding of this is that it was a temporary, first-step of a multi step implementation to solve the problem. I'm assuming (let me know if I'm wrong) that the dead end is remedied with a direct link to a page that allows the user to scale down, delete, or both... However there are a couple issues with this which need to be solved which is why we are implementing only the first increment (blocking the user and using language to direct them to a solution)

  • We don't have backend support to know which space is using the most resources
  • Even if we did, we can't be sure we're sending them to the right space, because that space may not be the one the user wishes to compromise

So we need to figure out what the best next step is for the user. Because we don't have the ability to upgrade the account we need to offer a solution that exists within the current resource limitations.

  • Do we block and send them to an account-wide overview page with quick actions to scale up/down and or delete? Do we offer this experience in an abbreviated way directly in the overlay?
  • Do we offer a way to quickly "disable" or "turn off" all deployments in a given space or application, which scales them all the way down with the click of a button? Essentially streamlining the scale down process in this view?
  • Do we allow the user to continue but offer adequate/detailed information as to why their deployments fail when they do, and defer the "clear path forward" to that point?
  • Do we allow the user to continue and give them a one time trial experience of what it's like having more resources (perhaps a bit extreme answer to Todd's point, emotional response) with persisting reminders they are over quota and the trial period will end after a certain period of time? Then after it ends defer to a more hard blocked solution?

Those are only a few ideas off of the top of my head, it's Friday and almost lunch time, but all that to say that I understood this solution as an incremental first-step towards a better solution with the assumption that scaling up/down or deleting was the only path forward for the user. That assumption sounds like it's not be correct, as we can defer the blocking of the user to later when a deployment fails (if it does). It sounds like there needs to be more discussion around those two options and the possible solutions that branch from them.

I agree with @qodfathr that we need to consider the balance between emotional and analytical responses, and with @catrobson that this needs more discussion with the larger team.

@animuk
Copy link
Collaborator

animuk commented May 11, 2018

I would say it would be good if we give users

  1. A warning message that "his quota limit is exceeded and the pipeline build may fail"
  2. provide the url/link to the open shift console so that they can remove the unused resource quota in a separate window.
    3.then proceed with creating new app.

That way, openshiftio.io is not blocking its users an d also warn users that his deployment may fail and also display the action points to counter that.

@qodfathr
Copy link
Collaborator

Being out of quota (which has something like 5 dimensions -- disk,cpu, ram, services, routes, others?) should be something known to users at all times, in a non-intrusive way. Perhaps a warning icon in the topnav, which, when clicked, gives details plus links to how to remediate. But even as a "short term" solution, I wouldn't want us to proceed until all quota dimensions are known, documented and we've got the APIs to properly articulate them. Moreover, we cannot hard-code the quota limits into the front-end logic -- e.g, don't query to find out there are 5 routes in use and assume that the user cannot create a 6th. The API must be make this determination in conjunction with the backend. Assume the answer is potentially different for every single user. (i.e. a short-term solution is not an excuse to add to technical debt.)

Being out of quota may impact users in many ways, but completing the Launcher is not one those ways, so solving it in the Launcher I still argue is the wrong approach. Rather, if the user knows that they are out of quota before they even start the Launcher, I feel this is a better overall experience. An unobtrusive, non-blocking topnav warning is one way to achieve that goal. (Read: the user sees the warning icon and elects to do something about it before clicking Add to Space; I am not suggesting the user clicks Add to Space and then we put a pop-up in their face. It's more like this: I get in my car and the low gas light is on. I likely choose to go to the gas station and fill-up before embarking on my journey, but maybe I'm only going on a very short trip and know I can deal with buying gas later. In any event, the light is a non-obtrusive warning. It doesn't stop me from starting my car or driving to the destination of my choice. If I go too far and run out of gas, oh well, I made a choice. Given that this rarely happens, the unobtrusive warning model is proven to work.)

The primary impact of quota maximization is that app deployment fails. So, as a longer-term goal, IMHO, this is where we should look to improve the experience. e.g., rather than failing the pipeline, put it in a warning state requiring input (much like the Promotion step we have today). The two options could be "try again" or "cancel." The idea being that if the deployment fails due to some quota limitation (or really for any reason), the user is given the option to go correct the cause of the deployment failure and then try the deployment again (without having to rerun the whole pipeline). Taken further, imagine if the user could redirect the deployment elsewhere -- e.g. "oh, yeah, my free Stage env is full or too small to ever run that app, but let's deploy it to OSO Pro instread." A quick edit to the pipeline (perhaps via a GUI) and a click of "Try Again" and the user is back on track.

Related to all of this, pipelines that are in any "input required" state probably should also have some sort of non-blocking notification to the user (e.g. an icon in the topnav). Clicking would give details and a link to the awaiting pipeline. This helps to address the primary impact of failed deployments (assuming that the pipeline does not fail but rather goes into the remediation state as described above). Moreover, this would help address a concern whereby a pipeline is not starting in Space S2 because another pipeline is awaiting input in Space S1 and I only have the authority to run one build at a time. (Note that in the current UI, you cannot see the awaiting S1 pipeline when in S2, so, today, the user is somewhat in the dark as to why the S2 pipeline is not starting.)

@ldimaggi
Copy link
Collaborator

+1 to Todd's description of handling a deployment that would fail due to a lack of resources. Instead of blocking a user, I think it's better to inform the user, and provide the user with a path forward.

Also +1 to Todd's recommendation of our having a clear definition of all the dimensions of a user being out of quota (we've been focusing on a subset of these in our discussions) AND to our providing easy to understand quota information to our users.

@animuk
Copy link
Collaborator

animuk commented May 14, 2018

@pradeepto Please see Todd's comments above. As a part of long term approach instead of pipeline failure in case of resource quota limit exceed in openshift, can we give users a warning message with the option to go correct the cause of the deployment failure in the pipeline build level. Can you analyse the possibility from the build team as a future process improvement?
cc @krishnapaparaju

@animuk
Copy link
Collaborator

animuk commented Jun 14, 2018

@krishnapaparaju As discussed, We need an endpoint in the jenkins proxy layer, which will call the Openshift API to fetch the usage quota details and show the user two options which could be "try again" or "cancel" in the pipeline screen before it starts deployment in the openshift.
cc @pradeepto

@andrewazores
Copy link
Collaborator Author

The API you're asking for sounds like a combination of ones that already exist and one we have in progress for the Deployments page: #3388 . We also have an OpenShift events bus API in progress which would allow you to discover when deployment failures occur: #2528 .

@piyush-garg
Copy link
Collaborator

piyush-garg commented Jun 20, 2018

@andrewazores @animuk @jyasveer Is there anything build-cd team is needed to do. Can we drop that label if everything is fine?

@animuk
Copy link
Collaborator

animuk commented Jun 25, 2018

@piyush1594 As discussed, If you already have that endpoint in the jenkins proxy layer, then use that to prompt the user "try again" and/or "cancel" in the pipeline screen before it starts deployment in the openshift.

@bartoszmajsak
Copy link
Contributor

@GeorgeActon @animuk just wanted to understand how much has been done and what is still needed to make this feature reality. /cc @slemeur

@animuk
Copy link
Collaborator

animuk commented Aug 13, 2018

@bartoszmajsak this has been assigned to build team before start of J train. Can you ask them?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests