Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RW-2704][risk=low] Split up listCluster API and move initialization logic to the client-side. #3135

Merged
merged 13 commits into from
Feb 27, 2020

Conversation

gjuggler
Copy link
Collaborator

@gjuggler gjuggler commented Feb 14, 2020

This PR resolves RW-2704 by splitting up the server-side listCluster endpoint into component get/create/delete pieces.

As a result of that server-side simplification, the initialization state management and retry logic had to move somewhere. I opted to implement this on the client side, since we already had a few client-side classes which were effectively repeating the same or similar initialization logic (notebook-redirect, interactive-notebook, and reset-cluster-button).

The resulting controller class, cluster-initializer, is meant to consolidate this logic into one well-tested module. Each React component uses this in a more-or-less simple way (see below for notes on why reset-cluster-button is special) with the Promise based API.

Some notes:

  • Unit testing Promises with time-based delays isn't currently possible in Jest (see useFakeTimers breaks with native promise implementation jestjs/jest#7151). I had originally implemented the polling loop using a while loop within an async function, but I ultimately switched to using a poll() method and setTimeout to make unit testing more feasible.
  • The reset cluster button is a bit of a special case here, since we don't want the button to take any action until the user explicitly chooses to "Reset" (this is the whole point of RW-2704). The solution I took was to create a "no-action" initializer when the button is first loaded, and to create a "action-ready" initializer when the button is clicked. It's a bit awkward, but less awkward than almost all other options I could think of.

Ultimately, while working through this PR it became clear that this functionality should really live somewhere on the server-side, ideally within the analysis application (e.g. Leo) itself. That would allow Terra and AoU, for example, to simplify both of our clients. Something worth thinking about as Framework platform convergence discussions continue.

200:
description: Returns the cluster that was created for this user and workspace.
schema:
$ref: '#/definitions/CreateClusterResponse'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideal REST semantics would have getCluster and createCluster return a Cluster directly rather than wrapping


export class ClusterInitializer {
private workspaceNamespace: string;
private onStatusUpdate: (ClusterStatus) => void;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

if (this.isClusterErrored()) {
// TODO: report errors to Stackdriver here.
console.log('Cluster error -- deleting for retry');
await this.deleteCluster();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per ticket discussion, I might move this above the loop so it only happens on initialization - errors encountered later could always throw and require user intervention. I think this limits the risk of concurrent pollers causing cluster churn - delete is the only really problematic/destructive action here. The rest should be relatively safe for concurrent requests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I owed you a reply on this one. I thought hard about treating delete cluster specially somehow, but ultimately I concluded that it was simpler & more readable, and arguably not too much more dangerous, to have it treated the same as our other state transitions. My hope is that the other guard rails (max initialization time, max delete count, max 500s) will help mitigate any potential out of control loops.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes makes sense, I surmised as much from the changes - thanks.

if (isAbortError(e)) {
return Promise.reject();
}
// If we received a NOT_FOUND error, we need to create a cluster for this workspace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a pre-emptive comment, but this should actually check for NOT_FOUND

// Overall strategy: continue polling the get-cluster endpoint, with capped exponential backoff,
// until we either reach our goal state (a RUNNING cluster) or run up against the overall
// timeout threshold.
while (!this.isClusterRunning()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing which isn't covered here yet is error retries. Since this is a very long polling cycle (expected to commonly be ~7-10m), would be ideal if this were resilient to the occasional transient 500. Using this package on the outgoing calls would probably suffice:

export async function fetchAbortableRetry<T>(fetchFn: () => Promise<T>, timeoutMillis: number, maxRetries: number): Promise<T> {

(e.g. only fail the polling loop if the request fails X times in a row)

@gjuggler gjuggler changed the title [RW-2704][risk=low] Split up listCluster API and move recovery to the client-side. [RW-2704][risk=low] Split up listCluster API and move initialization logic to the client-side. Feb 21, 2020
@gjuggler
Copy link
Collaborator Author

@calbach this is finally worth another look. I'm really not in love with how complex the solution got, but IMO it's probably better to have one gnarly-but-well-tested thing rather than simpler-looking-but-more-fragile logic spread around in a few components. Curious to hear whether you agree.

@jaycarlton as I warned you, here's a client-side PR that feels like a server-side PR (for obvious reasons...). Feel free to just browse, or treat this like a full code review & dig into questionable-looking areas, etc.

Both: I pushed a relatively-recent version of this branch to my test GAE service, accessible here: https://gjordan-dot-all-of-us-workbench-test.appspot.com/ There's one known issue I need to fix (the cluster reset button looks weird while initially fetching cluster status), but please use that to run it through the ringer.

@gjuggler gjuggler marked this pull request as ready for review February 21, 2020 15:10
throw new BadRequestException("Must specify billing project");
}
public ResponseEntity<Cluster> getCluster(String workspaceNamespace) {
String firecloudWorkspaceName = lookupWorkspace(workspaceNamespace).getFirecloudName();
Copy link
Contributor

@jaycarlton jaycarlton Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can the body of this function be pushed into the WorkspaceService, and have it return Optional<Cluster>?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it wouldn't be worth it. This controller is <10 lines (including wrapping!) which is way shorter than before, and very much in the spirit of https://github.com/all-of-us/workbench/blob/master/api/doc/code-structure.md#controllers.

(And, upon reading that section again, I would probably disagree that 'controllers should contain no actual logic'. IMO, having some calls to well-factored access enforcement and validity checking service methods is more reasonably the scope of a controller than of a service.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can go along with that. I just haven't see much of a clean example of how to draw the line. If we're not really MVC, then I don't know what the "C" means.

Copy link
Contributor

@jaycarlton jaycarlton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my backend notes.

For the big PRs, I like to pull the branch so I can navigate better. The downside is that I might have seen things that are slightly outside the scope of the PR 🙈 .

workspaceService.enforceWorkspaceAccessLevel(
workspaceNamespace, firecloudWorkspaceName, WorkspaceAccessLevel.WRITER);

this.leonardoNotebooksClient.deleteCluster(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop the this..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. It gets hard switching between Typescript and Java sometimes...

api/src/main/resources/workbench.yaml Show resolved Hide resolved
// a workspace's namespace is always its billing project ID
private static final String WORKSPACE_NS = BILLING_PROJECT_ID;
private static final String WORKSPACE_ID = "wsid";
private static final String WORKSPACE_NAME = "wsn";
// Workspace ID is also known as firecloud_name. This identifier is generated by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment would be very useful in the code as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it to the workbench.yaml parameters definition description, which feels like the most centralized place where this lives. Maybe it could go on the DB model class as well, but we tend to not have as much comment type stuff over there.

@Autowired ClusterAuditor clusterAuditor;
@Autowired LeonardoNotebooksClient notebookService;
@Autowired FireCloudService fireCloudService;
@MockBean AdminActionHistoryDao adminActionHistoryDao;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I like to name mocked members with a mock prefix, especially for large test classes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@MockBean ComplianceService complianceService;
@MockBean DirectoryService directoryService;
@MockBean FireCloudService fireCloudService;
@MockBean LeonardoNotebooksClient notebookService;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this name intentional? Looks like there's a separate org.pmiops.workbench.notebooks.NotebooksService.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.


assertThrows(
ForbiddenException.class, () -> clusterController.listClusters(WORKSPACE_NS, WORKSPACE_ID));
assertThrows(ForbiddenException.class, () -> clusterController.getCluster(WORKSPACE_NS));
verify(workspaceService, never()).validateActiveBilling(anyString(), anyString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the first doThrow if we're going to verify that the method that does the throwing isn't called? Here and below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch – fixed.

Copy link
Contributor

@jaycarlton jaycarlton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the nits that's fit to crit....ique.

// Fetch the current cluster status, with some graceful error handling for NOT_FOUND response
// and abort signals.
try {
this.currentCluster = await this.getCluster();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have getCluster() call onStatusUpdate() itself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's easier to follow when all of the status updates are happening in the main poll method.

new ClusterInitializationFailedError('Abort signal received during cluster API call',
this.currentCluster));
} else if (this.isNotFoundError(e)) {
// A not-found error is somewhat expected, if a cluster has recently been deleted or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally Leo would expose info on recently deleted clusters (say for 7 days) so that we don't need to guess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does expose that information via getClusters?includeDeleted=true. However, whether or not it was deleted is irrelevant here. In either situation we're going to create a new cluster.

} else if (this.isClusterErrored()) {
// If cluster is in error state, delete it so it can be re-created at the next poll loop.
reportError(
`Cluster ${this.currentCluster.clusterNamespace}/${this.currentCluster.clusterName}` +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: exposing a property for the full cluster path/fqn would be cleaner than doing it here I think.

}
}

setTimeout(() => this.poll(), this.currentDelay);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we already have backoff delay programmed someplace? Seems odd for it to live here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some stuff over here http://github.com/all-of-us/workbench/blob/67ab8fb211b1b76440cff340343c7a107903a87b/ui/src/app/utils/retry.tsx#L69-L69 but nothing which fit this use case. Trying to generalize this would have caused more maintenance overhead than inlining the two lines to increment exponential decay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but there's also a lot of maintenance overhead if we have multiple, slightly inconsistent ways of doing basic things like retrying.

@@ -29,7 +29,7 @@ export function reportError(err: Error) {

/** Returns true if the given error is an AbortError, as used in fetch() aborts. */
export function isAbortError(e: Error) {
return (e instanceof DOMException) && e.name === 'AbortError';
return e instanceof DOMException && e.name === 'AbortError';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fwiw I think the parentheses help here.

Copy link
Contributor

@calbach calbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Client side looks reasonable, mostly small things. Thanks for the detailed comments.

retry();
});
private async runCluster(onClusterReady: Function): Promise<void> {
const initializer = new ClusterInitializer({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looking at this interface, I'm wondering whether it is worth exposing this class at all. Seems like a static helper like initializeCluster({ ... }); might restrict the API surface a bit more clearly, since the current API is basically: create instance, call run() exactly once.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can take a stab at that. I ended up modeling this as a static method on ClusterInitializer rather than a separately-exported function. LMK if you feel it looks out of place and we should really only export a wrapper function.


private pollClusterTimer: NodeJS.Timer;
private aborter = new AbortController();
private initializer?: ClusterInitializer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to hold this reference?

await this.initializer.run();
this.setState({isPollingCluster: false});
} catch (e) {
if (this.aborter.signal.aborted) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opt: this may be a more direct way of checking:

export function isAbortError(e: Error) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd originally had that, but the awkward thing here is that the ClusterInitializer captures its own AbortError and throws a more generic initialization failure error (

if (isAbortError(e)) {
).

I'll attempt to clarify this by doubling-down on the error subclassing approach, with a more specific "abort error" thrown from the initializer.

// If the initializer has completed and the status is null, it means that
// a cluster doesn't exist for this workspace.
return this.createTooltip(
'You have not yet created a notebook server for this workspace.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this language is slightly misleading, I would suggest:

"You do not currently have an active notebook server for this workspace."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point – done. (I assume this is mostly because we leave no clusters behind when we run the periodic reset-everything cron.)

p.then(() => isSettled = true).catch((e) => {
isSettled = true;
// Re-throw the error.
// throw e;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uncomment?

export class ClusterInitializationFailedError extends Error {
public readonly cluster: Cluster;

constructor(message: string, cluster?: Cluster) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Typescript you can just do constructor(private readonly cluster?: Cluster) and remove the rest of the instance property boilerplate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* App Engine server-side request timeout. To reliably implement this control on the server side
* would likely require new database persistence, tasks queues, and additional APIs in order to
* provide the client with status updates.
* - Ultimately, we might expect this type of functionality ("Get me a cluster for workspace X and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree with this statement. We've conceptually merged several Leo concepts into one within the AoU product, primarily on account of product decisions to avoid exposing the inner workings of the cluster state. This creates some impedance mismatches which generates complexity here. For example, we don't really expose the fact that a cluster can become paused, or that it can become deleted. If we decided to surface these things (as Terra UI probably does), it would create less business logic here.

  • Polling cannot be removed, unless cluster creation is redesigned and made instantaneous somehow. Even if we pushed the maximum amount of logic into Leo, the operation is fundamentally asynchronous.
  • Logic around when to create/start/stop a cluster at the request of a user needs to live at the product level.
  • "recreate this cluster" could potentially be a Leo operation, but since we customize Leo clusters anyways, we probably want a hand in recreating clusters anyways.

The other avenue we could explore would be to say that Leo should not have an error state, but ultimately some cluster creations will fail regardless of how many retries Leo adds. So I do think a product-level cluster error state is still a necessity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points all around — but I still feel that if Leo had an async API which was effectively "Create or reuse a cluster with config X, for project Y and user Z, and get it to a running state", it would be heavily used by both Terra and AoU, since to both of these UIs it's effectively a means to an end (which is to have a Jupyter notebook running in my browser).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the less we know about the underlying clusters the better. We shouldn't know whether we have a new or existing cluster ideally (provided we can supply config).

const DEFAULT_MAX_CREATE_COUNT = 2;
const DEFAULT_MAX_DELETE_COUNT = 2;
const DEFAULT_MAX_RESUME_COUNT = 2;
const DEFAULT_MAX_SERVER_ERROR_COUNT = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 seems reasonable for consecutive errors, but might be a bit low for global error count during the polling cycle. May want to make this distinction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wary of adding too much more complexity without letting this settle in for awhile. Okay if I bump to 10 and call it a day?

if (this.abortSignal && this.abortSignal.aborted) {
// We'll bail out early if an abort signal was triggered while waiting for the poll cycle.
return this.reject(
new ClusterInitializationFailedError('Request was aborted.', this.currentCluster));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opt nit: probably wouldn't put trailing punctuation in my error messages, though I don't know we have any real conventions/guidelines on this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack – done.

ui/src/app/utils/cluster-initializer.tsx Show resolved Hide resolved
Copy link
Collaborator Author

@gjuggler gjuggler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews!

throw new BadRequestException("Must specify billing project");
}
public ResponseEntity<Cluster> getCluster(String workspaceNamespace) {
String firecloudWorkspaceName = lookupWorkspace(workspaceNamespace).getFirecloudName();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it wouldn't be worth it. This controller is <10 lines (including wrapping!) which is way shorter than before, and very much in the spirit of https://github.com/all-of-us/workbench/blob/master/api/doc/code-structure.md#controllers.

(And, upon reading that section again, I would probably disagree that 'controllers should contain no actual logic'. IMO, having some calls to well-factored access enforcement and validity checking service methods is more reasonably the scope of a controller than of a service.)

workspaceService.enforceWorkspaceAccessLevel(
workspaceNamespace, firecloudWorkspaceName, WorkspaceAccessLevel.WRITER);

this.leonardoNotebooksClient.deleteCluster(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. It gets hard switching between Typescript and Java sometimes...


FirecloudWorkspace fcWorkspace;
FirecloudWorkspace firecloudWorkspace;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (this.abortSignal && this.abortSignal.aborted) {
// We'll bail out early if an abort signal was triggered while waiting for the poll cycle.
return this.reject(
new ClusterInitializationFailedError('Request was aborted.', this.currentCluster));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack – done.

ui/src/app/utils/cluster-initializer.tsx Show resolved Hide resolved
// Fetch the current cluster status, with some graceful error handling for NOT_FOUND response
// and abort signals.
try {
this.currentCluster = await this.getCluster();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's easier to follow when all of the status updates are happening in the main poll method.

if (this.isClusterErrored()) {
// TODO: report errors to Stackdriver here.
console.log('Cluster error -- deleting for retry');
await this.deleteCluster();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I owed you a reply on this one. I thought hard about treating delete cluster specially somehow, but ultimately I concluded that it was simpler & more readable, and arguably not too much more dangerous, to have it treated the same as our other state transitions. My hope is that the other guard rails (max initialization time, max delete count, max 500s) will help mitigate any potential out of control loops.

}
}

setTimeout(() => this.poll(), this.currentDelay);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some stuff over here http://github.com/all-of-us/workbench/blob/67ab8fb211b1b76440cff340343c7a107903a87b/ui/src/app/utils/retry.tsx#L69-L69 but nothing which fit this use case. Trying to generalize this would have caused more maintenance overhead than inlining the two lines to increment exponential decay.

@gjuggler
Copy link
Collaborator Author

Should be ready for another look now

throw new BadRequestException("Must specify billing project");
}
public ResponseEntity<Cluster> getCluster(String workspaceNamespace) {
String firecloudWorkspaceName = lookupWorkspace(workspaceNamespace).getFirecloudName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can go along with that. I just haven't see much of a clean example of how to draw the line. If we're not really MVC, then I don't know what the "C" means.

fcResponse = fireCloudService.getWorkspace(workspaceNamespace, workspaceId);
fcResponse =
fireCloudService.getWorkspace(
dbWorkspace.getWorkspaceNamespace(), dbWorkspace.getFirecloudName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking @ericsong was leading the charge to get these out of the service level. Maybe I misunderstood.

api/src/main/resources/workbench.yaml Show resolved Hide resolved
ui/src/app/utils/cluster-initializer.spec.tsx Show resolved Hide resolved
* App Engine server-side request timeout. To reliably implement this control on the server side
* would likely require new database persistence, tasks queues, and additional APIs in order to
* provide the client with status updates.
* - Ultimately, we might expect this type of functionality ("Get me a cluster for workspace X and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the less we know about the underlying clusters the better. We shouldn't know whether we have a new or existing cluster ideally (provided we can supply config).

ui/src/app/utils/cluster-initializer.tsx Show resolved Hide resolved
ui/src/app/utils/cluster-initializer.tsx Show resolved Hide resolved
}
}

setTimeout(() => this.poll(), this.currentDelay);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but there's also a lot of maintenance overhead if we have multiple, slightly inconsistent ways of doing basic things like retrying.

Copy link
Contributor

@calbach calbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Looks good, thanks!

@@ -1,5 +1,7 @@
package org.pmiops.workbench.notebooks;

import static org.pmiops.workbench.exceptions.ExceptionUtils.convertNotebookException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Ignore if this has already been discussed to death elsewhere, but not a big fan of this static import. In general I don't love them outside of tests and in this case it's not saving much space below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah – I'm a blind slave to IntelliJ, and they put the "import static..." option above the "quality usage of..." so I chose the first one. Fixed!

new ClusterInitializationFailedError('Abort signal received during cluster API call',
this.currentCluster));
} else if (this.isNotFoundError(e)) {
// A not-found error is somewhat expected, if a cluster has recently been deleted or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does expose that information via getClusters?includeDeleted=true. However, whether or not it was deleted is irrelevant here. In either situation we're going to create a new cluster.

if (this.isClusterErrored()) {
// TODO: report errors to Stackdriver here.
console.log('Cluster error -- deleting for retry');
await this.deleteCluster();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes makes sense, I surmised as much from the changes - thanks.

@gjuggler gjuggler merged commit efc5fa2 into master Feb 27, 2020
@gjuggler gjuggler deleted the gj/cluster-api branch February 27, 2020 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants