JobUpdateTask cleanups #650

ches · 2020-04-25T19:11:55Z

What this PR does / why we need it:

Cleanup, this is purely refactoring aiming to make code easier to read, understand, and modify.

I've been doing some work—and reviews of work—around this code, and found it a bit hard to follow and work with. This is my attempt at improving that.

This PR is meant to be reviewed commit-by-commit, it should hopefully make sense quickly that way. Tests are untouched until the final commit and should pass at every step.

I have a question or two though that I'll ask inline, so I'm marking this WIP, I may update a bit more based on the answers, especially docstrings where warranted.

Which issue(s) this PR fixes:

No issue

Does this PR introduce a user-facing change?:

NONE

ches · 2020-04-25T19:19:07Z

core/src/main/java/feast/core/service/JobService.java

    Optional<Job> getJob = this.jobRepository.findById(request.getId());
    if (getJob.isEmpty()) {
+      // FIXME: if getJob.isEmpty then constructing this error message will always throw an error...
      throw new NoSuchElementException(
          "Attempted to stop nonexistent job with id: " + getJob.get().getId());


This is a bug, I'll raise it in a separate issue/PR, didn't want to change it here because it should probably have a hotfix instead of waiting on this refactoring PR to get through. There are two or three instances of the same problem in this file.

The intention here was probably to use request.getId() for the exception message. getJob.get() will throw NoSuchElementException by definition. @mrzzy

Good catch. It should be request.getId() instead.

ches · 2020-04-25T19:42:16Z

core/src/main/java/feast/core/job/JobUpdateTask.java

@@ -73,147 +74,131 @@ public JobUpdateTask(
    this.currentJob = currentJob;
    this.jobManager = jobManager;
    this.jobUpdateTimeoutSeconds = jobUpdateTimeoutSeconds;
+    this.runnerType = jobManager.getRunnerType().toString();
  }

  @Override
  public Job call() {
    ExecutorService executorService = Executors.newSingleThreadExecutor();


@zhilingc I'm trying to understand why this is used—JobUpdateTask is a Callable and its instances are all dispatched on separate threads by JobCoordinatorService, it seems redundant to have the tasks then use another thread again where they invoke a job manager (that will [usually] do network I/O). If anything, it would seem that futures should be at the layer of the job manager instead. call() submits and then blocks on them before returning.

One more question for you is about the parameters of helper methods like startJob and updateJob: most of them except the job are instance state of the JobUpdateTask itself, and they were able to be final, so it seems like there's no need for passing them around through arguments. I removed them in the process of working out what call() actually needed to touch. Am I missing anything, or were they just a vestige of some refactoring in your initial work?

After reading through the code I'm not sure what my past self was thinking tbqh :( You're right, its completely redundant.

As for the second point, you're right. I must've missed it.

core/src/main/java/feast/core/job/JobUpdateTask.java

ches · 2020-04-25T20:22:54Z

core/src/main/java/feast/core/model/JobStatus.java

-  public static Collection<JobStatus> getTerminalState() {
-    return TERMINAL_STATE;
+  public static Set<JobStatus> getTerminalStates() {
+    return TERMINAL_STATES;


Got a little carried away, JobStatus changes are further than I meant to go for now… but I started touching JobCoordinatorService for filter(job -> !job.hasTerminated()) and I ended up here…

These semantically are/should be Sets I think. I'm not sure I see value in these static getters wrapping constants, after this PR their only remaining use is in a few tests. Considered removing them, but thought I'd leave it to discuss first.

Makes sense.

ches · 2020-04-26T04:00:54Z

/assign @pradithya

zhilingc · 2020-04-26T16:34:57Z

Thanks for the PR, I'll have a look tomorrow!

zhilingc · 2020-04-27T02:47:23Z

No complaints here, thanks so much for the PR, and sorry for the mess :/

zhilingc · 2020-04-27T02:47:32Z

/test test-end-to-end

zhilingc · 2020-04-27T02:55:47Z

/test test-end-to-end-batch

zhilingc · 2020-04-27T03:40:00Z

/lgtm

zhilingc · 2020-04-27T03:40:18Z

/test test-end-to-end-batch

core/src/main/java/feast/core/job/JobUpdateTask.java

woop · 2020-04-27T05:52:02Z

Thanks @ches, looks great. The only open item from my side is @mrzzy's comment. Otherwise it looks good to me.

ches · 2020-04-27T14:36:12Z

/hold

Logic is a bit complicated here and trimming excess noise starts to help make it quicker to grok. We were doing the equivalent of Set#equals by hand.

The helper methods like startJob and updateJob had parameters that are instance state, so there seemed to be no reason to pass around arguments for them.

And factor out things that don't differ between its test cases. JobCoordinatorService unmarshals protos to model types eagerly as they come off the wire, so we're left dealing with models everywhere else.

ches · 2020-04-28T02:00:45Z

/test test-end-to-end-batch

woop · 2020-04-28T02:13:34Z

Just FYI we are considering enabling the e2e tests again by default. The added benefit of not having to wait for them in small PRs isn't worth the manual effort for big PRs.

ches · 2020-04-28T02:18:40Z

Okay, so I did the spec renaming for consistency as @mrzzy suggested, and when I stepped back and looked at it, all the instance variables of Job were protobuf types and it made me sad. So in the last commit I went for the full monty and converted them all to domain types, JobCoordinatorService now unmarshals protos to domain models as soon as it takes them off the wire so they're not passed around the program anywhere. I think the result is much nicer, no to and fro conversions happening in several places across classes.

I tried removing the redundant Executor in JobUpdateTask and made a realization: it currently serves to impose the per-task timeout. In JobCoordinatorService we do CompletionService#take() which can block indefinitely, so it relies on the tasks timing out internally in which case the callable returns null. Changing from take() to poll(long timeout, TimeUnit unit) would require updating logic further in the coordinator to preserve existing behavior, so I'm going to leave it alone for now.

I'm leaning more toward my instinct in the original comment/question for @zhilingc being the right answer eventually: the asynchrony (and possibility of failure) should be expressed at the JobManager interface layer, indeed there is an existing TODO that it needs a timeout.

ches · 2020-04-28T02:19:07Z

/hold cancel

ches · 2020-04-28T02:27:37Z

The job manager soon after converts things back to proto again, but that's natural, these are I/O boundaries of the system. The overhead of these conversions is the price we pay if we don't want wire format in business logic layer, which is probably worth it. It ought to be less than the actual de/serialization anyway, and this stuff isn't a hot path/tight loop 🤞

ches · 2020-04-28T02:34:09Z

Oh, finally, let me know if you'd like me to try to squash a few commits. I think you're typically squashing merges anyway so it'd not be time well spent in that case.

woop · 2020-04-28T07:32:00Z

core/src/main/java/feast/core/service/JobCoordinatorService.java

-
-    log.info("Updating feature set status");
-    updateFeatureSetStatuses(jobUpdateTasks);
+    executorService.shutdown();
  }

  // TODO: make this more efficient
  private void updateFeatureSetStatuses(List<JobUpdateTask> jobUpdateTasks) {
    Set<FeatureSet> ready = new HashSet<>();


Not really related to this PR per say, but would it be possible for us to remove this updateFeatureSetStatuses call and use the real job state?

I think you're asking: rather than cache feature set statuses in bulk, could status be returned on-demand by checking job state for one given feature set, when requested? I guess the team will have more context on the design.

As it stands, I'd have to get my head more into functional consideration of what it's doing beyond the mechanical refactoring I've done to this class, but it seems like there could be more to look at later:

In the Poll() loop we are:

At the beginning, doing getJob for the source/store combo of all feature sets that are subscribed to

Then, doing the startOrUpdate of tasks for all those, and waiting on all those tasks

Throwing away the results of waiting (updated Job references), because the startOrUpdate returns void

Doing getJob again for each of the tasks, in the updateFeatureSetStatuses

It seems like this second round of DB lookups could be avoided by keeping and using the Jobs we just got back. Unless I'm missing something, if it's trying to avoid a race condition or something.

getJob does jobRepository.findBySourceIdAndStoreNameOrderByLastUpdatedDesc returning all matching records, and then takes the first (most recent) one client-side instead of having SQL do it. And this happens in loops, 2x per polling loop as noted in 1.

Tracking your comments here: #664

and then takes the first (most recent) one client-side instead of having SQL do it.

That's a very low hanging fruit.

woop · 2020-04-29T15:06:05Z

core/src/main/java/feast/core/job/JobUpdateTask.java

    this.store = store;
    this.currentJob = currentJob;
    this.jobManager = jobManager;
    this.jobUpdateTimeoutSeconds = jobUpdateTimeoutSeconds;
+    this.runnerName = jobManager.getRunnerType().toString();


The runnerName as we have it above refers to the runnerType right? Should we deduplicate that from the configuration (we could do that at the configuration level as well)?

feast: jobs: polling_interval_milliseconds: 60000 job_update_timeout_seconds: 240 active_runner: my_direct_runner runners: - name: my_direct_runner type: DirectRunner options: {}

https://github.com/gojek/feast/blame/ded7ca59a2bd5b40715af73fdaa4e4f19ad0b915/core/src/main/java/feast/core/config/FeastProperties.java#L95

The idea with the configuration was to provide a more forward compatible configuration schema, but we could just as easily use type there instead of name.

Yes, from your example config, jobManager.getRunnerType().toString() will be "DirectRunner" (whereas jobManager.getRunnerType().name() would be "DIRECT").

In JobUpdateTask this value is being used only for log messages.

ches · 2020-04-29T17:04:40Z

Is there anything further I should update on this one?

woop · 2020-04-30T00:32:26Z

/lgtm

woop · 2020-04-30T00:32:53Z

/approve

feast-ci-bot · 2020-04-30T00:32:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ches, woop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [woop]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

woop · 2020-04-30T00:35:30Z

I'm leaning more toward my instinct in the original comment/question for @zhilingc being the right answer eventually: the asynchrony (and possibility of failure) should be expressed at the JobManager interface layer, indeed there is an existing TODO that it needs a timeout.

This seems reasonable, although I have only superficially looked at job management. I do think we need to look at this in a bit more detail soon though, especially if folks are bringing their own runners.

Clears a FIXME left from #650, get() uses that will always fail. Also fixes a bunch of == comparisons on strings, reference equality is not what you want there.

* Fix Optional#get() and string comparison bugs in JobService Clears a FIXME left from #650, get() uses that will always fail. Also fixes a bunch of == comparisons on strings, reference equality is not what you want there. * Hide Maven transfer progress, for CI It makes trying to read output in GitHub Actions horrible. Seeing failures in red is kind of nice, so `--no-transfer-progress` keeps colors unlike `--batch-mode`.

ches added the do-not-merge/work-in-progress label Apr 25, 2020

ches requested review from zhilingc and mrzzy April 25, 2020 19:11

ches requested a review from pradithya as a code owner April 25, 2020 19:11

feast-ci-bot added size/L and removed do-not-merge/work-in-progress labels Apr 25, 2020

ches commented Apr 25, 2020

View reviewed changes

core/src/main/java/feast/core/job/JobUpdateTask.java Outdated Show resolved Hide resolved

ches commented Apr 25, 2020

View reviewed changes

core/src/main/java/feast/core/job/JobUpdateTask.java Show resolved Hide resolved

ches commented Apr 25, 2020

View reviewed changes

feast-ci-bot assigned pradithya Apr 26, 2020

ches mentioned this pull request Apr 26, 2020

core: Use Runner enum type instead of string for Job model #651

Merged

feast-ci-bot assigned zhilingc Apr 27, 2020

feast-ci-bot added the lgtm label Apr 27, 2020

mrzzy reviewed Apr 27, 2020

View reviewed changes

core/src/main/java/feast/core/job/JobUpdateTask.java Outdated Show resolved Hide resolved

mrzzy reviewed Apr 27, 2020

View reviewed changes

core/src/main/java/feast/core/job/JobUpdateTask.java Outdated Show resolved Hide resolved

feast-ci-bot added the do-not-merge/hold label Apr 27, 2020

ches added 3 commits April 27, 2020 21:38

core: Clean up repetition in JobUpdateTask

ac1143d

Logic is a bit complicated here and trimming excess noise starts to help make it quicker to grok. We were doing the equivalent of Set#equals by hand.

core: Reduce noise of AuditLogger in JobUpdateTask

59f0352

core: Simplify JobUpdateTask internal helper calls

5871d13

The helper methods like startJob and updateJob had parameters that are instance state, so there seemed to be no reason to pass around arguments for them.

core: Refactor JobUpdateTask to use domain model types

ded7ca5

And factor out things that don't differ between its test cases. JobCoordinatorService unmarshals protos to model types eagerly as they come off the wire, so we're left dealing with models everywhere else.

ches force-pushed the JobUpdateTask-cleanup branch from a3b0452 to ded7ca5 Compare April 28, 2020 01:46

feast-ci-bot added size/XL and removed lgtm size/L labels Apr 28, 2020

feast-ci-bot removed the do-not-merge/hold label Apr 28, 2020

woop reviewed Apr 28, 2020

View reviewed changes

woop reviewed Apr 29, 2020

View reviewed changes

woop mentioned this pull request Apr 30, 2020

Job Management Optimizations #664

Closed

feast-ci-bot assigned woop Apr 30, 2020

feast-ci-bot added the lgtm label Apr 30, 2020

feast-ci-bot added the approved label Apr 30, 2020

feast-ci-bot merged commit 497b08d into master Apr 30, 2020

woop mentioned this pull request Apr 30, 2020

Document release steps #476

Merged

ches deleted the JobUpdateTask-cleanup branch May 7, 2020 12:11

ches mentioned this pull request Jun 18, 2020

Fix Optional#get() and string comparison bugs in JobService #804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JobUpdateTask cleanups #650

JobUpdateTask cleanups #650

ches commented Apr 25, 2020

ches Apr 25, 2020 •

edited

Loading

mrzzy Apr 27, 2020

ches Apr 25, 2020 •

edited

Loading

zhilingc Apr 27, 2020

ches Apr 25, 2020 •

edited

Loading

zhilingc Apr 27, 2020

ches commented Apr 26, 2020

zhilingc commented Apr 26, 2020

zhilingc commented Apr 27, 2020

zhilingc commented Apr 27, 2020

zhilingc commented Apr 27, 2020

zhilingc commented Apr 27, 2020

zhilingc commented Apr 27, 2020

woop commented Apr 27, 2020

ches commented Apr 27, 2020

ches commented Apr 28, 2020

woop commented Apr 28, 2020

ches commented Apr 28, 2020

ches commented Apr 28, 2020

ches commented Apr 28, 2020

ches commented Apr 28, 2020

woop Apr 28, 2020

ches Apr 29, 2020

woop Apr 30, 2020

woop Apr 29, 2020 •

edited

Loading

woop Apr 29, 2020

ches Apr 29, 2020

ches commented Apr 29, 2020

woop commented Apr 30, 2020

woop commented Apr 30, 2020

feast-ci-bot commented Apr 30, 2020

woop commented Apr 30, 2020

JobUpdateTask cleanups #650

JobUpdateTask cleanups #650

Conversation

ches commented Apr 25, 2020

ches Apr 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ches Apr 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ches Apr 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ches commented Apr 26, 2020

zhilingc commented Apr 26, 2020

zhilingc commented Apr 27, 2020

zhilingc commented Apr 27, 2020

zhilingc commented Apr 27, 2020

zhilingc commented Apr 27, 2020

zhilingc commented Apr 27, 2020

woop commented Apr 27, 2020

ches commented Apr 27, 2020

ches commented Apr 28, 2020

woop commented Apr 28, 2020

ches commented Apr 28, 2020

ches commented Apr 28, 2020

ches commented Apr 28, 2020

ches commented Apr 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woop Apr 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ches commented Apr 29, 2020

woop commented Apr 30, 2020

woop commented Apr 30, 2020

feast-ci-bot commented Apr 30, 2020

woop commented Apr 30, 2020

ches Apr 25, 2020 •

edited

Loading

ches Apr 25, 2020 •

edited

Loading

ches Apr 25, 2020 •

edited

Loading

woop Apr 29, 2020 •

edited

Loading