Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global seach hangs UI for tens of seconds with large numbers of jobs #8549

Closed
optiz0r opened this issue Jul 28, 2020 · 4 comments · Fixed by #8571
Closed

Global seach hangs UI for tens of seconds with large numbers of jobs #8549

optiz0r opened this issue Jul 28, 2020 · 4 comments · Fixed by #8571

Comments

@optiz0r
Copy link
Contributor

optiz0r commented Jul 28, 2020

Nomad version

Nomad v0.12.1

Issue

On a busy cluster with large numbers of jobs, the global search hangs up the UI for tens of seconds while it retrieves and processes the job lists and displays the results. For example, a call to v1/jobs has been observed to take 18s to return 762 rows of job data, and the UI was hung for all that time. There is no visible indication that the search is doing anything, page elements stop reacting to input events, and the cursor in the search box stops flashing.

Reproduction steps

On a cluster with a large number of jobs:

  • Type something into the global search box
  • Observe UI is stalled

Notes

I see the UI is making its own calls to /v1/jobs and /v1/nodes and doing the processing client side, and seems to be doing this in the main thread. I see there is also a /v1/search api that can do prefix matching. Would it make more sense to use this API endpoint to do the filtering server side, and update the UI asynchronously?

@DingoEatingFuzz
Copy link
Contributor

Hi @optiz0r, thank you for this report! It's incredibly helpful to hear how features are performing in the wild. I have some back story and a couple questions for you.

First, a note on the /v1/search API: it only supports prefix search (to power CLI autocomplete) and we really wanted fuzzy find behavior. It wasn't an easy call to make but we figured since we already have to fetch all jobs for the jobs list page and all nodes for the clients list page that we could get away with this approach and ultimately have a better UX.

Clearly having the UI hang for 18s is not a good UX, so we need to do something about that.

For example, a call to v1/jobs has been observed to take 18s to return 762 rows of job data, and the UI was hung for all that time.

When you say the call takes 18s is that the time it takes for the API request to respond or the amount of time before the UI is interactive again?

With a busy cluster, are you experiencing the UI hanging or generally chugging elsewhere or is it just the global search feature?

Are you only experiencing the UI being slow or are all cluster actions slow?

@optiz0r
Copy link
Contributor Author

optiz0r commented Jul 30, 2020

First, a note on the /v1/search API: it only supports prefix search (to power CLI autocomplete) and we really wanted fuzzy find behavior. It wasn't an easy call to make but we figured since we already have to fetch all jobs for the jobs list page and all nodes for the clients list page that we could get away with this approach and ultimately have a better UX.

Makes sense! I agree substring matches are more useful to have here, particularly as many of our jobs have a common prefix.

When you say the call takes 18s is that the time it takes for the API request to respond or the amount of time before the UI is interactive again?

The 18s I quoted is what chrome dev tools had as the total time for the /v1/jobs request. The total traffic for the search requests is ~47KB, so not tiny but also not enormous. This was the longest time I saw. I had repeated the search multiple times in succession and saw this drop down to about 12s. The UI definitely feels frozen for longer than the download time of that one request.

It's worth noting, the UI locks up immediately on first keypress into the search box, so quickly that often it doesn't render any of the characters typed into the search box until after it finished doing its thing (or maybe just the first one). The typed characters suddenly appear in the text field at the same time as the results are rendered. It feels very much like the search operation is being done in the main thread, blocking everything else, and would be better done in a separate thread/worker.

--

Repeating my tests this morning, I'm seeing the /v1/jobs and /v1/nodes calls complete in tens of milliseconds rather than seconds, but the UI is still locking up for ~10 seconds on each search. I was doing rolling upgrades of nomad clients when I first noticed, so perhaps the cluster was mroe busy than usual running allocations at the time, but that's not the whole story here.

If it makes any difference, I have been running that search while I had the CSI storage view open (because there's no background ajax calls, which made it easier for me to see in dev tools what was happening during the search). Does that mean you don't have any of the job/node data already available in memory to search on and so it takes longer to fetch them on-demand?

Repeating the search on the jobs list view, I see slightly different behaviour in the dev tools. I only see a /v1/nodes request appear in the network tab, no /v1/jobs request. I do see regular hits for jobs?index=... as the jobs list update in the background, but these also stop appearing for the duration of the search operation.

Are you only experiencing the UI being slow or are all cluster actions slow?

I have not been looking for, or noticed any other slowness.

With a busy cluster, are you experiencing the UI hanging or generally chugging elsewhere or is it just the global search feature?

UI seems reasonably snappy otherwise, it's just the global search that's noticeably slow.

Most of our workload on this cluster is batch jobs which are completed but not yet GC'd. You could probably re-create the experience quite easily by spawning a few hundred dummy batch that terminate instantly (e.g. exec "echo hello world"). There's no need to have hundreds of running service jobs.

Ben

backspace added a commit that referenced this issue Jul 30, 2020
This closes #8549. Thanks to @optiz0r for the bug report. Having
the global search attempt to render every returned result is
obviously a mistake!

I chose to have the full number of matches still render, though
I also considered having it display (10+) instead. The choice of
truncating at 10 is arbitrary, maybe a higher number would be
preferable, I’m open to suggestions.
backspace added a commit that referenced this issue Aug 5, 2020
This closes #8549. Thanks to @optiz0r for the bug report. Having
the global search attempt to render every returned result is
obviously a mistake!
@DingoEatingFuzz
Copy link
Contributor

Hey @optiz0r!

Thanks again for your thorough report of your experience. It led @backspace to find a definite UI performance issue that will go out with the next Nomad release!

@github-actions
Copy link

github-actions bot commented Nov 4, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants