Skip to content

Conversation

@XuTingjun
Copy link
Contributor

The bug is that: When clicking the app in incomplete page, the tasks progress is 100/2000. After the app finished, click into the app again, the tasks progress still shows 100/2000.

This patch I override the handle function in UIHandlerCollection. I just add the line 137-151, to make sure when click the "App ID" , the HistoryPage will refresh the SparkUI if it is incompleted.

The problem:
Though it makes the SparkUI can refresh, but the refresh is effective just after clicking the "App ID". If have entered the SparkUI, refresh makes no change.

@srowen
Copy link
Member

srowen commented Jun 1, 2015

Can you explain this more? what does this accomplish that isn't accomplished by refreshing the page? and unless the page is refreshed I would not expect an update.

This code has a lot of style problems like indentation too.

@WangTaoTheTonic
Copy link
Contributor

Hey Sean, what we wanna fix is the "load once, show them forever" issue. That is to say, when user click a link on history page, the provider will load an application information from persistent layer into memory and will never update it.

So if the click happened with application incompleted, then no matter what the followed operations are, the history page will show the incompleted infomation ever. Obviously it is a problem.

We have not much experience at Jetty so maybe it is not a perfect solution by now. We are welcome for any feedback.

@squito
Copy link
Contributor

squito commented Jun 1, 2015

Can I suggest that you start by creating a failing test case? You can try to follow the examples in UISeleniumSuite. that might help us see what is going wrong.

@XuTingjun XuTingjun changed the title [SPARK-7889] make sure click the "App ID" on HistoryPage, the SparkUI will be refreshed. [SPARK-7889] [UI] make sure click the "App ID" on HistoryPage, the SparkUI will be refreshed. Jun 2, 2015
@XuTingjun
Copy link
Contributor Author

@tsudukim hey, I think this bug is introduced by #3467, can you have a look?

@tsudukim
Copy link
Contributor

tsudukim commented Jun 2, 2015

I wonder if it is a good idea that we have a custom HandlerCollection.
The cause of the problem is the cache mechanism in HistoryServer. The current HistoryServer caches all UI object in appCache regardless of whether it is completed or incomplete. I think it's reasonable to fix the cache layer not to retain when incomplete job.
Setting spark.history.retainedApplications = 0 (in order not to use cache) can be a workaround for this problem.

@XuTingjun
Copy link
Contributor Author

@tsudukim Sorry, I don't agree with you, "spark.history.retainedApplications" can't to be 0 I think

@squito
Copy link
Contributor

squito commented Jun 2, 2015

It makes sense that the issue might be from caching applications. I think @tsudukim wasn't saying that "spark.history.retainedApplications = 0" is a permanent solution, just a temporary workaround if you need this in a currently deployed system.

It seems like this one commit was trying to solve the problem by preventing caching of incomplete apps, right? Those changes look reasonable -- do they solve the problem alone? there is a lot of other stuff in this PR that seems pretty irrelevant, so its hard to make sense of it. Again, a test case would help out a lot.

@XuTingjun
Copy link
Contributor Author

I think the reason is not just because of the appCache. After my debug the code, I found there are two mainly reasons:
1.First time send "/history/appid" request, the handler "/history/*" will deal with it. During this, the handler adds the handlers of sparkui into historyserver, contains the handler "/history/appid". So when the second time send "/history/appid" request, the handler "/history/appid" will deal with it instead of handler "/history/*". so the second time ui is the same with the first one;
2.In the handler "/history/*", the code use appCache.get() to cache app ui. I think we should use appCache.refresh() instead to make it can refresh.

@squito
Copy link
Contributor

squito commented Jun 8, 2015

Hi @XuTingjun the updated description makes sense. Also I took a really quick look at the patch and it seems reasonable. however I'm going to defer reviewing this closely until I see a new test to reproduce this. I am happy to help write that test if you need it, but I'd prefer if you do it :)

in any case I'll have jenkins start testing this as is ...

@squito
Copy link
Contributor

squito commented Jun 8, 2015

jenkins, ok to test

@XuTingjun
Copy link
Contributor Author

Hi @squito, I think I need your help, I am not clearly know how to write this test.

@squito
Copy link
Contributor

squito commented Jun 9, 2015

OK no problem @XuTingjun , happy to help, though I think unfortunately I am unlikely to get to this until after spark summit. Sorry for the delay, feel free to bug me again if you don't hear from me by end of next week.

and anyone else is free to jump in the meantime as well :)

@squito
Copy link
Contributor

squito commented Jun 9, 2015

Jenkins, this is ok to test

@steveloughran
Copy link
Contributor

  1. I can do the test as part of SPARK-1537; indeed, I already have it replicated, I've just turned that test off. That code already has everything needed (listener for spark events, event publishing, fetching of events, spinning up the history server and GET of the relevant pages).
  2. I'd add the test to IncompleteSparkUISuite, which already walks an app through its lifecycle and verifies that the GET /?incomplete=true listing changes state. All it needs is an extra {{GET /history/appId}} to verify that its changed state too.
  3. My proposed patch in SPARK-8275 was a lot simpler. All I was doing there was, on a {{GET /history/appId}} was looking in the cache, and if incomplete, check that the entry hadn't been in the cache longer than an expiry time. This would ensure that repeated GETs within a short period of time wouldn't have any performance hit, especially with multiple viewers, but the caching wouldn't be eternal. After all, the history server already discards old entries once the cache gets full.

Accordingly, why don't I implement my simple discard algorithm and the test? If that isn't enough we can look at this code?

@XuTingjun
Copy link
Contributor Author

@squito, Can you pay attention on this? Thanks.

@squito
Copy link
Contributor

squito commented Jun 23, 2015

sorry for the delay, I'm just looking at this again. @steveloughran I want to make sure I understand what you are proposing -- that fix would be totally independent of the ATS work for SPARK-1537, right? I see the relationship, but think they are best tackled in separate PRs (i think that is what you are describing, just want to double check).

Just thinking aloud here -- my only concern with adding an expiry time is that you never know what to set it to, and it might be a little confusing to users. I was thinking we could add another query param ?refresh=true to avoid the performance hit all the time, but I suppose that will just lead to users always setting that param, so it probably doesn't help. So the cache expiry time seems like the way to go.

@XuTingjun how about you help take a look at what @steveloughran comes up with and see if it addresses your concerns?

@steveloughran
Copy link
Contributor

-yes, independent. SPARK-1537: implements the HistoryProvider in YARN, gets it from ATS, etc, etc. The JIRA we are looking at here, where my patch is pull/6935 says "we ask the provider for the app UI again if it is (a) incomplete and (b) being asked for after some defined interval" . I know deciding what's a good interval is hard, as there's a tradeoff in responsiveness and system load —I'll leave it to others to suggest a good default there.

@steveloughran
Copy link
Contributor

...I'm looking at this and my PR; I think they are differently interdependent.

  1. This patch adds a new method to ApplicationHistoryProvider for the history provider to give its status on an application.
  2. plus an implementation in the FsHistoryProvider to choose the flag based on its knowledge of the loaded path and whether the file came has the {{.".inprogress"}} extension
  3. The web UI is registered in SparkUI such that hitting paths in the UI can trigger a reload.

There's some extra map datastructures added in HistoryServer and FsHistoryProvider to aid this.

In comparison, my pull request #6935, does all its work in a pulled out ApplicationCache, which adds a completed flag and a timestamp, the goal being that a refresh is triggered if !completed and now > timestamp + interval.

But that PR doesn't: join up the web UI, or let the providers declare that an app has completed except by a complete reload of their UI

I think the things can be merged; I'd have mine being able to query (indirectly) the history provider to see if the app has completed. or even better: if it has changed since the last reload. That would eliminate wasteful reloads. How to achieve this? either the cache calls through to the history provider, or the history provider itself provides some callback for the cache entries to invoke.

There's one more option to consider here. Why do we actually replay-then-pause in-progress applications? Why not have a thread continually loading them and forwarding events until they complete? In this mode, there's no need to reload the cache, replay events....the UI just stays up to date.

I'll cover that on my PR

@rxin
Copy link
Contributor

rxin commented Dec 31, 2015

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

@asfgit asfgit closed this in 7b4452b Dec 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants