[SPARK-1888] enhance MEMORY_AND_DISK mode by dropping blocks in parallel #791

cloud-fan · 2014-05-15T11:58:26Z

It's unefficient to drop memory blocks to disk inside a synchronized block as IO is slow. As the TODO says, we just need synchronize selecting blocks to be dropped. So my implementation is: in ensureFreeSpace, we iterate entries and select blocks to be dropped. But instead of dropping block inside ensureFreeSpace, we can just mark selected entries as dropping, and return these blocks, let the caller do the dropping. When other thread call ensureFreeSpace again, they will skip entries that marked as dropping when iterating entries. And the caller, tryToPut, will do the dropping before put the new block into entries. In this way, we can do dropping in parallel.

AmplabJenkins · 2014-05-15T12:02:58Z

Can one of the admins verify this patch?

mridulm · 2014-05-16T22:52:41Z

IMO this make things fragile.
First off, not MT safe.
Secondly does not handle corner cases - for example exception handling.

cloud-fan · 2014-05-17T16:12:13Z

This is thread safe. tryToPut call ensureFreeSpace in a synchronized block, so there is only one thread can run ensureFreeSpace at the same time, which means each thread will select different to-be-dropped blocks. @mridulm could you point out where may case multi thread problem that I may missed?
About exception handling, It may happen that some entries are marked as dropping but no thread is dropping them. I will work on it.

mridulm · 2014-05-17T16:30:55Z

Use of dropping is not my safe
On 17-May-2014 9:42 pm, "Wenchen Fan" notifications@github.com wrote:

This is thread safe. tryToPut call ensureFreeSpace in a synchronized
block, so there is only one thread can run ensureFreeSpace at the same
time, which means each thread will select different to-be-dropped blocks.
@mridulm https://github.com/mridulm could you point out where may case
multi thread problem that I may missed?
About exception handling, It may happen that some entries are marked as
dropping but no thread is dropping them. I will work on it.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/791#issuecomment-43411524
.

cloud-fan · 2014-05-19T09:55:06Z

As far as I know, reasons for task failure may be: exception happens during task execution, Executor lost and relaunch, stage cancelled by user. But I'm not sure if I listed all the reasons. And I don't know the detail how spark relaunch Executor and cancel stage and how to handle these when dropping memory blocks. Is a try-catch enough for it? I want to reset the dropping flag if the task is terminated.

mridulm · 2014-05-19T10:13:42Z

It should read MT safe - phone "autocorrected" it, sigh.

There could be any number of reasons for dropping block to fail (including disk issues, etc).
When it does, we should not have inconsistent state.

tdas · 2014-05-19T22:47:21Z

Can you please create a JIRA for this, and update the title of the PR.

cloud-fan · 2014-05-20T03:02:04Z

@mridulm @tdas I have created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-1888

cloud-fan · 2014-05-20T03:14:20Z

@mridulm Sorry I may misunderstood you because of my poor english :(
Let me list things one by one so that we can make it clear

Currently spark MEMORY_AND_DISK mode is slower than DISK_ONLY mode sometimes because of the lock on IO(dropping blocks)
As the TODO says, the solution is: just synchronize the selecting of to-be-dropped blocks and do dropping in parallel
My solution is fragile, but it works if nothing goes wrong
My solution is not MT safe. For example, if a block is being dropped by one thread and another thread is trying to remove it, oops.
There could be any number of reasons for dropping block to fail, but wouldn't be any KINDS of them. As far as I know, one is exception(including disk issue, etc), one is executor lost, one is stage cancelled.
I do appreciate if you could discuss with me one by one as I listed above. Thanks!

cloud-fan · 2014-05-20T10:57:12Z

As we know, memory store is used for add, read, remove blocks. Reading and removing is quite simple, so let's focus on adding.
Adding may trigger dropping action, as I said before, dropping flag can make each thread select different to-be-dropped blocks, so it's safe to do dropping in parallel.
When dropping and reading are processing at the same time, the entry is still there before dropping finished, so it's safe.
When dropping and removing are processing at the same time, if dropping finished first, then remove will fail and it's OK. If removing finished first, dropping thread will try to remove entry after write block into disk store, and this remove action will fail because the entry has been removed, then dropping thread will remove the block from disk store to cancel the dropping. You can find this logic in BlockManage#dropFromMemory.
So I think my solution is MT safe already.
And about task termination, one reason is self exception(like disk error, etc), another is killed by executor. Now I put dropping code in a try-catch and if catch any exception, reset the dropping flag of to-be-dropped blocks and throw that exception. I think this should work for handling corner cases.

mridulm · 2014-05-20T12:26:47Z

It is not MT safe because the PR is checking/modifiying shared state (like dropping variable) in an unsafe manner.
I will comment in detail on the patch later today since I dont see to be conveying what I mean properly :-)

mridulm · 2014-05-20T12:27:44Z

core/src/main/scala/org/apache/spark/storage/MemoryStore.scala

You are modifying entry.dropping here - there is no gaurantee this change will be visible to other threads anytime soon.

cloud-fan · 2014-05-21T02:28:58Z

@mridulm Thanks very much for your comment! I think a big difference is: earlier code call BlockManager#dropFromMemory within putLock, but now we call it in parallel, we have to check it carefully.
About the try-catch, I agree using finally would be better, I will work on it.
About the entry.getSize, I found I need pair.getValue twice, so I extract it into val entry = pair.getValue. Is that right?
By the way, what do you think of define the dropping flag as volatile?

cloud-fan · 2014-05-21T09:32:01Z

ensureFreeSpace has 2 jobs. 1) iterate entries and select blocks to be dropped. 2) if to-be-dropped blocks can free enough space, mark them as dropping and return them to the caller.
ensureFreeSpace is called within putLock, so each thread will see the dropping flag modification(I will discuss flag resetting in exception handling later) and thus get different to-be-dropped blocks. And block reading don't need the dropping flag so no conflict there. Let's consider block removing and exception handling(reset dropping flag)
Job 1 of ensureFreeSpace(selecting) and removing are both synchronized by entries, so they must process by turn.
If a block is removed first, then everything is OK.
If a block is removed after Job 2 of ensureFreeSpace(marking) which is also synchronized by entries(in my modification), then the block will be dropped into disk and managed by diskStore, which I think is OK.
If a block is removed between selecting and marking, the marking will check if entry is null, so it's OK, too.
About exception handling, flag resetting is also synchronized by entries, so it won't process during selecting and marking.
If resetting happened before selecting, then selecting will be able to select these blocks and re-drop them.
If resetting happened after selecting, which means the selected to-be-dropped blocks won't include the resetted blocks, so there is no conflict.
Actually there are 3 place that write or read the dropping flag(selecting, marking and resetting) and they are all synchronized by entries, so I think we don't need to define the flag as volatile.

mridulm · 2014-05-21T10:06:03Z

With the latest commit, the issue with dropping flag is gone - which is great.
There is a change of behavior w.r.t earlier code.
Whether the earlier code was the way it was intentionally or accidentally, I am not sure - will let @mateiz or others comment.

Essentially there are a few things here :

a) What happens if existing block is re-added. Looks like this was probably handled earlier also ?
I went up the call tree a bit, and did not look like this was prevented : but maybe I missed it. Any comments @mateiz ?

b) What happens if same block is added in parallel by two threads.
If this was supported usecase, then the current PR breaks this - it is possible for first thread to add it, and second to evict it from memory in case it was not possible to host both two copies in memory (according to the free space computed).

cloud-fan · 2014-05-21T10:33:53Z

@mridulm I checked the code of BlockManager#doPut.

val putBlockInfo = {
  val tinfo = new BlockInfo(level, tellMaster)
  // Do atomically !
  val oldBlockOpt = blockInfo.putIfAbsent(blockId, tinfo)

  if (oldBlockOpt.isDefined) {
    if (oldBlockOpt.get.waitForReady()) {
      logWarning("Block " + blockId + " already exists on this machine; not re-adding it")
      return updatedBlocks
    }

    // TODO: So the block info exists - but previous attempt to load it (?) failed.
    // What do we do now ? Retry on it ?
    oldBlockOpt.get
  } else {
    tinfo
  }
}

BlockManger will create a BlockInfo for the block to be added, and val oldBlockOpt = blockInfo.putIfAbsent(blockId, tinfo), so if multi threads are adding same block, one thread will put the BlockInfo successfully and the other will fail and stop to put.

tdas · 2014-05-21T11:44:02Z

This seems really promising!! However, can you explain whether the following sequence of events is possible or not in ensureFreeSpace and tryToPut?

Both thread 1 and thread 2 wants to insert blocks of 100 bytes. Existing blocks include block A and block B of 100 bytes each, and the total capacity is 200 bytes. Next,

Thread 1 selects block A (not marked yet) and exits the entries.synchronized { // select }
Thread 2 selects block A as well (as it is not marked yet) and exits entries.synchronized { // select }
Thread 1 enters entries.synchronized { // mark } and marks block A to be dropped
Thread 2 also enters entries.synchronized { // mark } and marks block A to be dropped again (this seems to be possible since there is no double check to see whether each block has already been marked or not)
Thread 1 then drops Block A to disk
Thread 2 tries to drop Block A to disk as well, but since it is already dropped, no more action is taken.
Both threads think that 100 bytes have been cleared. Hence 2 x 100 bytes are inserted after dropping only 100 bytes.

Is this sequence possible?

mridulm · 2014-05-21T11:47:04Z

@cloud-fan there are multiple calls to memoryStore to directly put a block - not just from external addition.
So looking at only doPut might not help ?

mridulm · 2014-05-21T11:48:08Z

@tdas there is a dropping flag which prevents this.
Or did I misunderstand your query ?

tdas · 2014-05-21T11:50:05Z

@mridulm i may be missing something as well. Are you referring to the new dropping flag inside the case class Entry?

mridulm · 2014-05-21T11:52:37Z

@tdas yes - thread 1 should set A's dropping to true; so thread 2 should not select it

tdas · 2014-05-21T11:54:30Z

@mridulm Is that so? Since selection and marking are occurring in different entries.synchronized blocks, selection and marking are not "atomic together". So two threads can select the same block, before marking that block.

cloud-fan · 2014-05-21T13:14:50Z

@tdas you missed an important thing. trToPut call ensureFreeSpace within the putLock, so one thread have to wait another thread done both selection and marking, which means selection and marking are "atomic together" for adding block action.

mridulm · 2014-05-21T13:54:53Z

@tdas as @cloud-fan stated, the code uses the implementation detail that the private method is always called within context of a tryToPut lock - and not called by anyone else. I dont like the fact that we have locking state spread out like this, but then this is how it was already I guess ...
Maybe we should at best annotate the method with a comment ? And possibly assert that it is within tryToPut lock ?

cloud-fan · 2014-05-21T16:12:29Z

@tdas @mridulm what about we moving the putLock.synchronized into ensureFreeSpace and let tryToPut call ensureFreeSpace directly? I think it will be more clear this way.

mridulm · 2014-05-21T16:47:58Z

@cloud-fan makes more sense.
Also, please rename it to something more appropriate (since it is not longer trying to put within that block !)

@tdas, can you also comment about the usecases/flows I mentioned above ?

cloud-fan · 2014-05-22T02:28:28Z

@tdas I think we shouldn't synchronize on this. When one thread is running ensureFreeSpace, others should not get into ensureFreeSpace, but should be able to add and remove blocks. So using a putLock is better.
About the test, I haven't but I'm going to. Can spark-perf do this?

cloud-fan · 2014-05-22T02:34:03Z

@mridulm I checked all caller of MemoryStore#putValues and putBytes via IDE, it shows only BlockManager will call them and with block info synchronized. So maybe we don't need to worry about putting same block in parallel?

cloud-fan · 2014-05-22T04:46:19Z

@mridulm @tdas I have moved putLock.synchronized into ensureFreeSpace and rename this method to getToBeDroppedBlocks. And I also updated the scaladoc to explain this selection, marking, then dropping process. Please take a look to see if I missed something.
Can one of the admins ask @AmplabJenkins to run the unit_test? I want to make sure my PR doesn't break some basic functions...

mridulm · 2014-05-22T07:22:41Z

core/src/main/scala/org/apache/spark/storage/MemoryStore.scala

instead of 'get', can you rename it to 'find' or some such ?

cloud-fan · 2014-06-19T07:54:08Z

did a manual merge :)

SparkQA · 2014-09-05T23:46:39Z

Can one of the admins verify this patch?

andrewor14 · 2014-09-11T22:11:58Z

@cloud-fan This is now outdated. There have been relatively significant changes that went into MemoryStore recently. Do you mind updating this to master?

andrewor14 · 2014-09-11T22:16:08Z

Actually, before you do that, have you looked at #2134, which seems to be doing something really similar on the new code?

andrewor14 · 2014-09-11T22:23:07Z

Let me raise the same question here that I raised in #2134. If my understanding is correct, by the time ensureFreeSpace returns we aren't guaranteed to have actually freed up the space requested, but we return "success" to the caller anyway. Doesn't this cause a potential race condition where we put a new block faster than the old block is dropped? And this is likely if we put the new block into memory but the old block onto disk. I haven't followed the details of the above conversation, but the consequence of this condition could be an OOM if we're already at the edge of the total memory.

cloud-fan · 2014-09-12T03:00:53Z

First, ensureFreeSpace(I renamed it to findToBeDroppedBlocks) doesn't always return true. If it can't find enough to-be-dropped blocks to "free space", it will return false.
Then if ensureFreeSpace return true, you can regard it as return a Future that PROMISE you it will free a certain amount of memory after you finish that Future. At that moment, the freeMemory is not updated(means the memory has not been freed yet) until you finish the Future.
So if we put new blocks faster than the old blocks are dropped, ensureFreeSpace will return false(because there is not enough memory), then no to-be-dropped block is added, that thread will put that block into disk store.

cloud-fan · 2014-09-12T03:02:11Z

Seems a big change has made to memory store, I will digest it and update my PR.

liyezhang556520 · 2014-09-12T05:37:07Z

core/src/main/scala/org/apache/spark/storage/MemoryStore.scala

@cloud-fan @andrewor14
Hi cloud-fan, I think there will be some problem when you doesn't update the currentMemory. Assume there are two threads, the first one get the selectLock and finished running and release the lock, till now the currentMemory is not updated, then the second thread get the selectLock, the value of currentMemory for the second thread is the same with the first thread, so, the freeMemory=maxMemory-currentMemory is use for two times by the two threads. which means the selectedMemory for the second thread is smaller than it actually required.

cloud-fan · 2014-09-19T11:06:42Z

Hi @liyezhang556520 , thanks for pointing this out! I have updated my PR, please review @andrewor14

liyezhang556520 · 2014-09-19T12:19:05Z

core/src/main/scala/org/apache/spark/storage/MemoryStore.scala

Hi @cloud-fan , you removed accountingLock.synchronized here, so there will be more than one thread call planFreeSpace here for reserving memory. And each thread will asking for memory with size maxUnrollMemory - currentUnrollMemory. I think the logic is not the same with the original intention.

There is second question, what if maxUnrollMemory is large (maxMemory*unrollFraction might be dozens of GB large), while the requested memory amountToRequest is small (maybe dozens of MB), then you only use one thread to free the size, which is spaceToEnsure, this seems doesn't solve the IO issue.

Third, since you are lazy drop the to be dropped blocks, how can you avoid OOM which is @andrewor14 pointed out (the putting speed is faster than dropping)?

Does the three problems exists in the current patch? Maybe I missed something.

cloud-fan · 2014-09-22T06:50:58Z

@liyezhang556520 Thanks for you comments. 1) yes, the logic is not the same with the original intention. I have updated my PR to fix this. 2) the origin logic to calculate spaceToEnsure is in consideration of single thread, I have updated that logic now. 3) I didn't drop blocks lazily. unrollSafely will call DroppingTask#runTask and wait it done.

pwendell · 2014-11-10T02:08:31Z

This has mostly gone stale so I'd suggest we close this issue and revisit this later. This is a decent idea, but it does complicate things a good amount, and this particular piece of code IMO is already quite complicated. As with any performance change, it would be useful to quantify the performance problems observed as a result of this issue. For instance, has it been observed as a bottleneck in real clusters? Putting information of this type on the JIRA would be useful.

liyezhang556520 · 2014-11-10T02:19:31Z

@pwendell , I updated a design doc for SPARK-3000 several days ago which is also mainly to resolve the issue, There might have some performance problems in some case. You can have a look on this.

apache#791)

…6.1.1.4.0 (apache#791)" (apache#841) May be the RCA of https://jirap.corp.ebay.com/browse/HADP-59331

cloud-fan changed the title ~~improve performance of MemoryStore#tryToPut by elimating unnecessary lock~~ [SPARK-1888] enhance MEMORY_AND_DISK mode by dropping blocks in parallel May 20, 2014

mridulm reviewed May 20, 2014
View reviewed changes

mridulm reviewed May 22, 2014
View reviewed changes

ScrapCodes mentioned this pull request Sep 1, 2014

[SPARK-3000][CORE] drop old blocks to disk in parallel when free memory is not enough for caching new blocks #2134

Closed

liyezhang556520 reviewed Sep 12, 2014
View reviewed changes

drop block in parallel

1a18101

cloud-fan force-pushed the master branch from 415ec01 to 1a18101 Compare September 19, 2014 11:02

liyezhang556520 reviewed Sep 19, 2014
View reviewed changes

cloud-fan force-pushed the master branch from 22e0546 to b155b9e Compare September 22, 2014 06:31

some fix

198b2c1

cloud-fan force-pushed the master branch from b155b9e to 198b2c1 Compare September 22, 2014 10:55

cloud-fan closed this Nov 10, 2014

agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022

MapR [SPARK-861] Error logs in spark-historyserver (apache#791)

77a2444

udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024

MapR [SPARK-861] Error logs in spark-historyserver (apache#791)

94c170b

mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025

MapR [SPARK-861] Error logs in spark-historyserver (apache#791)

9613355

turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025

[HADP-57526] Bump parquet version to 1.15.1 and Iceberg to 1.6.1.1.4.0 (

a824e9b

apache#791)

turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025

Revert "[HADP-57526] Bump parquet version to 1.15.1 and Iceberg to 1.…

d332657

…6.1.1.4.0 (apache#791)" (apache#841) May be the RCA of https://jirap.corp.ebay.com/browse/HADP-59331

[SPARK-1888] enhance MEMORY_AND_DISK mode by dropping blocks in parallel #791

[SPARK-1888] enhance MEMORY_AND_DISK mode by dropping blocks in parallel #791

Uh oh!

Conversation

cloud-fan commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

mridulm commented May 16, 2014

Uh oh!

cloud-fan commented May 17, 2014

Uh oh!

mridulm commented May 17, 2014

Uh oh!

cloud-fan commented May 19, 2014

Uh oh!

mridulm commented May 19, 2014

Uh oh!

tdas commented May 19, 2014

Uh oh!

cloud-fan commented May 20, 2014

Uh oh!

cloud-fan commented May 20, 2014

Uh oh!

cloud-fan commented May 20, 2014

Uh oh!

mridulm commented May 20, 2014

Uh oh!

mridulm May 20, 2014

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 21, 2014

Uh oh!

cloud-fan commented May 21, 2014

Uh oh!

mridulm commented May 21, 2014

Uh oh!

cloud-fan commented May 21, 2014

Uh oh!

tdas commented May 21, 2014

Uh oh!

mridulm commented May 21, 2014

Uh oh!

mridulm commented May 21, 2014

Uh oh!

tdas commented May 21, 2014

Uh oh!

mridulm commented May 21, 2014

Uh oh!

tdas commented May 21, 2014

Uh oh!

cloud-fan commented May 21, 2014

Uh oh!

mridulm commented May 21, 2014

Uh oh!

cloud-fan commented May 21, 2014

Uh oh!

mridulm commented May 21, 2014

Uh oh!

cloud-fan commented May 22, 2014

Uh oh!

cloud-fan commented May 22, 2014

Uh oh!

cloud-fan commented May 22, 2014

Uh oh!

mridulm May 22, 2014

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 19, 2014

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!

andrewor14 commented Sep 11, 2014

Uh oh!

andrewor14 commented Sep 11, 2014

Uh oh!

andrewor14 commented Sep 11, 2014

Uh oh!

cloud-fan commented Sep 12, 2014

Uh oh!