Splash consistency with Spark's RDD guarantees #7

fvlankvelt · 2016-12-16T21:35:26Z

Reviewing Splash's code, I notice quite a number of places where a workset is modified in an RDD#foreach or RDD#map operation. This of course works fine when every change that is made is kept in memory and all memory is retained by Spark.

However, AFAICS this is a poor match to Spark's fault tolerance guarantees. E.g. #foreach operation is assumed by Spark to not do any mutations. This means that it is free to discard a copy of data that is also available on disk, whether a foreach loop iterated over it or not. When records in the RDD have been changed "behind Spark's back", results will differ depending on whether there was a GC or e.g. a node crashed.

Now, perhaps there's a good reason as to why this is not an issue for the approach Splash takes. I would certainly be curious to know under which conditions it is possible to do in-memory mutations without telling Spark - and still get the same fault tolerance guarantees.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splash consistency with Spark's RDD guarantees #7

Splash consistency with Spark's RDD guarantees #7

fvlankvelt commented Dec 16, 2016

Splash consistency with Spark's RDD guarantees #7

Splash consistency with Spark's RDD guarantees #7

Comments

fvlankvelt commented Dec 16, 2016