Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splash consistency with Spark's RDD guarantees #7

Open
fvlankvelt opened this issue Dec 16, 2016 · 0 comments
Open

Splash consistency with Spark's RDD guarantees #7

fvlankvelt opened this issue Dec 16, 2016 · 0 comments

Comments

@fvlankvelt
Copy link

Reviewing Splash's code, I notice quite a number of places where a workset is modified in an RDD#foreach or RDD#map operation. This of course works fine when every change that is made is kept in memory and all memory is retained by Spark.

However, AFAICS this is a poor match to Spark's fault tolerance guarantees. E.g. #foreach operation is assumed by Spark to not do any mutations. This means that it is free to discard a copy of data that is also available on disk, whether a foreach loop iterated over it or not. When records in the RDD have been changed "behind Spark's back", results will differ depending on whether there was a GC or e.g. a node crashed.

Now, perhaps there's a good reason as to why this is not an issue for the approach Splash takes. I would certainly be curious to know under which conditions it is possible to do in-memory mutations without telling Spark - and still get the same fault tolerance guarantees.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant