Reasons to use EEL for Data Ingestion at REST over Spark #265

hannesmiller · 2017-03-07T08:40:52Z

For documentation???

Spark is great at parallel processing data already in a distributed store like HDFS but it's not really designed for ingesting data at REST from a non-distributed store like a Local File System though there is support for it, i.e. local mode.

The disadvantage of ingesting data at REST from a local file system:

There's no advantage in using YARN on a local file system as its not a distributed store - you would need to distribute the data beforehand which defeats the purpose
Though it's tempting to use local mode with all the supported file formats that come out-of-box with Spark you are still faced with the same issues mentioned previously. Moreover in local mode all partitioned data is distributed across N threads in the same process as the client which can become a memory bottleneck.
Memory is also a bottleneck If you decide to collect the data in Spark...this pulls all the data into memory before you process - collect is not stream based like a Java InputStream or a JDBC ResultSet iterator
Spark does support JDBC datasets but you still need to provide a partitioning strategy so that Spark can split your query into multiple select statements for each partition - therefore it's possible to get more throughput but if you are saving to HDFS you can end up with lots of small files - not good for Hadoop. With EEL you have more control because you can specify N ioThreads for your sources and sinks, i.e. use more threads to read in parallel for your Source and fewer on your Sink resulting in sensible file sizes if you are writing to HDFS

sksamuel · 2017-07-14T22:24:29Z

I think the collect issue is the same in eel anyway - if you collect into memory, doesn't matter if its stream based input or not.

Local mode with Spark does work, so that isn't an issue really. Although it never seems proper. Any docs that state they don't want you really using it ?

hannesmiller changed the title ~~Reasons to use EEL for batch Ingestion over Spark~~ Reasons to use EEL for Data Ingestion at REST over Spark Mar 7, 2017

sksamuel assigned hannesmiller Apr 24, 2017

sksamuel added the documentation label Apr 24, 2017

garyfrost added the help wanted label Feb 1, 2018

garyfrost added this to the 1.3 milestone Feb 1, 2018

garyfrost added priority and removed help wanted labels Feb 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reasons to use EEL for Data Ingestion at REST over Spark #265

Reasons to use EEL for Data Ingestion at REST over Spark #265

hannesmiller commented Mar 7, 2017 •

edited

Loading

sksamuel commented Jul 14, 2017

Reasons to use EEL for Data Ingestion at REST over Spark #265

Reasons to use EEL for Data Ingestion at REST over Spark #265

Comments

hannesmiller commented Mar 7, 2017 • edited Loading

sksamuel commented Jul 14, 2017

hannesmiller commented Mar 7, 2017 •

edited

Loading