You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Spark is great at parallel processing data already in a distributed store like HDFS but it's not really designed for ingesting data at REST from a non-distributed store like a Local File System though there is support for it, i.e. local mode.
The disadvantage of ingesting data at REST from a local file system:
There's no advantage in using YARN on a local file system as its not a distributed store - you would need to distribute the data beforehand which defeats the purpose
Though it's tempting to use local mode with all the supported file formats that come out-of-box with Spark you are still faced with the same issues mentioned previously. Moreover in local mode all partitioned data is distributed across N threads in the same process as the client which can become a memory bottleneck.
Memory is also a bottleneck If you decide to collect the data in Spark...this pulls all the data into memory before you process - collect is not stream based like a Java InputStream or a JDBC ResultSet iterator
Spark does support JDBC datasets but you still need to provide a partitioning strategy so that Spark can split your query into multiple select statements for each partition - therefore it's possible to get more throughput but if you are saving to HDFS you can end up with lots of small files - not good for Hadoop. With EEL you have more control because you can specify N ioThreads for your sources and sinks, i.e. use more threads to read in parallel for your Source and fewer on your Sink resulting in sensible file sizes if you are writing to HDFS
The text was updated successfully, but these errors were encountered:
hannesmiller
changed the title
Reasons to use EEL for batch Ingestion over Spark
Reasons to use EEL for Data Ingestion at REST over Spark
Mar 7, 2017
I think the collect issue is the same in eel anyway - if you collect into memory, doesn't matter if its stream based input or not.
Local mode with Spark does work, so that isn't an issue really. Although it never seems proper. Any docs that state they don't want you really using it ?
For documentation???
Spark is great at parallel processing data already in a distributed store like HDFS but it's not really designed for ingesting data at REST from a non-distributed store like a Local File System though there is support for it, i.e. local mode.
The disadvantage of ingesting data at REST from a local file system:
The text was updated successfully, but these errors were encountered: