Zero Copy with Apache Iceberg #1541

ma3u · 2024-09-09T09:03:55Z

ma3u
Sep 9, 2024
Collaborator

In many big data and near-realtime use cases we need to analyse the data and not coping this large datasets. We also call this principle zero copy or compute-to-data.

There is a similar upstream EDC discussion about code2data and confidential compute from 2023 I want to discuss the need in Catena-X / Eclipse Tractus-X.

Many data provider don’t want to send big data sets or IoT streams (e.g. Apache Kafka, MQTT) to the cloud on regulation, cost and performance reasons. The Eclipse Dataspace Components connector give an practical example how to handle streaming data: https://github.com/eclipse-edc/Samples/tree/main/transfer/streaming/streaming-02-kafka-to-http

The aim is to process data where it resides, rather than moving large datasets to separate compute resources. That’s improving cost efficiency and performance by reducing data movement and enhance data security and privacy by keeping sensitive data in its original location.

Especially the Manufacturing-X family aims to implement a federated, decentralized and collaborative data ecosystem for smart manufacturing. It recognizes that deployment needs customization across an infrastructure continuum from cloud to edge, depending on the applications. Manufacturing-X aims to address cross-industry use cases based on collaborative use of data, which would likely involve integrating cloud and edge computing capabilities. EdgeComputing is seen as a key trend reshaping manufacturing, allowing data processing closer to the source and enabling real-time insights. It’s enabling manufacturers to filter data, reduce server overload, and perform local data analysis in real-time.

Apache Iceberg provides features that enable efficient data processing where the data is stored, which can be seen as a way of bringing compute closer to the big data like time travel and incremental queries.

So why we need Apache Iceberg Table support?

Apache Iceberg is emerging as a quasi-standard for many popular data platforms in cloud environments, offering features that enhance data management and analytics at scale.

It’s a quasi standard for the most data platforms:

Snowflake: Specializes in cloud-based data warehousing
Databricks: Excels in real-time data processing and machine learnin and Built on Apache Spark, enabling large-scale data processing
Google Cloud has announced Iceberg support for BigLake
Amazon Web Services (AWS) mentions Iceberg as a solution for transactional data lakes
Cloudera offers Iceberg as part of their open data lakehouse solution
Microsoft has announced plans to support Iceberg in OneLake in Fabric
Iceberg is designed to work with various data processing engines and storage systems, including Spark, Trino, Flink, Presto, Hive and Impala

Large technology companies like Netflix and Apple were involved in Iceberg's creation, and it's being deployed by some of the largest technology companies. Here you can find an article what Apache Iceberg means for the data community. An Iceberg Extension would help to increase the acceptance of the EDC connector and data spaces in general.

Integrating Apache Iceberg as an extension in EDC would require custom development, as there isn't a direct cost out-of-the-box integration between these two technologies. This integration would allow EDC to use Iceberg tables as data sources and sinks, leveraging Iceberg's features like schema evolution and time travel within the context of data spaces.

This workshop show how to setup a local S3 compliant
datalake to store the IoT data with Minio: https://github.com/tlepple/iceberg-intro-workshop

And installs a single node Apache Iceberg processing engine and lay the groundwork for the support of our Apache Iceberg tables and catalog.

Implementation in EDC with Iceberg:

Create a custom DataSource that exposes Iceberg table data through EDC's API
Implement query capabilities to allow consumers to request specific data
Use EDC's contract negotiation process to manage access rights
On the consumer side, implement a DataSink that pulls data and writes it to Iceberg tables (PullTransfer) or writes received data to Iceberg tables (
#PushTransfer) according of the supported transfer methods of the dataspace protocol. https://docs.internationaldataspaces.org/ids-knowledgebase/v/dataspace-protocol/transfer-process/transfer.process.protocol

More information to Iceberg you find here: https://iomete.com/the-ultimate-guide-to-apache-iceberg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero Copy with Apache Iceberg #1541

{{title}}

Replies: 0 comments

Select a reply

Zero Copy with Apache Iceberg #1541

ma3u Sep 9, 2024 Collaborator

Replies: 0 comments

ma3u
Sep 9, 2024
Collaborator