You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In many big data and near-realtime use cases we need to analyse the data and not coping this large datasets. We also call this principle zero copy or compute-to-data.
There is a similar upstream EDC discussion about code2data and confidential compute from 2023 I want to discuss the need in Catena-X / Eclipse Tractus-X.
The aim is to process data where it resides, rather than moving large datasets to separate compute resources. That’s improving cost efficiency and performance by reducing data movement and enhance data security and privacy by keeping sensitive data in its original location.
Especially the Manufacturing-X family aims to implement a federated, decentralized and collaborative data ecosystem for smart manufacturing. It recognizes that deployment needs customization across an infrastructure continuum from cloud to edge, depending on the applications. Manufacturing-X aims to address cross-industry use cases based on collaborative use of data, which would likely involve integrating cloud and edge computing capabilities. EdgeComputing is seen as a key trend reshaping manufacturing, allowing data processing closer to the source and enabling real-time insights. It’s enabling manufacturers to filter data, reduce server overload, and perform local data analysis in real-time.
Apache Iceberg provides features that enable efficient data processing where the data is stored, which can be seen as a way of bringing compute closer to the big data like time travel and incremental queries.
So why we need Apache Iceberg Table support?
Apache Iceberg is emerging as a quasi-standard for many popular data platforms in cloud environments, offering features that enhance data management and analytics at scale.
It’s a quasi standard for the most data platforms:
Snowflake: Specializes in cloud-based data warehousing
Databricks: Excels in real-time data processing and machine learnin and Built on Apache Spark, enabling large-scale data processing
Google Cloud has announced Iceberg support for BigLake
Amazon Web Services (AWS) mentions Iceberg as a solution for transactional data lakes
Cloudera offers Iceberg as part of their open data lakehouse solution
Microsoft has announced plans to support Iceberg in OneLake in Fabric
Iceberg is designed to work with various data processing engines and storage systems, including Spark, Trino, Flink, Presto, Hive and Impala
Large technology companies like Netflix and Apple were involved in Iceberg's creation, and it's being deployed by some of the largest technology companies. Here you can find an article what Apache Iceberg means for the data community. An Iceberg Extension would help to increase the acceptance of the EDC connector and data spaces in general.
Integrating Apache Iceberg as an extension in EDC would require custom development, as there isn't a direct cost out-of-the-box integration between these two technologies. This integration would allow EDC to use Iceberg tables as data sources and sinks, leveraging Iceberg's features like schema evolution and time travel within the context of data spaces.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
In many big data and near-realtime use cases we need to analyse the data and not coping this large datasets. We also call this principle zero copy or compute-to-data.
There is a similar upstream EDC discussion about code2data and confidential compute from 2023 I want to discuss the need in Catena-X / Eclipse Tractus-X.
Many data provider don’t want to send big data sets or IoT streams (e.g. Apache Kafka, MQTT) to the cloud on regulation, cost and performance reasons. The Eclipse Dataspace Components connector give an practical example how to handle streaming data: https://github.com/eclipse-edc/Samples/tree/main/transfer/streaming/streaming-02-kafka-to-http
The aim is to process data where it resides, rather than moving large datasets to separate compute resources. That’s improving cost efficiency and performance by reducing data movement and enhance data security and privacy by keeping sensitive data in its original location.
Especially the Manufacturing-X family aims to implement a federated, decentralized and collaborative data ecosystem for smart manufacturing. It recognizes that deployment needs customization across an infrastructure continuum from cloud to edge, depending on the applications. Manufacturing-X aims to address cross-industry use cases based on collaborative use of data, which would likely involve integrating cloud and edge computing capabilities. EdgeComputing is seen as a key trend reshaping manufacturing, allowing data processing closer to the source and enabling real-time insights. It’s enabling manufacturers to filter data, reduce server overload, and perform local data analysis in real-time.
Apache Iceberg provides features that enable efficient data processing where the data is stored, which can be seen as a way of bringing compute closer to the big data like time travel and incremental queries.
So why we need Apache Iceberg Table support?
Apache Iceberg is emerging as a quasi-standard for many popular data platforms in cloud environments, offering features that enhance data management and analytics at scale.
It’s a quasi standard for the most data platforms:
Large technology companies like Netflix and Apple were involved in Iceberg's creation, and it's being deployed by some of the largest technology companies. Here you can find an article what Apache Iceberg means for the data community. An Iceberg Extension would help to increase the acceptance of the EDC connector and data spaces in general.
Integrating Apache Iceberg as an extension in EDC would require custom development, as there isn't a direct cost out-of-the-box integration between these two technologies. This integration would allow EDC to use Iceberg tables as data sources and sinks, leveraging Iceberg's features like schema evolution and time travel within the context of data spaces.
This workshop show how to setup a local S3 compliant
datalake to store the IoT data with Minio: https://github.com/tlepple/iceberg-intro-workshop
And installs a single node Apache Iceberg processing engine and lay the groundwork for the support of our Apache Iceberg tables and catalog.
Implementation in EDC with Iceberg:
#PushTransfer) according of the supported transfer methods of the dataspace protocol. https://docs.internationaldataspaces.org/ids-knowledgebase/v/dataspace-protocol/transfer-process/transfer.process.protocol
More information to Iceberg you find here: https://iomete.com/the-ultimate-guide-to-apache-iceberg
Beta Was this translation helpful? Give feedback.
All reactions