[Umbrella] InLong Transform feature #10022

luchunliang · 2024-04-20T09:01:47Z

Motivation

InLong Transform empowers InLong to expand its access and distribution capabilities, adapting to a richer variety of data protocols and reporting scenarios on the access side, and adapting to complex and diverse data distribution scenarios on the distribution side. This improves data quality and collaboration, providing connection, aggregation, filtering, grouping, value extraction, sampling, and other computing capabilities that are decoupled from the computing engine. It simplifies users' pre-processing operations for reporting data, lowers the threshold for data usage, simplifies users' pre-processing operations before starting data analysis, and focuses on the business value of data.

Scenarios

Data Cleansing: During the data integration process, it is necessary to clean data from different sources to eliminate errors, duplicates, and inconsistencies. Transform capabilities can help companies perform data cleansing more effectively and improve data quality.
Data Fusion: Combining data from different sources for unified analysis and reporting. Transform capabilities can handle data in different formats and structures, enabling data fusion and integration.
Data Standardization: Converting data into a unified standard format for cross-system and cross-platform data analysis. Transform capabilities can help companies achieve data standardization and normalization.
Data Partitioning and Indexing: To improve the performance of data queries and analysis, data needs to be partitioned and indexed. Transform capabilities can dynamically adjust field values for partitioning and indexing, thereby improving the performance of the data warehouse.
Data Aggregation and Calculation: During the data analysis process, data needs to be aggregated and calculated to extract valuable information. Transform capabilities can perform complex data aggregation and calculations, supporting multi-dimensional data analysis.
Data Security and Privacy Protection: During the data integration process, it is essential to ensure data security and privacy. Transform capabilities can implement data de-identification, encryption, and authorization management to protect data security and privacy.
Cross-team Data Sharing: For data security reasons, only filtered subsets of data streams are shared; for data dependency decoupling considerations, data interfaces are agreed upon with collaborating teams, dynamically adjusting the merging of multiple streams into the data stream interface.

Feature list

Rich Data Protocols

In addition to CSV and KV, standard protocols such as PB, JSON, and Thrift are supported, as well as business-customized HTTP packet and TCP packet protocols.
In the collection and distribution stages, Transform is integrated as an SDK to implement protocol processing and data conversion.

Decoupling from the Computing Engine

By using Transform's internal flow processing, the reference to the computing engine's operators is avoided, achieving decoupling from the computing engine.
Data output Writers and aggregation flow are registered to the Transform framework through defined interfaces, adapting to different computing engines.

Seamless and Lossless Changes

Transform supports periodically pulling from the Manager, enabling seamless and lossless configuration changes.
This avoids scenarios where changes to FlinkSQL and SparkSQL require job restarts.

Automatic Scaling

Transform tasks support scheduling between different computing jobs, achieving seamless and lossless automatic scaling.

Task list

InLong Component

Other for not specified component

Are you willing to submit PR?

Yes, I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

github-actions · 2024-07-07T01:55:36Z

This issue is stale because it has been open for 60 days with no activity.

luchunliang added the type/umbrella label Apr 20, 2024

luchunliang added this to the 1.13.0 milestone Apr 20, 2024

luchunliang self-assigned this Apr 20, 2024

dockerzhang mentioned this issue May 6, 2024

[Feature][SDK] Support to transform from PB protocol to CSV/KV protocol by single SQL #10117

Closed

2 tasks

github-actions bot added the stage/stale Issues or PRs that had no activity for a long time label Jul 7, 2024

luchunliang modified the milestones: 1.13.0, 1.14.0 Jul 18, 2024

github-actions bot removed the stage/stale Issues or PRs that had no activity for a long time label Jul 19, 2024

vernedeng mentioned this issue Aug 15, 2024

[Umbrella] Tencent Rhino-bird: Expand InLong Transform functions #10796

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Umbrella] InLong Transform feature #10022

[Umbrella] InLong Transform feature #10022

luchunliang commented Apr 20, 2024 •

edited

Loading

github-actions bot commented Jul 7, 2024

[Umbrella] InLong Transform feature #10022

[Umbrella] InLong Transform feature #10022

Comments

luchunliang commented Apr 20, 2024 • edited Loading

Motivation

Scenarios

Feature list

Rich Data Protocols

Decoupling from the Computing Engine

Seamless and Lossless Changes

Automatic Scaling

Task list

InLong Component

Are you willing to submit PR?

Code of Conduct

github-actions bot commented Jul 7, 2024

luchunliang commented Apr 20, 2024 •

edited

Loading