Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Umbrella] InLong Transform feature #10022

Open
14 of 16 tasks
luchunliang opened this issue Apr 20, 2024 · 1 comment
Open
14 of 16 tasks

[Umbrella] InLong Transform feature #10022

luchunliang opened this issue Apr 20, 2024 · 1 comment
Assignees
Milestone

Comments

@luchunliang
Copy link
Contributor

luchunliang commented Apr 20, 2024

Motivation

InLong Transform empowers InLong to expand its access and distribution capabilities, adapting to a richer variety of data protocols and reporting scenarios on the access side, and adapting to complex and diverse data distribution scenarios on the distribution side. This improves data quality and collaboration, providing connection, aggregation, filtering, grouping, value extraction, sampling, and other computing capabilities that are decoupled from the computing engine. It simplifies users' pre-processing operations for reporting data, lowers the threshold for data usage, simplifies users' pre-processing operations before starting data analysis, and focuses on the business value of data.

Scenarios

  • Data Cleansing: During the data integration process, it is necessary to clean data from different sources to eliminate errors, duplicates, and inconsistencies. Transform capabilities can help companies perform data cleansing more effectively and improve data quality.

  • Data Fusion: Combining data from different sources for unified analysis and reporting. Transform capabilities can handle data in different formats and structures, enabling data fusion and integration.

  • Data Standardization: Converting data into a unified standard format for cross-system and cross-platform data analysis. Transform capabilities can help companies achieve data standardization and normalization.

  • Data Partitioning and Indexing: To improve the performance of data queries and analysis, data needs to be partitioned and indexed. Transform capabilities can dynamically adjust field values for partitioning and indexing, thereby improving the performance of the data warehouse.

  • Data Aggregation and Calculation: During the data analysis process, data needs to be aggregated and calculated to extract valuable information. Transform capabilities can perform complex data aggregation and calculations, supporting multi-dimensional data analysis.

  • Data Security and Privacy Protection: During the data integration process, it is essential to ensure data security and privacy. Transform capabilities can implement data de-identification, encryption, and authorization management to protect data security and privacy.

  • Cross-team Data Sharing: For data security reasons, only filtered subsets of data streams are shared; for data dependency decoupling considerations, data interfaces are agreed upon with collaborating teams, dynamically adjusting the merging of multiple streams into the data stream interface.

Feature list

Rich Data Protocols

  • In addition to CSV and KV, standard protocols such as PB, JSON, and Thrift are supported, as well as business-customized HTTP packet and TCP packet protocols.
  • In the collection and distribution stages, Transform is integrated as an SDK to implement protocol processing and data conversion.

image

Decoupling from the Computing Engine

  • By using Transform's internal flow processing, the reference to the computing engine's operators is avoided, achieving decoupling from the computing engine.
  • Data output Writers and aggregation flow are registered to the Transform framework through defined interfaces, adapting to different computing engines.

image

Seamless and Lossless Changes

  • Transform supports periodically pulling from the Manager, enabling seamless and lossless configuration changes.
  • This avoids scenarios where changes to FlinkSQL and SparkSQL require job restarts.

image

Automatic Scaling

  • Transform tasks support scheduling between different computing jobs, achieving seamless and lossless automatic scaling.

image

Task list

InLong Component

Other for not specified component

Are you willing to submit PR?

  • Yes, I am willing to submit a PR!

Code of Conduct

Copy link

github-actions bot commented Jul 7, 2024

This issue is stale because it has been open for 60 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant