cbdb overview test

apache · Jun 5, 2023 · 7917e99 · 7917e99
1 parent dc7265a
commit 7917e99
Showing 1 changed file with 110 additions and 0 deletions.
diff --git a/docs/cbdb-overview-test.md b/docs/cbdb-overview-test.md
@@ -0,0 +1,110 @@
+---
+id: cbdb-overview
+title: Product Introduction
+---
+
+# Cloudberry Database Introduction
+
+Cloudberry Database, built on the latest PostgreSQL 14.4 core (released in mid-2022), stands as one of the most advanced and mature open-source MPP databases available. It encompasses a range of high-performance features, including high concurrency and high availability, to efficiently handle complex tasks and meet the demands of managing and processing massive amounts of data. Its widespread adoption across various domains speaks to its extensive applicability.
+
+- Outstanding Performance: Cloudberry exhibits significant advantages in data storage, high concurrency, high availability, linear scalability, responsiveness, ease of use, and cost-effectiveness. In the era of big data, Cloudberry excels in processing terabyte-level datasets, surpassing Hadoop in terms of stand-alone performance.
+- Strong Syntax Compatibility: Cloudberry offers superior usability and functionality compared to Hive, the SQL engine on Hadoop. Its user-friendly approach makes it more accessible for both novice and experienced users alike.
+- Comprehensive Tooling: Cloudberry boasts a comprehensive suite of tools, eliminating the need for extensive tool customization. This makes it an ideal solution for large-scale data warehouse projects, saving users valuable time and effort.
+- Flexible Deployment: Cloudberry supports flexible deployment options, including the traditional hardware deployment, as well as the multi-cloud and cross-cloud deployments.
+- Support for Different Data Types, Formats, and Storage Media: Cloudberry provides comprehensive support for diverse data types, formats, and storage media. Its flexibility ensures it can effectively meet the various requirements of users.
+
+This manual focuses on introducing the product architecture and the underlying mechanisms of Cloudberry Database's internal modules, and highlights their significance for users.
+
+## 产品架构
+
+In most cases, Cloudberry Database closely resembles PostgreSQL in terms of SQL support, functionality, configuration options, and end-user capabilities. The interaction experience for database users with Cloudberry Database closely mirrors that of interacting with a stand-alone PostgreSQL.
+
+Cloudberry Database utilizes MPP (Massively Parallel Processing) architecture technology to store and process vast amounts of data by distributing data and workloads across multiple servers or hosts.
+
+MPP, also known as a shared-nothing architecture, refers to a system consisting of multiple hosts that collaborate to execute operations. Each host has its own processor, memory, disk, network resources, and operating system. Cloudberry Database leverages this high-performance system architecture to efficiently distribute the workload of massive data and fully utilize all system resources in parallel for query processing.
+
+From a user's perspective, Cloudberry Database is a comprehensive Relational Database Management System (RDBMS). At the physical level, it encompasses multiple PostgreSQL instances. To achieve efficient division of labor and cooperation among multiple independent PostgreSQL instances, Cloudberry Database implements distributed clustering, including data storage, computation, communication, and management. While being a cluster, Cloudberry Database encapsulates all the intricate distributed details, providing users with a unified logical database. This encapsulation significantly simplifies the work of developers and operations personnel.
+
+The architecture diagram of Cloudberry Database is shown below: [Insert Architecture Diagram]
+
+![Cloudberry Architecture](./media/cbdb-arch.png)
+
+HashData is comprised of the following components:
+
+- The **master** serves as the gateway to the Cloudberry Database system. It handles client connections and SQL queries, while delegating tasks to segment instances. Users interact with Cloudberry Database by connecting to the master using client programs (such as psql) or application programming interfaces (APIs) like JDBC, ODBC, or the libpq PostgreSQL C API.
+    - The master acts as the global system directory, containing a set of system tables that hold metadata about the Cloudberry Database system itself. 
+    - The master doesn't store any user data. The data is exclusively stored on segment instances.
+    - The master performs client connection authentication, processes SQL commands, distributes workload among segments, coordinates the results returned by each segment, and presents the final results to the client program.
+    - Cloudberry Database utilizes Write-Ahead Logging (WAL) for master/standby mirroring. In WAL-based logging, all modifications are logged before being written to disk, ensuring data integrity for any in-process operations.
+
+- The **segment** instances are independent Postgres processes. Each segment stores a specific portion of the data and handles the corresponding queries. When a user connects to the database through the master and submits a query request, processes are created on each segment to handle the query. User-defined tables and their indexes are distributed across all available segments within Cloudberry Database, with each segment containing distinct portions of the data. The processes responsible for processing different portions of the data run on their respective segments. Users interact with the segment through the master, and the segment operate on servers known as the segment host.
+
+    Typically, a segment host runs 2 to 8 segment instances, depending on the processor, memory, storage, network interfaces, and workload. Balancing the configuration of segment hosts is critical as Cloudberry Database achieves optimal performance by evenly distributing data and workload among segments, allowing all segments to initiate and complete tasks simultaneously.
+
+- The **interconnect** serves as the network layer in the Cloudberry Database system architecture. It refers to the network infrastructure relied upon for communication between the master and segments, utilizing standard Ethernet switching.
+
+    For optimal performance, it is recommended to employ a network with a speed of 10GB or faster. By default, the interconnect module employs the UDP protocol with flow control (UDPIFC) for communication and message transmission across the network. Cloudberry Database performs packet validation beyond what UDP offers, which means reliability equivalent to using the TCP protocol, and performance and scalability beyond that of TCP. If the interconnect is changed to the TCP protocol instead, Cloudberry Database's scalability is limited to 1000 Data Nodes. However, this restriction does not apply when UDPIFC is used as the default protocol.
+
+- Cloudberry Database ensures data consistency through Multiversion Concurrency Control (MVCC). When querying the database, each transaction sees only a snapshot of the data, which ensures that the current transaction does not see changes made by other transactions on the same records. MVCC provides transaction isolation for every database transaction.
+
+    MVCC avoids explicit locking of database transactions, which minimize lock contention to ensure performance in multi-user environments. One significant advantage of MVCC over locks is that read and write operations do not conflict, and they never block each other.
+
+## 数据加载
+
+Cloudberry Database 通过外部表技术支持大批量并行、持续化的数据加载，能够支持GBK/UTF8等字符集间的自动转换。由于基于 MPP 架构，Scatter-Gather Streaming<sup>TM</sup> 技术提供性能线性扩张。能够支持外部文件服务器、Hive、Hbase、HDFS、S3 多种存储介质以及 CSV、Text、JSON、ORC、Parquet等多种文件格式，支持 Zip 等压缩数据文件加载，被 DataStage、Informatica、Kettle 等多款 ETL 工具集成。
+
+Cloudberry Database 同时支持流式数据加载，针对订阅的 Kafka Topic，根据设置的Task最大值，启动多个 Task 并行读取 Partition 数据，读取后将记录缓存，到一定时间或记录数，通过 gpfdist 加载到 Cloudberry Database 保证数据不重、不丢，用于流数据采集、实时分析场景。支持达到每分钟几千万的数据加载吞吐量。
+
+PXF 是 Cloudberry Database 内置组件，支持将外部数据源映射到 Cloudberry Database 外部表，实现 Data Fabric 架构。并基于 MPP 引擎实现并行、高速的数据访问，支持混合数据生态管理和访问。
+
+![数据加载架构图](./media/cbdb-data-loading-arch.png)
+
+## 数据存储和安全
+
+Cloudberry Database 计算的并行化基于数据在存储层的均匀分布，数据均匀分布是并行处理的关键，Cloudberry Database 数据库提供了 Hash 和 Random 两种方式存储层分布数据，保证：数据均匀分布在每一块磁盘上面发挥每一块磁盘性能，根本上解决I/O瓶颈Cloudberry Database 提供了更灵活的分布方式。
+
+针对小表可以采用Replication Table 支持用户在创建表示指定自定义Hash算法，灵活控制数据分布。同时支持行式存储和列式存储。
+
+- 行式存储：更新速度快，大多数字段频繁查询，随机行访问较多。
+- 列式存储：少数字段查询，大幅节省 I/O 操作，大数据量频繁访问。
+
+Cloudberry 可以按照应用类型设计存储模式，最细粒度到分区，实现一张表多种存储模式，达到最优化访问性能。查询执行时，Cloudberry Database 优化器会根据用户使用的存储形态根据统计信息生成对应最优的查询计划，而不需要用户干预。
+
+![行式存储和列式存储](./media/cbdb-row-col-storage.png)
+
+数据压缩提高数据处理性能，压缩比依赖于压缩算法和数据内容，针对移动信令、话单、点击流数据压缩比可以达到 20 倍以上。无论哪种存储模式，均支持压缩，一张表的不同列支持不同的压缩算法。Cloudberry Database 提供多种压缩算法：
+
+- Zlib 1-9，压缩比高，占用 CPU 资源较多，适用于 CPU 计算能力较强的场景。
+- Zstandard 1~19，实现 CPU 与压缩比的平衡。
+
+同时，数据安全也非常重要，Cloudberry 支持多数据库，数据库之间数据不共享，跨数据库访问可通过 DBLink。数据库内部数据的逻辑组织，包括多累数据对象，如：表、视图、索引、函数等，数据访问可以跨 Schema。
+
+在存储安全性上，支持不同存储模式，支持数据冗余，支持数据加密 AES 128、192，256 DES等以及国密加密。支持密文认证，支持各类加密算法 SCRAM-SHA-256、MD5、LDAP、RADIUS 等。针对不同的用户，在不同级别的对象（如：Schema、表、行、列、视图、函数等）上进行多种类型的权限设定，可以设定的权限包括：SELECT、UPDATE、执行权、所有权等等。
+
+## 数据分析
+
+Cloudberry Database 内核内置了强大的并行优化器和执行器，能够兼容 PostgreSQL 生态，能够支持数据分区裁剪、索引（BTree，Bitmap，Hash，Brin，GIN等），JIT（表达式即时编译处理）等技术。
+
+除此之外，Cloudberry Database 集成了大量丰富的分析组件：
+
+- 机器学习组件。MADlib on Cloudberry Database：全部 SQL 驱动，算法 + 算力 + 数据。
+- PL language。开发人员可以使用 R、Python、Perl、Java、PostgreSQL 等语言编写用户自定义函数。
+- 基于 MPP 引擎，实现高性能、并行计算，与 SQL 无缝集成，针对 SQL 执行结果计算、分析。
+- PostGIS。基于PostGIS 2.X 进行了企业级改进，支持 Cloudberry Database MPP 架构，集成对象存储，支持大容量对象从 OSS 加载入库，支持所有的空间数据类型（geometry、geography 、Raster等），支持时空索引，支持复杂的空间和地理位置计算，球体长度计算 空间聚集函数（包含、覆盖、相交等）。
+- Cloudberry Database Text 组件。支持利用 ElasticSearch 加速文件检索能力，相比传统的 GIN 数据文本查询性能达到数量级的明显提升，并且支持多种分词，自然语言处理，查询结果渲染等能力。
+
+## 灵活的工作负载管理
+
+- 连接池 PGBouncer（Connection 级，在连接级别支持 Cloudberry 集群高并发）：数据库端统一管理会话，控制同时有多少用户可以接入，避免频繁创建销毁服务进程，占用内存小，支持高并发，使用 libevent 进行 Socket 通信，效率更高。
+- 资源组 Resource Group（Session 级，在会话级别量化控制 Cloudberry 集群资源）：梳理典型工作负载，分析负载 CPU、内存、并发度需求，基于对工作负载的分析设置 Resource Group，监控 GP 运行，动态调整 RS，利用规则清理空闲会话。
+- 动态分配资源组（Query 级，在 SQL 级别动态调整 CBDB 集群资源）：在 SQL 语句执行前或执行过程中，动态实现资源的灵活、动态调配，用于优待特定查询，从而缩短其运行时间。
+
+## 高度兼容第三方产品
+
+Cloudberry Database 数据库与 BI 工具、挖掘预测工具、ETL 工具、J2EE/.NET 应用程序、以及其他数据源/计算引擎均有良好的连通性。
+
+![第三方兼容](./media/cbdb-third-party-compati.png)
+
+## 跨平台和国产化支持
+
+Cloudberry Database 支持多种包括 X86、ARM、飞腾、鲲鹏、海光等系统硬件架构，以及 CentOS、Ubuntu、Kylin、BC-Linux 等多种操作系统环境。