Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: rename old names for ASF brand Compliance #213

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
10 changes: 5 additions & 5 deletions docs/advanced-analytics/directory-tables.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ title: Directory Table

# Directory Table

Cloudberry Database has introduced *directory tables* in v1.5.3 for unified management of unstructured data on local or object storage.
Apache Cloudberry has introduced *directory tables* in v1.5.3 for unified management of unstructured data on local or object storage.

In the context of large-scale AI, AI applications have generated the need to manage unstructured multi-modal corpora. There is a need to continuously prepare a large amount of high-quality curated unstructured corpora, train large models through data iteration, and summarize rich knowledge bases. Therefore, there are technical challenges in the management and processing of structured corpora.

To address these challenges, Cloudberry Database introduces directory tables for managing multiple types of unstructured data. Developer users can use simple SQL statements to take advantage of the capabilities of multiple computing engines to achieve one-stop data processing and application development.
To address these challenges, Apache Cloudberry introduces directory tables for managing multiple types of unstructured data. Developer users can use simple SQL statements to take advantage of the capabilities of multiple computing engines to achieve one-stop data processing and application development.

Directory tables store, manage, and analyze unstructured data objects. They reside within tablespaces. When unstructured data files are imported, a directory table record (file metadata) is created, and the file itself is loaded into object storage. The table metadata remains associated with the corresponding object storage file.

Expand Down Expand Up @@ -40,7 +40,7 @@ CREATE DIRECTORY TABLE <table_name>;

To create a directory table in an external storage, you first need to create a tablespace in that storage. You'll need to provide connection information of the external storage server, such as server IP address, protocol, and access credentials. The following examples show how to create directory tables on QingCloud Object Storage and HDFS.

1. Create server objects and define connection methods for external data sources. Cloudberry Database supports protocols for multiple storage options, including S3 object storage and HDFS. The following examples create server objects named `oss_server` and `hdfs_server` on QingCloud and HDFS, respectively.
1. Create server objects and define connection methods for external data sources. Apache Cloudberry supports protocols for multiple storage options, including S3 object storage and HDFS. The following examples create server objects named `oss_server` and `hdfs_server` on QingCloud and HDFS, respectively.

- For QingCloud:

Expand All @@ -58,7 +58,7 @@ To create a directory table in an external storage, you first need to create a t

- `protocol`: The protocol used to connect to the external data source. In the examples above, `'qingstor'` indicates using the QingCloud object storage service protocol, and `'hdfs'` indicates using the HDFS storage service protocol.
- `prefix`: Sets the path prefix when accessing object storage. If this prefix is set, all operations will be limited to this specific path, such as `prefix '/rose-oss-test4/usf1'`. This is typically used to organize and isolate data stored in the same bucket.
- `endpoint`: Specifies the network address of the external object storage service. For example, `'pek3b.qingstor.com'` is a specific regional node of the QingCloud service. Through this endpoint, Cloudberry Database can access external data.
- `endpoint`: Specifies the network address of the external object storage service. For example, `'pek3b.qingstor.com'` is a specific regional node of the QingCloud service. Through this endpoint, Apache Cloudberry can access external data.
- `https`: Specifies whether to connect to the object storage service using the HTTPS protocol. In this command, `'false'` indicates using an unencrypted HTTP connection. This setting might be influenced by data transmission security requirements, and it is generally recommended to use HTTPS to ensure data security.
- `virtual_host`: Determines whether to access the bucket using virtual hosting. `'false'` means that bucket access is not done in virtual host style (which means that the bucket name is not included in the URL). This option is typically dependent on the URL format support provided by the storage service provider.
- `namenode`: Represents the IP of the HDFS node. You need to replace `<HDFS node IP:port>` with the actual IP address and port number, such as `'192.168.51.106:8020'`.
Expand Down Expand Up @@ -134,7 +134,7 @@ In general, the fields of a directory table are as follows:

### Upload file into directory table

After uploading a file to a directory table, Cloudberry Database manages the file's upload to local storage or object storage and stores the file's metadata in the directory table. In Cloudberry Database v1.5.3, users cannot directly manage object storage directory files.
After uploading a file to a directory table, Apache Cloudberry manages the file's upload to local storage or object storage and stores the file's metadata in the directory table. In Apache Cloudberry v1.5.3, users cannot directly manage object storage directory files.

Upload files from local storage to database object storage:

Expand Down
24 changes: 12 additions & 12 deletions docs/advanced-analytics/postgis.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,17 @@ title: Geospatial Analytics

# Geospatial Analytics

[PostGIS](https://postgis.net/) extends the capabilities of the PostgreSQL by adding support for storing, indexing, and querying geospatial data. Cloudberry Database supports PostGIS for geospatial analytics.
[PostGIS](https://postgis.net/) extends the capabilities of the PostgreSQL by adding support for storing, indexing, and querying geospatial data. Apache Cloudberry supports PostGIS for geospatial analytics.

This document introduces how to compile and build PostGIS for your Cloudberry Database cluster.
This document introduces how to compile and build PostGIS for your Apache Cloudberry cluster.

You can access the Cloudberry Database PostGIS project repo at [`cloudberrydb/postgis`](https://github.com/cloudberrydb/postgis). The PostGIS code in this repo is dedicated to Cloudberry Database. The compilation and building method introduced in this document is based on the code of this repo.
You can access the PostGIS for Apache Cloudberry project repo at [`cloudberry-contrib/postgis`](https://github.com/cloudberry-contrib/postgis). The PostGIS code in this repo is dedicated to Apache Cloudberry. The compilation and building method introduced in this document is based on the code of this repo.

## Compile PostGIS for Cloudberry Database
## Compile PostGIS for Apache Cloudberry

Before installing PostGIS for Cloudberry Database, install the required dependencies and compile several components. This process is currently supported only on CentOS, with plans to support Rocky Linux in the future.
Before installing PostGIS for Apache Cloudberry, install the required dependencies and compile several components. This process is currently supported only on CentOS, with plans to support Rocky Linux in the future.

Before you get started, ensure that the Cloudberry Database is correctly installed on your machine. If it is not installed, see the [documentation](https://cloudberrydb.org/docs/) for installation instructions.
Before you get started, ensure that the Apache Cloudberry is correctly installed on your machine. If it is not installed, see the [documentation](https://cloudberry.apache.org/docs/) for installation instructions.

1. Install the pre-requested dependencies.

Expand Down Expand Up @@ -93,10 +93,10 @@ Before you get started, ensure that the Cloudberry Database is correctly install

3. Build and install PostGIS.

1. Download the `cloudberrydb/postgis` repo to your `/home/gpadmin` directory:
1. Download the `cloudberry-contrib/postgis` repo to your `/home/gpadmin` directory:

```bash
git clone https://github.com/cloudberrydb/postgis.git /home/gpadmin/postgis
git clone https://github.com/cloudberry-contrib/postgis.git /home/gpadmin/postgis
chown -R gpadmin:gpadmin /home/gpadmin/postgis
```

Expand All @@ -105,8 +105,8 @@ Before you get started, ensure that the Cloudberry Database is correctly install
Before starting the compilation process, run the following commands to make sure the environment variables are set ready:

```bash
source /usr/local/cloudberrydb/greenplum_path.sh
source /home/gpadmin/cloudberrydb/gpAux/gpdemo/gpdemo-env.sh
source /usr/local/cloudberry/greenplum_path.sh
source /home/gpadmin/cloudberry/gpAux/gpdemo/gpdemo-env.sh
scl enable devtoolset-10 bash
source /opt/rh/devtoolset-10/enable
```
Expand All @@ -120,9 +120,9 @@ Before you get started, ensure that the Cloudberry Database is correctly install
make && make install
```

## Use PostGIS in Cloudberry Database
## Use PostGIS in Apache Cloudberry

After you have compiled and built PostGIS and the supporting extensions successfully on your Cloudberry Database cluster and have started the demo cluster, you can run the following commands to enable PostGIS and the supporting extensions:
After you have compiled and built PostGIS and the supporting extensions successfully on your Apache Cloudberry cluster and have started the demo cluster, you can run the following commands to enable PostGIS and the supporting extensions:

```sql
$ psql -p 7000 postgres
Expand Down
8 changes: 4 additions & 4 deletions docs/basic-query-syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@
title: Basic Query Syntax
---

# Basic Queries of Cloudberry Database
# Basic Queries of Apache Cloudberry

This document introduce the basic queries of Cloudberry Database.
This document introduce the basic queries of Apache Cloudberry.

Cloudberry Database is a high-performance, highly parallel data warehouse developed based on PostgreSQL and Greenplum. Here are some examples of the basic query syntax.
Apache Cloudberry is a high-performance, highly parallel data warehouse developed based on PostgreSQL and Greenplum. Here are some examples of the basic query syntax.

- `SELECT`: Used to retrieve data from databases & tables.

Expand Down Expand Up @@ -59,7 +59,7 @@ Cloudberry Database is a high-performance, highly parallel data warehouse develo
WHERE department_id IN (SELECT id FROM departments WHERE location = 'New York'); -- Queries all employees working in New York.
```

The above is just a brief overview of the basic query syntax in Cloudberry Database. Cloudberry Database also provides more advanced queries and functions to help developers perform complex data operations and analyses.
The above is just a brief overview of the basic query syntax in Apache Cloudberry. Apache Cloudberry also provides more advanced queries and functions to help developers perform complex data operations and analyses.

## See also

Expand Down
30 changes: 15 additions & 15 deletions docs/cbdb-architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,37 +2,37 @@
title: Architecture
---

# Cloudberry Database Architecture
# Apache Cloudberry Architecture

This document introduces the product architecture and the implementation mechanism of the internal modules in Cloudberry Database.
This document introduces the product architecture and the implementation mechanism of the internal modules in Apache Cloudberry.

In most cases, Cloudberry Database is similar to PostgreSQL in terms of SQL support, features, configuration options, and user functionalities. Users can interact with Cloudberry Database in a similar way to how they interact with a standalone PostgreSQL system.
In most cases, Apache Cloudberry is similar to PostgreSQL in terms of SQL support, features, configuration options, and user functionalities. Users can interact with Apache Cloudberry in a similar way to how they interact with a standalone PostgreSQL system.

Cloudberry Database uses MPP (Massively Parallel Processing) architecture to store and process large volumes of data, by distributing data and computing workloads across multiple servers or hosts.
Apache Cloudberry uses MPP (Massively Parallel Processing) architecture to store and process large volumes of data, by distributing data and computing workloads across multiple servers or hosts.

MPP, known as the shared-nothing architecture, refers to systems with multiple hosts that work together to perform a task. Each host has its own processor, memory, disk, network resources, and operating system. Cloudberry Database uses this high-performance architecture to distribute data loads and can use all system resources in parallel to process queries.
MPP, known as the shared-nothing architecture, refers to systems with multiple hosts that work together to perform a task. Each host has its own processor, memory, disk, network resources, and operating system. Apache Cloudberry uses this high-performance architecture to distribute data loads and can use all system resources in parallel to process queries.

From users' view, Cloudberry Database is a complete relational database management system (RDBMS). In a physical view, it contains multiple PostgreSQL instances. To make these independent PostgreSQL instances work together, Cloudberry Database performs distributed cluster processing at different levels for data storage, computing, communication, and management. Cloudberry Database hides the complex details of the distributed system, giving users a single logical database view. This greatly eases the work of developers and operational staff.
From users' view, Apache Cloudberry is a complete relational database management system (RDBMS). In a physical view, it contains multiple PostgreSQL instances. To make these independent PostgreSQL instances work together, Apache Cloudberry performs distributed cluster processing at different levels for data storage, computing, communication, and management. Apache Cloudberry hides the complex details of the distributed system, giving users a single logical database view. This greatly eases the work of developers and operational staff.

The architecture diagram of Cloudberry Database is as follows:
The architecture diagram of Apache Cloudberry is as follows:

![Cloudberry Database Architecture](./media/cbdb-arch.png)
![Apache Cloudberry Architecture](./media/cbdb-arch.png)

- **Coordinator node** (or control node) is the gateway to the Cloudberry Database system, which accepts client connections and SQL queries, and allocates tasks to data node instances. Users interact with Cloudberry Database by connecting to the coordinator node using a client program (such as psql) or an application programming interface (API) (such as JDBC, ODBC, or libpq PostgreSQL C API).
- The coordinator node acts as the global system directory, containing a set of system tables that record the metadata of Cloudberry Database.
- **Coordinator node** (or control node) is the gateway to the Apache Cloudberry system, which accepts client connections and SQL queries, and allocates tasks to data node instances. Users interact with Apache Cloudberry by connecting to the coordinator node using a client program (such as psql) or an application programming interface (API) (such as JDBC, ODBC, or libpq PostgreSQL C API).
- The coordinator node acts as the global system directory, containing a set of system tables that record the metadata of Apache Cloudberry.
- The coordinator node does not store user data. User data is stored only in data node instances.
- The coordinator node performs authentication for client connections, processes SQL commands, distributes workload among segments, coordinates the results returned by each segment, and returns the final results to the client program.
- Cloudberry Database uses Write Ahead Logging (WAL) for coordinator/standby mirroring. In WAL-based logging, all modifications are first written to a log before being written to the disk, which ensures the data integrity of in-process operations.
- Apache Cloudberry uses Write Ahead Logging (WAL) for coordinator/standby mirroring. In WAL-based logging, all modifications are first written to a log before being written to the disk, which ensures the data integrity of in-process operations.

- **Segment** (or data node) instances are individual Postgres processes, each storing a portion of the data and executing the corresponding part of the query. When a user connects to the database through the coordinator node and submits a query request, a process is created on each segment node to handle the query. User-defined tables and their indexes are distributed across the available segments, and each segment node contains distinct portions of the data. The processes of data processing runs in the corresponding segment. Users interact with segments through the coordinator, and the segment operate on servers known as the segment host.

Typically, a segment host runs 2 to 8 data nodes, depending on the processor, memory, storage, network interface, and workload. The configuration of the segment host needs to be balanced, because evenly distributing the data and workload among segments is the key to achieving optimal performance with Cloudberry Database, which allows all segments to start processing a task and finish the work at the same time.
Typically, a segment host runs 2 to 8 data nodes, depending on the processor, memory, storage, network interface, and workload. The configuration of the segment host needs to be balanced, because evenly distributing the data and workload among segments is the key to achieving optimal performance with Apache Cloudberry, which allows all segments to start processing a task and finish the work at the same time.

- **Interconnect** is the network layer in the Cloudberry Database system architecture. Interconnect refers to the network infrastructure upon which the communication between the coordinator node and the segments relies, which uses a standard Ethernet switching structure.
- **Interconnect** is the network layer in the Apache Cloudberry system architecture. Interconnect refers to the network infrastructure upon which the communication between the coordinator node and the segments relies, which uses a standard Ethernet switching structure.

For performance reasons, a 10 GB or faster network is recommended. By default, the Interconnect module uses the UDP protocol with flow control (UDPIFC) for communication to send messages through the network. The data packet verification performed by Cloudberry Database exceeds the scope provided by UDP, which means that its reliability is equivalent to using the TCP protocol, and its performance and scalability surpass the TCP protocol. If the Interconnect is changed to the TCP protocol instead, the scalability of Cloudberry Database is limited to 1000 segments. This limit does not apply when UDPIFC is used as the default protocol.
For performance reasons, a 10 GB or faster network is recommended. By default, the Interconnect module uses the UDP protocol with flow control (UDPIFC) for communication to send messages through the network. The data packet verification performed by Apache Cloudberry exceeds the scope provided by UDP, which means that its reliability is equivalent to using the TCP protocol, and its performance and scalability surpass the TCP protocol. If the Interconnect is changed to the TCP protocol instead, the scalability of Apache Cloudberry is limited to 1000 segments. This limit does not apply when UDPIFC is used as the default protocol.

- Cloudberry Database uses Multiversion Concurrency Control (MVCC) to ensure data consistency. When querying the database, each transaction only sees a snapshot of the data, ensuring that current transactions do not see modifications made by other transactions on the same records. In this way, MVCC provides transaction isolation in the database.
- Apache Cloudberry uses Multiversion Concurrency Control (MVCC) to ensure data consistency. When querying the database, each transaction only sees a snapshot of the data, ensuring that current transactions do not see modifications made by other transactions on the same records. In this way, MVCC provides transaction isolation in the database.

MVCC minimizes lock contention to ensure performance in a multi-user environment. This is done by avoiding explicit locking for database transactions.

Expand Down
Loading
Loading