From 1be784c666155460e4bcaa1c00801847dc209272 Mon Sep 17 00:00:00 2001 From: Nebulus <609117264@qq.com> Date: Fri, 26 Jul 2019 22:15:53 +0800 Subject: [PATCH 1/7] =?UTF-8?q?=E6=95=B0=E6=8D=AE=E5=88=86=E7=89=87?= =?UTF-8?q?=E6=98=AF=E5=A6=82=E4=BD=95=E5=9C=A8=E5=88=86=E5=B8=83=E5=BC=8F?= =?UTF-8?q?=20SQL=20=E6=95=B0=E6=8D=AE=E5=BA=93=E4=B8=AD=E8=B5=B7=E4=BD=9C?= =?UTF-8?q?=E7=94=A8=E7=9A=84?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 翻译完成,幸苦校对的同学了 --- ...ing-works-in-a-distributed-sql-database.md | 88 +++++++++---------- 1 file changed, 44 insertions(+), 44 deletions(-) diff --git a/TODO1/how-data-sharding-works-in-a-distributed-sql-database.md b/TODO1/how-data-sharding-works-in-a-distributed-sql-database.md index 2e836f274ec..54d6be63c6c 100644 --- a/TODO1/how-data-sharding-works-in-a-distributed-sql-database.md +++ b/TODO1/how-data-sharding-works-in-a-distributed-sql-database.md @@ -2,104 +2,104 @@ > * 原文作者:[Sid Choudhury](https://blog.yugabyte.com/author/sidchoudhury/) > * 译文出自:[掘金翻译计划](https://github.com/xitu/gold-miner) > * 本文永久链接:[https://github.com/xitu/gold-miner/blob/master/TODO1/how-data-sharding-works-in-a-distributed-sql-database.md](https://github.com/xitu/gold-miner/blob/master/TODO1/how-data-sharding-works-in-a-distributed-sql-database.md) -> * 译者: +> * 译者:[Ultrasteve](https://github.com/Ultrasteve) > * 校对者: -# How Data Sharding Works in a Distributed SQL Database +# 数据分片是如何在分布式 SQL 数据库中起作用的 -Enterprises of all sizes are embracing rapid modernization of user-facing applications as part of their broader digital transformation strategy. The relational database (RDBMS) infrastructure that such applications rely on suddenly needs to support much larger data sizes and transaction volumes. However, a monolithic RDBMS tends to quickly get overloaded in such scenarios. One of the most common architectures to get more performance and scalability in an RDBMS is to “shard” the data. In this blog, we will learn what sharding is and how it can be used to scale a database. We will also review the pros and cons of common sharding architectures, plus explore how sharding is implemented in distributed SQL-based RDBMS like [YugaByte DB.](https://github.com/YugaByte/yugabyte-db) +如今,所有规模的企业都在拥抱用户导向应用的高速现代化,以此来作为它们迈向更广阔的数字转型策略的其中一步。因此,这些应用所依赖的 RDBMS(关系型数据库基础设施),如今就需要支持更大的数据量和事务量。然而,在这种场景中,一个单体 RDBMS 通常很快会达到过载状态。数据分片是用于解决这种问题的其中一种最为普遍的架构,它能够使 RDBMS 得到更好的性能和更高的扩展性。在这篇文章中,我们会探讨几种常见分片架构的优劣,还会探索在分布式 SQL 数据库中,例如 [YugaByte DB](https://github.com/YugaByte/yugabyte-db) 是如何实现数据分片的。 -## What is Data Sharding? +## 数据分片到底是什么? -Sharding is the process of breaking up large tables into smaller chunks called **shards** that are spread across multiple servers. A **shard** is essentially a horizontal data partition that contains a subset of the total data set, and hence is responsible for serving a portion of the overall workload. The idea is to distribute data that can’t fit on a single node onto a **cluster** of database nodes. Sharding is also referred to as **horizontal partitioning**. The distinction between horizontal and vertical comes from the traditional tabular view of a database. A database can be split vertically — storing different table columns in a separate database, or horizontally — storing rows of the same table in multiple database nodes. +分片是一种把大表切分成**数据分片**的过程,分割后的数据块会分布在多个服务器中。**数据分片**必须是水平切分的,各个分片是整个数据集的子集,它们各自负责总体工作量的一部分。这种方法的中心思想,便是将原本难以放在单体中的庞大数据,分散到一个**数据库集群**中。分片也称为**水平切分**,水平切分和垂直切分的区别来自于传统的表式数据库。一个数据库可以被垂直切分(把表中不同的列分散在数据库中),也可以被水平切分(把不同的行分散到多个数据库节点中)。 ![](https://3lr6t13cowm230cj0q42yphj-wpengine.netdna-ssl.com/wp-content/uploads/2019/06/data-sharding-distributed-sql-1.png) -**Figure 1 : Vertical and Horizontal Data Partitioning (Source: Medium)** +**图一 :垂直切分与水平切分** -## Why Shard a Database? +## 为什么要对数据库进行分片? -Business applications that rely on monolithic RDBMS hit bottlenecks as they grow. With limited CPU, storage capacity and memory, database performance is bound to suffer. Query performance and routine maintenance of an unsharded database becomes extremely slow. When it comes to adding resources to support database operations, vertical scaling (aka scaling up) has its own set of limits and eventually reaches a point of diminishing returns. +随着业务规模的扩大,依赖单体 RDBMS 的商业应用会达到性能瓶颈。受到 CPU 性能,辅存和主存的大小的限制,数据库的性能总有一天会遭殃。在一个未分片的数据库中,读操作的响应及日常运维的速度会变得极度缓慢。当我们想要为数据库操作提供更多运行资源时,垂直扩张(又称作扩容)存在一系列缺陷,最终会达到得不偿失的地步。 -On the other hand, horizontally partitioning a table means more compute capacity to serve incoming queries, and therefore you end up with faster query response times and index builds. By continuously balancing the load and data set over additional nodes, sharding also allows easy expansion to accommodate more capacity. Moreover, a network of smaller, cheaper servers may be more cost effective in the long term than maintaining one big server. +从另一方面来看,对表格进行水平切分意味着拥有更多的计算资源去应对查询请求,你会得到更短的响应时间并能够建立更多的索引。分片通过持续的平衡额外节点之间的数据量和工作量,能在扩张中更有效的利用新资源。不仅如此,维护一组更小更廉价的服务器比维护一个大型的服务器要实惠的多。 -Besides resolving scaling challenges, sharding can potentially alleviate the impact of unplanned outages. During downtime, all the data in an unsharded database is inaccessible, which can be disruptive or downright disastrous. When done right, sharding can provide high availability: even if one or two nodes hosting a few shards are down, the rest of the database is still available for read/write operations as long as the other nodes (hosting the remaining shards) run in different failure domains. Overall, sharding can increase total cluster storage capacity, speed up processing, and offer higher availability at a lower cost than vertical scaling. +除了解决扩展性的问题,分片还可以应对潜在的意外宕机问题。当一个未分片的服务器宕机时,所有的数据都将变得不可访问,这将是一个灾难。然而分片能够很好的解决这个问题。即使一两个节点宕掉,还存在其他保留着剩下分片的节点,只要它们在不同的出错域,便仍然能够提供数据读写服务。总的来说,分片可以提升集群的存储容量,缩短处理时间,并在相对于垂直扩展消耗更少资金的情况下,提供更高的可用性。 -## The Perils of Manual Sharding +## 手动分片的隐患 -Sharding, including the day-1 creation and day-2 rebalancing, when completely automated can be a boon to high-volume data apps. Unfortunately, monolithic databases like Oracle, PostgreSQL, MySQL and even newer distributed SQL databases like Amazon Aurora do not support sharding automatically. This means manual sharding at the application layer if you want to continue to use these databases. The net result is a massive increase in development complexity. Your application has to have additional sharding logic to know exactly how your data is distributed, and how to fetch it. You also have to decide what sharding approach to adopt, how many shards to create, and how many nodes to use. And also account for shard key as well as even sharding approach changes if your business needs change. +对于大数据量的应用来说,在包含一系列建表和负载均衡的分片中进行全自动化部署将会获得巨大收益。不幸的是,像 Oracle, PostgreSQL, MySQL 这些单体数据库,甚至一些更新的分布式 SQL 数据库,如 Amazon Aurora ,并不支持自动分片。这意味着如果你想继续使用这些数据库,你必须在应用层进行手动分片。这大大增加了开发的难度。为了知道你的数据是如何分配的,你的应用需要一套额外的分片代码,并需要知道数据的来源。你还需要决定采用什么分片方法,最终需要多少分片,并需要多少个节点。一旦你的业务改变了,分片方式和分片主键也要随之变化。 -One of the most significant challenges with manual sharding is uneven shard allocation. Disproportionate distribution of data could cause shards to become unbalanced, with some overloaded while others remain relatively empty. It’s best to avoid accruing too much data on a shard, because a hotspot can lead to slowdowns and server crashes. This problem could also arise from a small shard set, which forces data to be spread across too few shards. This is acceptable in development and testing environments, but not in production. Uneven data distribution, hotspots, and storing data on too few shards can all cause shard and server resource exhaustion. +手动分片的其中一个重大挑战便是不平均的分片。不成比例的分配数据将导致分片变得不平衡,这意味着当一些节点过载时其他节点可能是空闲的。因为部分节点的过载可能会拖累整体的响应速度并导致服务崩溃,我们最好在分片时尽可能少的增加数据。这个问题也有可能在一个小的分片集中发生,因为小的分片集意味着将数据分散到极少数量的分片中。这虽然在开发环境和测试环境中是可以接受的,但生产环境中是不允许的。不平均的数据分配,部分节点过载和过少的数据分配都会导致分片和服务资源的枯竭。 -Finally, manual sharding can complicate operational processes. Backups will now have to be performed for multiple servers. Data migration and schema changes must be carefully coordinated to ensure all shards have the same schema copy. Without sufficient optimization, database joins across multiple servers could highly inefficient and difficult to perform. +最后,手动分片会使操作过程复杂化。现在需要在多个服务器中进行备份了。为了保证所有分片都有相同的模式,数据迁移和模式的变化现在需要更小心的进行协调。在缺乏足够优化的情况下,在多个服务器中进行数据库 join 操作会变得不高效和难以执行。 -## Common Auto-Sharding Architectures +## 常用的自动分片架构 -Sharding has been around for a long time, and over the years different sharding architectures and implementations have been used to build large scale systems. In this section, we will go over the three most common ones. +分片由来已久,这么多年来发展了许多用于部署在大范围的系统中的分片架构和实现。在这一节中,我们会讨论三种最常见的实现方式。 -### Hash-based Sharding +### 基于哈希的分片 -Hash-based sharding takes a shard key’s value and generates a hash value from it. The hash value is then used to determine in which shard the data should reside. With a uniform hashing algorithm such as ketama, the hash function can evenly distribute data across servers, reducing the risk of hotspots. With this approach, data with close shard keys are unlikely to be placed on the same shard. This architecture is thus great for targeted data operations. +基于哈希的分片使用分片主键来产生一些哈希值,这些哈希值将被用于决定这一条数据存储在哪里。通过使用一个通用的哈希算法 ketama ,哈希函数能够在服务器间平均的分摊数据,以此来减少部分节点的过载。在这种方法里,那些分片主键相近的数据不太可能会被分配在同一个分片中。这个架构因此十分适用于目标明确的数据操作。 ![](https://3lr6t13cowm230cj0q42yphj-wpengine.netdna-ssl.com/wp-content/uploads/2019/06/data-sharding-distributed-sql-2.png) -**Figure 2: Hash-based sharding (Source: MongoDB Docs)** +**图二 : 基于哈希的分片(来源:MongoDB 文档)** -### Range-based Sharding +### 基于范围的分片 -Range-based sharding divides data based on ranges of the data value (aka the keyspace). Shard keys with nearby values are more likely to fall into the same range and onto the same shards. Each shard essentially preserves the same schema from the original database. Sharding becomes as easy as identifying the data’s appropriate range and placing it on the corresponding shard. +基于范围的分片,参照数据值的范围来分割数据。切分主键值相同的数据更容易落到同一个范围中,因此也更容易落到同一个分片中。每个分片都必须保存于原数据库相同的模式。数据分片将变得十分简单,正如辨别数据正确范围并放到相应的分片中一样容易。 ![](https://3lr6t13cowm230cj0q42yphj-wpengine.netdna-ssl.com/wp-content/uploads/2019/06/Sharding-Image-copy.jpg) -**Figure 3 : Range-based sharding example** +**图三 :基于范围的分片** -Range-based sharding allows for efficient queries that reads target data within a contiguous range or range queries. However, range-based sharding needs the user to apriori choose the shard keys, and poorly chosen shard keys could result in database hotspots. +基于范围的分片能让依据目标数据范围的查询,或范围查询变得更加高效。然而这种分片方式需要用户事先选择分片主键,如果分片主键选的不好,可能会导致部分节点过载。 -A good rule-of-thumb is to pick shard keys that have large cardinality, low recurring frequency, and that do not increase, or decrease, monotonically. Without proper shard key selections, data could be unevenly distributed across shards, and specific data could be queried more compared to the others, creating potential system bottlenecks in the shards that get a heavier workload. +一个好的原则就是选择那些基数更大重复率更低的键作为分片主键,这些键通常十分稳定,不会增加和减少,是无变化的。如果没有正确的选择分片主键,数据会不均等的分配在分片中,特定的数据会比其他数据的访问频率更高,这让那些工作量较大的分片产生瓶颈。 -The ideal solution to uneven shard sizes is to perform automatic shard splitting and merging. If the shard becomes to big or hosts a frequently accessed row, then breaking the shard into multiple shards and then rebalancing them across all the available nodes leads to better performance. Similarly, the opposite process can be undertaken when there are too many small shards. +解决不均等分片的理想方法是进行归并和自动化。如果分片变得过大或者其中的某一行被频繁的访问,那么最好就将这个大的分片再进行更细的分片,并将这些小的分片重新平均的分配到各个节点中。同样的,当小分片过多的时候,我们可以做相反的事情。 -### Geo-based Sharding +### 基于地理位置的分片 -In geo-based (aka location-aware) sharding, data is partitioned according to a user-specified column that maps range shards to specific regions and the nodes in those regions. For example, a cluster that runs across 3 regions in the US, UK and the EU can rely on the Country_Code column of the User table to map the user’s row to the nearest region that is in conformance with GDPR rules. +在基于地理位置的分片中,数据会按照那些用户个性化的列(列中的值和地理位置有关)来进行分片,不同的分片被分配到对应的区域中。例如,有一个部署在美国,英国和欧洲的集群,我们可以根据用户表中的 Country_Code 这一列的值,并依照 GDPR(通用数据保护条例)来将分片放到合适的位置。 -## Sharding in YugaByte DB +## YugaByte DB 中的分片 -YugaByte DB is an auto-sharded, ultra-resilient, high-performance, geo-distributed SQL database built with inspiration from Google Spanner. It currently supports hash-based sharding by default. Range-based sharding is an active work-in-progress project while geo-based sharding is on the roadmap for later this year. Each data shard is called a tablet, and it resides on a corresponding tablet server. +YugaByte DB 是一个具备自动分片功能和高度弹性的高性能分布式 SQL 数据库,它由 Google Spanner 开发。它目前默认支持基于哈希的分片方式。它是一个活跃更新的项目,而基于地理位置和基于范围的分片功能将在今年年尾加入。在 YugaByte DB 中每一个数据分片被称作子表,它们被分配在相应的子表服务器中。 -### Hash-based Sharding +### 基于哈希的分片 -For hash-based sharding, tables are allocated a hash space between 0x0000 to 0xFFFF (the 2-byte range), accommodating as many as 64K tablets in very large data sets or cluster sizes. Consider a table with 16 tablets as shown in Figure 4. We take the overall hash space \[0x0000 to 0xFFFF), and divide it into 16 segments — one for each tablet. +对于基于哈希的分片,表被分配在 0x0000 到 0xFFFF (总共 2B 的范围中)的哈希空间中,它在很大的数据集或集群中容纳了大约 64KB 的子表。我们来看看图四中有 16 个分片子表的表。这里用到整一个 2B 大小的哈希空间来容纳分片,并将它分成16个部分,每个部分对应一个子表。 ![](https://3lr6t13cowm230cj0q42yphj-wpengine.netdna-ssl.com/wp-content/uploads/2019/06/data-sharding-distributed-sql-4.png) -**Figure 4: Hash-based sharding in YugaByte DB** +**图四 :在 YugaByte DB 基于哈希分片** -In read/write operations, the primary keys are first converted into internal keys and their corresponding hash values. The operation is served by collecting data from the appropriate tablets. (Figure 3) +在读写操作中,主键是最先被转化成内键和它们对应的哈希值。这个操作通过收集可用子表中的数据来实现。(图三) ![](https://3lr6t13cowm230cj0q42yphj-wpengine.netdna-ssl.com/wp-content/uploads/2019/06/data-sharding-distributed-sql-5.png) -**Figure 5: Figuring out which tablet to use in Yugabyte DB** +**图五 :在 Yugabyte DB 决定使用哪个子表** -As an example, suppose you want to insert a key k, with a value v into a table as shown in Figure 6, the hash value of k is computed, and then the corresponding tablet is looked up, followed by the relevant tablet server. The request is then sent directly to that tablet server for processing. +例如,如图六所示,你现在想在表中插入一个键 k,值为 v 的数据。首先会根据键的值 k 来计算出一个哈希值,之后数据库会查询对应的子表和子表服务器。最后,这个请求会被直接传到相应的服务器中进行处理。 ![](https://3lr6t13cowm230cj0q42yphj-wpengine.netdna-ssl.com/wp-content/uploads/2019/06/data-sharding-distributed-sql-6.png) -**Figure 6 : Storing value of k in YugaByte DB** +**图六 :在 YugaByte DB 中存储 k 值** -### Range-based Sharding +### 基于范围的分片 -SQL tables can be created with ASC and DESC directives for the first column of a primary key as well as first of the indexed columns. This will lead to the data getting stored in the chosen order on a single shard (aka tablet). Work is in progress to dynamically [split the tablets](https://github.com/YugaByte/yugabyte-db/issues/1004) (based on various criteria such as range boundary and load) as well as enhance the [SQL syntax](https://github.com/YugaByte/yugabyte-db/issues/1486) to specify the exact ranges. +SQL 表可以在主键的第一列中设置自动递增和自动递减。这让数据能够按照预先选择的顺序存储在单个分片(即子表)中。目前,项目组正在开发[动态分割子表](https://github.com/YugaByte/yugabyte-db/issues/1004)(基于多种标准,如范围边界和负载),和用于明确指明特定范围的[增强SQL语法](https://github.com/YugaByte/yugabyte-db/issues/1486)这些功能。 -## Summary +## 总结 -Data sharding is a solution for business applications with large data sets and scale needs. There are a variety of sharding architectures to choose from, each of which provides different capabilities. Before settling on a sharding architecture, the needs and workload requirements of your app must be mapped out. Manual sharding should be avoided in most circumstances given significant increase in application logic complexity. [YugaByte DB](https://github.com/YugaByte/yugabyte-db) is an auto-sharded distributed SQL database with support for hash-based sharding today and support for range-based/geo-based sharding coming soon. You can see YugaByte DB’s automatic sharding in action in this [tutorial.](https://docs.yugabyte.com/latest/explore/auto-sharding/) +数据分片是一种在商业应用中用于建设大型数据集和满足扩展性需求的解决方案。目前有许多数据分片架构供我们选择,每一种都提供了不同的功能。在决定用哪一种架构之前,我们需要清晰的列出你的项目需求和预期负载量。由于会显著的增加应用逻辑的复杂度,我们应该在绝大部分情况下尽量避免手动分片。[YugaByte DB](https://github.com/YugaByte/yugabyte-db) 是一种具备自动分片功能的分布式 SQL 数据库,它目前支持基于哈希的分片,而基于范围和基于地理位置的分片功能将很快能够用到。你可以查看这个[教程](https://docs.yugabyte.com/latest/explore/auto-sharding/)来学习 YugaByte DB 的自动分片功能。 -## What’s Next? +## 下一步? -* [Compare](https://docs.yugabyte.com/latest/comparisons/) YugaByte DB in depth to databases like [CockroachDB](https://www.yugabyte.com/yugabyte-db-vs-cockroachdb/), Google Cloud Spanner and MongoDB. -* [Get started](https://docs.yugabyte.com/latest/quick-start/) with YugaByte DB on macOS, Linux, Docker, and Kubernetes. -* [Contact us](https://www.yugabyte.com/about/contact/) to learn more about licensing, pricing or to schedule a technical overview. +* [深入比较](https://docs.yugabyte.com/latest/comparisons/) YugaByte DB 和 [CockroachDB](https://www.yugabyte.com/yugabyte-db-vs-cockroachdb/),Google Cloud Spanner 与 MongoDB 的不同之处。 +* [开始](https://docs.yugabyte.com/latest/quick-start/)使用 YugaByte DB ,在 macOS,Linux,Docker 和 Kubernetes 中使用它. +* [联系我们](https://www.yugabyte.com/about/contact/) 了解证书及收费问题或预约一个技术面谈。 > 如果发现译文存在错误或其他需要改进的地方,欢迎到 [掘金翻译计划](https://github.com/xitu/gold-miner) 对译文进行修改并 PR,也可获得相应奖励积分。文章开头的 **本文永久链接** 即为本文在 GitHub 上的 MarkDown 链接。 From 9d913ae2cb617cbd8442816fed154e80959fa365 Mon Sep 17 00:00:00 2001 From: Nebulus <609117264@qq.com> Date: Sat, 27 Jul 2019 16:47:56 +0800 Subject: [PATCH 2/7] back --- ...ing-works-in-a-distributed-sql-database.md | 88 +++++++++---------- 1 file changed, 44 insertions(+), 44 deletions(-) diff --git a/TODO1/how-data-sharding-works-in-a-distributed-sql-database.md b/TODO1/how-data-sharding-works-in-a-distributed-sql-database.md index 54d6be63c6c..2e836f274ec 100644 --- a/TODO1/how-data-sharding-works-in-a-distributed-sql-database.md +++ b/TODO1/how-data-sharding-works-in-a-distributed-sql-database.md @@ -2,104 +2,104 @@ > * 原文作者:[Sid Choudhury](https://blog.yugabyte.com/author/sidchoudhury/) > * 译文出自:[掘金翻译计划](https://github.com/xitu/gold-miner) > * 本文永久链接:[https://github.com/xitu/gold-miner/blob/master/TODO1/how-data-sharding-works-in-a-distributed-sql-database.md](https://github.com/xitu/gold-miner/blob/master/TODO1/how-data-sharding-works-in-a-distributed-sql-database.md) -> * 译者:[Ultrasteve](https://github.com/Ultrasteve) +> * 译者: > * 校对者: -# 数据分片是如何在分布式 SQL 数据库中起作用的 +# How Data Sharding Works in a Distributed SQL Database -如今,所有规模的企业都在拥抱用户导向应用的高速现代化,以此来作为它们迈向更广阔的数字转型策略的其中一步。因此,这些应用所依赖的 RDBMS(关系型数据库基础设施),如今就需要支持更大的数据量和事务量。然而,在这种场景中,一个单体 RDBMS 通常很快会达到过载状态。数据分片是用于解决这种问题的其中一种最为普遍的架构,它能够使 RDBMS 得到更好的性能和更高的扩展性。在这篇文章中,我们会探讨几种常见分片架构的优劣,还会探索在分布式 SQL 数据库中,例如 [YugaByte DB](https://github.com/YugaByte/yugabyte-db) 是如何实现数据分片的。 +Enterprises of all sizes are embracing rapid modernization of user-facing applications as part of their broader digital transformation strategy. The relational database (RDBMS) infrastructure that such applications rely on suddenly needs to support much larger data sizes and transaction volumes. However, a monolithic RDBMS tends to quickly get overloaded in such scenarios. One of the most common architectures to get more performance and scalability in an RDBMS is to “shard” the data. In this blog, we will learn what sharding is and how it can be used to scale a database. We will also review the pros and cons of common sharding architectures, plus explore how sharding is implemented in distributed SQL-based RDBMS like [YugaByte DB.](https://github.com/YugaByte/yugabyte-db) -## 数据分片到底是什么? +## What is Data Sharding? -分片是一种把大表切分成**数据分片**的过程,分割后的数据块会分布在多个服务器中。**数据分片**必须是水平切分的,各个分片是整个数据集的子集,它们各自负责总体工作量的一部分。这种方法的中心思想,便是将原本难以放在单体中的庞大数据,分散到一个**数据库集群**中。分片也称为**水平切分**,水平切分和垂直切分的区别来自于传统的表式数据库。一个数据库可以被垂直切分(把表中不同的列分散在数据库中),也可以被水平切分(把不同的行分散到多个数据库节点中)。 +Sharding is the process of breaking up large tables into smaller chunks called **shards** that are spread across multiple servers. A **shard** is essentially a horizontal data partition that contains a subset of the total data set, and hence is responsible for serving a portion of the overall workload. The idea is to distribute data that can’t fit on a single node onto a **cluster** of database nodes. Sharding is also referred to as **horizontal partitioning**. The distinction between horizontal and vertical comes from the traditional tabular view of a database. A database can be split vertically — storing different table columns in a separate database, or horizontally — storing rows of the same table in multiple database nodes. ![](https://3lr6t13cowm230cj0q42yphj-wpengine.netdna-ssl.com/wp-content/uploads/2019/06/data-sharding-distributed-sql-1.png) -**图一 :垂直切分与水平切分** +**Figure 1 : Vertical and Horizontal Data Partitioning (Source: Medium)** -## 为什么要对数据库进行分片? +## Why Shard a Database? -随着业务规模的扩大,依赖单体 RDBMS 的商业应用会达到性能瓶颈。受到 CPU 性能,辅存和主存的大小的限制,数据库的性能总有一天会遭殃。在一个未分片的数据库中,读操作的响应及日常运维的速度会变得极度缓慢。当我们想要为数据库操作提供更多运行资源时,垂直扩张(又称作扩容)存在一系列缺陷,最终会达到得不偿失的地步。 +Business applications that rely on monolithic RDBMS hit bottlenecks as they grow. With limited CPU, storage capacity and memory, database performance is bound to suffer. Query performance and routine maintenance of an unsharded database becomes extremely slow. When it comes to adding resources to support database operations, vertical scaling (aka scaling up) has its own set of limits and eventually reaches a point of diminishing returns. -从另一方面来看,对表格进行水平切分意味着拥有更多的计算资源去应对查询请求,你会得到更短的响应时间并能够建立更多的索引。分片通过持续的平衡额外节点之间的数据量和工作量,能在扩张中更有效的利用新资源。不仅如此,维护一组更小更廉价的服务器比维护一个大型的服务器要实惠的多。 +On the other hand, horizontally partitioning a table means more compute capacity to serve incoming queries, and therefore you end up with faster query response times and index builds. By continuously balancing the load and data set over additional nodes, sharding also allows easy expansion to accommodate more capacity. Moreover, a network of smaller, cheaper servers may be more cost effective in the long term than maintaining one big server. -除了解决扩展性的问题,分片还可以应对潜在的意外宕机问题。当一个未分片的服务器宕机时,所有的数据都将变得不可访问,这将是一个灾难。然而分片能够很好的解决这个问题。即使一两个节点宕掉,还存在其他保留着剩下分片的节点,只要它们在不同的出错域,便仍然能够提供数据读写服务。总的来说,分片可以提升集群的存储容量,缩短处理时间,并在相对于垂直扩展消耗更少资金的情况下,提供更高的可用性。 +Besides resolving scaling challenges, sharding can potentially alleviate the impact of unplanned outages. During downtime, all the data in an unsharded database is inaccessible, which can be disruptive or downright disastrous. When done right, sharding can provide high availability: even if one or two nodes hosting a few shards are down, the rest of the database is still available for read/write operations as long as the other nodes (hosting the remaining shards) run in different failure domains. Overall, sharding can increase total cluster storage capacity, speed up processing, and offer higher availability at a lower cost than vertical scaling. -## 手动分片的隐患 +## The Perils of Manual Sharding -对于大数据量的应用来说,在包含一系列建表和负载均衡的分片中进行全自动化部署将会获得巨大收益。不幸的是,像 Oracle, PostgreSQL, MySQL 这些单体数据库,甚至一些更新的分布式 SQL 数据库,如 Amazon Aurora ,并不支持自动分片。这意味着如果你想继续使用这些数据库,你必须在应用层进行手动分片。这大大增加了开发的难度。为了知道你的数据是如何分配的,你的应用需要一套额外的分片代码,并需要知道数据的来源。你还需要决定采用什么分片方法,最终需要多少分片,并需要多少个节点。一旦你的业务改变了,分片方式和分片主键也要随之变化。 +Sharding, including the day-1 creation and day-2 rebalancing, when completely automated can be a boon to high-volume data apps. Unfortunately, monolithic databases like Oracle, PostgreSQL, MySQL and even newer distributed SQL databases like Amazon Aurora do not support sharding automatically. This means manual sharding at the application layer if you want to continue to use these databases. The net result is a massive increase in development complexity. Your application has to have additional sharding logic to know exactly how your data is distributed, and how to fetch it. You also have to decide what sharding approach to adopt, how many shards to create, and how many nodes to use. And also account for shard key as well as even sharding approach changes if your business needs change. -手动分片的其中一个重大挑战便是不平均的分片。不成比例的分配数据将导致分片变得不平衡,这意味着当一些节点过载时其他节点可能是空闲的。因为部分节点的过载可能会拖累整体的响应速度并导致服务崩溃,我们最好在分片时尽可能少的增加数据。这个问题也有可能在一个小的分片集中发生,因为小的分片集意味着将数据分散到极少数量的分片中。这虽然在开发环境和测试环境中是可以接受的,但生产环境中是不允许的。不平均的数据分配,部分节点过载和过少的数据分配都会导致分片和服务资源的枯竭。 +One of the most significant challenges with manual sharding is uneven shard allocation. Disproportionate distribution of data could cause shards to become unbalanced, with some overloaded while others remain relatively empty. It’s best to avoid accruing too much data on a shard, because a hotspot can lead to slowdowns and server crashes. This problem could also arise from a small shard set, which forces data to be spread across too few shards. This is acceptable in development and testing environments, but not in production. Uneven data distribution, hotspots, and storing data on too few shards can all cause shard and server resource exhaustion. -最后,手动分片会使操作过程复杂化。现在需要在多个服务器中进行备份了。为了保证所有分片都有相同的模式,数据迁移和模式的变化现在需要更小心的进行协调。在缺乏足够优化的情况下,在多个服务器中进行数据库 join 操作会变得不高效和难以执行。 +Finally, manual sharding can complicate operational processes. Backups will now have to be performed for multiple servers. Data migration and schema changes must be carefully coordinated to ensure all shards have the same schema copy. Without sufficient optimization, database joins across multiple servers could highly inefficient and difficult to perform. -## 常用的自动分片架构 +## Common Auto-Sharding Architectures -分片由来已久,这么多年来发展了许多用于部署在大范围的系统中的分片架构和实现。在这一节中,我们会讨论三种最常见的实现方式。 +Sharding has been around for a long time, and over the years different sharding architectures and implementations have been used to build large scale systems. In this section, we will go over the three most common ones. -### 基于哈希的分片 +### Hash-based Sharding -基于哈希的分片使用分片主键来产生一些哈希值,这些哈希值将被用于决定这一条数据存储在哪里。通过使用一个通用的哈希算法 ketama ,哈希函数能够在服务器间平均的分摊数据,以此来减少部分节点的过载。在这种方法里,那些分片主键相近的数据不太可能会被分配在同一个分片中。这个架构因此十分适用于目标明确的数据操作。 +Hash-based sharding takes a shard key’s value and generates a hash value from it. The hash value is then used to determine in which shard the data should reside. With a uniform hashing algorithm such as ketama, the hash function can evenly distribute data across servers, reducing the risk of hotspots. With this approach, data with close shard keys are unlikely to be placed on the same shard. This architecture is thus great for targeted data operations. ![](https://3lr6t13cowm230cj0q42yphj-wpengine.netdna-ssl.com/wp-content/uploads/2019/06/data-sharding-distributed-sql-2.png) -**图二 : 基于哈希的分片(来源:MongoDB 文档)** +**Figure 2: Hash-based sharding (Source: MongoDB Docs)** -### 基于范围的分片 +### Range-based Sharding -基于范围的分片,参照数据值的范围来分割数据。切分主键值相同的数据更容易落到同一个范围中,因此也更容易落到同一个分片中。每个分片都必须保存于原数据库相同的模式。数据分片将变得十分简单,正如辨别数据正确范围并放到相应的分片中一样容易。 +Range-based sharding divides data based on ranges of the data value (aka the keyspace). Shard keys with nearby values are more likely to fall into the same range and onto the same shards. Each shard essentially preserves the same schema from the original database. Sharding becomes as easy as identifying the data’s appropriate range and placing it on the corresponding shard. ![](https://3lr6t13cowm230cj0q42yphj-wpengine.netdna-ssl.com/wp-content/uploads/2019/06/Sharding-Image-copy.jpg) -**图三 :基于范围的分片** +**Figure 3 : Range-based sharding example** -基于范围的分片能让依据目标数据范围的查询,或范围查询变得更加高效。然而这种分片方式需要用户事先选择分片主键,如果分片主键选的不好,可能会导致部分节点过载。 +Range-based sharding allows for efficient queries that reads target data within a contiguous range or range queries. However, range-based sharding needs the user to apriori choose the shard keys, and poorly chosen shard keys could result in database hotspots. -一个好的原则就是选择那些基数更大重复率更低的键作为分片主键,这些键通常十分稳定,不会增加和减少,是无变化的。如果没有正确的选择分片主键,数据会不均等的分配在分片中,特定的数据会比其他数据的访问频率更高,这让那些工作量较大的分片产生瓶颈。 +A good rule-of-thumb is to pick shard keys that have large cardinality, low recurring frequency, and that do not increase, or decrease, monotonically. Without proper shard key selections, data could be unevenly distributed across shards, and specific data could be queried more compared to the others, creating potential system bottlenecks in the shards that get a heavier workload. -解决不均等分片的理想方法是进行归并和自动化。如果分片变得过大或者其中的某一行被频繁的访问,那么最好就将这个大的分片再进行更细的分片,并将这些小的分片重新平均的分配到各个节点中。同样的,当小分片过多的时候,我们可以做相反的事情。 +The ideal solution to uneven shard sizes is to perform automatic shard splitting and merging. If the shard becomes to big or hosts a frequently accessed row, then breaking the shard into multiple shards and then rebalancing them across all the available nodes leads to better performance. Similarly, the opposite process can be undertaken when there are too many small shards. -### 基于地理位置的分片 +### Geo-based Sharding -在基于地理位置的分片中,数据会按照那些用户个性化的列(列中的值和地理位置有关)来进行分片,不同的分片被分配到对应的区域中。例如,有一个部署在美国,英国和欧洲的集群,我们可以根据用户表中的 Country_Code 这一列的值,并依照 GDPR(通用数据保护条例)来将分片放到合适的位置。 +In geo-based (aka location-aware) sharding, data is partitioned according to a user-specified column that maps range shards to specific regions and the nodes in those regions. For example, a cluster that runs across 3 regions in the US, UK and the EU can rely on the Country_Code column of the User table to map the user’s row to the nearest region that is in conformance with GDPR rules. -## YugaByte DB 中的分片 +## Sharding in YugaByte DB -YugaByte DB 是一个具备自动分片功能和高度弹性的高性能分布式 SQL 数据库,它由 Google Spanner 开发。它目前默认支持基于哈希的分片方式。它是一个活跃更新的项目,而基于地理位置和基于范围的分片功能将在今年年尾加入。在 YugaByte DB 中每一个数据分片被称作子表,它们被分配在相应的子表服务器中。 +YugaByte DB is an auto-sharded, ultra-resilient, high-performance, geo-distributed SQL database built with inspiration from Google Spanner. It currently supports hash-based sharding by default. Range-based sharding is an active work-in-progress project while geo-based sharding is on the roadmap for later this year. Each data shard is called a tablet, and it resides on a corresponding tablet server. -### 基于哈希的分片 +### Hash-based Sharding -对于基于哈希的分片,表被分配在 0x0000 到 0xFFFF (总共 2B 的范围中)的哈希空间中,它在很大的数据集或集群中容纳了大约 64KB 的子表。我们来看看图四中有 16 个分片子表的表。这里用到整一个 2B 大小的哈希空间来容纳分片,并将它分成16个部分,每个部分对应一个子表。 +For hash-based sharding, tables are allocated a hash space between 0x0000 to 0xFFFF (the 2-byte range), accommodating as many as 64K tablets in very large data sets or cluster sizes. Consider a table with 16 tablets as shown in Figure 4. We take the overall hash space \[0x0000 to 0xFFFF), and divide it into 16 segments — one for each tablet. ![](https://3lr6t13cowm230cj0q42yphj-wpengine.netdna-ssl.com/wp-content/uploads/2019/06/data-sharding-distributed-sql-4.png) -**图四 :在 YugaByte DB 基于哈希分片** +**Figure 4: Hash-based sharding in YugaByte DB** -在读写操作中,主键是最先被转化成内键和它们对应的哈希值。这个操作通过收集可用子表中的数据来实现。(图三) +In read/write operations, the primary keys are first converted into internal keys and their corresponding hash values. The operation is served by collecting data from the appropriate tablets. (Figure 3) ![](https://3lr6t13cowm230cj0q42yphj-wpengine.netdna-ssl.com/wp-content/uploads/2019/06/data-sharding-distributed-sql-5.png) -**图五 :在 Yugabyte DB 决定使用哪个子表** +**Figure 5: Figuring out which tablet to use in Yugabyte DB** -例如,如图六所示,你现在想在表中插入一个键 k,值为 v 的数据。首先会根据键的值 k 来计算出一个哈希值,之后数据库会查询对应的子表和子表服务器。最后,这个请求会被直接传到相应的服务器中进行处理。 +As an example, suppose you want to insert a key k, with a value v into a table as shown in Figure 6, the hash value of k is computed, and then the corresponding tablet is looked up, followed by the relevant tablet server. The request is then sent directly to that tablet server for processing. ![](https://3lr6t13cowm230cj0q42yphj-wpengine.netdna-ssl.com/wp-content/uploads/2019/06/data-sharding-distributed-sql-6.png) -**图六 :在 YugaByte DB 中存储 k 值** +**Figure 6 : Storing value of k in YugaByte DB** -### 基于范围的分片 +### Range-based Sharding -SQL 表可以在主键的第一列中设置自动递增和自动递减。这让数据能够按照预先选择的顺序存储在单个分片(即子表)中。目前,项目组正在开发[动态分割子表](https://github.com/YugaByte/yugabyte-db/issues/1004)(基于多种标准,如范围边界和负载),和用于明确指明特定范围的[增强SQL语法](https://github.com/YugaByte/yugabyte-db/issues/1486)这些功能。 +SQL tables can be created with ASC and DESC directives for the first column of a primary key as well as first of the indexed columns. This will lead to the data getting stored in the chosen order on a single shard (aka tablet). Work is in progress to dynamically [split the tablets](https://github.com/YugaByte/yugabyte-db/issues/1004) (based on various criteria such as range boundary and load) as well as enhance the [SQL syntax](https://github.com/YugaByte/yugabyte-db/issues/1486) to specify the exact ranges. -## 总结 +## Summary -数据分片是一种在商业应用中用于建设大型数据集和满足扩展性需求的解决方案。目前有许多数据分片架构供我们选择,每一种都提供了不同的功能。在决定用哪一种架构之前,我们需要清晰的列出你的项目需求和预期负载量。由于会显著的增加应用逻辑的复杂度,我们应该在绝大部分情况下尽量避免手动分片。[YugaByte DB](https://github.com/YugaByte/yugabyte-db) 是一种具备自动分片功能的分布式 SQL 数据库,它目前支持基于哈希的分片,而基于范围和基于地理位置的分片功能将很快能够用到。你可以查看这个[教程](https://docs.yugabyte.com/latest/explore/auto-sharding/)来学习 YugaByte DB 的自动分片功能。 +Data sharding is a solution for business applications with large data sets and scale needs. There are a variety of sharding architectures to choose from, each of which provides different capabilities. Before settling on a sharding architecture, the needs and workload requirements of your app must be mapped out. Manual sharding should be avoided in most circumstances given significant increase in application logic complexity. [YugaByte DB](https://github.com/YugaByte/yugabyte-db) is an auto-sharded distributed SQL database with support for hash-based sharding today and support for range-based/geo-based sharding coming soon. You can see YugaByte DB’s automatic sharding in action in this [tutorial.](https://docs.yugabyte.com/latest/explore/auto-sharding/) -## 下一步? +## What’s Next? -* [深入比较](https://docs.yugabyte.com/latest/comparisons/) YugaByte DB 和 [CockroachDB](https://www.yugabyte.com/yugabyte-db-vs-cockroachdb/),Google Cloud Spanner 与 MongoDB 的不同之处。 -* [开始](https://docs.yugabyte.com/latest/quick-start/)使用 YugaByte DB ,在 macOS,Linux,Docker 和 Kubernetes 中使用它. -* [联系我们](https://www.yugabyte.com/about/contact/) 了解证书及收费问题或预约一个技术面谈。 +* [Compare](https://docs.yugabyte.com/latest/comparisons/) YugaByte DB in depth to databases like [CockroachDB](https://www.yugabyte.com/yugabyte-db-vs-cockroachdb/), Google Cloud Spanner and MongoDB. +* [Get started](https://docs.yugabyte.com/latest/quick-start/) with YugaByte DB on macOS, Linux, Docker, and Kubernetes. +* [Contact us](https://www.yugabyte.com/about/contact/) to learn more about licensing, pricing or to schedule a technical overview. > 如果发现译文存在错误或其他需要改进的地方,欢迎到 [掘金翻译计划](https://github.com/xitu/gold-miner) 对译文进行修改并 PR,也可获得相应奖励积分。文章开头的 **本文永久链接** 即为本文在 GitHub 上的 MarkDown 链接。 From cf5f19fb14862048c264db9919ee0a459ea4b1dc Mon Sep 17 00:00:00 2001 From: Nebulus <609117264@qq.com> Date: Sat, 27 Jul 2019 16:49:34 +0800 Subject: [PATCH 3/7] =?UTF-8?q?=E7=94=B1=E6=B5=85=E5=85=A5=E6=B7=B1?= =?UTF-8?q?=E7=90=86=E8=A7=A3=E4=B8=BB=E6=88=90=E5=88=86=E5=88=86=E6=9E=90?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 由浅入深理解主成分分析 --- ...anation-of-principal-component-analysis.md | 116 +++++++++--------- 1 file changed, 58 insertions(+), 58 deletions(-) diff --git a/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md b/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md index 24570e92cba..70ab53f738d 100644 --- a/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md +++ b/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md @@ -2,141 +2,141 @@ > * 原文作者:[Zakaria Jaadi](https://medium.com/@zakaria.jaadi) > * 译文出自:[掘金翻译计划](https://github.com/xitu/gold-miner) > * 本文永久链接:[https://github.com/xitu/gold-miner/blob/master/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md](https://github.com/xitu/gold-miner/blob/master/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md) -> * 译者: +> * 译者:[Ultrasteve](https://github.com/Ultrasteve) > * 校对者: -# A step by step explanation of Principal Component Analysis +# 由浅入深理解主成分分析 ![](https://cdn-images-1.medium.com/max/2360/0*MCObvpuCqWS5-z2m) -The purpose of this post is to provide a complete and simplified explanation of Principal Component Analysis, and especially to answer how it works step by step, so that everyone can understand it and make use of it, without necessarily having a strong mathematical background. +这篇文章的目的是对 PCA 做一个完整且简单易懂的介绍,重点会一步一步的讲解它是怎么工作的。看完这篇文章后,相信即使没有很强的数学背景的人,都能理解并使用它。 -PCA is actually a widely covered method on the web, and there are some great articles about it, but only few of them go straight to the point and explain how it works without diving too much into the technicalities and the ‘why’ of things. That’s the reason why i decided to make my own post to present it in a simplified way. +网上已经有很多介绍 PCA 的文章,其中一些质量也很高,但很少文章会直截了当的去介绍它是怎么工作的,通常它们会过度的拘泥于 PCA 背后的技术及原理。因此,我打算以我自己的方式,简单易懂的来向各位介绍 PCA 。 -Before getting to the explanation, this post provides logical explanations of what PCA is doing in each step and simplifies the mathematical concepts behind it, as standardization, covariance, eigenvectors and eigenvalues without focusing on how to compute them. +在解释 PCA 之前,这篇文章会先富有逻辑性的介绍 PCA 在每一步是做什么的,同时我们会简化其背后的数学概念。我们会讲到标准化,协方差,特征向量和特征值,但我们不会介绍如何去计算它们。 -## So what is Principal Component Analysis ? +## 什么是 PCA ? -Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. +PCA(主成分分析)是一种降维方法,常用于对那些维度很高的数据集作降维。它会将一个大数据集中的变量转化为维度更小的变量,同时保留这些变量的大部分信息。 -Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process. +减少数据的维度天然会牺牲一些精度,但奇妙的是,在降维算法中,精度的损失并不大。这是因为维度更小的数据能更容易被探索和可视化,在数据的分析和机器学习算法中,我们将不用去处理额外的变量,这让整个过程变得高效。 -So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while preserving as much information as possible. +总的来说,PCA 的中心思想十分简单——减少数据维度的同时尽可能的保留它的大部分信息。 -## Step by step explanation +## 一步一步的解释 -### Step 1: Standardization +### 步骤一:标准化 -The aim of this step is to standardize the range of the continuous initial variables so that each one of them contributes equally to the analysis. +为了让每一个维度对分析的结果造成同样的影响,我们需要对连续的初始变量的范围作标准化。 -More specifically, the reason why it is critical to perform standardization prior to PCA, is that the latter is quite sensitive regarding the variances of the initial variables. That is, if there are large differences between the ranges of initial variables, those variables with larger ranges will dominate over those with small ranges (For example, a variable that ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1), which will lead to biased results. So, transforming the data to comparable scales can prevent this problem. +更具体的说,在 PCA 之前作数据标准化的原因是,后续的结果对数据的方差十分敏感。也就是说,那些方差较大的维度会比方差更小的维度对结果造成更大的影响(例如,一个在 1 到 100 之间变化的维度对结果的影响,比一个 0 到 1 的更大),这会导致一个偏差较大的结果。所以,将数据转化到比较的范围可以预防这个问题。 -Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for each value of each variable. +从数学上来讲,我们可以通过减去数据的平均值并除以它的标准差来进行数据标准化。 ![](https://cdn-images-1.medium.com/max/2000/0*AgmY9auxftS9BI73.png) -Once the standardization is done, all the variables will be transformed to the same scale. +一旦我们完成数据标准化,所有的数据会在同一个范围内。 *** -if you want to get an in-depth understanding about standardization, i invite you to read this simple article i wrote about it. +如果你想更深入的了解数据标准化,我推荐你阅读我写的这篇小短文。 -* [**When and why to standardize your data ? A simple guide on when to standardize your data and when not to.**](https://github.com/xitu/gold-miner/blob/master/TODO1/when-to-standardize-your-data.md) +* [**什么时候进行数据标准化?为什么?**](https://github.com/xitu/gold-miner/blob/master/TODO1/when-to-standardize-your-data.md) -### Step 2: Covariance Matrix computation +### 步骤二:计算协方差矩阵 -The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance matrix. +这一步的目标是理解,数据集中的变量是如何从平均值变化过来的,不同的特征之间又有什么关系。换句话说,我们想要看看特征之间是否存在某种联系。事实上,特征中常常包含着一些冗余信息,这使得特征之间有时候会高度相关。为了了解这一层关系,我们需要计算协方差矩阵。 -The covariance matrix is a **p** × **p**** **symmetric matrix (where** p **is the number of dimensions) that has as entries the covariances associated with all possible pairs of the initial variables. For example, for a 3-dimensional data set with 3 variables** x**,** y**, and** z**, the covariance matrix is a 3×3 matrix of this from: +协方差矩阵是一个 **p** × **p**** **的对称矩阵 (** p **是维度的数量)它涵盖了数据集中所有元组对初始值的协方差。**例如,对于一个拥有三个变量** x**,** y** **,z 和三个维度的数据集,协方差矩阵将是一个 3 × 3 的矩阵: -![Covariance matrix for 3-dimensional data](https://cdn-images-1.medium.com/max/2000/0*xTLQtW2XQY6P3mZf.png) +![三个维度数据的协方差矩阵](https://cdn-images-1.medium.com/max/2000/0*xTLQtW2XQY6P3mZf.png) -Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom right) we actually have the variances of each initial variable. And since the covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix are symmetric with respect to the main diagonal, which means that the upper and the lower triangular portions are equal. +由于变量与自身的协方差等于它的方差( Cov(a,a)=Var(a) ),在主对角线(左上到右下)上我们已经计算出各个变量初始值的方差。又因为协方差满足交换律( Cov(a,b)=Cov(b,a) ),协方差矩阵的每一个元组关于主对角线对称,这意味着上三角部分和下三角部分是相等的。 -**What do the covariances that we have as entries of the matrix tell us about the correlations between the variables?** +**作为矩阵元组的协方差是怎么告诉我们变量之间的关系的?** -It’s actually the sign of the covariance that matters : +让我们来看看协方差取值的含义: -* if positive then : the two variables increase or decrease together (correlated) -* if negative then : One increases when the other decreases (Inversely correlated) +* 如果值为正:那么两个变量呈正相关(同增同减) +* 如果值为负数:那么两个变量呈负相关(增减相反) -Now, that we know that the covariance matrix is not more than a table that summaries the correlations between all the possible pairs of variables, let’s move to the next step. +现在,我们知道了协方差矩阵不仅仅是对于变量之间的协方差的总结,让我们进入到下一步吧。 -### Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components +### 步骤三:通过计算协方差矩阵的特征向量和特征值来计算出主成分 -Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the covariance matrix in order to determine the **principal components** of the data. Before getting to the explanation of these concepts, let’s first understand what do we mean by principal components. +特征值和特征向量是线性代数里面的概念,为了计算出数据的**主成分**,我们需要通过协方差矩阵来计算它们。在解释如何计算这两个值之前,让我们来看看主成分的意义是什么。 -Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components. So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to put maximum possible information in the first component, then maximum remaining information in the second and so on, until having something like shown in the scree plot below. +主成分是一个新的变量,它是初始变量的线性组合。这些新的变量之间是不相关的。第一主成分中包含了初始变量的大部分信息,是初始变量的压缩和提取。例如,虽然在一个 10 维的数据集中我们算出了 10 个主成分,但大部分的信息都会被压缩在第一主成分中,剩下的大部分信息又被压缩到第二主成分中,以此类推,我们得到了下面这张图: -![Percentage of variance (information) for by each PC](https://cdn-images-1.medium.com/max/2304/1*JLAVaWW5609YZoJ-NYkSOA.png) +![每一个主成分包含着多少信息](https://cdn-images-1.medium.com/max/2304/1*JLAVaWW5609YZoJ-NYkSOA.png) -Organizing information in principal components this way, will allow you to reduce dimensionality without losing much information, and this by discarding the components with low information and considering the remaining components as your new variables. +这种通过主成分来管理信息的方式,能够使我们降维的同时不会损失很多信息,同时还帮我们排除了那些信息量很少的变量。如此一来,我们就只用考虑那些主成分中压缩过的信息就可以了。 -An important thing to realize here is that, the principal components are less interpretable and don’t have any real meaning since they are constructed as linear combinations of the initial variables. +需要注意的一点是,这些主成分是难以解读的,由于它们是原变量的线性组合,通常它们没有实际的意义。 -Geometrically speaking, principal components represent the directions of the data that explain a **maximal amount of variance**, that is to say, the lines that capture most information of the data. The relationship between variance and information here, is that, the larger the variance carried by a line, the larger the dispersion of the data points along it, and the larger the dispersion along a line, the more the information it has. To put all this simply, just think of principal components as new axes that provide the best angle to see and evaluate the data, so that the differences between the observations are better visible. +从理论方面来说 ,主成分代表着蕴含**最多变量信息的方向**。对于主成分来说,变量的方差越大,空间中点就越分散,空间中的点越分散,那么它包含的信息就越多。简单的讲,主成分就是一条更好的阐述数据信息的新坐标轴,因此我们更容易从中观测到差异。 -### How PCA constructs the Principal Components? +### PCA 算法是怎么算出主成分的? -As there are as many principal components as there are variables in the data, principal components are constructed in such a manner that the first principal component accounts for the **largest possible variance** in the data set. For example, let’s assume that the scatter plot of our data set is as shown below, can we guess the first principal component ? Yes, it’s approximately the line that matches the purple marks because it goes through the origin and it’s the line in which the projection of the points (red dots) is the most spread out. Or mathematically speaking, it’s the line that maximizes the variance (the average of the squared distances from the projected points (red dots) to the origin). +有多少个变量就有多少个主成分。对于第一主成分来说沿着对应的坐标轴变化意味着有**最大的方差**。例如,我们将数据集用下列的散点图表示,现在你能够直接猜测出主成分应该是沿着哪一个方向的吗?这很简单,大概是图中紫色线的方向。因为它穿过了原点,而且数据映射在这条线上后,如红点所示,有着最大的方差(各点与原点距离的均方) ![](https://cdn-images-1.medium.com/max/2000/1*UpFltkN-kT9aGqfLhOR9xg.gif) -The second principal component is calculated in the same way, with the condition that it is uncorrelated with (i.e., perpendicular to) the first principal component and that it accounts for the next highest variance. +第二主成分也是这样计算的,它与第一主成分互不相关(即互为垂直),表示了下一个方差最大的方向。 -This continues until a total of p principal components have been calculated, equal to the original number of variables. +我们重复以上步骤直到我们从原始数据中计算出所有主成分。 -Now that we understood what we mean by principal components, let’s go back to eigenvectors and eigenvalues. What you firstly need to know about them is that they always come in pairs, so that every eigenvector has an eigenvalue. And their number is equal to the number of dimensions of the data. For example, for a 3-dimensional data set, there are 3 variables, therefore there are 3 eigenvectors with 3 corresponding eigenvalues. +现在我们知道了主成分的含义,让我们回到特征值和特征向量。你需要知道的是,它们通常成对出现,每一个特征向量对应一个特征值。它们各自的数量相等,等于原始数据的维度。例如,在一个三维数据集中,我们有三个变量,因此我们会有三个特征向量与三个特征值。 -Without further ado, it is eigenvectors and eigenvalues who are behind all the magic explained above, because the eigenvectors of the Covariance matrix are actually **the** **directions of the axes where there is the most variance** (most information) and that we call Principal Components. And eigenvalues are simply the coefficients attached to eigenvectors, which give the **amount of variance carried in each Principal Component**. +开门见山的说,特征矩阵和特征向量就是主成分分析背后的秘密。协方差矩阵的特征向量其实就是一系列的坐标轴,将数据映射到这些坐标轴后,我们将得到**最大的方差**(这意味这更多的信息),它们就是我们要求的主成分。特征值其实就是特征向量的系数,它代表了每个特征向量**包含了多少信息量**。 -By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the principal components in order of significance. +你可以根据特征值的大小对特征向量作排序,你将知道哪一个是最重要的主成分,哪一个不是。 -**Example:** +**例如:** -let’s suppose that our data set is 2-dimensional with 2 variables **x,y** and that the eigenvectors and eigenvalues of the covariance matrix are as follows: +现在我们有一个数据集,有两个变量两个维度 **x,y** ,它们的特征值与特征向量如下所示: ![](https://cdn-images-1.medium.com/max/2000/1*3OAdlot1vJcK6qzCePlq9Q.png) -If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the eigenvector that corresponds to the first principal component (PC1) is **v1** and the one that corresponds to the second component (PC2) is **v2.** +如果我们从大到小的排序特征值,我们得到 λ1>λ2,这意味着我们需要的第一主成分(PC1)是 **v1** ,第二主成分(PC2)是 **v2**。 -After having the principal components, to compute the percentage of variance (information) accounted for by each component, we divide the eigenvalue of each component by the sum of eigenvalues. If we apply this on the example above, we find that PC1 and PC2 carry respectively 96% and 4% of the variance of the data. +在得到主成分后,我们将每个特征值除以特征值的和,这样我们就得到了一个百分数。在上面的例子中,我们可以看到 PC1 和 PC2 各自携带了 96% 和 4% 信息。 -### Step 4: Feature vector +### 步骤四:主成分向量 -As we saw in the previous step, computing the eigenvectors and ordering them by their eigenvalues in descending order, allow us to find the principal components in order of significance. In this step, what we do is, to choose whether to keep all these components or discard those of lesser significance (of low eigenvalues), and form with the remaining ones a matrix of vectors that we call **Feature vector**. +正如我们在前面步骤所看到的,通过计算出特征向量并让他们根据特征值的降序排列,我们能知到每个主成分的重要性。在这一步中,我们将会讨论我们是应该保留最重要的几个主成分,还是保留所有主成分。在排除那些不需要的主成分后,剩下的我们称作**主成分向量**。 -So, the feature vector is simply a matrix that has as columns the eigenvectors of the components that we decide to keep. This makes it the first step towards dimensionality reduction, because if we choose to keep only **p** eigenvectors (components) out of **n**, the final data set will have only **p** dimensions. +主成分向量仅仅是一个矩阵,里面有那些我们决定保留的特征向量。这是数据降维的第一步,因为如果我们只打算在 **n** 个中保留 **p** 个特征向量(成分),那么当我们把数据映射到这些新的坐标轴上时,最后数据将只有 **p** 个维度。 -**Example**: +**例如:** -Continuing with the example from the previous step, we can either form a feature vector with both of the eigenvectors **v**1 and **v**2: +继续看上一步的例子,我们可以只用 **v**1 和 **v**2来形成主成分向量: ![](https://cdn-images-1.medium.com/max/2000/0*DwiYbyXZXvU20DjB.png) -Or discard the eigenvector **v**2, which is the one of lesser significance, and form a feature vector with **v**1 only: +因为 **v**2 没那么重要,我们丢弃掉它,只保留 **v**1: ![](https://cdn-images-1.medium.com/max/2000/0*YKNYKGQaNAYf6Iln.png) -Discarding the eigenvector **v2** will reduce dimensionality by 1, and will consequently cause a loss of information in the final data set. But given that **v**2 was carrying only 4% of the information, the loss will be therefore not important and we will still have 96% of the information that is carried by **v**1. +丢弃掉 **v2** 会使结果降低一个维度,当然也会造成数据的损失。但由于 **v**2 只保留了 4% 的信息,这个损失时可以忽略不计的。因为我们保留了 **v**1 ,我们仍然有 96% 的信息。 *** -So, as we saw in the example, it’s up to you to choose whether to keep all the components or discard the ones of lesser significance, depending on what you are looking for. Because if you just want to describe your data in terms of new variables (principal components) that are uncorrelated without seeking to reduce dimensionality, leaving out lesser significant components is not needed. +如我们在结果中所见,是否丢弃没有那么重要的成分完全取决于你。如果你只想根据主成分来重新表示数据,不想进行数据将维,那么丢弃掉不重要的成分是不必要的。 -### Last step : Recast the data along the principal components axes +### 最后一步:将数据映射到新的主成分坐标系中 -In the previous steps, apart from standardization, you do not make any changes on the data, you just select the principal components and form the feature vector, but the input data set remains always in terms of the original axes (i.e, in terms of the initial variables). +在前一步中,除了标准化数据,你并没有对数据作任何改变。你仅仅是选取了主成分,形成了主成分向量,但原始数据仍然在用原来的坐标系表示。 -In this step, which is the last one, the aim is to use the feature vector formed using the eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones represented by the principal components (hence the name Principal Components Analysis). This can be done by multiplying the transpose of the original data set by the transpose of the feature vector. +在这最后一步中,我们将使用那些从协方差矩阵中算出来的特征向量形成主成分矩阵,并将原始数据映射到主成分矩阵对应的坐标轴上 —— 这就叫做主成分分析。具体的做法便是用原数据矩阵的转置乘以主成分矩阵的转置。 ![](https://cdn-images-1.medium.com/max/2000/0*D02r0HjB8WtCq3Cj.png) *** -If you enjoyed this story, please click the 👏 button as many times as you think it deserves. And share to help others find it! Feel free to leave a comment below. +如果你喜欢这篇文章,请点击 👏 按钮。并转发让更多人看到!你也可以在下面留言。 -### References: +### 参考文献: * [**Steven M. Holland**, **Univ. of Georgia**]: Principal Components Analysis * [**skymind.ai**]: Eigenvectors, Eigenvalues, PCA, Covariance and Entropy From f4c98f10af010833e65a6d366adefd009cac9fb6 Mon Sep 17 00:00:00 2001 From: Nebulus <609117264@qq.com> Date: Tue, 30 Jul 2019 21:38:00 +0800 Subject: [PATCH 4/7] =?UTF-8?q?=E7=94=B1=E6=B5=85=E5=85=A5=E6=B7=B1?= =?UTF-8?q?=E7=90=86=E8=A7=A3=E4=B8=BB=E6=88=90=E5=88=86=E5=88=86=E6=9E=90?= =?UTF-8?q?=20(=E6=A0=A1=E5=AF=B9=E5=AE=8C=E6=AF=95)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 由浅入深理解主成分分析 (校对完毕) --- ...anation-of-principal-component-analysis.md | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md b/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md index 70ab53f738d..51f9723835f 100644 --- a/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md +++ b/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md @@ -3,7 +3,7 @@ > * 译文出自:[掘金翻译计划](https://github.com/xitu/gold-miner) > * 本文永久链接:[https://github.com/xitu/gold-miner/blob/master/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md](https://github.com/xitu/gold-miner/blob/master/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md) > * 译者:[Ultrasteve](https://github.com/Ultrasteve) -> * 校对者: +> * 校对者:[kasheemlew](https://github.com/kasheemlew), [TrWestdoor](https://github.com/TrWestdoor) # 由浅入深理解主成分分析 @@ -11,25 +11,25 @@ 这篇文章的目的是对 PCA 做一个完整且简单易懂的介绍,重点会一步一步的讲解它是怎么工作的。看完这篇文章后,相信即使没有很强的数学背景的人,都能理解并使用它。 -网上已经有很多介绍 PCA 的文章,其中一些质量也很高,但很少文章会直截了当的去介绍它是怎么工作的,通常它们会过度的拘泥于 PCA 背后的技术及原理。因此,我打算以我自己的方式,简单易懂的来向各位介绍 PCA 。 +网上已经有很多介绍 PCA 的文章,其中一些质量也很高,但很少文章会直截了当的去介绍它是怎么工作的,通常它们会过度的拘泥于 PCA 背后的技术及原理。因此,我打算以我自己的方式,来向各位简单易懂的介绍 PCA 。 -在解释 PCA 之前,这篇文章会先富有逻辑性的介绍 PCA 在每一步是做什么的,同时我们会简化其背后的数学概念。我们会讲到标准化,协方差,特征向量和特征值,但我们不会介绍如何去计算它们。 +在解释 PCA 之前,这篇文章会先富有逻辑性的介绍 PCA 在每一步是做什么的,同时我们会简化其背后的数学概念。我们会讲到标准化,协方差,特征向量和特征值,但我们不会专注于如何计算它们。 ## 什么是 PCA ? -PCA(主成分分析)是一种降维方法,常用于对那些维度很高的数据集作降维。它会将一个大数据集中的变量转化为维度更小的变量,同时保留这些变量的大部分信息。 +PCA(主成分分析)是一种降维方法,常用于对高维数据集作降维。它会将一个大的变量集合转化为更少的变量集合,同时保留大的变量集合中的大部分信息。 -减少数据的维度天然会牺牲一些精度,但奇妙的是,在降维算法中,精度的损失并不大。这是因为维度更小的数据能更容易被探索和可视化,在数据的分析和机器学习算法中,我们将不用去处理额外的变量,这让整个过程变得高效。 +减少数据的维度天然会牺牲一些精度,但降维算法的诀窍是牺牲很少的精度进行简化。这是因为维度更小的数据能更容易被探索和可视化,在数据的分析和机器学习算法中,我们将不用去处理额外的变量,这让整个过程变得高效。 -总的来说,PCA 的中心思想十分简单——减少数据维度的同时尽可能的保留它的大部分信息。 +总的来说,PCA 的中心思想十分简单——减少数据集的变量数目,同时尽可能保留它的大部分信息。 -## 一步一步的解释 +## 逐步解释 ### 步骤一:标准化 为了让每一个维度对分析的结果造成同样的影响,我们需要对连续的初始变量的范围作标准化。 -更具体的说,在 PCA 之前作数据标准化的原因是,后续的结果对数据的方差十分敏感。也就是说,那些方差较大的维度会比方差更小的维度对结果造成更大的影响(例如,一个在 1 到 100 之间变化的维度对结果的影响,比一个 0 到 1 的更大),这会导致一个偏差较大的结果。所以,将数据转化到比较的范围可以预防这个问题。 +更具体的说,在 PCA 之前作数据标准化的原因是,后续的结果对数据的方差十分敏感。也就是说,那些取值范围较大的维度会比相对较小的维度造成更大的影响(例如,一个在 1 到 100 之间变化的维度对结果的影响,比一个 0 到 1 的更大),这会导致一个偏差较大的结果。所以,将数据转化到比较的范围可以预防这个问题。 从数学上来讲,我们可以通过减去数据的平均值并除以它的标准差来进行数据标准化。 @@ -45,7 +45,7 @@ PCA(主成分分析)是一种降维方法,常用于对那些维度很高 ### 步骤二:计算协方差矩阵 -这一步的目标是理解,数据集中的变量是如何从平均值变化过来的,不同的特征之间又有什么关系。换句话说,我们想要看看特征之间是否存在某种联系。事实上,特征中常常包含着一些冗余信息,这使得特征之间有时候会高度相关。为了了解这一层关系,我们需要计算协方差矩阵。 +这一步的目标是理解数据集中的变量是如何从平均值变化过来的,不同的特征之间又有什么关系。换句话说,我们想要看看特征之间是否存在某种联系。有时特征之间高度相关,因此会有一些冗余的信息。为了了解这一层关系,我们需要计算协方差矩阵。 协方差矩阵是一个 **p** × **p**** **的对称矩阵 (** p **是维度的数量)它涵盖了数据集中所有元组对初始值的协方差。**例如,对于一个拥有三个变量** x**,** y** **,z 和三个维度的数据集,协方差矩阵将是一个 3 × 3 的矩阵: @@ -53,7 +53,7 @@ PCA(主成分分析)是一种降维方法,常用于对那些维度很高 由于变量与自身的协方差等于它的方差( Cov(a,a)=Var(a) ),在主对角线(左上到右下)上我们已经计算出各个变量初始值的方差。又因为协方差满足交换律( Cov(a,b)=Cov(b,a) ),协方差矩阵的每一个元组关于主对角线对称,这意味着上三角部分和下三角部分是相等的。 -**作为矩阵元组的协方差是怎么告诉我们变量之间的关系的?** +**协方差矩阵中的元素告诉了我们变量间什么样的关系呢?** 让我们来看看协方差取值的含义: @@ -74,7 +74,7 @@ PCA(主成分分析)是一种降维方法,常用于对那些维度很高 需要注意的一点是,这些主成分是难以解读的,由于它们是原变量的线性组合,通常它们没有实际的意义。 -从理论方面来说 ,主成分代表着蕴含**最多变量信息的方向**。对于主成分来说,变量的方差越大,空间中点就越分散,空间中的点越分散,那么它包含的信息就越多。简单的讲,主成分就是一条更好的阐述数据信息的新坐标轴,因此我们更容易从中观测到差异。 +从理论方面来说 ,主成分代表着蕴含**最大方差的方向**。对于主成分来说,变量的方差越大,空间中点就越分散,空间中的点越分散,那么它包含的信息就越多。简单的讲,主成分就是一条更好的阐述数据信息的新坐标轴,因此我们更容易从中观测到差异。 ### PCA 算法是怎么算出主成分的? @@ -88,7 +88,7 @@ PCA(主成分分析)是一种降维方法,常用于对那些维度很高 现在我们知道了主成分的含义,让我们回到特征值和特征向量。你需要知道的是,它们通常成对出现,每一个特征向量对应一个特征值。它们各自的数量相等,等于原始数据的维度。例如,在一个三维数据集中,我们有三个变量,因此我们会有三个特征向量与三个特征值。 -开门见山的说,特征矩阵和特征向量就是主成分分析背后的秘密。协方差矩阵的特征向量其实就是一系列的坐标轴,将数据映射到这些坐标轴后,我们将得到**最大的方差**(这意味这更多的信息),它们就是我们要求的主成分。特征值其实就是特征向量的系数,它代表了每个特征向量**包含了多少信息量**。 +简单地说,特征矩阵和特征向量就是主成分分析背后的秘密。协方差矩阵的特征向量其实就是一系列的坐标轴,将数据映射到这些坐标轴后,我们将得到**最大的方差**(这意味这更多的信息),它们就是我们要求的主成分。特征值其实就是特征向量的系数,它代表了每个特征向量**包含了多少信息量**。 你可以根据特征值的大小对特征向量作排序,你将知道哪一个是最重要的主成分,哪一个不是。 From 7ae2ce0b33831e29bb3d310a12f6bfdf2f464cda Mon Sep 17 00:00:00 2001 From: sun <776766759@qq.com> Date: Wed, 31 Jul 2019 13:23:31 +0800 Subject: [PATCH 5/7] Update a-step-by-step-explanation-of-principal-component-analysis.md --- ...ep-explanation-of-principal-component-analysis.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md b/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md index 51f9723835f..6ffb5cdcbcf 100644 --- a/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md +++ b/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md @@ -21,7 +21,7 @@ PCA(主成分分析)是一种降维方法,常用于对高维数据集作 减少数据的维度天然会牺牲一些精度,但降维算法的诀窍是牺牲很少的精度进行简化。这是因为维度更小的数据能更容易被探索和可视化,在数据的分析和机器学习算法中,我们将不用去处理额外的变量,这让整个过程变得高效。 -总的来说,PCA 的中心思想十分简单——减少数据集的变量数目,同时尽可能保留它的大部分信息。 +总的来说,PCA 的中心思想十分简单 —— 减少数据集的变量数目,同时尽可能保留它的大部分信息。 ## 逐步解释 @@ -47,7 +47,7 @@ PCA(主成分分析)是一种降维方法,常用于对高维数据集作 这一步的目标是理解数据集中的变量是如何从平均值变化过来的,不同的特征之间又有什么关系。换句话说,我们想要看看特征之间是否存在某种联系。有时特征之间高度相关,因此会有一些冗余的信息。为了了解这一层关系,我们需要计算协方差矩阵。 -协方差矩阵是一个 **p** × **p**** **的对称矩阵 (** p **是维度的数量)它涵盖了数据集中所有元组对初始值的协方差。**例如,对于一个拥有三个变量** x**,** y** **,z 和三个维度的数据集,协方差矩阵将是一个 3 × 3 的矩阵: +协方差矩阵是一个 **p** × **p** 的对称矩阵(**p** 是维度的数量)它涵盖了数据集中所有元组对初始值的协方差。例如,对于一个拥有三个变量 **x**、** y**、**z** 和三个维度的数据集,协方差矩阵将是一个 3 × 3 的矩阵: ![三个维度数据的协方差矩阵](https://cdn-images-1.medium.com/max/2000/0*xTLQtW2XQY6P3mZf.png) @@ -74,7 +74,7 @@ PCA(主成分分析)是一种降维方法,常用于对高维数据集作 需要注意的一点是,这些主成分是难以解读的,由于它们是原变量的线性组合,通常它们没有实际的意义。 -从理论方面来说 ,主成分代表着蕴含**最大方差的方向**。对于主成分来说,变量的方差越大,空间中点就越分散,空间中的点越分散,那么它包含的信息就越多。简单的讲,主成分就是一条更好的阐述数据信息的新坐标轴,因此我们更容易从中观测到差异。 +从理论方面来说,主成分代表着蕴含**最大方差的方向**。对于主成分来说,变量的方差越大,空间中点就越分散,空间中的点越分散,那么它包含的信息就越多。简单的讲,主成分就是一条更好的阐述数据信息的新坐标轴,因此我们更容易从中观测到差异。 ### PCA 算法是怎么算出主成分的? @@ -110,15 +110,15 @@ PCA(主成分分析)是一种降维方法,常用于对高维数据集作 **例如:** -继续看上一步的例子,我们可以只用 **v**1 和 **v**2来形成主成分向量: +继续看上一步的例子,我们可以只用 **v1** 和 **v2** 来形成主成分向量: ![](https://cdn-images-1.medium.com/max/2000/0*DwiYbyXZXvU20DjB.png) -因为 **v**2 没那么重要,我们丢弃掉它,只保留 **v**1: +因为 **v2** 没那么重要,我们丢弃掉它,只保留 **v1**: ![](https://cdn-images-1.medium.com/max/2000/0*YKNYKGQaNAYf6Iln.png) -丢弃掉 **v2** 会使结果降低一个维度,当然也会造成数据的损失。但由于 **v**2 只保留了 4% 的信息,这个损失时可以忽略不计的。因为我们保留了 **v**1 ,我们仍然有 96% 的信息。 +丢弃掉 **v2** 会使结果降低一个维度,当然也会造成数据的损失。但由于 **v2** 只保留了 4% 的信息,这个损失时可以忽略不计的。因为我们保留了 **v1** ,我们仍然有 96% 的信息。 *** From c2d4a6388a2d0ec61aa587c1edd927c350ee0f7f Mon Sep 17 00:00:00 2001 From: sun <776766759@qq.com> Date: Wed, 31 Jul 2019 13:30:44 +0800 Subject: [PATCH 6/7] Update a-step-by-step-explanation-of-principal-component-analysis.md --- ...-step-by-step-explanation-of-principal-component-analysis.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md b/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md index 6ffb5cdcbcf..6f28d4168dc 100644 --- a/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md +++ b/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md @@ -47,7 +47,7 @@ PCA(主成分分析)是一种降维方法,常用于对高维数据集作 这一步的目标是理解数据集中的变量是如何从平均值变化过来的,不同的特征之间又有什么关系。换句话说,我们想要看看特征之间是否存在某种联系。有时特征之间高度相关,因此会有一些冗余的信息。为了了解这一层关系,我们需要计算协方差矩阵。 -协方差矩阵是一个 **p** × **p** 的对称矩阵(**p** 是维度的数量)它涵盖了数据集中所有元组对初始值的协方差。例如,对于一个拥有三个变量 **x**、** y**、**z** 和三个维度的数据集,协方差矩阵将是一个 3 × 3 的矩阵: +协方差矩阵是一个 **p** × **p** 的对称矩阵(**p** 是维度的数量)它涵盖了数据集中所有元组对初始值的协方差。例如,对于一个拥有三个变量 **x**、**y**、**z** 和三个维度的数据集,协方差矩阵将是一个 3 × 3 的矩阵: ![三个维度数据的协方差矩阵](https://cdn-images-1.medium.com/max/2000/0*xTLQtW2XQY6P3mZf.png) From 7066a36c13d539a3ecd02a42dc72286f8da96b49 Mon Sep 17 00:00:00 2001 From: LeviDing Date: Wed, 31 Jul 2019 21:28:59 +0800 Subject: [PATCH 7/7] Update a-step-by-step-explanation-of-principal-component-analysis.md --- ...-explanation-of-principal-component-analysis.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md b/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md index 6f28d4168dc..8289643df84 100644 --- a/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md +++ b/TODO1/a-step-by-step-explanation-of-principal-component-analysis.md @@ -9,15 +9,15 @@ ![](https://cdn-images-1.medium.com/max/2360/0*MCObvpuCqWS5-z2m) -这篇文章的目的是对 PCA 做一个完整且简单易懂的介绍,重点会一步一步的讲解它是怎么工作的。看完这篇文章后,相信即使没有很强的数学背景的人,都能理解并使用它。 +这篇文章的目的是对主成分分析(PCA)做一个完整且简单易懂的介绍,重点会一步一步的讲解它是怎么工作的。看完这篇文章后,相信即使没有很强的数学背景的人,都能理解并使用它。 网上已经有很多介绍 PCA 的文章,其中一些质量也很高,但很少文章会直截了当的去介绍它是怎么工作的,通常它们会过度的拘泥于 PCA 背后的技术及原理。因此,我打算以我自己的方式,来向各位简单易懂的介绍 PCA 。 在解释 PCA 之前,这篇文章会先富有逻辑性的介绍 PCA 在每一步是做什么的,同时我们会简化其背后的数学概念。我们会讲到标准化,协方差,特征向量和特征值,但我们不会专注于如何计算它们。 -## 什么是 PCA ? +## 什么是 PCA? -PCA(主成分分析)是一种降维方法,常用于对高维数据集作降维。它会将一个大的变量集合转化为更少的变量集合,同时保留大的变量集合中的大部分信息。 +PCA 是一种降维方法,常用于对高维数据集作降维。它会将一个大的变量集合转化为更少的变量集合,同时保留大的变量集合中的大部分信息。 减少数据的维度天然会牺牲一些精度,但降维算法的诀窍是牺牲很少的精度进行简化。这是因为维度更小的数据能更容易被探索和可视化,在数据的分析和机器学习算法中,我们将不用去处理额外的变量,这让整个过程变得高效。 @@ -41,7 +41,7 @@ PCA(主成分分析)是一种降维方法,常用于对高维数据集作 如果你想更深入的了解数据标准化,我推荐你阅读我写的这篇小短文。 -* [**什么时候进行数据标准化?为什么?**](https://github.com/xitu/gold-miner/blob/master/TODO1/when-to-standardize-your-data.md) +* [**什么时候进行数据标准化?为什么?一篇简单的指南教你是否应该标准化你的数据。**](https://github.com/xitu/gold-miner/blob/master/TODO1/when-to-standardize-your-data.md) ### 步骤二:计算协方差矩阵 @@ -51,7 +51,7 @@ PCA(主成分分析)是一种降维方法,常用于对高维数据集作 ![三个维度数据的协方差矩阵](https://cdn-images-1.medium.com/max/2000/0*xTLQtW2XQY6P3mZf.png) -由于变量与自身的协方差等于它的方差( Cov(a,a)=Var(a) ),在主对角线(左上到右下)上我们已经计算出各个变量初始值的方差。又因为协方差满足交换律( Cov(a,b)=Cov(b,a) ),协方差矩阵的每一个元组关于主对角线对称,这意味着上三角部分和下三角部分是相等的。 +由于变量与自身的协方差等于它的方差(Cov(a,a)=Var(a)),在主对角线(左上到右下)上我们已经计算出各个变量初始值的方差。又因为协方差满足交换律(Cov(a,b)=Cov(b,a)),协方差矩阵的每一个元组关于主对角线对称,这意味着上三角部分和下三角部分是相等的。 **协方差矩阵中的元素告诉了我们变量间什么样的关系呢?** @@ -78,7 +78,7 @@ PCA(主成分分析)是一种降维方法,常用于对高维数据集作 ### PCA 算法是怎么算出主成分的? -有多少个变量就有多少个主成分。对于第一主成分来说沿着对应的坐标轴变化意味着有**最大的方差**。例如,我们将数据集用下列的散点图表示,现在你能够直接猜测出主成分应该是沿着哪一个方向的吗?这很简单,大概是图中紫色线的方向。因为它穿过了原点,而且数据映射在这条线上后,如红点所示,有着最大的方差(各点与原点距离的均方) +有多少个变量就有多少个主成分。对于第一主成分来说沿着对应的坐标轴变化意味着有**最大的方差**。例如,我们将数据集用下列的散点图表示,现在你能够直接猜测出主成分应该是沿着哪一个方向的吗?这很简单,大概是图中紫色线的方向。因为它穿过了原点,而且数据映射在这条线上后,如红点所示,有着最大的方差(各点与原点距离的均方)。 ![](https://cdn-images-1.medium.com/max/2000/1*UpFltkN-kT9aGqfLhOR9xg.gif) @@ -94,7 +94,7 @@ PCA(主成分分析)是一种降维方法,常用于对高维数据集作 **例如:** -现在我们有一个数据集,有两个变量两个维度 **x,y** ,它们的特征值与特征向量如下所示: +现在我们有一个数据集,有两个变量两个维度 **x,y**,它们的特征值与特征向量如下所示: ![](https://cdn-images-1.medium.com/max/2000/1*3OAdlot1vJcK6qzCePlq9Q.png)