Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs. #1208

Merged
merged 26 commits into from
Aug 4, 2021
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,12 +76,12 @@ Welcome to PaddleSeg! PaddleSeg is an end-to-end image segmentation development

* [Installation](./docs/install.md)
* [Get Started](./docs/quick_start.md)
* Data Processing
* [Data Format Description](./docs/data/marker/marker_c.md)
* [Data Annotation and Transform](./docs/data/transform/transform_c.md)
* Prepare Datasets
* [Preparation of Annotation Data](./docs/data/marker/marker.md)
* [Annotating Tutorial](./docs/data/transform/transform.md)
* [Custom Dataset](./docs/data/custom/data_prepare.md)

* Design Idea of PaddleSeg
* Interpretation of PaddleSeg's Modules
KazusaW1999 marked this conversation as resolved.
Show resolved Hide resolved
* [Detailed Configuration File](./docs/design/use/use.md)
* [Create Your Own Model](./docs/design/create/add_new_model.md)
* [Model Training](/docs/train/train.md)
Expand Down
16 changes: 8 additions & 8 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,14 +81,14 @@ PaddleSeg是基于飞桨[PaddlePaddle](https://www.paddlepaddle.org.cn)开发的

## 使用教程 <img src="./docs/images/teach.png" width="30"/>

* [安装](./docs/install.md)
* [全流程跑通PaddleSeg](./docs/quick_start.md)
* 数据处理
* [数据格式说明](./docs/data/marker/marker_c.md)
* [数据标注和转换](./docs/data/transform/transform_c.md)
* [自定义数据集](./docs/data/custom/data_prepare.md)

* PaddleSeg的设计思想
* [安装](./docs/install_cn.md)
* [全流程跑通PaddleSeg](./docs/quick_start_cn.md)
* 准备数据集
* [标注数据的准备](./docs/data/marker/marker_cn.md)
* [数据标注教程](./docs/data/transform/transform_cn.md)
* [自定义数据集](./docs/data/custom/data_prepare_cn.md)

* PaddleSeg模块解读
KazusaW1999 marked this conversation as resolved.
Show resolved Hide resolved
* [配置文件详解](./docs/design/use/use_cn.md)
* [如何创造自己的模型](./docs/design/create/add_new_model_cn.md)
* [模型训练](/docs/train/train.md)
Expand Down
109 changes: 102 additions & 7 deletions docs/data/custom/data_prepare.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,89 @@
# 自定义数据集
English|[简体中文](data_prepare_cn.md)
# Custom Dataset

如果您需要使用自定义数据集进行训练,请按照以下步骤准备数据.
## 1、How PaddleSeg Use Datasets
KazusaW1999 marked this conversation as resolved.
Show resolved Hide resolved

1.推荐整理成如下结构
We want to write the path of the image to the three folders `train.txt`, `val.txt`, `test.txt` and `labels.txt`, because PaddleSeg locates the image by reading these text files Path.
The texts of `train.txt`, `val.txt` and `test.txt` are divided into two columns with spaces as separators. The first column is the relative path of the image file relative to the dataset, and the second column is the relative path of the image file The relative path of the dataset. As follows:

```
images/xxx1.tif annotations/xxx1.png
images/xxx2.tif annotations/xxx2.png
...
```
`labels.txt`: Each line has a separate category, and the corresponding line number is the id corresponding to the category (line number starts from 0), as shown below:
```
labelA
labelB
...
```

## 2、Segment Custom Dataset
KazusaW1999 marked this conversation as resolved.
Show resolved Hide resolved

We all know that the training process of neural network models is usually divided into training set, validation set, and test set. If you are using a custom dataset, PaddleSeg supports splitting the dataset by running scripts. If your dataset has been divided into the above three types, you can skip this step.

### 2.1 Original Image Requirements
The size of the original image data should be (h, w, channel), where h, w are the height and width of the image, and channel is the number of channels of the image.

### 2.2 Annotation Requirements
The annotated image must be a single-channel image, the pixel value is the corresponding category, and the pixel annotated category needs to increase from 0.
For example, 0, 1, 2, 3 means that there are 4 categories, and the maximum number of labeled categories is 256. Among them, you can specify a specific pixel value to indicate that the pixel of that value does not participate in training and evaluation (the default is 255).

### 2.3 Custom Dataset Segmentation and File List Generation
KazusaW1999 marked this conversation as resolved.
Show resolved Hide resolved

For all data that is not divided into training set, validation set, and test set, PaddleSeg provides a script to generate segmented data and generate a file list.

#### Use scripts to randomly split the custom dataset proportionally and generate a file list
The data file structure is as follows:
```
./dataset/ # Dataset root directory
|--images # Original image catalog
| |--xxx1.tif
| |--...
| └--...
|
|--annotations # Annotated image catalog
| |--xxx1.png
| |--...
| └--...
```

Among them, the corresponding file name can be defined according to needs.

The commands used are as follows, which supports enabling specific functions through different Flags.
```
python tools/split_dataset_list.py <dataset_root> <images_dir_name> <labels_dir_name> ${FLAGS}
```
Parameters:
- dataset_root: Dataset root directory
- images_dir_name: Original image catalog
- labels_dir_name: Annotated image catalog

FLAGS:

|FLAG|Meaning|Default|Parameter numbers|
|-|-|-|-|
|--split|Dataset segmentation ratio|0.7 0.3 0|3|
|--separator|File list separator|"&#124;"|1|
|--format|Data format of pictures and label sets|"jpg" "png"|2|
|--label_class|Label category|'\_\_background\_\_' '\_\_foreground\_\_'|several|
|--postfix|Filter pictures and label sets according to whether the main file name (without extension) contains the specified suffix|"" ""(2 null characters)|2|


After running, `train.txt`, `val.txt`, `test.txt` and `labels.txt` will be generated in the root directory of the dataset.

**Note:** Requirements for generating the file list: either the original image and the number of annotated images are the same, or there is only the original image without annotated images. If the dataset lacks annotated images, a file list without separators and annotated image paths will be generated.

#### Example
```
python tools/split_dataset_list.py <dataset_root> images annotations --split 0.6 0.2 0.2 --format tif png
```



## 3.Dataset file organization

* If you need to use a custom dataset for training, it is recommended to organize it into the following structure:
custom_dataset
|
|--images
Expand All @@ -22,15 +102,30 @@
|
|--test.txt

其中train.txt和val.txt的内容如下所示:
The contents of train.txt and val.txt are as follows:

images/image1.jpg labels/label1.png
images/image2.jpg labels/label2.png
...

2.标注图像的标签从0,1依次取值,不可间隔。若有需要忽略的像素,则按255进行标注。
If you only have a divided dataset, you can generate a file list by executing the following script:
```
# Generate a file list, the separator is a space, and the data format of the picture and the label set is png
python tools/create_dataset_list.py <your/dataset/dir> --separator " " --format png png
```
```
# Generate a list of files. The folders for pictures and tag sets are named img and gt, and the folders for training and validation sets are named training and validation. No test set list is generated.
python tools/create_dataset_list.py <your/dataset/dir> \
--folder img gt --second_folder training validation
```
**Note:** A custom dataset directory must be specified, and FLAG can be set as needed. There is no need to specify `--type`.
After running, `train.txt`, `val.txt`, `test.txt` and `labels.txt` will be generated in the root directory of the dataset. PaddleSeg locates the image path by reading these text files.



* The labels of the annotated images are taken from 0, 1 in turn, and cannot be separated. If there are pixels that need to be ignored, they are labeled at 255.

可按如下方式对自定义数据集进行配置:
The custom dataset can be configured as follows:
```yaml
train_dataset:
type: Dataset
Expand All @@ -48,4 +143,4 @@ train_dataset:
- type: Normalize
mode: train
```
请注意**数据集路径和训练文件**的存放位置,按照代码中的dataset_root和train_path示例方式存放。
Please pay attention to the storage location of **dataset path and training file**, according to the example of dataset_root and train_path in the code.
143 changes: 143 additions & 0 deletions docs/data/custom/data_prepare_cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
简体中文|[English](data_prepare.md)
# 自定义数据集

## 1、PaddleSeg如何使用数据集
KazusaW1999 marked this conversation as resolved.
Show resolved Hide resolved
我们希望将图像的路径写入到`train.txt`,`val.txt`,`test.txt`和`labels.txt`三个文件夹中,因为PaddleSeg是通过读取这些文本文件来定位图像路径的。
`train.txt`,`val.txt`和`test.txt`文本以空格为分割符分为两列,第一列为图像文件相对于dataset的相对路径,第二列为标注图像文件相对于dataset的相对路径。如下所示:
```
images/xxx1.tif annotations/xxx1.png
images/xxx2.tif annotations/xxx2.png
...
```
`labels.txt`: 每一行为一个单独的类别,相应的行号即为类别对应的id(行号从0开始),如下所示:
```
labelA
labelB
...
```

## 2、切分自定义数据集

我们都知道,神经网络模型的训练过程通常要划分为训练集、验证集、测试集。如果你使用的是自定义数据集,PaddleSeg支持通过运行脚本的方式将数据集进行切分。如果你的数据集已经划分为以上三种,你可以跳过本步骤。

### 2.1 原图像要求
原图像数据的尺寸应为(h, w, channel),其中h, w为图像的高和宽,channel为图像的通道数。

### 2.2 标注图要求
标注图像必须为单通道图像,像素值即为对应的类别,像素标注类别需要从0开始递增。
KazusaW1999 marked this conversation as resolved.
Show resolved Hide resolved
例如0,1,2,3表示有4种类别,标注类别最多为256类。其中可以指定特定的像素值用于表示该值的像素不参与训练和评估(默认为255)。


### 2.3 自定义数据集切分与文件列表生成

对于未划分为训练集、验证集、测试集的全部数据,PaddleSeg提供了生成切分数据并生成文件列表的脚本。

#### 使用脚本对自定义数据集按比例随机切分,并生成文件列表
数据文件结构如下:
```
./dataset/ # 数据集根目录
|--images # 原图目录
| |--xxx1.tif
KazusaW1999 marked this conversation as resolved.
Show resolved Hide resolved
| |--...
| └--...
KazusaW1999 marked this conversation as resolved.
Show resolved Hide resolved
|
|--annotations # 标注图目录
| |--xxx1.png
| |--...
| └--...
```
其中,相应的文件名可根据需要自行定义。

使用命令如下,支持通过不同的Flags来开启特定功能。
```
python tools/split_dataset_list.py <dataset_root> <images_dir_name> <labels_dir_name> ${FLAGS}
```
参数说明:
- dataset_root: 数据集根目录
- images_dir_name: 原图目录名
- labels_dir_name: 标注图目录名

FLAGS说明:

|FLAG|含义|默认值|参数数目|
|-|-|-|-|
|--split|数据集切分比例|0.7 0.3 0|3|
|--separator|文件列表分隔符|" "|1|
|--format|图片和标签集的数据格式|"tif" "png"|2|
|--label_class|标注类别|'\_\_background\_\_' '\_\_foreground\_\_'|若干|
|--postfix|按文件主名(无扩展名)是否包含指定后缀对图片和标签集进行筛选|"" ""(2个空字符)|2|

运行后将在数据集根目录下生成`train.txt`,`val.txt`,`test.txt`和`labels.txt`.

**注:** 生成文件列表要求:要么原图和标注图片数量一致,要么只有原图,没有标注图片。若数据集缺少标注图片,将生成不含分隔符和标注图片路径的文件列表。

#### 使用示例
```
python tools/split_dataset_list.py <dataset_root> images annotations --split 0.6 0.2 0.2 --format tif png
```



## 3.数据集文件整理

* 如果你需要使用自定义数据集进行训练,推荐整理成如下结构:
custom_dataset
|
|--images
| |--image1.jpg
| |--image2.jpg
| |--...
|
|--labels
| |--label1.png
| |--label2.png
| |--...
|
|--train.txt
|
|--val.txt
|
|--test.txt

其中train.txt和val.txt的内容如下所示:

images/image1.jpg labels/label1.png
images/image2.jpg labels/label2.png
...

如果你只有划分好的数据集,可以通过执行以下脚本生成文件列表:
```
# 生成文件列表,其分隔符为空格,图片和标签集的数据格式都为png
python tools/create_dataset_list.py <your/dataset/dir> --separator " " --format png png
```
```
# 生成文件列表,其图片和标签集的文件夹名为img和gt,训练和验证集的文件夹名为training和validation,不生成测试集列表
python tools/create_dataset_list.py <your/dataset/dir> \
--folder img gt --second_folder training validation
```
**注:** 必须指定自定义数据集目录,可以按需要设定FLAG。无需指定`--type`。
运行后将在数据集根目录下生成`train.txt`,`val.txt`,`test.txt`和`labels.txt`。PaddleSeg是通过读取这些文本文件来定位图像路径的。



* 标注图像的标签从0,1依次取值,不可间隔。若有需要忽略的像素,则按255进行标注。

可按如下方式对自定义数据集进行配置:
```yaml
train_dataset:
type: Dataset
dataset_root: custom_dataset
train_path: custom_dataset/train.txt
num_classes: 2
transforms:
- type: ResizeStepScaling
min_scale_factor: 0.5
max_scale_factor: 2.0
scale_step_size: 0.25
- type: RandomPaddingCrop
crop_size: [512, 512]
- type: RandomHorizontalFlip
- type: Normalize
mode: train
```
请注意**数据集路径和训练文件**的存放位置,按照代码中的dataset_root和train_path示例方式存放。
Loading