Skip to content

Sleepychord/cogdata

Repository files navigation

Cogdata

Install

pip install cogdata
sudo `which install_unrarlib.sh`

Streaming in Cogdata v1.0

Readme

Directory Structure

.
├── cogdata_task_task1
│   ├── cogdata_config.json (indicating a task path)
│   ├── merged.bin
│   ├── dataset1
│   │   ├── dataset1.bin
│   │   └── meta_info.json
│   └── dataset2
│       ├── dataset2.bin
│       └── meta_info.json
├── dataset1
│   ├── cogdata_info.json (indicating a dataset path)
│   ├── dataset1.json
│   └── dataset1.rar
└── dataset2
    ├── cogdata_info.json
    ├── dataset2.json
    └── dataset2.zip

Pipeline

The motivation of this project is to provide lightweight APIs for large-scale NN-based data-processing, e.g. ImageTokenization. The abstraction has 3 parts:

  • Dataset: Raw dataset from other organization in various formats, e.g. rar, zip, etc. The information are recorded at cogdata_info.json in its split folder.
  • Task: A task is a collection of "configs, results for different datsets, logs, merged results, and evenly split results". The config of a task are recorded in cogdata_info.json. The states (processed, hanging/running, unprocessed)of a dataset in this tasks are in meta_info.json.
  • DataSaver: The format of saved results. The first option is our BinSaver, which saves plain bytes with fixed length. It can be read or memmap very fast. The config of DataSaver are also with the task in cogdata_info.json.

Commands

cogdata create_dataset  [-h] [--description DESCRIPTION] --data_files DATA_FILES [DATA_FILES ...] --data_format DATA_FORMAT [--text_files TEXT_FILES [TEXT_FILES ...]] [--text_format TEXT_FORMAT] name

Alias: cogdata data .... data_format is chosen from class names in cogdata.datasets, e.g. StreamingRarDataset. Texts related options are optional for text-image datasets.

cogdata create_task [-h] [--description DESCRIPTION] --task_type TASK_TYPE --saver_type SAVER_TYPE [--length_per_sample LENGTH_PER_SAMPLE] [--img_sizes IMG_SIZES [IMG_SIZES ...]] [--txt_len TXT_LEN]
                           [--dtype {int32,int64,float32,uint8,bool}] --model_path MODEL_PATH
                           task_id

Alias: cogdata task .... task_type and saver_type is chosen from class names in cogdata, e.g. ImageTextTokenizationTask or BinarySaver.

cogdata process [-h] --task_id TASK_ID [--nproc NPROC] [--dataloader_num_workers DATALOADER_NUM_WORKERS]
                       [--batch_size BATCH_SIZE] [--ratio RATIO]
                       [datasets [datasets ...]]

The i-th proc will be binded to the i-th GPU.

cogdata merge [-h] --task_id TASK_ID

Merge all the processed data.

cogdata list [-h] [--task_id TASK_ID]

List all the current datasets in this folder.

cogdata clean [-h] [--task_id TASK_ID]

Clean the unfinished states of the task.

Customized Tasks

Add --extra_code PATH_TO_CODE after cogdata (e.g., cogdata --extra_code ../examples/convert2tar_task.py [task or process] to execute and register your own task before running the command. See examples/ for details.

TODO List

  • 支持多种不同格式文本处理
  • sphinx 注释文档更详细撰写
  • 更精细化的参数管理,将tokenization一般化
  • PPT & 视频介绍
  • Merge 视频处理 [Wenyi]
  • Merge Object detection [Zhuoyi]

About

A light-weight data management system for large-scale pretraining

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •