Version 0.7.0
OneFlow v0.7.0 Release Notes
OneFlow v0.7.0 came out. Welcome to use it. We would love to hear your feedback!
本文的中文版本
https://mp.weixin.qq.com/s/dSR-2Xw92eoFhF0c6MtutQ
Highlights
This release has the following highlights:
-
Provides a Tensor that can be executed in multi-nodes multi-GPUs scenarios: Global Tensor. It is an easy-to-use solution for distributed execution. It makes it easier to implement various distributed parallel strategies and enables more flexible and user-friendly distributed implementation. It supports models including ResNet50, Wide and Deep, GPT, Bert, Swin-Transformer, InsightFace, etc.
-
Continues to improve nn.Graph. Supports the advanced features such as ZeRO, GradAcc, Checkpointing, and Pipelining, and enriches the graph.debug mode. Supports random 2D SBP conversion, semi-automatic derivation of 2D SBP, resuming training from the last checkpoint, etc. Adds OneFlow Feature Stages Identifications and identifies each feature of nn.Graph. For nn.Graph, its basic features are at the Beta Stage, which can meet most of the requirements of users; Advanced features are at Alpha Stage, meeting standard requirements.
-
Deeply optimizes the performance of Eager mode. The performance of the Swin-Transformer model is 3 times higher than that of v0.6.0 when tested on the V100.
-
Operators-related improvements: In the single-node single-GPU scenario, OneFlow's compatibility with PyTorch is further improved. The interfaces, semantics, and produced results of operators supported by OneFlow are in consistent with that of operators supported by PyTorch and an automatic testing framework is designed to verify the consistency. With common models, you can accomplish the migration by running
import oneflow as torch
. Compared with v0.6.0, OneFlow adds 16 operators, optimizes the performance of 6 operators, and fixes bugs in 16 operators. -
Supports Einsum and View mechanism.
-
Compiler-related improvements: OneFlow is officially connected to the MLIR ecosystem.
-
Releases OneFlow-Serving v0.1.0: We provide an out-of-the-box Triton OneFlow backend docker image. try here.
-
Releases LiBai v0.1.0, a toolbox for massively distributed parallel training of Transformer. Compared with customized code bases such as Megatron-LM, LiBai provides a series of models and training components for distributed training based on a modular design, aiming to make models trained in distributed mode as convenient as in single-GPU mode.
-
Releases Flow-Vision v0.1.0: adds DeiT, ConvNeXt, ReXNet, and other models and updates tutorials and documentation.
OneFlow Feature Stages identifications
OneFlow Feature Stages identifies the maturity level of OneFlow features. It provides users with a status description of a feature to inform the specific level of it, such as completeness, API stability, documentation, etc. It Provides OneFlow developers with a standard for feature refinement, which facilitates further improvement.
OneFlow Feature Stages
-
Stable Stage
- Purpose: release for production use
- Audience: all users
- Functionality: same as RC
- Testing: same as RC
- Performance: same as RC
- API: same as RC, with stability within long cycles (e.g., 1 year) and large versions (e.g., 1.0)
- Documentation: same as RC
-
Release Candidate (RC) Stage
- Purpose: release for deployment evaluation in production environments
- Audience: all users, including those who want to deploy production environments
- Functionality: being able to handle exceptions as well as normal inputs.
- Testing: end-to-end deployment validated in external environment with good experience
- Performance: provide evaluation reports and documentation to evaluate performance and scalability in external environments
- API: API for external user evaluation
- Documentation: features in this stage are added to the core-feature-set documentation
-
Beta Stage
- Purpose: release to provide a relatively stable, complete, and available version
- Audience: all users, especially those with strong feature demands, little concern for unknown trivial issues, and willingness to provide feedback
- Functionality: complete functionalities addressing the needs of various possible scenarios
- Testing: complete, covering various corner test cases, and various end-to-end integration tests
- Performance: performance evaluation and scalability evaluation
- API: recognized as complete and stable by seed users after full review
- Documentation: tutorials that describe the usage process
-
Alpah Stage
- Purpose: release to get early feedback for experimental features
- Audience: developers and expert users
- Functionality: core functionality completed
- Testing: unit testing completed for core requirements of the feature, possibly with unknown bugs
- Performance: evaluated
- API: well-defined but not rigorously reviewed, possibly requiring further changes
- Documentation: API documentation is a must to provide feature definitions
-
Pre-alpha Stage
- Purpose: release to validate feature prototypes or address urgent needs
- Audience: feature developers
- Functionality: limited prototype functionalities
- Testing: limited testing, possibly with many bugs
- Performance: unknown
- API: prone to changes
- Documentation: possibly none
OneFlow Framework
1. Distribution
Global Tensor
Global Tensor is a newly released set of distributed computing interfaces. It can easily support any parallelism including data parallelism, model parallelism, and pipeline parallelism. Unlike a normal Tensor (hereafter called Local Tensor), Global Tensor is a Tensor with a global view, whose data is distributed in a specific way across a set of devices in a cluster, and each node stores some or all of the Global Tensor's data. Placement and SBP are the basic properties of the Global Tensor that describe the distribution of the data in clusters.
Global Tensor's data distribution
Global Tensor supports three different ways of data distribution, which we collectively refer to as SBP.
- Split (dim): The data is equally split along
dim
dimension and distributed to each device. - Broadcast: The data is replicated between each device.
- PartialSum: The data is the element-wise addition for each device.
Consistent computational interfaces
Global Tensor has basically the same computational interfaces as Local Tensor. Only with small changes, you can convert the single-GPU mode to the distributed mode.
Local Tensor | Global Tensor |
---|---|
>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0])
>>> y = x * x |
>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0],
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(0))
>>> y = x * x
# This multiplication is performed on both rank 0 and rank 1 |
Supporting conversion between Local Tensor and Global Tensor
-
With Tensor.to_global interface, you can create a Global Tensor based on a Local Tensor, and regard this tensor as the local tensor of the Global Tensor on the present device.
-
With Tensor.to_local interface, you can return the local tensor of the Global Tensor on the present device.
Local Tensor To Global Tensor | Global Tensor To Local Tensor |
---|---|
>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0])
>>> y = x.to_global(
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(0))
>>> y.size()
oneflow.Size([4])
>>> y
tensor([1., 2., 1., 2.],
placement=oneflow.placement(type="cuda", ranks=[0, 1]),
sbp=(oneflow.sbp.split(axis=0),), dtype=oneflow.float32) |
>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0],
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(0))
>>> y = x.to_local()
>>> y.size()
oneflow.Size([1])
>>> y
tensor([1.], device='cuda:0', dtype=oneflow.float32)
# tensor([2.], device='cuda:0', dtype=oneflow.float32) if rank is 1 |
Supporting redistribution of Global Tensor in clusters
With Tensor.to_global interface, you can redistribute the data of Global Tensor in clusters. The data can be distributed to another set of nodes and the way of distribution in this set of nodes can also be changed (i.e.change SBP). Redistribution usually generates inter-process data communication, but Tensor.to_global interface finely avoids complicated low-level communication details.
>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0], placement=flow.placement("cuda", ranks=[0, 1]), sbp=flow.sbp.split(0))
>>> y = x.to_global(placement=flow.placement("cuda", ranks=[2, 3]), sbp=flow.sbp.broadcast)
Each operator of OneFlow defines a set of SBP signatures for the input and output tensor. Global Tensor supports automatic redistribution to provide the required SBP signature of a certain interface. Just as the code shown below:
>>> import oneflow as flow
>>> x = flow.randn(4, 4,
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(0))
>>> y = flow.randn(4, 4,
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(1))
>>> z = x + y
When x + y
is executed, since x is split along 0
dimension while y is split along 1
dimension, their local tensors at each device can not be added up directly. Therefore, x's SBP will be automatically converted to flow.sbp.split(1)
or y's SBP will be converted to flow.sbp.split(0)
, and the calculated result-z's SBP- is flow.sbp.split(1)
or flow.sbp.split(0)
.
Notes
-
Global Tensor doesn't support mix-in with DDP interface currently.
-
Global Tensor requires all devices to execute simultaneously, and the code that has branches would lead to process deadlock because of divergent execution paths. We will continue fixing this problem.
2. Continued improvement of nn.Graph's features
Overview of the development of nn.Graph v0.7.0
-
Fundamental features enter into Beta Stage, meeting most requirements of users;
-
Advanced features enter into Alpha Stage, meeting standard requirements of users;
-
ResNet50, Wide and Deep, GPT, Bert, Swin-Transformer, InsightFace, and other models are supported;
Feature of nn.Graph
-
Static and dynamic casting of operators under Static Graph enter into Beta Stage from Alpha Stage
-
Adds the unit test of static execution for all legal operators under nn.Graph, and automated unit test is ready;
-
Supports more flexible inputs and outputs, including List/Tuple/Dict and their nesting, and fixs the Tuple problem of producing a return size of "1";
-
Adds backward automatic test;
-
-
Optimizer and LR Scheduler under Static Graph enter into Beta Stage from Alpha Stage.
-
Adds more built-in LR schedulers, including WarmupLR, CosineAnnealingWarmRestarts and other common schedulers, and provides SequentialLR and ChainedScheduler to enable scheduler with different combination capacity;
-
Refactors scheduler's get_lr function, converting it to the implementation of pure function. This change permits to use schedulers in combination by changing the calculation of lr from iterative solution to analytical solution;
-
Adds "is_sparse" parameter for
add_optimizer
interface, supporting sparse updates under graph mode. Optimizers that support sparse updates include Adam and SGD, while optimizers under Eager mode don't support sparse updates yet. Subsequent version will support both sparse updates and sparse tensor. The feature is at Pre-alpha Stage; -
Adds Debug print feature for LR and Step, for which you only need to turn on LR Scheduler's
verbose
button.
-
-
state_dict
andload_state_dict
under Static Graph are newly added, which allow to resume training from last checkpoint. The feature is at Beta Stage; -
Debug under Static Graph enters into Beta Stage from Alpha Stage;
-
Adds
debug(2)
、debug(3)
that allow to find out problems in nn.Module, by locating the Python code of operators at c++ layer and locating forward graph creation and inference for operators; -
Adds the display of memory overhead
-
-
ZeRO-DP under Static Graph is newly added, which allows to reducememory overhead related to Optimizer under data parallelism, and the feature is at Alpha Stage;
-
Global Tensor under Static Graph supports multiple parallel methods, and the feature is between Alpha Stage and Beta Stage;
-
It is utilized in LiBai and other model libraries;
-
It is widely utilized in OneFlow's model libraries, and the coverage of unit test is still ongoing;
-
1D Global Tensor supports you to only define input tensor's SBP, while output tensor's SBP can be derived automatically with good results, and the feature is at Beta Stage;
-
2D Global Tensor supports you to only define input tensor's SBP, while output tensor's SBP can be derived automatically with good results, and the feature is at Alpha Stage;
-
Conversion from 1D to ND or ND to 1D is newly supported, and the feature is at Alpha Stage;
-
Random conversion of 2D SBP is newly supported, and the feature is at Alpha Stage;
-
Testing of 1D&2D single operator is still ongoing, and the feature is at Pre-alpha Stage;
-
Selecting SBP with semi-automatic derivation is supported, and the feature is at Pre-alpha Stage;
-
-
For Gradient Accumulation under Static Graph, we refactor and repair support for Reshape and add API documentation. For the input of
mini-batch
interface, the future version will offer the input ofmicro-batch
with better experience, and the feature is from Pre-Alpha to Alpha Stage; -
For pipeline parallelism under Static Graph, the tutorial is perfected, and pipeline parallelism is available in Libai and other model libraries. The feature is at Beta Stage;
-
For automatic mixed precision (AMP) under Static Graph, the API documentation is newly added. The feature is from Pre-Alpha to Alpha Stage;
-
For Activation Checkpointing under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;
-
For Op Fuse optimization under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;
-
For XLA/TensorRT/OpenVINO execution under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;
Tutorials
- en https://docs.oneflow.org/en/master/basics/08_nn_graph.html
- zh https://docs.oneflow.org/master/basics/08_nn_graph.html
API Documentation
- en https://oneflow.readthedocs.io/en/master/graph.html
- zh https://start.oneflow.org/oneflow-api-cn/graph.html
Tutorials of pipeline parallelism:
- en https://docs.oneflow.org/en/master/parallelism/06_pipeline.html
- zh https://docs.oneflow.org/master/parallelism/06_pipeline.html
Model support under nn.Graph
- Training ResNet50 with single-node single-GPU or single-node multi-GPU is supported, https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/resnet50
- Wide and Deep model is supported, https://github.com/Oneflow-Inc/models/tree/main/RecommenderSystems/wide_and_deep
- GPT、Bert、Swin Transformer in Libai are supported, https://github.com/Oneflow-Inc/libai
- Functioanl problems in support for above models are resolved;
3. Performance optimization of Eager
-
The performance of Eager is deeply optimized. When OneFlow run Swin-Transformer's model performance on V100 GPU, single-GPU card delivers a 25% speedup than PyTorch, and 8 single GPU card 10% speedup;
-
The communication scheduling policy for NCCL in DDP is optimized;
-
DDP supports the optimization of AllReduce fuse, reducing additional overhead generated by fragmented AllReduce, with a 5% performance speedup when it is tested on ResNet50;
-
VM supports the optimization of instruction fusion, significantly saving scheduling overhead of Kernel;
-
Additional memory overhead is optimized when CPU overload is too high;
-
Eager DataLoader supports the optimization of inter-process memory sharing;
-
The performance of Clip Grad is optimized;
4. Improvements of operators
- OneFlow is successfully adapted to oneDNN for CPU operators acceleration.
The performance of CPU operators such as unary and binary element-wise is improved by 4 times, and the speed of Swin-Transformer's dataloader is improved by 2.5 times. #7319
-
Adds the functionality of inter-process shared memory to Dataloader, which greatly improves the performance of DataLoader in DDP.
-
Adds Bool type Tensor. #7523
-
Realizes to_contiguous that view relied on. #7670
-
Adds Scalar div operators. #7483
-
Adds Lamb optimizer. #7389
-
Adds Polynomial Learning Rate Scheduler. #7260
-
Adds cumprod operators. #7278
-
Adds Tensor.T() and oneflow.t() operators. #7269
-
Adds normalize operators. #7113
-
Adds the inplace version of div and sub operators. #7293
-
Adds the feature of Module.zero_grad. #7587
-
Adds the feature of Scalar Tensor being the index to do list indexing. #7597
-
Adds support for Leaky ReLU operators half type. #7569
-
Adds support for mask select operators. #7492
-
Adds non-reduce communication operations such as Bool type Broadcast and Allgather. #7366
-
Develops autotest that supports eager global based on an autotest framework. #7204
-
Optimizes performance for ReduceSum CUDA Kernel. #7684
-
Optimizes CUDA Kernel of gather operators. #7351
-
Optimizes the performance for CUDA Kernel of MaxPool and AvgPool operators in NCHW. #7426 & #7451
-
Optimizes the backward computing of PReLU operators, which can save more memory in general. #7600
-
Optimizes backward Kernel of LayerNorm to further save memory. #6996
-
Supports passing single int in stride and dilation in Conv1D/2D/3D and DeConv1D/2D/3D Kernel. Adds Tensor.zero_() interface that aligns with PyTorch tensor.norm, torch.max and torch.min.
Supports inplace in flow.nn.functional.dropout. #7593 -
Fixes bug where the BatchNorm module raises an error when affine=False. #7755
-
Fixes Maximum and Mimimum backward bug. #7519
-
Fixes bug where the result of var operators is unexpected in some cases. #7517
-
Fixes incorrect behavior of Tensor deepcopy bug. #7490
-
Fixes bug where input index is scalar tensor in slice operators. #7479
-
Fixes bug where BinaryCrossEntropy can produce nan in half. #7476
-
Fixes bug where an error is raised when the base and exponent of pow operators are respectively real number type and Tensor type. #7729
-
Fixes stack operators backward bug. #7363
-
Fixes inefficiency problem caused by CPU synchronization when clip grad is executed on CUDA with the default configuration. #7304
-
Fixes the SBP inference of Batch Gather and Unsorted Batch Segment Sum operators, and runs the global unittest successfully. #7590
-
Fixes Physical Shape inference of Affine Grid operators, fixes the unexpected result bug in some SBP cases, and runs the global unittest successfully. #7578
-
Fixes the problem that arange operators don't support generating 0 size tensor, and runs the global unittest successfully. #7576
-
Fixes the incorrect SBP inference of flip operators, and runs the global unittest successfully. #7496
-
Fixes advanced indexing and zeroslike operators SBP bugs. #7238
-
Fixes bug where Eager global inplace might not be successful. #7348
5. Supporting einsum & view mechanism
Adds einsum
operators. einsum
provides a set of concise but elegant rules, which can implement tensor operations including but not limited to: inner product, outer product, tensor multiplication, tensor transposition and tensor contraction, etc. Proficient use of einsum
allows you to easily implement various complex tensor operations and be less error-prone. #7526
Adds view
mechanism. The view mechanism allows the common operators to reuse/share Tensor's memory, and the memory can be saved by reducing the Kernel Launch/Compute process. At present, new view operators that do not change the tensor.is_contiguous() property have been added, such as reshape, view, squeeze, unsqueeze, etc.: #7503 More view operators will be added later (such as transpose, permute, narrow, expand, and unfold).
6. Improvements of the complier
-
OneFlow is officially connected to the MLIR ecosystem, and the OneFlow Dialect component is complete. Successfully completes OneFlow Job (computation graph of OneFlow nn.Graph) and RoundTrip of MLIR, and runs RoundTrip tests on all operators of OneFlow in CI process.
-
Implements static graph optimization with a series of automatic fused operators based on MLIR DRR to accelerate OneFlow model training and inference.
7. OneFlow Serving
OneFlow Serving v0.1.0 comes out with the following features:
-
Provides OneFlow C++ API used for inference, supporting model loading and static graph inference.
-
The model weights and the computation graph in MLIR format can be saved simultaneously by running
flow.save(graph)
in Python. They can be loaded in C++ API (while loading computation graph is not supported in Python API at present). -
Supports inference of OneFlow model using TensorRT and OpenVINO automatically without model conversion (based on OneFlow XRT module), achieving better acceleration on NVIDIA GPU and Intel CPU.
-
Implements Triton OneFlow backend
- Provides out-of-the-box Docker image.
- Supports auto configuration: only the model path needs to be given, and no Triton configuration file needs to be written in the configuration.
-
Welcome to use the project deployed with Triton OneFlow backend launched on OneFlow Cloud Platform.
8. LiBai
LiBai is a toolbox for massively distributed parallel training of Transformer. Compared with custom code bases such as Megatron-LM, LiBai provides a series of models and training components for distributed training based on a modular design, aiming to make models trained in distributed mode as convenient as in single-GPU mode. The 0.1.0 version mainly supports the following features and models:
Features:
- Data Parallelism
- 1D Tensor Parallelism
- Pipeline Parallelism
- Unified Distributed Layers
- Extensible for new parallelism
- Mixed Precision Training
- Activation Checkpointing
- Gradient Accumulation
- Gradient Clip
- ZeRO
- More flexible "LazyConfig" configuration system
- Easy-to-use
Trainer
andEvaluator
- Data preprocessing supporting images and texts
Models:
Bert
(3D Parallelism)GPT-2
(3D Parallelism)ViT
(3D Parallelism)Swin-Transformer
(Data Parallelism)- Supports fine-tuning tasks in
projects/
- Supports text classification tasks in
projects/
9. flow-vison
flowvision 0.1.0 stable version comes out with the following improvements based on the previous version:
- Adds initialization method
trunc_normal_
- Adds
DeiT
model, rebuiltVisionTransformer
model - Adds
ConvNeXt
model - Adds
ReXNet
model - Supports Learning Rate Schedule in
PolyLRScheduler
andTanhLRScheduler
- Fixes the use of
F.normalize
in SSD model - Fixes bugs in
EfficientNet
andRes2Net
- Fixes weights problem in
vit_small_patch32_384
andres2net50_48w_2s
models - Rebuilds
model zoo
and runs more complete tests on existing models - Rebuilds
load_state_dict_from_url
method to automatically save the downloaded weights in the cache folder - Improves documents about
Getting Started
andflowvision.models
The 0.2.0 version of flowvision is already in progress. A large number of new models will be added based on the 0.1.0 version, and the documentation will be improved, so stay tuned.