Releases: sophgo/tpu-mlir
TPU-MLIR v1.8 Release
Highlights:
-
Enhancements:
- Added support for dynamic shape inference in various operations.
- Optimized core operations for better performance on specific models.
- Improved backend support for multiple models like BM1684X, BM1688, BM1690, SG2380, etc.
- Introduced new operations and patterns for more efficient model processing.
- Updated documentation for better clarity and user guidance.
-
Bug Fixes:
- Resolved issues related to input/output handling, kernel configurations, and model-specific bugs.
- Fixed bugs in dynamic compilation, core parallel processing, and various backend operations.
- Addressed errors in specific model post-processing steps like YOLOv5, EfficientNet, etc.
-
Performance Improvements:
- Optimized cycle calculations for multi-core models.
- Enhanced bandwidth usage statistics for better resource management.
- Accelerated compilation processes for training models using a new layer-group scheme.
-
New Features:
- Introduced new operations like attention quant block, prelu op, and various dynamic compile features.
- Added support for additional operations, weight location, and dynamic compile enhancements.
Documentation Updates:
- Updated developer manuals, quick start guides, and model-specific documentation for better understanding.
Miscellaneous:
- Streamlined workflows for faster commit checks and improved debugging processes.
- Added new test cases for regression testing and script-based model evaluations.
- Fine-tuned backend operations for improved model performance and accuracy.
TPU-MLIR v1.7 Release
Change Log
New Features
- Added support for new operations including flash attention, custom op dynamic compile, and tpulang ops.
- Enabled AttnReorder and added support for dynamic indices in ops like onehot, scatterelements, and cumsum.
- Added
--dump_dataframe
option for bmodel_checker and support for transpose with order[1, 2, 3, 0]
. - Introduced Watchpoint feature to TDB and added support for mixed-precision networks.
- Implemented optimizations for dma efficiency of flash attention and optimized backend for various models.
- Added support for local memory dump in pcie mode and added various quantization features like eva quant, swin quant, and detr quant.
- Enhanced multi-core support including support for LayerNorm and GroupNorm in coreParallel, and multi-core data slice in tensorLocation.
- Added new patterns for Cswin and Einsum operations.
- Improved support for LLM (Large Language Models) in bm1688.
Bug Fixes
- Fixed various bugs including kernel_module msg_id, SAM-VIT-encoder regression, and attention accuracy problems.
- Addressed logical issues in AddToScale pattern and issues in fp_forward.
- Resolved bugs in model info core dump, op's liveRange in coreParallel, and DevParallel bugs.
- Fixed issues in model combine with io alone and bugs in various ops like interp, RotaryPosEmbPattern, and efficient-lite4 permute.
Performance Improvements
- Improved the performance of TDB and the bmodel_checker for 1684x pcie.
- Optimized facenet and fixed performance issues of 1688 multicore.
- Enabled single-core mode optimizations where necessary.
Documentation and Testing
- Updated documentation, refined custom chapters, and ensured consistency in quick start docs.
- Added test cases for custom tpulang, multi-core with subnets, and custom cpuop.
- Fixed various documentation errors and updated the release note.
Other Changes
- Added restrictions to tpulang ops and net test cases.
- Adjusted descriptions and refined interfaces for better user experience.
- Updated backend .so files and addressed sensitive words in the codebase.
- Added support for int4 dtype in tpu_profile and ensured tool/scripts work in Python virtual environments.
Technical Preview
Features
- Added support for LLM Decoding by utilizing multi-cores to enhance processing efficiency.
- Introduced
fx2mlir
, a new functionality for enhanced MLIR conversion. - Implemented
nnvlc2.0
andnnvlc1.0
local activation and weight operations, respectively, for improved neural network performance. - Enabled
TPULANG
support for operations like sort, argsort, and additional ops, enhancing the language's functionality and flexibility. - Added
cv186x
support inrun_sensitive_layer.py
and for the TDB, expanding compatibility and debugging capabilities. - Introduced new ops and features like
Watchpoint
in TDB andactivation ops
support for scale & zero_point, broadening the range of functionalities available in thetpu-mlir
project. - Supports
BM1690
. - L2mem performs intermediate data exchange for active tensor.
Bug Fixes
- Resolved a variety of bugs affecting backend processes, including issues with the
1684x
backend,permutefuse2
,permutemulconstswap
, and more, improving overall stability and performance. - Fixed several critical issues across
tpulang
, including errors insort_by_key
operation,reshape
operations,where
operation, and more, enhancing the language's reliability for developers. - Addressed bugs in model processing, including fixes for
concat
logic,scale2conv
,scale2conv3d
,instance norm
, and several more, ensuring smoother model optimization and execution. - Corrected errors in the documentation, providing clearer and more accurate information for users and developers.
Documentation Updates
- Updated
tpulang
documentation to include new functionalities and optimizations, making it easier for users to understand and utilize the language effectively.
Performance Improvements
- Optimized TDB and
bmodel_checker
for1684x pcie
mode, significantly reducing processing times and enhancing efficiency for model analysis. - Improved the efficiency of DMA in flash attention operations, ensuring faster data handling and processing.
- Enabled IO tag mode and refined address mode for better memory management and operational flexibility.
TPU-MLIR v1.6.1
Full Changelog: v1.6...v1.6.1
TPU-MLIR v1.6 release
Change Log
Bug Fixes
- Fixed documentation errors and added checks for documentation errors during build.
- Set workaround for
ar.copy
cycle issue to 0, avoiding potential data overwriting in inplacing operations. - Addressed a bug in
Caffe DetectionOutput
and fixed a hang incv186x
. - Corrected
Mul buffer
size alignment issues and various other buffer size corrections. - Fixed issues with
attention accuracy
,RotaryPosEmbPattern
, andop status validation
before the matching process. - Addressed a series of backend bugs, including daily build errors, performance declines, and incorrect return values.
- Fixed
data_checker
issues,api_conv
bug, and a local slice calculation bug. - Resolved incorrect affineMap for Pooling buffer and fixed reshape bug for inner products.
- Corrected
Mul&Div
dynamic support for local operations and fixed issues withConv2d
buffer size calculations. - Addressed various matmul bugs, including fp8 support issues and quantization inconsistencies.
Features
- Enabled multicore optimizations and added support for multi-core model tests.
- Updated
libbackend_1688.so
and various backend updates for better performance and compatibility. - Introduced
groupParallel
operation, support for dynamic input data generation. - Added support for new patterns such as
Permute fuse pattern
andsplitQuantizedMLP pattern
. - Implemented
npz compare visualizer
tool and added support forbm1688 backend
. - Added
MatMul weight split case
and improved permute performance. - Added support for
img2col pattern
, attention interface, and several dialects for SG2260 operations.
Documentation Updates
- Updated release notes and resolved issues with document formatting.
- Standardized expression terminology and replaced sensitive words in documentation.
Performance Improvements
- Improved local softmax performance and optimized dataFlow checking in coreMatch.
- Enhanced performance for Vit L i8 4 batch operations and refined conv multi-core handling.
- Optimized VIT-B concurrency and addressed performance issues with
MaxPool
buffer sizes.
v1.6-beta.0
New Features
- Implemented SG2260 structureOp interface and structured transform, including a solver for finding transforms【ea234bc2†source】.
- Added OneHot converter and support for fp8 in the debugger【c03ba46c†source】【f87127bd†source】【fed7e68a†source】.
- Supported MatMulOp for special cases broadcast in batch dims and added interface for attention【90d4b327†source】【044c4fc3†source】.
- Provided "decompose linalg op" and "tile+fuse" pass for MatMul parallel supports more batch patterns【25f24e3d†source】.
- Unet single block test added【ea76f9c9†source】.
- Implemented fp8 support for Matmul and other ops including addconst, subconst, mul, add, sub, and abs【e09adbda†source】【7eaec57f†source】.
Performance Improvements
- Improved Matmul fp8 performance with new backend support【2b8dd03b†source】.
- Enabled distribute MLP and attention with improved performance for cascade_net input/output names and order【d5a42d7a†source】.
- Refactored tdb to improve disassembler serialize and resolve BM1688 decoding issue【e73450f8†source】【1457df29†source】.
- Improved weight reorder for ConvOp and optimized permute of attention matmul【a9045c3c†source】【91a353e3†source】.
Bug Fixes
- Resolved various bugs in MatMul, Conv, and other ops across multiple chipsets including SG2260, BM1688, and CV18xx【b809a8c1†source】【bfada4de†source】【9804e30c†source】.
- Fixed bugs related to ReduceOp, ArgOp, SliceOp, and others for better operation and tensor handling【2cdeb60d†source】【bbacf00f†source】.
- Addressed issues in SAM, daily test, and tdb related to core operations and functionality【83e1979c†source】【7c37e39d†source】.
- Fixed memory and data handling bugs for more accurate and stable operation of the models【2310cd8d†source】【0ed60f1f†source】.
Documentation Updates
- Updated documentation to remove sensitive words and improve clarity and comprehensiveness【43e0b428†source】【5d6c49fc†source】.
Miscellaneous
- Enhanced various backend libraries and supported new ops and patterns for more efficient and versatile model handling【1ca95d71†source】【8f1a2de8†source】.
- Improved scatterE and reduce dynamic shape_value handling for better model optimization【fa2ccf29†source】.
- Refinements in graph optimization, permute parallel indexMapping, and related areas for improved model processing【094f05da†source】【1ec6c16b†source】.
Technical Preview
TPU-MLIR Project Update
Bug Fixes and Dependency Updates
- Fix Dependency: Fixed the dependency of MLIRInputConversion.
- SDK Release Workflow: Fixed tpu-mlir tag for building and added workflow file for SDK release.
- Softplus LoweringINT8: Fixed 1684 Softplus LoweringINT8 issue.
- Slice Begin Index: Fixed bm1684 slice begin_index problem.
- Mul Conflict Resolution: Partially fixed the output data sign of mul conflict with chip restriction.
Feature Enhancements and Support
- Subgraph Split Support: Enhanced support for subgraph split.
- Quant IO List Note: Added quant io list note for better quantization handling.
- New Full Operation: Supported the aten::new_full operation.
- Torch Flip for bm1684x: Added support for torch.flip for bm1684x.
- Weight Input Shape Bind: Supported shape bind for weight input.
Updates and Implementations for Specific Operations
- Backend Update for sg2260: Updated sg2260 for backend for tag31.
- ScatterElements Implementation: Implemented ScatterElements for any axis.
- Unary Indexing Map: Added unary indexing map.
- Binary Indexing Map: Added binary (add/sub/mul/div/min/max) indexing map.
- Dynamic NMS Support: Featured support for dynamic nms for bm1684x.
Codebase and Documentation Refinements
- Cleanup: Removed test/sg2260 dialect.
- Documentation Update: Updated nntoolchain README and lib.
- Codegen Documentation: Added documentation for codegen.
- Template Format Update: Updated import mlir file template format.
- Quick Start Docs Modification: Modified quick start docs for tpu-mlir.
Optimizations and Performance Improvements
- Kernel Module Usage: Reverted to using the old kernel module.
- MLIR Conv2D Optimization: Improved 1684 mlir conv2d with 3ic optimization.
- SWINT Quantization: Added swint quant for better performance.
- Opt Parameter Addition: Added an optimization parameter.
- Loop and Fusion Enhancements: Supported interchange of inner loop, padOp transform, tensor op collapse, fusion on linalg-on-tensor, etc.
Technical Preview
🐳 Docker Image Update
Changed required Docker image from sophgo/tpuc_dev:v2.2 to sophgo/tpuc_dev:v3.1, which is based on Ubuntu 22.04.
📖 Documentation
Updated docs to add more parameters in model deployment.
🐛 Bug Fixes
Fixed TPU-MLIR dialect Python binding for DEBUG mode.
Resolved backward training bug.
Addressed average pooling and max pooling issues.
Several other bug fixes related to Winograd inference, training, and more.
🚀 Feature Additions
Added Deconv3D backend support.
Support for dynamic tile added for bm1684x.
Added Winograd feature.
Several other feature additions, including dual-core support in debugger, MatMulSliceMerge support for int8/int4, and more.
🔧 Code Maintenance
Code renaming and cleaning.
Regression adjustments and tests.
⚙️ Backend Updates
Backend updates for various architectures including BM1684 and sg2260.
Technical Preview
New Features and Enhancements
- Support for Various Operations: Added support for exp, erf, gelu, loopop, and other operations for specific platforms.
- Tooling and Visualization: Renamed profile.py, added visual tools for weights, and enhanced debugging capabilities.
- Model Support and Adjustments: Added daily release models, scripts, and support for specific model types like yolov8, yolov4s.
- Distribution and Multicore Support: Implemented distribution steps, multicore support, and group convolution transformation.
Bug Fixes and Resolutions
- Model and Parsing Fixes: Resolved issues in emvd models, parsing errors, slice bugs, and fixed specific issues in bm1684 and bm1686.
- Codegen and Canonicalization Fixes: Addressed type errors, canonicalization failures, and operand kind checks.
- Inference and Optimization Fixes: Fixed inference issues in max, where, and slice operations, and refined canonicalization.
Documentation and Cleanup
- Documentation Updates: Refined tpu-mlir docs, added supposed ops document, and updated specific documents.
- Code Cleanup and Refactoring: Removed unnecessary code, reconstructed permute move canonicalization, and prepared for LLVM upgrade.
Other Changes
- Testing and Calibration: Added test cases, calibration updates, and support for regression and tag in TDB.
- Backend and Runtime Adjustments: Updated backend, added support for auto-increase op, and fixed minor bugs.
Technical Preview
Features:
BM1686: support post handle op, provided parallelOp codegen, add DivOp for f16/bf16.
BM1684: Support dynamic compilation load tensor from L2mem, implement GROUP_3D local layer function, support more dynamic ops, like MinConst, MaxConst, Lut; and some static ops, like deform_conv2d.
CV18XX: Support more ops like equalOp.
Support IfOp for f16/bf16/int8 mode.
Implement post process function of sensitive layer, unranked tensor and dynamic tensor at frontend, add empty and baddbmm torch converter/interpreter.
Support weight split when layer group if op is broadcastbinary, suppoprt parse ops of each layer in top.mlir, support int32 to i/u8 inference for modeol_runner.py.
Remove onnx-sim and use unranked_type for all ops.
Implement more graph opimize: merge matmul + add to matmul if float type, fuse same operation pass, weight trans when permute+add.
Support more torch ops, like rmsnorm, ceil, remainder.
Other new operations: lowering of GatherElements, multi-input Add.
Bug Fixes:
Fix chatglm2 rmsnorm untransformed prob, ScaleOp inference error, bmodel_dis format bin, shape inference of matmul, subnet output order mismatch cause error in dynamic runtime.
Avoid duplicate name of inserted CastOp, distinguish caffe matmul shape.
Code Refactoring:
Use llvm::md5, llvm::sha256.
Use Clang to speed up code compilation.
Remove some unused header files.
Use rewriter.eraseOp instead of op->earse, use string to define padding mode.
Refine disassembler, refactor mix_precision.
Documentation Updates:
Update document version and change some model-zoo requirements.
Modified English part and modified developer_manual doc for visual.py part.
Testing and Verification:
Updated list of test models supported by BM1684X.