NVIDIA · kaiyux · Jun 5, 2024 · Jun 5, 2024
diff --git a/.gitattributes b/.gitattributes
@@ -1,2 +1,4 @@
 *.a filter=lfs diff=lfs merge=lfs -text
 *.lib filter=lfs diff=lfs merge=lfs -text
+*.so filter=lfs diff=lfs merge=lfs -text
+*.dll filter=lfs diff=lfs merge=lfs -text
diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml
@@ -17,7 +17,7 @@ body:
         - Libraries
           - TensorRT-LLM branch or tag (e.g., main, v0.7.1)
           - TensorRT-LLM commit (if known)
-          - Versions of TensorRT, AMMO, CUDA, cuBLAS, etc. used
+          - Versions of TensorRT, Modelopt, CUDA, cuBLAS, etc. used
           - Container used (if running TensorRT-LLM in a container)
         - NVIDIA driver version
         - OS (Ubuntu 22.04, CentOS 7, Windows 10)

diff --git a/.github/workflows/auto_close_inactive_issues.yml b/.github/workflows/auto_close_inactive_issues.yml
@@ -0,0 +1,25 @@
+# Ref: https://docs.github.com/en/actions/managing-issues-and-pull-requests/closing-inactive-issues
+name: Close inactive issues
+on:
+  schedule:
+    - cron: "30 1 * * *"
+
+jobs:
+  stale:
+    runs-on: ubuntu-latest
+    permissions:
+      issues: write
+      pull-requests: write
+    steps:
+      - uses: actions/stale@v9
+        with:
+          days-before-issue-stale: 30
+          days-before-issue-close: 15
+          stale-issue-label: "stale"
+          exempt-issue-labels: ""
+          stale-issue-message: This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
+          close-issue-message: "This issue was closed because it has been stalled for 15 days with no activity."
+          days-before-pr-stale: -1
+          days-before-pr-close: -1
+          repo-token: ${{ secrets.GITHUB_TOKEN }}
+          debug-only: false
diff --git a/.gitignore b/.gitignore
@@ -6,7 +6,6 @@ __pycache__/
 *.nsys-rep
 .VSCodeCounter
 build*/
-*.so
 *.egg-info/
 .coverage
 *.csv
@@ -29,11 +28,14 @@ config.json
 /*.svg
 cpp/cmake-build-*
 cpp/.ccache/
+tensorrt_llm/bin
 tensorrt_llm/libs
+tensorrt_llm/bindings.*.so
 tensorrt_llm/bindings.pyi
 tensorrt_llm/bindings/*.pyi
 *docs/cpp_docs*
 *docs/source/_cpp_gen*
+*.swp
 
 # Testing
 .coverage.*

diff --git a/3rdparty/cutlass b/3rdparty/cutlass
diff --git a/CHANGELOG.md b/CHANGELOG.md
diff --git a/README.md b/README.md
@@ -6,9 +6,9 @@ TensorRT-LLM
 
 [![Documentation](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://nvidia.github.io/TensorRT-LLM/)
 [![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
-[![cuda](https://img.shields.io/badge/cuda-12.3-green)](https://developer.nvidia.com/cuda-downloads)
-[![trt](https://img.shields.io/badge/TRT-9.3-green)](https://developer.nvidia.com/tensorrt)
-[![version](https://img.shields.io/badge/release-0.9.0-green)](./setup.py)
+[![cuda](https://img.shields.io/badge/cuda-12.4.0-green)](https://developer.nvidia.com/cuda-downloads)
+[![trt](https://img.shields.io/badge/TRT-10.0.1-green)](https://developer.nvidia.com/tensorrt)
+[![version](https://img.shields.io/badge/release-0.10.0.dev-green)](./setup.py)
 [![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
 
 [Architecture](./docs/source/architecture/overview.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/performance/perf-overview.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](./docs/source/)
@@ -57,11 +57,11 @@ like `GPTAttention` or `BertAttention`, can be found in the
 [models](./tensorrt_llm/models) module.
 
 TensorRT-LLM comes with several popular models pre-defined. They can easily be
-modified and extended to fit custom needs. Refer to the [Support Matrix](docs/source/reference/support-matrix.md) for a list of supported models.
+modified and extended to fit custom needs. Refer to the [Support Matrix](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html) for a list of supported models.
 
 To maximize performance and reduce memory footprint, TensorRT-LLM allows the
 models to be executed using different quantization modes (refer to
-[`support matrix`](docs/source/reference/support-matrix.md#software)).  TensorRT-LLM supports
+[`support matrix`](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html#software)).  TensorRT-LLM supports
 INT4 or INT8 weights (and FP16 activations; a.k.a.  INT4/INT8 weight-only) as
 well as a complete implementation of the
 [SmoothQuant](https://arxiv.org/abs/2211.10438) technique.
@@ -70,8 +70,8 @@ well as a complete implementation of the
 
 To get started with TensorRT-LLM, visit our documentation:
 
-- [Quick Start Guide](docs/source/quick-start-guide.md)
-- [Release Notes](docs/source/release-notes.md)
-- [Installation Guide for Linux](docs/source/installation/linux.md)
-- [Installation Guide for Windows](docs/source/installation/windows.md)
-- [Supported Hardware, Models, and other Software](docs/source/reference/support-matrix.md)
+- [Quick Start Guide](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html)
+- [Release Notes](https://nvidia.github.io/TensorRT-LLM/release-notes.html)
+- [Installation Guide for Linux](https://nvidia.github.io/TensorRT-LLM/installation/linux.html)
+- [Installation Guide for Windows](https://nvidia.github.io/TensorRT-LLM/installation/windows.html)
+- [Supported Hardware, Models, and other Software](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html)