Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 17 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,23 +5,23 @@ This is a blog where I write about research papers and blog posts I read.

## Posts

- [Swift for Tensorflow](https://github.com/kimbochen/md-blogs/tree/main/swift-for-tensorflow)
- [How PyTorch Works - A Systems Perspective](https://github.com/kimbochen/md-blogs/tree/main/pytorch-systems-intro)
- [PaLM - Pathways Language Model](https://github.com/kimbochen/md-blogs/tree/main/palm)
- [TPU v4 and TPU v5e](https://github.com/kimbochen/md-blogs/tree/main/tpuv4_v5e)
- [Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts](https://github.com/kimbochen/md-blogs/tree/main/mobile-v-moes)
- [The Hardware Lottery](https://github.com/kimbochen/md-blogs/tree/main/the-hardware-lottery)
- [Graph Compilers](https://github.com/kimbochen/md-blogs/tree/main/graph-compilers)
- [Triton Compiler](https://github.com/kimbochen/md-blogs/tree/main/triton)
- [Triton GPU IR Analysis](https://github.com/kimbochen/md-blogs/tree/main/triton-gpu-ir-analysis)
- [Distributed Training in ML](https://github.com/kimbochen/md-blogs/tree/main/ml-distributed-training)
- [Local Value Canonicalization in Julia](https://github.com/kimbochen/md-blogs/tree/main/local-value-canon-in-julia)
- [Tesla AI Day 2021 - Vision](https://github.com/kimbochen/md-blogs/tree/main/tesla-ai-day-2021-vision)
- [What Triton Does in a Matrix Multiplication](https://github.com/kimbochen/md-blogs/tree/main/what-triton-does-in-a-matmul)
- [Swift for Tensorflow](./swift-for-tensorflow/README.md)
- [How PyTorch Works - A Systems Perspective](./pytorch-systems-intro/README.md)
- [PaLM - Pathways Language Model](./palm/README.md)
- [TPU v4 and TPU v5e](./tpuv4_v5e/README.md)
- [Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts](./mobile-v-moes/README.md)
- [The Hardware Lottery](./the-hardware-lottery/README.md)
- [Graph Compilers](./graph-compilers/README.md)
- [Triton Compiler](./triton/README.md)
- [Triton GPU IR Analysis](./triton-gpu-ir-analysis/README.md)
- [Distributed Training in ML](./ml-distributed-training/README.md)
- [Local Value Canonicalization in Julia](./local-value-canon-in-julia/README.md)
- [Tesla AI Day 2021 - Vision](./tesla-ai-day-2021-vision/README.md)
- [What Triton Does in a Matrix Multiplication](./what-triton-does-in-a-matmul/README.md)

## Others

- [Reading List Dump](https://github.com/kimbochen/md-blogs/tree/main/reading-list-dump)
- [Twitter Archive](https://github.com/kimbochen/md-blogs/tree/main/tweets)
- [Post Archive](https://github.com/kimbochen/md-blogs/tree/main/post-archive)
- [Threads Archive](https://github.com/kimbochen/md-blogs/tree/main/threads-archive)
- [Reading List Dump](./reading-list-dump/README.md)
- [Twitter Archive](./tweets/README.md)
- [Post Archive](./post-archive/README.md)
- [Threads Archive](./threads-archive/README.md)
2 changes: 1 addition & 1 deletion graph-compilers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ This is why PyTorch 2 is pushing to support dynamic shapes ([Documentation](http
When lowering graphs to hardware-specific operators, Glow does not map high-level operator nodes to hardware, e.g. fully-connected layer,
but further lowers the nodes to linear-algebra-level operators, e.g. a matrix multiplication and a broadcast addition.
This gradual lowering technique is also seen in PyTorch 2 ATen IR and Prim IR
(See my [blog post](https://github.com/kimbochen/md-blogs/tree/main/pytorch-systems-intro#the-pytorch-20-compiling-pipeline) for more).
(See my [blog post](/pytorch-systems-intro/README.md#the-pytorch-20-compiling-pipeline) for more).

### Quantization
Glow performs model weight quantization, which is something I find quite interesting.
Expand Down
2 changes: 1 addition & 1 deletion swift-for-tensorflow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,5 +57,5 @@ and a [blog](https://pytorch.org/blog/understanding-lazytensor-system-performanc
A natural next step is to look into what compiler features PyTorch 2.0 has.
PyTorch 2.0 has 3 new components: TorchDynamo, AOTAutograd, and TorchInductor.
I wrote briefly about it in my
[PyTorch Systems Intro](https://github.com/kimbochen/md-blogs/tree/main/pytorch-systems-intro#The-PyTorch-20-compiling-pipeline) post,
[PyTorch Systems Intro](/pytorch-systems-intro/README.md#The-PyTorch-20-compiling-pipeline) post,
but the 3 features are definitely worth a deep dive in the future.
6 changes: 3 additions & 3 deletions triton/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Deep learning algorithms are resource-intensive, so researchers need efficient i
This is usually done by implementing specialized GPU kernels,
but GPU programming requires a lot of knowledge about GPU architecture and familiarity with low-level programming.
This increases the development time and in worse case limits the researchers' ability to explore more unconventional algorithms
(i.e. [the hardware lottery](https://github.com/kimbochen/md-blogs/tree/main/the-hardware-lottery)).
(i.e. [the hardware lottery](/the-hardware-lottery/README.md)).
Triton offers a programming model that is simpler than common GPU ones, e.g. CUDA, but has more control than deep learning frameworks,
all the while leveraging a compiler to achieve the performance of highly-tuned low-level GPU kernel implementations.

Expand Down Expand Up @@ -97,7 +97,7 @@ Triton is integrated with TorchInductor and is the default codegen for GPUs.
The PyTorch compiler stack leverages Triton to generate generic kernels with function inlining and operator fusion.
Using Triton with TorchInductor offers decent speedups for model training and inference.
Function inlining and operator fusion are the optimizations that provide the most speedup,
which is in line with what I learned about ML compilers ([My ML compiler post](https://github.com/kimbochen/md-blogs/tree/main/graph-compilers)).
which is in line with what I learned about ML compilers ([My ML compiler post](/graph-compilers/README.md)).

| | Inference | Training |
| -: | :- | :- |
Expand All @@ -113,7 +113,7 @@ However, highly-tuned libraries like cuBLAS still outperforms Triton by a decent
The author explains that cuBLAS is able to apply [3D matmul algorithms](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5389455)
to provide more parallelism.
For more information on the PyTorch compiler stack, see this [terrific slide deck by Keren Zhou](https://www.jokeren.tech/slides/Triton_bsc.pdf)
or [my blog post](https://github.com/kimbochen/md-blogs/tree/main/pytorch-systems-intro#pytorch).
or [my blog post](/pytorch-systems-intro/README.md#pytorch).


## Further Readings
Expand Down