kimbochen · Viranchee · Sep 16, 2024
diff --git a/README.md b/README.md
@@ -5,23 +5,23 @@ This is a blog where I write about research papers and blog posts I read.
 
 ## Posts
 
-- [Swift for Tensorflow](https://github.com/kimbochen/md-blogs/tree/main/swift-for-tensorflow)
-- [How PyTorch Works - A Systems Perspective](https://github.com/kimbochen/md-blogs/tree/main/pytorch-systems-intro)
-- [PaLM - Pathways Language Model](https://github.com/kimbochen/md-blogs/tree/main/palm)
-- [TPU v4 and TPU v5e](https://github.com/kimbochen/md-blogs/tree/main/tpuv4_v5e)
-- [Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts](https://github.com/kimbochen/md-blogs/tree/main/mobile-v-moes)
-- [The Hardware Lottery](https://github.com/kimbochen/md-blogs/tree/main/the-hardware-lottery)
-- [Graph Compilers](https://github.com/kimbochen/md-blogs/tree/main/graph-compilers)
-- [Triton Compiler](https://github.com/kimbochen/md-blogs/tree/main/triton)
-- [Triton GPU IR Analysis](https://github.com/kimbochen/md-blogs/tree/main/triton-gpu-ir-analysis)
-- [Distributed Training in ML](https://github.com/kimbochen/md-blogs/tree/main/ml-distributed-training)
-- [Local Value Canonicalization in Julia](https://github.com/kimbochen/md-blogs/tree/main/local-value-canon-in-julia)
-- [Tesla AI Day 2021 - Vision](https://github.com/kimbochen/md-blogs/tree/main/tesla-ai-day-2021-vision)
-- [What Triton Does in a Matrix Multiplication](https://github.com/kimbochen/md-blogs/tree/main/what-triton-does-in-a-matmul)
+- [Swift for Tensorflow](./swift-for-tensorflow/README.md)
+- [How PyTorch Works - A Systems Perspective](./pytorch-systems-intro/README.md)
+- [PaLM - Pathways Language Model](./palm/README.md)
+- [TPU v4 and TPU v5e](./tpuv4_v5e/README.md)
+- [Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts](./mobile-v-moes/README.md)
+- [The Hardware Lottery](./the-hardware-lottery/README.md)
+- [Graph Compilers](./graph-compilers/README.md)
+- [Triton Compiler](./triton/README.md)
+- [Triton GPU IR Analysis](./triton-gpu-ir-analysis/README.md)
+- [Distributed Training in ML](./ml-distributed-training/README.md)
+- [Local Value Canonicalization in Julia](./local-value-canon-in-julia/README.md)
+- [Tesla AI Day 2021 - Vision](./tesla-ai-day-2021-vision/README.md)
+- [What Triton Does in a Matrix Multiplication](./what-triton-does-in-a-matmul/README.md)
 
 ## Others
 
-- [Reading List Dump](https://github.com/kimbochen/md-blogs/tree/main/reading-list-dump)
-- [Twitter Archive](https://github.com/kimbochen/md-blogs/tree/main/tweets)
-- [Post Archive](https://github.com/kimbochen/md-blogs/tree/main/post-archive)
-- [Threads Archive](https://github.com/kimbochen/md-blogs/tree/main/threads-archive)
+- [Reading List Dump](./reading-list-dump/README.md)
+- [Twitter Archive](./tweets/README.md)
+- [Post Archive](./post-archive/README.md)
+- [Threads Archive](./threads-archive/README.md)
diff --git a/graph-compilers/README.md b/graph-compilers/README.md
@@ -57,7 +57,7 @@ This is why PyTorch 2 is pushing to support dynamic shapes ([Documentation](http
 When lowering graphs to hardware-specific operators, Glow does not map high-level operator nodes to hardware, e.g. fully-connected layer,
 but further lowers the nodes to linear-algebra-level operators, e.g. a matrix multiplication and a broadcast addition.
 This gradual lowering technique is also seen in PyTorch 2 ATen IR and Prim IR
-(See my [blog post](https://github.com/kimbochen/md-blogs/tree/main/pytorch-systems-intro#the-pytorch-20-compiling-pipeline) for more).
+(See my [blog post](/pytorch-systems-intro/README.md#the-pytorch-20-compiling-pipeline) for more).
 
 ### Quantization
 Glow performs model weight quantization, which is something I find quite interesting.

diff --git a/swift-for-tensorflow/README.md b/swift-for-tensorflow/README.md
@@ -57,5 +57,5 @@ and a [blog](https://pytorch.org/blog/understanding-lazytensor-system-performanc
 A natural next step is to look into what compiler features PyTorch 2.0 has.
 PyTorch 2.0 has 3 new components: TorchDynamo, AOTAutograd, and TorchInductor.
 I wrote briefly about it in my
-[PyTorch Systems Intro](https://github.com/kimbochen/md-blogs/tree/main/pytorch-systems-intro#The-PyTorch-20-compiling-pipeline) post,
+[PyTorch Systems Intro](/pytorch-systems-intro/README.md#The-PyTorch-20-compiling-pipeline) post,
 but the 3 features are definitely worth a deep dive in the future.
diff --git a/triton/README.md b/triton/README.md
@@ -8,7 +8,7 @@ Deep learning algorithms are resource-intensive, so researchers need efficient i
 This is usually done by implementing specialized GPU kernels, 
 but GPU programming requires a lot of knowledge about GPU architecture and familiarity with low-level programming.
 This increases the development time and in worse case limits the researchers' ability to explore more unconventional algorithms
-(i.e. [the hardware lottery](https://github.com/kimbochen/md-blogs/tree/main/the-hardware-lottery)).  
+(i.e. [the hardware lottery](/the-hardware-lottery/README.md)).  
 Triton offers a programming model that is simpler than common GPU ones, e.g. CUDA, but has more control than deep learning frameworks,
 all the while leveraging a compiler to achieve the performance of highly-tuned low-level GPU kernel implementations.
 
@@ -97,7 +97,7 @@ Triton is integrated with TorchInductor and is the default codegen for GPUs.
 The PyTorch compiler stack leverages Triton to generate generic kernels with function inlining and operator fusion.
 Using Triton with TorchInductor offers decent speedups for model training and inference.
 Function inlining and operator fusion are the optimizations that provide the most speedup,
-which is in line with what I learned about ML compilers ([My ML compiler post](https://github.com/kimbochen/md-blogs/tree/main/graph-compilers)).  
+which is in line with what I learned about ML compilers ([My ML compiler post](/graph-compilers/README.md)).
 
 | | Inference | Training |
 | -: | :- | :- |
@@ -113,7 +113,7 @@ However, highly-tuned libraries like cuBLAS still outperforms Triton by a decent
 The author explains that cuBLAS is able to apply [3D matmul algorithms](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5389455)
 to provide more parallelism.  
 For more information on the PyTorch compiler stack, see this [terrific slide deck by Keren Zhou](https://www.jokeren.tech/slides/Triton_bsc.pdf)
-or [my blog post](https://github.com/kimbochen/md-blogs/tree/main/pytorch-systems-intro#pytorch).
+or [my blog post](/pytorch-systems-intro/README.md#pytorch).
 
 
 ## Further Readings