Deep learning frameworks are still evolving, making it hard to design custom hardware. Reconfigurable devices such as field-programmable gate arrays (FPGA) make it easier to evolve hardware, frameworks and software alongside each other. In order to design robust hardware, a certain level of knowledge is necessary-- ranging from undergrad elements of CS to embedded systems and definitely Deep Learning . It is becoming harder to find people who understand the full stack from a first principles theory. This is a micro curriculum which will help you understand the system stack starting from the Building Block of ICs to AI Accelerators from a first Principles Perspective.
Note:- This is in NO way complete. I will keep updating this repo
- Building block of ICs. Learn about Transistors- Follow the Book by Sedra-Smith on Microelectronic Circuits. BJT, FETs, Power Transistors . Some basic Circuit Theory:- Divide it into sub-chapters. IC design wiki.
- Sequential and Combinational Circuits. Synchronous, Asynchronous, Register Transfer level, Introduction to VLSI design, Verilog and VHDL.
- 8051
- 8085, 8086 - Basics, Instruction cycle
- AVR Family :- Arduino, some basic projects like LCD controller, Servo Motors, Sensors etc.
- PIC Family:- Some background knowledge.
- Talk about how FPGAs are Built - An FPGA from 7400s. The most basic building block of an FPGA is the Cell, or Slice. Talk about Programmable LUTs.
- Build Your Own FPGA
- Internal Functionality of FPGA look Up tables Getting Started with FPGAs- https://www.allaboutcircuits.com/technical-articles/getting-started-with-fpgas-look-up-tables-and-flip-flops/
- 5 easy steps to Building an Embedded Processor System inside an FPGA Designing an FPGA from Scratch 38 part Tutorial:- Writing a Software device driver and an application program to run on the system. Pick out a suitable development Board.- Designing an FPGA from SCratch
- All about FPGAs
- How to Get started with FPGA programming ? What is FPGA programming ?
- Digital Logic Design- Combinational and sequencial Circuits.
- Verilog/VHDL language
- Simulation - Modelsim
- Synthesis and Implementation Xilinx ISE desisn Suite:- Xilinx ISE
- Read about RISC Architectures - RISC Wiki
- Learn about ARM organisation. ARM core dataflow model. 3 stage and 5 stage pipeline. ARM 7 and ARM 9. Explaining Pipelining in ARM Processors.
Include material for the risc V architecture as well.
- ARM Assembly basics Tutorial Series:- Writing ARM Assembly Learn about the Assembly language, data types and addressing modes. A good reading source would be from Computer Organization and Architecture by William Stallings. 32- Bit ARM and 16 -Bit Thumb instruction set.
- ARM Assembly Language.
- ARM Datatypes
- ARM Addressing Modes
- ARM Instruction Formats.
- ARM Processor/ also cores.
- ARM Cortex M
- ARM Cortex A
- Operating System Overview
- Scheduling.
- Memory Management
- Translation Lookaside Buffers
- ARM Memory Management:- Developer ARM :- Learn the Architecture. Download the full tutorial pdf.
- ARM Linux distributions:-Linux ARM distros, (ARM Linux Distributions wiki)[https://en.wikipedia.org/wiki/Category:ARM_Linux_distributions]
- Building an MMU(Verilog, 1000):- ARM9, explain TLBs and other fun things. Maybe also a memory controller, depending on how the FPGA is, then add the init code to your bootloader.
-
Coding an assembler:- write in python. Happens in parallel with the CPU building. Initially outputs just binary files, but changed when you write a linker.
-
Building a ARM7 CPU(Verilog, 1500):- Break this into subchapters. A simple pipeline to start, decode, fetch, execute.
-
Coding a bootrom(Assembler, 40) - from geohotz Memory Management Unit - wiki. https://developer.arm.com/architectures/learn-the-architecture/memory-management/the-memory-management-unit-mmu
-
https://medium.com/@g33konaut/writing-an-x86-hello-world-boot-loader-with-assembly-3e4c5bdd96cf
-
Bootloader in C:- https://www.codeproject.com/Articles/664165/Writing-a-boot-loader-in-Assembly-and-C-Part
-
Read the Compiler Design Tutorial by Tutorials Point. (Tutorials Point Compiler Design)[https://www.tutorialspoint.com/compiler_design/compiler_design_overview.htm]
-
Write a Compiler in Haskell. learn Haskell- Covers the Basics of Compilers.
-
Tutorial for implementation of functional languages - https://www.microsoft.com/en-us/research/uploads/prod/1992/01/tutor.pdf
-
Write a C compiler - https://github.com/nlsandler/write_a_c_compiler, https://norasandler.com/2017/11/29/Write-a-Compiler.html
-
Haskell C Compiler - https://github.com/NunoDasNeves/haskell-c-compiler
-
Implementing a JIT Compiled Language with Haskell and LLVM. LLVM Tutorial A JIT compiler runs after the program has started and compiles the code (usually bytecode or some kind of VM instructions) on the fly (or just-in-time, as it's called) into a form that's usually faster, typically the host CPU's native instruction set. A JIT has access to dynamic runtime information whereas a standard compiler doesn't and can make better optimizations like inlining functions that are used frequently.This is in contrast to a traditional compiler that compiles all the code to machine language before the program is first run.
-
Optimizing a Compiler - Ycombinator:- https://news.ycombinator.com/item?id=15821899
Needed for System On Chip design for ASICs.
- SoC wiki
- SoC Design Methodology , Overview of the SOC Design Process.
- Canonical SoC Design, System Design Flow, System Architecture, Components of the system, Hardware & Software, Processor Architectures, System Architecture and Complexity. Parameterized Systems-on-a-Chip , System-on-a-chip Peripheral Cores.
- Overview of SOC external memory, Internal Memory, Size, Cache memory, Cache Organization, Cache data. Types of Cache:- Split Level Caches, Multi Level Cache. SOC Memory System .
- SoC Notes:- SOC Notes
- Buffers and latches, Crystal, Reset circuit, Chip select logic circuit, timers and counters.Universal asynchronous receiver, transmitter (UART), Pulse width modulators.
- Building a UART(Verilog, 100):- An intro chapter to Verilog, copy a real UART, introducing the concept of MMIO. Serial test echo program and led control. Software Serial arduino.cc
- Implementing a UART in Verilog and Migen
- UART, Serial Port, RS-232 Interface
- Semiconductor Main Memory :- SRAM , DRAM, Chip Logic. Flash Memory:- NOR, NAND flash Memory, External Memory
- DDR DRAM :- DDR SDRAM
- Tutorial on Hardware accelerators for DNNs
- FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Basically a Raspberry Pi on Steroids.
- Jetson Nano embedded Technical Specifications- https://developer.nvidia.com/embedded/develop/hardware
- Jetson nano DL Benchmarks:- https://developer.nvidia.com/embedded/jetson-nano-dl-inference-benchmarks
- Jetson nano Developer's Kit:- https://developer.nvidia.com/embedded/jetson-nano-developer-kit
- What is a Tensor Processing Unit ? RISC, CISC , TPU instruction set , the TPU. GPU vs TPU. Matrix Multiplying Unit (MMU). Parallel Processing on Matrix Multiplying Unit. Why Matrix Multiplication ? . Matrix Machine. Systolic Array - 1) Cycle 1 and Cycle 2. Use cases of TPU.
- Edge TPU performance benchmarks
- Tensorflow Models on the Edge TPU
- Implement a Multilayer perceptron for image classification using the CIFAR Dataset (GPU vs TPU).
- A Survey of Accelerator Architectures for Deep Neural Networks
CUDA provides two APIs (Application Programming Interfaces) for developers: the CUDA driver API and the CUDA runtime API. The CUDA driver API is more fundamental (low-level) and more flexible. The CUDA runtime API is constructed based on the CUDA driver API and is easier to use. We only consider the CUDA runtime API CUDA C++ extends C++ by allowing the programmer to define C++ functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C++ functions.
A kernel is defined using the __global__
declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<...>>>. A few examples has been provided in cuda programming.