HISTORY

### TODO
  * consider data movement in nested loops. 
  * consider "move data from local to kernel"? 
  * move data to different storage on xcel?
  * stop using device scope flag insertiton to split the host code from device: perform the splitting after the boudn inference pass. 1. create a map from stage to device scope. 2. replace the hyper-node (i.e. the stages grouped into same scoep) with a new stage including only kernel stmt.  

# 2020-02-19
  * how to factor channel into intel opencl backend?    

# 2020-02-09
  * support data packing of multiple variable
  * data padding for sequential data access  
  * correct location to insert "attr partition" in SDAccel 

# 2020-02-05
  * array partitioning heterocl module  
  * create multi-kernel program in host code generator
    * 1. How to determine whether a kernel stmt is categorized into kernel or sub function: for the HLSC backend, channels are represented by hls::stream explicitly or as FIFO, whcih will be annotated by STREAM prgama. For OpenCL, kernel functions need to be initialized and called by host. 
    * 2. change the stream type to be backend agnostic ()

# 2020-02-04
  * generate the host code & Makefiel for simulation
  * added constant local buffer in heterocl module 
  * fixed resue buffer dimension recovery issue

# 2020-02-03
  * 1. One more thing to fix. inter-kernel streaming integrated with memory reuse.

# 2020-02-02
    * 1. memory object allocation + copy data + enqueue tasks before top_function and data fetching at the end of the function call (Important: To handle pipes between kernels, different kernels but sub functions are required)

    * 2. do we still want to reuse opencl utility functions from rosetta? it's not portable and fine-granular enough. And also it requries extra efforst to pass the IR information (e.g. size, data type) to the wrapper code generater. The solution is to push OpenCL host code to a separate code generator inherited from CodeGenC. and create initilization code for Kernel Stmt IR.

    * 3. update the simulation flow : we separate the simultaion to two stages a) generate wrapper code and compile. The wrapper code copies the data from python front-end to the execution engine (e.g. TVM PackedFunc) b) update and execute. The array args are automaticallly copied using the shared memory initialized in wrapper code. However, the scalar varible needs to be update. **To summarize**, the simulation enables a one-time compilation, which frees the users from compiling the program everytime when calling the function.

# 2020-02-01
  * TODO: create opencl host codegen
    * 1. Annotate KernelStmt & KernelDef in stream inference IR pass (if a kernel calls a global variable, then the storage scope of the arg in kernel def should be annotated as global)
    * 2. check the data shape when calling the heterocl module 

# 2020-01-31
  * create reusable conv2d function in hlib
  * fixed clang seg fault (kernel name mis-matching)

# 2020-01-30
  * fixed issues of wrong input buffer in schedule_dataflow
  * update the SDAccel codegen
    * 1. macro for opencl 1.2 legacy support
    * 2. remove `__local` identifier for allocate ir 
    * 3. check the sequence of kernel def string 
    * 4. for variable type arg in top func ()

# 2020-01-28
  * udpate the HLSC & OpenCL codegen for SDAccel  
    * 1. channel read & write on host and top_function on device (ignore the nested for loops to transmit data through chanels) 
    * 2. channels between kernel functions (create opencl channels in dataflow mode where one kernel function, which is decorated with dataflow attribute, wrapping multiple sub-functions + blocking pipes for multiple kernels)
    * 3. KernelDef IR printer for SDAccel (__kernel attribute + dataflow)

# 2020-01-25
  * Updated streaming ir pass to perform device scope grouping & the compute being offloaded is modelled as a KernelDef and corresponding KernelStmt (How to manage the streaming channel allocation between host & device: create two identical buffer in host & device)

# 2020-01-21
  * Handle Host Device Splitting (How to generate the top function calls from the host. Refer to HostDeviceSplitter in TVM to mutate all body in attr stmts marked with thread_extent or pipeline_exec_scope. These stmts are considered as Call IR nodes and wrapped into LoweredFunc) 
    * 1. modify the stream inference ir pass (first analyze all the device_scop attr and use a unified attr stmt to group all adjacent stmts. And then split the program with device scope attr stmt body. wrap the body with a Call node with all argument information included)  
    * 2. modified the `move_to` scheduling in HeteroCL (alloc buffer does not match with var. We need to bind the arg to the tensor in StorageFaltten pass)
  * Fix SDAccel flow and argument partitioning 
    * updated CodeGenHLSC (if no data movement attr is detected, all compute will be placed onto the host C program generated by CodeGenHLSC to avoid the CHECK arguments in the beginning)
    * updated CodeGenSDAccel (StreamStmt & StreamExpr with channel index greater than 0 are considered as OpenCL channels)
    * update GetHost and GetDevice Function in CodeGen (1. do not generate the interface.cpp with no data movement attr. i.e., the whole program running on host x86; 2. update the argument analysis function.) 

# 2020-01-20
  * handle self-feedback streaming with multiple rds/wrs (if the user specifys which load/store to be streamed in a multiple loads/stores kernel, the compiler does not create local buffers) 
  * modified ir UseDefAnalysis pass to support separate analysis in kernel def 
  * create common channel buffer for inter-module streaming 
  * use constraint to check index access mismatching: remove the checking by comparing the itervar range map size. Make sure the extracted information are same (e.g. size, extents). If mis-matching happens, insert a new buffer blocked with the current stmt (better can apply compute_at)

## 2020-01-19
  * updated hlib (support variable size function arguments in hcl module; added hcl function / math library + test cases)
  * fixed reduction issue for declarative stmts in hcl module
  * fixed broadcasting issue in inter-module streaming (local buffer)
  * serialize inner loop containing stream expr & stmts 
  * improve the kernel updater: the current updater replace every rd/wr. Need to check whether the rd is in condition or not. Only update the right one (especially for rd case). 

## 2020-01-18
  * updated digitrec KNN example on SDAccel
  * added hlib math related utility functions
  * added auto-schduling for data reuse e.t.c 
  * updated hlib nn to include activation in conv and dense layer 

## 2020-01-12
  * added vgg16 & alexnet in examples (with reusable conv and dense layers)
  * updated hlib nn utility functions (LRN and padding)
  * updated the runner function of unified sim flow 
  * modified hcl reducer to allow tvm expr range 
  * fully support declarative primitives in hcl module 
  * initial commit for cost model for control flow graph analysis 
  * optmized data structure in unified sim flow 

## 2020-01-03
  * updated optical flow example 
  * added streaming example for resnet 18
  * updated the codegen for device & host code splitting (push the top calling into codegen phase instead of ir level): create vector<Array<Var>>, we only push the streamed variable into the list

## 2019-12-29
  * model communication cost & perform schedule before ir level

## 2019-12-24
  * support compute ptimitive in hcl.def_ 
  * port optical flow example from rosetta

## 2019-12-22
  * updated digit rec flow with backend test cases
  * udpated `stream_to` primitive: when streaming an on-chip buffer from one kernel to another, all related kernel stmts (supposely only two kernels) will be marked in the annotation. ir passes in later stage can add pragma to the alloc buffer stmt (with the buffer info in the annotation)
  * updated vivado codegen backend: in simulation mode we can utilize `hls::stream`, which is an incompatibale type in sdsoc. We added a new mode in Vivado codegen to avoid potential issues.  

## 2019-12-21
  * updated the python interface (execution & and information lookup to avoid passing options) 
  * updated the sdsoc backend (sim + impl) 
    * added flag in `VivadoHLS` codegen to generate SDSoc compatible pragmas
    * updated the harness files (Makefile / Headers)
  * support `.to()` to device in llvm simulation ???
  * fixed buffer alloc stmt for channel in `move_to`

### 2019-12-20
  * updated digitrec example for in-kernel scehdule
  * hard-fixed the seg fault error of reuse buffer
  * pushed the execution flow to python 
  * updated the shared memory generation function
  * added the implementation flow (c++ & python interface)
    * create typedef alternatives for vivado csim
    * support shared memory generation in csim

### 2019-12-19
  * fixed issue with self loop feedback 
  * fixed the void kernel case (no data movement)
  * udpate inter-kernel schedule & vivado hls backend 
  * cleaned up codegen streaming related data structure (WIP)
  * move internal buffer in kernel (which copy to move ?)

### 2019-12-18
  * updated stream ir pass to remove buffer named after kernel 
  * updated initializer for scalar argument 
  * removed auto buffer replacement for data movement  

### 2019-12-18
  * udpated digit recognition example 
  * simlutaion timer for vivado csim flow
  * fixed pattern mismatch issue of buffer with pure reduce axis

### 2019-12-17
  * fixed vivado hls backend casting issue with streaming sender index
  * added graph information for kernel stmt traversal 
  * create itervar array for building stream buffer op node 

### 2019-12-16
  * ir_pass level scheduling for kernel function calls
  * remove nested device scope in stream_inference ir pass
  * check access pattern of stream expr & corresponding stream stmt
    * push access information into stream expr & stmt nodes
    * (deprecated) create local buffer if access index mismatch  
    * mutate stream sender to keep same access pattern 

### 2019-12-11
  * fixed the FIFO depth issue (kernel updater)
  * updated the stream inference ir pass in lower function 
  * added position and channel info to kernel stmt / expr node

### 2019-12-10
  * pushed arg position inference to python frontend
  * allocate var id for StreamStmt out of kernel def 
  * remove TypeCollector and update tuple with struct 

### 2019-12-09 
  * fixed issue of zc706 simulation 
    * (deprecated) remove kernel-name variable allocation before KernelDef
    * change multi-dimension array access to row-major single-dimension access
    * create local buffer for each on-device variable
    * updated the `KernelUpdater` class (using position index instead of name)
    * added `stream_arg_pos` map in `CodeGenC` to facilitate codegen with streaming
  * fixed test cases in python 2/3 
    * changed tvm `build` function to support legacy string type target 
    * fixed opencl aocl data type mismatching issue
    * fixed kernel def data type conversion issue