Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Tapa HLS backend #269

Merged
merged 18 commits into from
Dec 3, 2024
Merged

Support Tapa HLS backend #269

merged 18 commits into from
Dec 3, 2024

Conversation

EthanMeng324
Copy link
Contributor

Description

This PR supports Tapa HLS (https://github.com/rapidstream-org/rapidstream-tapa/) backend for Allo. This is mainly designed with Allo dataflow programming interface. The original scheduling interface hasn't been tested yet. This backend has a new kernel codegen, host codegen, and makefile codegen. The basic usage is also different from Vitis HLS. Some usages with makefile are as follows.

make csim: A fast software simulation that only relied on kernel.cpp and tapa_host.cpp.

make fast_hw_emu A fast hardware emulation similar to hw_emu, but no longer need to generate .xclbin.

make run TARGET=<hw_emu/hw>: sw_emu is no longer supported with make run, one can just use csim instead.

Examples

To use Tapa HLS for backend, we can simply choose "tapa" as target when building. For example:

@df.region()
def top():
    @df.kernel(mapping=[P0, P1])
    def gemm(A: int32[M, K], B: int32[K, N], C: int32[M, N]):
        ...

mod = df.build(top, target="tapa", mode="csim")

Issues

  1. Due to the GLIBC version incompatibility between our server and Tapa, tapa g++ and tapa compile is currently not runnable in our server. Which means we should generate the Tapa executable and .xo file in a docker container, copy the generated file, and run the actual testing in our server (You can use docker image ethanmeng324/tapa:v3.0). This will be resolved in the future where there will be an alternative choice that implicitly go through this process (launch docker container, generate file, copy result, continue running).
  2. Current codegen for Tapa does not support multi-dimensional array access because some tricky issue with tapa::mmap and tapa::vec_t. Our current solution is to flatten the array access, like changing from a[1][1] to a[1 * 16 + 1], where 16 is the size of dim_0. Because of this issue, input and output array buffer as L3 cache is not supported in Tapa. However, we will change the input and output buffer into stream type in the future, which will solve this problem.

Checklist

  • PR's title starts with a category (e.g. [Bugfix], [IR], [Builder], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage (It would be better to provide ~2 different test cases to test the robustness of your code)
  • Code is well-documented

@chhzh123
Copy link
Member

Can you attach a simple program and the generated TAPA code as a comment in this PR?

Copy link
Member

@chhzh123 chhzh123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! This PR is very comprehensive!

  1. Can you provide instructions on how to build and run TAPA programs from Allo? Based on the description, it can only generate the TAPA C++ file and require users to explicitly invoke docker, right?
  2. Is it possible to reuse the EmitVivadoHLS pass? It seems most of the facilities are the same, but only the function generation logic needs to be changed. Copying the whole implementation may make it hard to maintain afterwards.

@EthanMeng324
Copy link
Contributor Author

Thanks for contributing! This PR is very comprehensive!

  1. Can you provide instructions on how to build and run TAPA programs from Allo? Based on the description, it can only generate the TAPA C++ file and require users to explicitly invoke docker, right?
  2. Is it possible to reuse the EmitVivadoHLS pass? It seems most of the facilities are the same, but only the function generation logic needs to be changed. Copying the whole implementation may make it hard to maintain afterwards.

Thanks for reviewing! For the questions:

  1. That might be a little complicated for now. For csim and fast_hw_emu, we can just use the docker container, go into the generated top.prj folder, and run make csim or make fast_hw_emu. For hw_emu or hw, we should first run make all TARGET=hw_emu or make all TARGET=hw in the docker container, then copy the generated to the server under top.prj folder, and then run the same command again in the server.
  2. There is actually a lot of small changes in many part, and will be more in the future. I guess if we reuse EmitVivadoHLS there might be a lot of if and else statement which can be pretty messy, like the makefile generation code. I personally think it's better we keep it separate like Intel HLS.

@chhzh123
Copy link
Member

There is actually a lot of small changes in many part, and will be more in the future.

Can you be specific about which parts involve many small changes? I thought only the function interfaces are different

@EthanMeng324
Copy link
Contributor Author

Can you attach a simple program and the generated TAPA code as a comment in this PR?

Take this simple 2 * 2 tiled gemm as an example:

@df.region()
def top():
    @df.kernel(mapping=[P0, P1])
    def gemm(A: float32[M, K], B: float32[K, N], C: float32[M, N]):
        pi, pj = df.get_pid()
        for i in range(pi * Mt, (pi + 1) * Mt):
            for j in range(pj * Nt, (pj + 1) * Nt):
                for k in range(K):
                    C[i, j] += A[i, k] * B[k, j]

The generated Tapa HLS code is as follows:

void gemm_0_0(
  tapa::mmap<float> v0,
  tapa::mmap<float> v1,
  tapa::mmap<float> v2
) {	// L2
  l_S_i_0_i: for (int i = 0; i < 16; i++) {	// L3
    l_S_j_0_j: for (int j = 0; j < 16; j++) {	// L4
      l_S_k_0_k: for (int k = 0; k < 32; k++) {	// L5
        float v6 = v0[((i * 32) + k)];	// L6
        float v7 = v1[((k * 32) + j)];	// L7
        float v8 = v6 * v7;	// L8
        float v9 = v2[((i * 32) + j)];	// L9
        float v10 = v9 + v8;	// L10
        v2[((i * 32) + j)] = v10;	// L11
      }
    }
  }
}

void gemm_0_1(
  tapa::mmap<float> v11,
  tapa::mmap<float> v12,
  tapa::mmap<float> v13
) {	// L17
  l_S_i_0_i1: for (int i1 = 0; i1 < 16; i1++) {	// L18
    l_S_j_0_j1: for (int j1 = 0; j1 < 16; j1++) {	// L19
      int v16 = (j1 + 16);	// L19
      l_S_k_0_k1: for (int k1 = 0; k1 < 32; k1++) {	// L20
        float v18 = v11[((i1 * 32) + k1)];	// L21
        float v19 = v12[((k1 * 32) + v16)];	// L22
        float v20 = v18 * v19;	// L23
        float v21 = v13[((i1 * 32) + v16)];	// L24
        float v22 = v21 + v20;	// L25
        v13[((i1 * 32) + v16)] = v22;	// L26
      }
    }
  }
}

void gemm_1_0(
  tapa::mmap<float> v23,
  tapa::mmap<float> v24,
  tapa::mmap<float> v25
) {	// L32
  l_S_i_0_i2: for (int i2 = 0; i2 < 16; i2++) {	// L33
    int v27 = (i2 + 16);	// L33
    l_S_j_0_j2: for (int j2 = 0; j2 < 16; j2++) {	// L34
      l_S_k_0_k2: for (int k2 = 0; k2 < 32; k2++) {	// L35
        float v30 = v23[((v27 * 32) + k2)];	// L36
        float v31 = v24[((k2 * 32) + j2)];	// L37
        float v32 = v30 * v31;	// L38
        float v33 = v25[((v27 * 32) + j2)];	// L39
        float v34 = v33 + v32;	// L40
        v25[((v27 * 32) + j2)] = v34;	// L41
      }
    }
  }
}

void gemm_1_1(
  tapa::mmap<float> v35,
  tapa::mmap<float> v36,
  tapa::mmap<float> v37
) {	// L47
  l_S_i_0_i3: for (int i3 = 0; i3 < 16; i3++) {	// L48
    int v39 = (i3 + 16);	// L48
    l_S_j_0_j3: for (int j3 = 0; j3 < 16; j3++) {	// L49
      int v41 = (j3 + 16);	// L49
      l_S_k_0_k3: for (int k3 = 0; k3 < 32; k3++) {	// L50
        float v43 = v35[((v39 * 32) + k3)];	// L51
        float v44 = v36[((k3 * 32) + v41)];	// L52
        float v45 = v43 * v44;	// L53
        float v46 = v37[((v39 * 32) + v41)];	// L54
        float v47 = v46 + v45;	// L55
        v37[((v39 * 32) + v41)] = v47;	// L56
      }
    }
  }
}

void top(
  tapa::mmap<float> v48,
  tapa::mmap<float> v49,
  tapa::mmap<float> v50
) {	// L62
  tapa::task()
  .invoke(gemm_0_0, v48, v49, v50)	// L63
  .invoke(gemm_0_1, v48, v49, v50)	// L64
  .invoke(gemm_1_0, v48, v49, v50)	// L65
  .invoke(gemm_1_1, v48, v49, v50);	// L66
}

@EthanMeng324
Copy link
Contributor Author

There is actually a lot of small changes in many part, and will be more in the future.

Can you be specific about which parts involve many small changes? I thought only the function interfaces are different

Currently, there are getTypeName, emitValue, emitArrayDecl, emitAffineLoad, emitAffineStore, emitCall, emitLoopDirectives, emitFunctionDirectives, emitFunction, and emitModule.

@chhzh123
Copy link
Member

I think a better way to do this is to provide a basic class for EmitHLS, and Vivado, Intel, TAPA backends are all inherited from the base class, so only specific functions need to be overloaded instead of creating if-else branches in the same file.

@EthanMeng324
Copy link
Contributor Author

I think a better way to do this is to provide a basic class for EmitHLS, and Vivado, Intel, TAPA backends are all inherited from the base class, so only specific functions need to be overloaded instead of creating if-else branches in the same file.

That actually makes great sense. Do you want me to include it in this PR?

@chhzh123
Copy link
Member

Maybe not in this PR as it requires lots of code change, but I think you can annotate the functions (just using one line of comment) that are different from the Vivado HLS backend.

@EthanMeng324
Copy link
Contributor Author

Maybe not in this PR as it requires lots of code change, but I think you can annotate the functions (just using one line of comment) that are different from the Vivado HLS backend.

Sure, just updated.

@chhzh123 chhzh123 merged commit 9c48a6a into cornell-zhang:main Dec 3, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants