cmd/compile: performance of go wasm is very poor

### Go version

go version go1.21.6 linux/arm64

### Output of `go env` in your module/workspace:

```shell
GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='linux'
GOINSECURE=''
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPRIVATE=''
GOSUMDB='off'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOVCS=''
GOVERSION='go1.21.6'
GCCGO='gccgo'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3484327849=/tmp/go-build -gno-record-gcc-switches'
```


### What did you do?

As shown in the following example,```test.go``` is compiled into wasm:

test.go :
```
//go:noinline
func testFor() int {
	sum := 0
	for i := 0; i < 200; i++ {
		for j := 0; j < 10000000; j++ {
			sum += j
		}
	}
	return sum
}

func main() {
	startTime := time.Now()

	res := testFor()
	elapsed := time.Since(startTime)
	fmt.Println("Done. Cost: ", elapsed, res)
}
```
go to wasm compile command:
``` GOOS=wasip1 GOARCH=wasm go build  -o test.wasm test.go```


### What did you see happen?

As you can see in the wat code, the ```for``` loop is expressed using the ```br_table``` operation. 

wat code:
```
  (func $main.testFor (type 0) (param i32) (result i32)
    (local i32 i64 i64 i64 i64)
    global.get 0
    local.set 1
    loop  ;; label = @1
      block  ;; label = @2
        block  ;; label = @3
          block  ;; label = @4
            block  ;; label = @5
              block  ;; label = @6
                block  ;; label = @7
                  block  ;; label = @8
                    block  ;; label = @9
                      local.get 0
                      br_table 0 (;@9;) 1 (;@8;) 2 (;@7;) 3 (;@6;) 4 (;@5;) 5 (;@4;) 6 (;@3;) 7 (;@2;)
                    end
                    i64.const 0
                    local.set 2
                    i64.const 0
                    local.set 3
                    i32.const 2
                    local.set 0
                    br 7 (;@1;)
                  end
                  local.get 2
                  i64.const 1
                  i64.add
                  local.set 2
                end
                local.get 2
                i64.const 200
                i64.lt_s
                i32.eqz
                if  ;; label = @7
                  i32.const 4
                  local.set 0
                  br 6 (;@1;)
                end
              end
              i64.const 0
              local.set 4
              i32.const 6
              local.set 0
              br 4 (;@1;)
            end
            local.get 1
            i64.extend_i32_u
            i64.const 8
            i64.add
            i32.wrap_i64
            local.get 3
            i64.store
            local.get 1
            i32.const 8
            i32.add
            local.tee 1
            global.set 0
            i32.const 0
            return
          end
          local.get 4
          i64.const 1
          i64.add
          local.set 5
          local.get 3
          local.get 4
          i64.add
          local.set 3
          local.get 5
          local.set 4
        end
        local.get 4
        i64.const 10000000
        i64.lt_s
        if  ;; label = @3
          i32.const 5
          local.set 0
          br 2 (;@1;)
        end
        i32.const 1
        local.set 0
        br 1 (;@1;)
      end
    end
    unreachable)
```

### What did you expect to see?

When the aot compiler of wasm runtime performs backend optimization, it is difficult to identify the ```br_table``` as a ```for``` loop. So, during the backend optimization, this ```for``` loop was not optimized.

I tested several of the most popular wasm runtimes, such as wasmtime, wamr, and wasmer, and I found that the performance of go wasm after aot compilation was very poor, and the runtime performance was only 20% of go native in the best case.
Why use so many ```br_table``` operation instead of ```loop``` operation? Will the performance of go wasm be optimized in the future?


Also, I found that the wat code of the go runtime functions uses br_table a lot,the craziest function has 417 hops in the br_table.

![image](https://github.com/golang/go/assets/10509166/fd215beb-8c22-4b56-95e0-1df9c40a395c)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cmd/compile: performance of go wasm is very poor #65440

Go version

Output of `go env` in your module/workspace:

What did you do?

What did you see happen?

What did you expect to see?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cmd/compile: performance of go wasm is very poor #65440

Description

Go version

Output of go env in your module/workspace:

What did you do?

What did you see happen?

What did you expect to see?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Output of `go env` in your module/workspace: