I have not benchmarked it yet. What speed are you getting on CPU with LLaMa.cpp? Try reporting the ms/token with -c 2048 -n 512. Run about 10 times, and report the fastest trial. Also exclude trials where it finishes early (before 450 tokens).

After we can quantify LLaMa.cpp on your device, I'll find a way to get a numerical measurement from your MLC experience.

To explain the benefit of using Metal correctly, you're talking about jumps from the red curve to the cyan curve. LLMs are smaller problems (1000 atoms -> 1000 tokens) and memory/latency bounded. For big stuff like Stable Diffusion, you can grossly misuse the MTLCommandQueue _{(ahem, PyTorch)}, but the problem's so big, the compute bottleneck is even larger.

Most people use Metal like this:

MTLCommandQueue {
  MTLCommandBuffer {
    MTLComputeCommandEncoder {
      // command that adds two 1000-wide tensors
      // finishes in 1 microsecond
    }
  }
  // 10 microseconds latency
  MTLCommandBuffer {
    MTLComputeCommandEncoder {
      // command that multiplies two 1000-wide tensors
      // finishes in 1 microsecond
    }
  }
  // 10 microseconds latency
  MTLCommandBuffer {
    MTLComputeCommandEncoder {
      // command that exponentiates two 1000-wide tensors
      // finishes in 1 microsecond
    }
  }
  // 10 microseconds latency
  MTLCommandBuffer.waitUntilCompeted {
    // 200 microseconds latency
  }
}

How it's designed to be used:

MTLCommandQueue {
  MTLCommandBuffer {
    MTLComputeCommandEncoder {
      // command that adds two 1000-wide tensors
      // finishes in 1 microsecond
      // command that multiplies two 1000-wide tensors
      // finishes in 1 microsecond
      // command that exponentiates two 1000-wide tensors
      // finishes in 1 microsecond

      // some very clever GPGPU way to run the complex
      // logic on the GPU, to prevent synchronization
      // with the CPU
      // finishes in 5 microseconds
    }
  }
  // 10 microseconds latency
}

Apple Silicon GPU Support Possible? #1545

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions