Gradio error: "Not implemented yet" #15

mmealman · 2023-05-29T22:46:47Z

I'm getting an error when attempting to use generate_simple inside of a Gradio UI. I can run test_inference.py just fine, however when I put that code into a Gradio UI and attempt to redirect the output to a Chatbot component, I get the below error:

Traceback (most recent call last):
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/routes.py", line 422, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/blocks.py", line 1323, in process_api
    result = await self.call_function(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/blocks.py", line 1051, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/mmealman/src/exllama/webui/Chatbot.py", line 72, in bot
    bot_message = self.predict(history, user_message)
  File "/home/mmealman/src/exllama/webui/Chatbot.py", line 58, in predict
    return self.textgen.test_generate()
  File "/home/mmealman/src/exllama/TextGenerator.py", line 96, in test_generate
    text = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
  File "/home/mmealman/src/exllama/generator.py", line 176, in generate_simple
    self.gen_begin(ids)
  File "/home/mmealman/src/exllama/generator.py", line 103, in gen_begin
    self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True)
  File "/home/mmealman/src/exllama/model.py", line 1153, in forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device])
  File "/home/mmealman/src/exllama/model.py", line 540, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer)
  File "/home/mmealman/src/exllama/model.py", line 447, in forward
    query_states = self.q_proj.forward(hidden_states)
  File "/home/mmealman/src/exllama/model.py", line 314, in forward
    out = cuda_ext.ExAutogradMatmul4bitCuda.apply(x, self.qweight, self.scales, self.qzeros, self.groupsize, self.bits, self.maxq)
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/home/mmealman/src/exllama/cuda_ext.py", line 271, in forward
    raise ValueError("Not implemented yet")
ValueError: Not implemented yet

Below is the generation code I'm calling in the Chatbot:

    def test_generate(self):
        tokenizer_model_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/tokenizer.model"
        model_config_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/config.json"
        model_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/vicuna-13B-1.1-GPTQ-4bit-128g.safetensors"
        config = ExLlamaConfig(model_config_path)
        config.model_path = model_path
        config.max_seq_len = 2048
        model = ExLlama(config)
        cache = ExLlamaCache(model)

        tokenizer = ExLlamaTokenizer(tokenizer_model_path)
        generator = ExLlamaGenerator(model, tokenizer, cache)
        generator.settings.token_repetition_penalty_max = 1.2
        generator.settings.token_repetition_penalty_sustain = 20
        generator.settings.token_repetition_penalty_decay = 50

        prompt = \
        "On 19 February 1952, Headlam became senior air staff officer (SASO) at Eastern Area Command in Penrith, New South " \
        "Wales. During his term as SASO, the RAAF began re-equipping with English Electric Canberra jet bombers and CAC " \
        "Sabre jet fighters. The Air Force also underwent a major organisational change, as it transitioned from a " \
        "geographically based command-and-control system to one based on function, resulting in the establishment of Home " \
        "(operational), Training, and Maintenance Commands. Eastern Area Command, considered a de facto operational " \
        "headquarters owing to the preponderance of combat units under its control, was reorganised as Home Command in " \
        "October 1953. Headlam was appointed an Officer of the Order of the British Empire (OBE) in the 1954 New Year " \
        "Honours for his \"exceptional ability and devotion to duty\". He was promoted to acting air commodore in May. His " \
        "appointment as aide-de-camp to Queen Elizabeth II was announced on 7 October 1954."

        gen_tokens = 200
        text = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
        return text

ExLLaMA generation in all other stand alone Python scripts works fine. The Gradio UI code also has worked fine in several other projects.

The text was updated successfully, but these errors were encountered:

turboderp · 2023-05-29T22:57:14Z

The only place it throws that exception is in the quantized autograd matmul function, after testing torch.is_grad_enabled() == True. So I would assume that you're running without torch.no_grad(), which is currently required since it doesn't support back propagation yet.

And it might never, actually, since it requires a rewrite (or alternative version) of all the CUDA functions, and it's not clear that it would perform any better than the Transformers/GPTQ implementation anyway, when training models.

mmealman · 2023-05-29T23:23:47Z

Yep, that worked perfectly. Thanks, I was working on this for a couple hours.

        # start timer:
        t0 = time.time()

        with torch.no_grad():
            text = generator.generate_simple(_MESSAGE, max_new_tokens=req.max_new_tokens)

        new_text = text.replace(_MESSAGE,"")
        new_text = new_text.lstrip()

mmealman closed this as completed May 29, 2023

ZanMax mentioned this issue Apr 18, 2024

Run on CPU without AVX2 #315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradio error: "Not implemented yet" #15

Gradio error: "Not implemented yet" #15

mmealman commented May 29, 2023

turboderp commented May 29, 2023

mmealman commented May 29, 2023

Gradio error: "Not implemented yet" #15

Gradio error: "Not implemented yet" #15

Comments

mmealman commented May 29, 2023

turboderp commented May 29, 2023

mmealman commented May 29, 2023