Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradio error: "Not implemented yet" #15

Closed
mmealman opened this issue May 29, 2023 · 2 comments
Closed

Gradio error: "Not implemented yet" #15

mmealman opened this issue May 29, 2023 · 2 comments

Comments

@mmealman
Copy link

I'm getting an error when attempting to use generate_simple inside of a Gradio UI. I can run test_inference.py just fine, however when I put that code into a Gradio UI and attempt to redirect the output to a Chatbot component, I get the below error:

Traceback (most recent call last):
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/routes.py", line 422, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/blocks.py", line 1323, in process_api
    result = await self.call_function(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/blocks.py", line 1051, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/mmealman/src/exllama/webui/Chatbot.py", line 72, in bot
    bot_message = self.predict(history, user_message)
  File "/home/mmealman/src/exllama/webui/Chatbot.py", line 58, in predict
    return self.textgen.test_generate()
  File "/home/mmealman/src/exllama/TextGenerator.py", line 96, in test_generate
    text = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
  File "/home/mmealman/src/exllama/generator.py", line 176, in generate_simple
    self.gen_begin(ids)
  File "/home/mmealman/src/exllama/generator.py", line 103, in gen_begin
    self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True)
  File "/home/mmealman/src/exllama/model.py", line 1153, in forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device])
  File "/home/mmealman/src/exllama/model.py", line 540, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer)
  File "/home/mmealman/src/exllama/model.py", line 447, in forward
    query_states = self.q_proj.forward(hidden_states)
  File "/home/mmealman/src/exllama/model.py", line 314, in forward
    out = cuda_ext.ExAutogradMatmul4bitCuda.apply(x, self.qweight, self.scales, self.qzeros, self.groupsize, self.bits, self.maxq)
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/home/mmealman/src/exllama/cuda_ext.py", line 271, in forward
    raise ValueError("Not implemented yet")
ValueError: Not implemented yet

Below is the generation code I'm calling in the Chatbot:

    def test_generate(self):
        tokenizer_model_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/tokenizer.model"
        model_config_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/config.json"
        model_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/vicuna-13B-1.1-GPTQ-4bit-128g.safetensors"
        config = ExLlamaConfig(model_config_path)
        config.model_path = model_path
        config.max_seq_len = 2048
        model = ExLlama(config)
        cache = ExLlamaCache(model)

        tokenizer = ExLlamaTokenizer(tokenizer_model_path)
        generator = ExLlamaGenerator(model, tokenizer, cache)
        generator.settings.token_repetition_penalty_max = 1.2
        generator.settings.token_repetition_penalty_sustain = 20
        generator.settings.token_repetition_penalty_decay = 50

        prompt = \
        "On 19 February 1952, Headlam became senior air staff officer (SASO) at Eastern Area Command in Penrith, New South " \
        "Wales. During his term as SASO, the RAAF began re-equipping with English Electric Canberra jet bombers and CAC " \
        "Sabre jet fighters. The Air Force also underwent a major organisational change, as it transitioned from a " \
        "geographically based command-and-control system to one based on function, resulting in the establishment of Home " \
        "(operational), Training, and Maintenance Commands. Eastern Area Command, considered a de facto operational " \
        "headquarters owing to the preponderance of combat units under its control, was reorganised as Home Command in " \
        "October 1953. Headlam was appointed an Officer of the Order of the British Empire (OBE) in the 1954 New Year " \
        "Honours for his \"exceptional ability and devotion to duty\". He was promoted to acting air commodore in May. His " \
        "appointment as aide-de-camp to Queen Elizabeth II was announced on 7 October 1954."

        gen_tokens = 200
        text = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
        return text

ExLLaMA generation in all other stand alone Python scripts works fine. The Gradio UI code also has worked fine in several other projects.

@turboderp
Copy link
Owner

The only place it throws that exception is in the quantized autograd matmul function, after testing torch.is_grad_enabled() == True. So I would assume that you're running without torch.no_grad(), which is currently required since it doesn't support back propagation yet.

And it might never, actually, since it requires a rewrite (or alternative version) of all the CUDA functions, and it's not clear that it would perform any better than the Transformers/GPTQ implementation anyway, when training models.

@mmealman
Copy link
Author

Yep, that worked perfectly. Thanks, I was working on this for a couple hours.

        # start timer:
        t0 = time.time()

        with torch.no_grad():
            text = generator.generate_simple(_MESSAGE, max_new_tokens=req.max_new_tokens)

        new_text = text.replace(_MESSAGE,"")
        new_text = new_text.lstrip()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants