Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running t5 example #131

Closed
tokestermw opened this issue Oct 29, 2022 · 4 comments
Closed

Error running t5 example #131

tokestermw opened this issue Oct 29, 2022 · 4 comments

Comments

@tokestermw
Copy link

tokestermw commented Oct 29, 2022

Hi!, running the end to end t5 example, I get this error when running the last cell:

File /usr/local/lib/python3.9/dist-packages/transformers/models/t5/modeling_t5.py:1648, in T5ForConditionalGeneration.forw
ard(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_h
ead_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, ou
tput_hidden_states, return_dict)
   1645         decoder_attention_mask = decoder_attention_mask.to(self.decoder.first_device)
   1647 # Decode
-> 1648 decoder_outputs = self.decoder(
   1649     input_ids=decoder_input_ids,
   1650     attention_mask=decoder_attention_mask,
   1651     inputs_embeds=decoder_inputs_embeds,
   1652     past_key_values=past_key_values,
   1653     encoder_hidden_states=hidden_states,
   1654     encoder_attention_mask=attention_mask,
   1655     head_mask=decoder_head_mask,
   1656     cross_attn_head_mask=cross_attn_head_mask,
   1657     use_cache=use_cache,
   1658     output_attentions=output_attentions,
   1659     output_hidden_states=output_hidden_states,
   1660     return_dict=return_dict,
   1661 )
   1663 sequence_output = decoder_outputs[0]
   1665 # Set device for model parallelism

File /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File /kernl/src/kernl/model_optimization.py:50, in optimize_model.<locals>.run(*args, **kwargs)
     48 def run(*args, **kwargs):
     49     with torchdynamo.optimize(compiler):
---> 50         return original_model.forward2(*args, **kwargs)

File /usr/local/lib/python3.9/dist-packages/torchdynamo/eval_frame.py:148, in catch_errors_wrapper.<locals>.catch_errors(f
rame, cache_size)
    146         return None
    147     with compile_lock:
--> 148         return callback(frame, cache_size)
    149 except Exception:
    150     logging.basicConfig()

File /usr/local/lib/python3.9/dist-packages/torchdynamo/convert_frame.py:347, in convert_frame.<locals>._convert_frame(fra
me, cache_size)
    345 counters["frames"]["total"] += 1
    346 try:
--> 347     result = inner_convert(frame, cache_size)
    348     counters["frames"]["ok"] += 1
    349     return result

File /usr/local/lib/python3.9/dist-packages/torchdynamo/convert_frame.py:108, in wrap_convert_context.<locals>._fn(*args,
**kwargs)
    106 torch.fx.graph_module._forward_from_src = fx_forward_from_src_skip_result
    107 try:
--> 108     return fn(*args, **kwargs)
    109 finally:
    110     torch._C._set_grad_enabled(prior_grad_mode)

File /usr/local/lib/python3.9/dist-packages/torchdynamo/convert_frame.py:288, in convert_frame_assert.<locals>._convert_fr
ame_assert(frame, cache_size)
    286 for attempt in itertools.count():
    287     try:
--> 288         code = transform_code_object(frame.f_code, transform)
    289         orig_code_map[code] = frame.f_code
    290         break

File /usr/local/lib/python3.9/dist-packages/torchdynamo/bytecode_transformation.py:338, in transform_code_object(code, tra
nsformations, safe)
    334 assert len(code_options["co_varnames"]) == code_options["co_nlocals"]
    336 instructions = cleaned_instructions(code, safe)
--> 338 transformations(instructions, code_options)
    340 fix_vars(instructions, code_options)
    342 dirty = True

File /usr/local/lib/python3.9/dist-packages/torchdynamo/convert_frame.py:264, in convert_frame_assert.<locals>._convert_fr
ame_assert.<locals>.transform(instructions, code_options)
    253 nonlocal output
    254 tracer = InstructionTranslator(
    255     instructions,
    256     frame.f_code,
   (...)
    262     one_graph,
    263 )
--> 264 tracer.run()
    265 output = tracer.output
    266 assert output.output_instructions

File /usr/local/lib/python3.9/dist-packages/torchdynamo/symbolic_convert.py:312, in InstructionTranslatorBase.run(self)
    307 def run(self):
    308     try:
    309         while (
    310             self.instruction_pointer is not None
    311             and not self.output.should_exit
--> 312             and self.step()
    313         ):
    314             pass
    315     except (
    316         exc.BackendCompilerFailed,
    317         exc.RestartAnalysis,
   (...)
    320         exc.Unsupported,
    321     ):

File /usr/local/lib/python3.9/dist-packages/torchdynamo/symbolic_convert.py:290, in InstructionTranslatorBase.step(self)
    288     if not hasattr(self, inst.opname):
    289         unimplemented(f"missing: {inst.opname}")
--> 290     getattr(self, inst.opname)(inst)
    291     return inst.opname != "RETURN_VALUE"
    292 except Unsupported as exc:

File /usr/local/lib/python3.9/dist-packages/torchdynamo/symbolic_convert.py:1335, in InstructionTranslator.RETURN_VALUE(se
lf, inst)
   1333     raise exc.SkipFrame()
   1334 self.instruction_pointer = None
-> 1335 self.output.compile_subgraph(self)
   1336 self.output.add_output_instructions([create_instruction("RETURN_VALUE")])

File /usr/local/lib/python3.9/dist-packages/torchdynamo/output_graph.py:307, in OutputGraph.compile_subgraph(self, tx, par
tial_convert)
    304 output = []
    305 if count_calls(self.graph) != 0 or len(pass2.graph_outputs) != 0:
    306     output.extend(
--> 307         self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
    308     )
    310     if len(pass2.graph_outputs) != 0:
    311         output.append(pass2.create_store(graph_output_var))

File /usr/local/lib/python3.9/dist-packages/torchdynamo/output_graph.py:348, in OutputGraph.compile_and_call_fx_graph(self, tx, rv, root)
    346 gm.recompile()
    347 name = unique_id("__compiled_fn")
--> 348 compiled_fn = self.call_user_compiler(gm)
    349 compiled_fn = torchdynamo.disable(compiled_fn)
    350 counters["stats"]["unique_graphs"] += 1

File /usr/local/lib/python3.9/dist-packages/torchdynamo/output_graph.py:371, in OutputGraph.call_user_compiler(self, gm)
    369     compiled_fn = gm.forward
    370     if config.raise_on_backend_error:
--> 371         raise BackendCompilerFailed(self.compiler_fn, e) from e
    372 return compiled_fn

BackendCompilerFailed: compiler raised IndexError: map::at

I am running inside a Docker image from the Dockerfile here.

Using this nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

Using torch versions

torch               1.12.1+cu116
torchdynamo         1.13.0.dev0

And triton version

triton              2.0.0.dev20221001
@pommedeterresautee
Copy link
Member

Hi, thank you for your report.
I don't know if you had issues with PyTorch version (1.13 has been released but Kernl does not work yet with it), etc.
Just pushed a fix here: #132

After building the image with the fix (PyTorch version pinpointed):

docker run --rm -it --gpus all -v $(pwd):/kernl kernl
pip install jupyter
jupyter nbconvert --execute --clear-output tutorial/t5\ e2e.ipynb

It worked as expected. Let me know if it helps.

@tokestermw
Copy link
Author

tokestermw commented Oct 31, 2022

Using 0.12.1 (and Docker commands from #132), I get an error but maybe there are hardware requirements (currently using Tesla T4s). #47

TORCHDYNAMO: backend compiler failed
Traceback (most recent call last):
  File "<string>", line 21, in kernel_fma
KeyError: ('2-.-0-.-0-1e8410f206c822547fb50e2ea86e45a6-cfed90d463fc30ccffb0eb2fd26372d3-a357695982511d203a134df772c7b4a1-2
121719c12e3ab66746f4a57f276d42e-0f76008a374e725ca29ccb33f1ba668f-dc48432b6b79843e2f9c7ad2e7355f59-f40d73592c2578180d3d8e3f
64e3957d-0dd03b0bd512a184b3512b278d9dfa59-d7c2e52f8151bec157e9a17a1ec37dd3', (torch.float16, None, torch.float16, torch.fl
oat16, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (32, 8, 64, 32,
 1, True, False, False, 'relu'), (True, (False,), True, True, True, (False, False), (True, False), (True, False), (True, F
alse), (True, False), (True, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, Tr
ue)))

...

  File "/usr/local/lib/python3.9/dist-packages/triton/runtime/autotuner.py", line 62, in kernel_call
    self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current)
  File "/usr/local/lib/python3.9/dist-packages/triton/runtime/autotuner.py", line 200, in run
    return self.fn.run(*args, **kwargs)
  File "<string>", line 42, in kernel_fma
  File "/usr/local/lib/python3.9/dist-packages/triton/compiler.py", line 1225, in compile
    return CompiledKernel(name, so_cache_manager._make_path(so_name), fn_cache_manager.cache_dir, device)
  File "/usr/local/lib/python3.9/dist-packages/triton/compiler.py", line 1250, in __init__
    mod, func, n_regs, n_spills = _triton.code_gen.load_binary(metadata["name"], self.asm["cubin"], self.shared, device)
RuntimeError: CUDA: Error- illegal address

...

   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.9/dist-packages/torch/nn/modules/sparse.py:158, in Embedding.forward(self, input)
    157 def forward(self, input: Tensor) -> Tensor:
--> 158     return F.embedding(
    159         input, self.weight, self.padding_idx, self.max_norm,
    160         self.norm_type, self.scale_grad_by_freq, self.sparse)

File /usr/local/lib/python3.9/dist-packages/torch/nn/functional.py:2199, in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2193     # Note [embedding_renorm set_grad_enabled]
   2194     # XXX: equivalent to
   2195     # with torch.no_grad():
   2196     #   torch.embedding_renorm_
   2197     # remove once script supports set_grad_enabled
   2198     _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2199 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

@pommedeterresautee
Copy link
Member

Yes indeed, only ampere are supported. Docker disable the check to let us build the image but there is no more check. May be we should move the check elsewhere

@pommedeterresautee
Copy link
Member

Closing as the questions is answered. Don't hesitate to reopen if you want more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants