Mistral 7b and Mixtral 8x7b experience degraded performance (using official docs) #1305

iibw · 2024-03-15T01:38:15Z

System Info

TensorRT-LLM v0.8.0 (pinned to release commit)
Nvidia A100
Mistral-7B-Instruct-v0.2
Using the CPP runner
Installed with pip install tensorrt_llm==0.8.0 --extra-index-url https://pypi.nvidia.com
Cloned this repository at the v0.8.0 5955b8a commit for contained scripts

Who can help?

@kaiyux @byshiue

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Build and run Mistral 7b with the example instructions.

python convert_checkpoint.py --model_dir Mistral-7B-Instruct-v0.2 \
                             --output_dir trt_engines/tllm_checkpoint_1gpu_mistral \
                             --dtype float16
trtllm-build --checkpoint_dir trt_engines/tllm_checkpoint_1gpu_mistral \
            --output_dir trt_engines/fp16/1-gpu/ \
            --gemm_plugin float16 \
            --max_input_len 32256
python3 ../run.py --max_output_len 2000 \
                  --input_text "<prompt here>" \
                  --tokenizer_dir Mistral-7B-Instruct-v0.2 \
                  --engine_dir trt_engines/fp16/1-gpu \
                  --max_attention_window_size 4096

Experiment with various prompts conforming to the Mistral prompt template/format such as:

"[INST] write me a long story [/INST]"
"[INST] write me a long book [/INST]"
"[INST] Please write an essay on the thermodynamics of pizza. [/INST]"
or if you disturb the formatting a bit as with the spaces here
"[INST] hi [/INST] Hello ! How can I help you today ? [INST] hi [/INST]"

Expected behavior

The LLM will provide a complete response which ends instead of deteriorating into an infinite loop. This problem does not happen with Transformers as far as I can tell. Every one of the above example prompts ends and does not infinitely loop.

For example with Transformers:
prompt: [INST] Please write an essay on the thermodynamics of pizza. [/INST]
output:

<s> [INST] Please write an essay on the thermodynamics of pizza. [/INST] Title: The Thermodynamics of Pizza: A Delicious Exploration of Energy Transformations

Abstract:
Pizza, a beloved food item enjoyed by millions around the world, offers an intriguing platform to explore the fundamental principles of thermodynamics. This essay delves into the fascinating world of energy transformations in the context of pizza production, cooking, and consumption.

Introduction:
Thermodynamics, the branch of physics that deals with heat and temperature, provides a framework to understand the transformations of energy in various systems. In our daily lives, we encounter numerous examples of energy transformations, some as simple as a cup of hot coffee cooling down to room temperature or as complex as the combustion engine in a car. In this essay, we will explore the thermodynamics of pizza, from its production to consumption.

Production:
The production of pizza involves several energy transformations. The primary raw materials, such as flour, water, yeast, tomatoes, and cheese, undergo various processes to create the final product. The energy required to produce these raw materials comes from the sun, through the process of photosynthesis in plants, or from non-renewable sources like fossil fuels.

During the baking process, the dough is transformed into a golden-brown crust. This transformation occurs due to the application of heat, which causes the water in the dough to evaporate, producing steam. The heat also denatures the proteins in the dough, allowing it to set and form a solid structure. This process is an endothermic reaction, meaning it absorbs energy from its surroundings.

Cooking:
The cooking of pizza is another fascinating example of energy transformations. The pizza is typically cooked in a wood-fired or gas-fired oven, which provides the high temperatures necessary to cook the pizza evenly and quickly. The heat from the oven transfers energy to the pizza, causing the water in the dough and toppings to evaporate, producing steam. This steam helps to cook the pizza from the inside out, while the high temperatures also help to melt the cheese and brown the crust.

Consumption:
The consumption of pizza is the final stage in its life cycle, and it too involves energy transformations. Our bodies use the energy stored in the pizza to fuel various metabolic processes. The carbohydrates in the pizza are broken down into glucose, which is used as a source of energy. The proteins in the pizza are broken down into amino acids, which are used to build and repair body tissues. The fats in the pizza provide a source of energy and help to absorb the fat-soluble vitamins in the pizza.

Conclusion:
The thermodynamics of pizza provide a fascinating glimpse into the world of energy transformations. From the production of the raw materials to the cooking of the pizza and its consumption, each stage involves the absorption, transfer, and transformation of energy. Understanding these principles not only deepens our appreciation for the science behind our favorite food but also highlights the interconnectedness of various systems in the natural world. So the next time you enjoy a slice of pizza, take a moment to savor not only its delicious flavors but also the fascinating energy transformations that brought it to your table.</s>

actual behavior

Generation does not end until the max_output_len is reached and the farther it goes, the worse it gets. From what I've seen it starts repeating itself and then outputting random tokens which decode as random unicode symbols.

For example,
prompt: [INST] Please write an essay on the thermodynamics of pizza. [/INST]
output:
starts off well with

Input [Text 0]: "<s> [INST] Please write an essay on the thermodynamics of pizza. [/INST]"
Output [Text 0 Beam 0]: "Title: The Thermodynamics of Pizza: A Delicious Exploration

Abstract:
This essay delves into the fascinating world of thermodynamics as it applies to the beloved food item: pizza.

Introduction:
Pizza, a staple food in many households, is a delightful combination of dough, sauce, and various toppings. However, beyond its delicious taste and satisfying texture, lies an intriguing thermodynamic story.

First Law of Thermodynamics:
The first law of thermodynamics, also known as the law of conservation of energy, states that energy cannot be created or destroyed, only transferred or transformed from one form to another.

In the context of pizza, the first law of thermodynamics can be applied to the various stages of pizza preparation and consumption.

For instance, when the dough is being kneaded, the energy from the kneading action is being transferred to the dough, transforming the dough from a disorganized, sticky mass to a smooth, elastic, cohesive structure that is capable of holding the various toppings and sauces that will be added later.

Similarly, when the pizza is being cooked in a hot oven, the energy from the heat is being transferred to the pizza, transforming the pizza from a raw, doughy, unappetizing state to a delicious, golden-brown, mouth-watering masterpiece that is capable of satisfying even the most discerning of taste buds.

but after awhile it starts a continuous loop until finally it reaches the end

In conclusion, the thermodynamics of pizza provide an intriguing and fascinating exploration into the various stages of pizza preparation, consumption, and disposal, and the ways in which the laws of thermodynamics apply to these processes.

From the first law of thermodynamics, which governs the conservation of energy and the transfer and transformation of energy from one form to another, we can see how the energy from the pizza, in the form of the heat and nutrients that it contains, is transferred and transformed from the pizza to the body of the person who is consuming the pizza, and how this process is governed by the first law of thermodynamics.

From the second law of thermodynamics, which governs the increase of entropy, or disorder, in a closed system over time, we can see how the pizza, which once was a delicious, satisfying, nourishing meal, has now been transformed into a mere waste product, which is capable of contributing to the total entropy, or disorder, of the universe.

In summary, the thermodynamics of pizza provide an intriguing and fascinating exploration into the various stages of pizza preparation, consumption, and disposal, and the ways in which the laws of thermodynamics apply to these processes.

From the first law of thermodynamics, which governs the conservation of energy and the transfer and transformation of energy from one form to another, we can see how the energy from the pizza, in the form of the heat and nutrients that it contains, is transferred and transformed from the pizza to the body of the person who is consuming the pizza, and how this process is governed by the first law of thermodynamics.

From the second law of thermodynamics, which governs the increase of entropy, or disorder, in a closed system over time, we can see how the pizza, which once was a delicious, satisfying, nourishing meal, has now been transformed into a mere waste product, which is capable of contributing to the total entropy, or disorder, of the universe.

In conclusion, the thermodynamics of pizza provide an intriguing and fascinating exploration into the various stages of pizza preparation, consumption, and disposal, and the ways in which the laws of thermodynamics apply to these processes.

From the first law of thermodynamics, which governs the conservation of energy and the transfer and

additional notes

Increasing the repetition penalty has an effect, but it doesn't always work and degrades the output whereas I've never seen this issue with Mistral 7b using Transformers.

The text was updated successfully, but these errors were encountered:

PerkzZheng · 2024-03-19T06:35:54Z

@iibw do you see the same issue with Mistral-7B-v0.1 (https://huggingface.co/mistralai/Mistral-7B-v0.1) ? just trying to rule out some potential factors that might lead to this.

iibw · 2024-03-21T16:39:31Z

@PerkzZheng

@iibw do you see the same issue with Mistral-7B-v0.1 (https://huggingface.co/mistralai/Mistral-7B-v0.1) ? just trying to rule out some potential factors that might lead to this.

None of the four prompts I provided experienced degraded performance when I tested them with Mistral-7b-Instruct-v0.1. Three out of the four prompts produced exactly the same output between TensorRT-LLM and Transformers and the last one wasn't the same as the Transformers output, but it was similar. None of them produced repeating outputs and after experimenting with other prompts, trying to get the same issue to happen, I failed to do so.

So, it seems your hunch was correct, and this problem does not affect Mistral v0.1.

PerkzZheng · 2024-03-22T01:32:44Z

Mistral-7B-Instruct-v0.2

does one of your prompts have large than 4096 sequence length (input + output) ? Mistral-instruct-v0.2 doesn't have a sliding window attention, so you would better remove that line (--max_attention_window_size 4096). That is the only difference I can tell from the huggingface configurations.

iibw · 2024-03-22T23:10:25Z

@PerkzZheng

Mistral-7B-Instruct-v0.2

does one of your prompts have large than 4096 sequence length (input + output) ? Mistral-instruct-v0.2 doesn't have a sliding window attention, so you would better remove that line (--max_attention_window_size 4096). That is the only difference I can tell from the huggingface configurations.

No because the sequence length is never more than around 2000 tokens due to --max_output_len 2000. The repetition starts long before 4096 tokens are hit.

Just in case that was the issue, I ran Mistral-7B-Instruct-v0.2 without --max_attention_window_size 4096 and nothing changed. It reached the max_output_len while repeating over and over.

For example, given the prompt [INST] write me a long story [/INST], this was output:

[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
No protocol specified
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Loaded engine size: 13815 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 19092, GPU 15929 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +206, GPU +58, now: CPU 19298, GPU 15987 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +13812, now: CPU 0, GPU 13812 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 19320, GPU 18929 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 19320, GPU 18937 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13812 (MiB)
[TensorRT-LLM][WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[TensorRT-LLM][INFO] Allocate 58636369920 bytes for k/v cache.
[TensorRT-LLM][INFO] Using 447360 tokens in paged KV cache.
[TensorRT-LLM] TensorRT-LLM version: 0.8.0Input [Text 0]: "<s> [INST] write me a long story [/INST]"
Output [Text 0 Beam 0]: "Title: The Chronicles of the Celestial Garden

Prologue: The Whispered Prophecy

In the farthest reaches of the cosmos, nestled between the swirling nebulae and the vast, star-studded expanses, lay the Celestial Garden.

This enchanted paradise was a place of unparalleled beauty and tranquility. It was a realm where the very fabric of reality seemed to bend and warp in the most fantastical ways.

The Celestial Garden was a place of wonder and enchantment, where the most extraordinary beings from across the cosmos came to live, learn, and grow.

But the Celestial Garden was not without its dangers and challenges. For in the vast, swirling cosmos, there were those who coveted the unparalleled beauty and power of the Celestial Garden for their own nefarious purposes.

And so, it was foretold by the ancient, wise seers of the cosmos that a great and terrible battle would soon be waged across the vast, swirling cosmos. A battle that would determine the fate of the Celestial Garden and all the wondrous beings that called it home.

This was the whispered prophecy, passed down through the generations of the ancient, wise seers of the cosmos. And it was this prophecy that had brought the great and powerful beings of the cosmos to the Celestial Garden, to prepare for the coming battle and to protect the unparalleled beauty and power of the Celestial Garden for generations to come.

Chapter One: The Gathering of the Great and Powerful

In the heart of the Celestial Garden, nestled among the swirling, iridescent petals of the most beautiful and fragrant flowers in the cosmos, stood the magnificent, resplendent Palace of the Celestial Garden.

This wondrous, enchanted palace was the grandest and most magnificent edifice in the entire cosmos. It was a place of unparalleled beauty, wonder, and enchantment, where the very fabric of reality seemed to bend and warp in the most fantastical ways.

The Palace of the Celestial Garden was a place of pilgrimage for the great and powerful beings of the cosmos. For it was here, in the heart of the Celestial Garden, that the great and powerful beings of the cosmos had come to gather, to prepare for the coming battle, and to protect the unparalleled beauty and power of the Celestial Garden for generations to come.

And so, the great and powerful beings of the cosmos had come to the Celestial Garden, each one bringing with them their unique and extraordinary abilities, knowledge, and power.

There were the great and powerful elemental beings, each one hailing from the most far-flung and inaccessible corners of the cosmos. There were the great and powerful beings of fire, who could summon forth the most scorching and infernal flames, capable of reducing even the most indestructible of materials to ash and dust. There were the great and powerful beings of water, who could summon forth the most torrential and devastating floods, capable of submerging even the most colossal of structures beneath their relentless and unyielding waves. There were the great and powerful beings of air, who could summon forth the most tempestuous and violent storms, capable of tearing apart even the most robust and sturdy of structures with their relentless and unyielding winds. And there were the great and powerful beings of earth, who could summon forth the most solid and unyielding of structures, capable of withstanding even the most cataclysmic of events with their unyielding and unbreakable strength.

But the great and powerful beings of the cosmos were not just limited to the elemental beings. There were also the great and powerful beings of light, who could summon forth the most brilliant and radiant of lights, capable of illuminating even the most darkest and shadowy of corners of the cosmos with their unyielding and unquenchable light. And there were the great and powerful beings of darkness, who could summon forth the most enveloping and all-consuming of darknesses, capable of shrouding even the most brilliant and radiant of lights in an impenetrable and unbreakable darkness, leaving all that lay within its grasp in a state of utter and complete darkness and oblivion.

And then there were the great and powerful beings of time, who could summon forth the most elusive and fleeting of all entities, time itself. These great and powerful beings of time could manipulate and control the very fabric of time itself, bending and warping it to their will, allowing them to travel through time and space at will, visiting the most far-flung and inaccessible corners of the cosmos, and witnessing the most awe-inspiring and mind-boggling of events and phenomena that the cosmos had to offer.

And finally, there were the great and powerful beings of space, who could summon forth the most vast and expansive of all entities, space itself. These great and powerful beings of space could manipulate and control the very fabric of space itself, bending and warping it to their will, allowing them to travel through the most far-flung and inaccessible corners of the cosmos, and witness the most awe-inspiring and mind-boggling of events and phenomena that the cosmos had to offer.

And so, the great and powerful beings of the cosmos had come to the Celestial Garden, each one bringing with them their unique and extraordinary abilities, knowledge, and power. And together, they had formed an unbreakable and unyielding alliance, dedicated to protecting the unparalleled beauty and power of the Celestial Garden for generations to come.

Chapter Two: The Preparations for the Coming Battle

In the heart of the Celestial Garden, nestled among the swirling, iridescent petals of the most beautiful and fragrant flowers in the cosmos, stood the magnificent, resplendent Palace of the Celestial Garden.

This wondrous, enchanted palace was the grandest and most magnificent edifice in the entire cosmos. It was a place of pilgrimage for the great and powerful beings of the cosmos, who came to the Celestial Garden to prepare for the coming battle, and to protect the unparalleled beauty and power of the Celestial Garden for generations to come.

And so, the great and powerful beings of the cosmos had come to the Celestial Garden, each one bringing with them their unique and extraordinary abilities, knowledge, and power. And together, they had formed an unbreakable and unyielding alliance, dedicated to protecting the unparalleled beauty and power of the Celestial Garden for generations to come.

But the great and powerful beings of the cosmos knew that they could not simply rest on their laurels and wait for the coming battle to come to them. For they knew that the forces of darkness and oblivion, led by the malevolent and all-powerful being of darkness and oblivion, known only as the Void King, would stop at nothing to prevent the great and powerful beings of the cosmos from protecting the unparalleled beauty and power of the Celestial Garden for generations to come.

And so, the great and powerful beings of the cosmos had come to the Celestial Garden not only to prepare for the coming battle, but also to take the fight to the forces of darkness and oblivion, led by the malevolent and all-powerful being of darkness and oblivion, known only as the Void King.

And so, the great and powerful beings of the cosmos had come to the Celestial Garden, each one bringing with them their unique and extraordinary abilities, knowledge, and power, to form an unbreakable and unyielding alliance, dedicated to protecting the unparalleled beauty and power of the Celestial Garden for generations to come, and to taking the fight to the forces of darkness and oblivion, led by the malevolent and all-powerful being of darkness and oblivion, known only as the Void King.

And so, the great and powerful beings of the cosmos had come to the Celestial Garden, each one bringing with them their unique and extraordinary abilities, knowledge, and power, to form an unbreakable and unyielding alliance, dedicated to protecting the unparalleled beauty and power of the Celestial Garden for generations to come, and to taking the fight to the forces of darkness and oblivion, led by the malevolent and all-powerful being of darkness and oblivion, known only as the Void King.

And so, the great and powerful beings of the cosmos had come to the Celestial Garden, each one bringing with them their unique and extraordinary abilities, knowledge, and power, to form an unbreakable and unyielding alliance, dedicated to protecting the un"
Tokens: 2013

I noticed at the start there are some json config errors. Maybe they contribute to this? Although I believe they exist with v0.1 as well.

iibw · 2024-03-23T02:32:57Z

@PerkzZheng it looks like this bug affects Mixtral as well.

Using the same system info as my Mistral testing and these commands:

python convert_checkpoint.py --model_dir Mixtral-8x7B-Instruct-v0.1 \
                             --output_dir trt_engines/Mixtral-8x7B-Instruct-v0.1/int8_wo_1gpu_ckpt \
                             --dtype float16 \
                             --use_weight_only \
                             --weight_only_precision int8 \
                             --load_model_on_cpu
trtllm-build --checkpoint_dir trt_engines/Mixtral-8x7B-Instruct-v0.1/int8_wo_1gpu_ckpt \
             --output_dir trt_engines/Mixtral-8x7B-Instruct-v0.1/int8_weight_only/1-gpu \
             --gemm_plugin float16 --gpt_attention_plugin float16 --context_fmha enable \
             --use_paged_context_fmha enable --remove_input_padding enable \
             --max_batch_size 64 --max_input_len 32256 --max_output_len 32256 \
             --max_num_tokens 32256
python3 ../run.py --engine_dir trt_engines/Mixtral-8x7B-Instruct-v0.1/int8_weight_only/1-gpu \
                  --tokenizer_dir Mixtral-8x7B-Instruct-v0.1 \
                  --max_output_len 2000 \
                  --input_text "<prompt here>"

I built Mixtral for my system (I can't do full precision because I don't have enough VRAM) and ran the four prompts above. Three of the four prompts worked without issue, but the prompt [INST] write me a long book [/INST] looped over and over until it reached the 2000 token limit. I tried these prompts with Mixtral on Transformers with bitsandbytes 8bit precision and again, no issue. The three prompts that worked properly with TensorRT-LLM produced a very similar output when using Transformers and the prompt which was broken with TensorRT-LLM, was not broken with Transformers.

The output of the broken prompt [INST] write me a long book [/INST]:

I'd be happy to help you write a book, but I'll need a bit more information about what you have in mind before I get started. Here are some questions that might help us get started:

* What is the topic of the book?
* What is the intended audience for the book?
* What is the desired length of the book?
* Do you have any specific ideas or themes that you would like to include in the book?

Once I have a better understanding of what you're looking for, I can begin to help you write your book. In the meantime, here is a brief outline that might help us get started:

1. Introduction: This section should introduce the topic of the book and provide some context for the reader. It should also include a brief overview of the main ideas and themes that will be explored in the book.
2. Background information: This section should provide any necessary background information that the reader will need in order to understand the topic of the book. This might include historical context, scientific concepts, or other relevant information.
3. Main ideas and arguments: This section should be the bulk of the book, and should present the main ideas and arguments that you want to make. It should be organized in a logical and coherent way, with each chapter building on the ideas presented in the previous chapter.
4. Conclusion: This section should summarize the main ideas and arguments presented in the book, and should provide some final thoughts or reflections on the topic.
5. Appendices (if necessary): This section can include any additional information or materials that didn't fit neatly into the main body of the book. This might include data tables, charts, or other supplementary materials.

I hope this outline is helpful as you begin to write your book. If you have any specific questions or need further guidance, please don't hesitate to ask. I'm here to help!

As I mentioned earlier, I'll need more information about the topic and intended audience of your book in order to provide more specific advice and guidance. In the meantime, here are a few general tips that may be helpful as you begin to write:

* Outline your book before you start writing. This will help you organize your thoughts and ensure that your book has a clear structure and flow.
* Write regularly, even if it's just for a short period of time each day. This will help you build momentum and make progress on your book.
* Don't be afraid to revise and edit your work as you go. It's rare for a first draft to be perfect, so don't be afraid to make changes and improvements as you work.
* Seek feedback from others. This can be a valuable way to get fresh perspectives on your work and identify any areas that may need improvement.
* Stay organized and keep track of your progress. This will help you stay on track and motivated as you work on your book.

I hope these tips are helpful as you begin to write your book. If you have any specific questions or need further guidance, please don't hesitate to ask. I'm here to help!

Here are a few more specific tips that may be helpful as you write your book:

* Use clear and concise language. This will help your readers understand your ideas and arguments more easily.
* Use examples and anecdotes to illustrate your points. This can make your writing more engaging and help your readers relate to your ideas.
* Use headings and subheadings to organize your ideas and make your writing easier to follow.
* Use transitional phrases to link your ideas together and help your writing flow smoothly.
* Use images, charts, and other visual aids to help illustrate your points and break up the text.

I hope these tips are helpful as you write your book. If you have any specific questions or need further guidance, please don't hesitate to ask. I'm here to help!

As you can see, writing a book is a big task that requires careful planning and attention to detail. However, with hard work and dedication, it is definitely possible to write a successful and engaging book. I wish you the best of luck as you begin this exciting journey!

If you have any specific questions or need further guidance, please don't hesitate to ask. I'm here to help!

Here are a few more tips that may be helpful as you write your book:

* Don't be afraid to take breaks and step away from your writing when you need to. This can help you clear your mind and come back to your work with fresh eyes.
* Set specific goals for yourself and track your progress. This can help you stay motivated and on track as you work on your book.
* Seek out resources and advice from other writers. There are many books, articles, and online resources available that can provide valuable insights and tips on writing a book.
* Consider working with a writing coach or mentor. This can be a great way to get personalized feedback and guidance as you write your book.
* Most importantly, have fun and enjoy the process of writing your book! This will help you stay motivated and engaged as you work on your project.

I hope these tips are helpful as you begin to write your book. If you have any specific questions or need further guidance, please don't hesitate to ask. I'm here to help!

As you can see, writing a book is a complex and challenging process that requires careful planning, hard work, and dedication. However, it can also be a rewarding and fulfilling experience that allows you to share your ideas and insights with the world. I wish you the best of luck as you begin this exciting journey!

If you have any specific questions or need further guidance, please don't hesitate to ask. I'm here to help!

Here are a few more tips that may be helpful as you write your book:

* Be patient and persistent. Writing a book takes time, and it's important to be patient and keep working even when you encounter challenges or setbacks.
* Seek out opportunities to share your work and get feedback from others. This can be a great way to improve your writing and build your confidence as a writer.
* Don't be afraid to take risks and try new things. Writing a book is a creative process, and it's important to be open to new ideas and approaches.
* Keep learning and growing as a writer. There is always room for improvement, and seeking out opportunities to learn and grow can help you become a better writer.
* Most importantly, believe in yourself and your ability to write a great book. With hard work and dedication, you can achieve your goals and create something truly special.

I hope these tips are helpful as you begin to write your book. If you have any specific questions or need further guidance, please don't hesitate to ask. I'm here to help!

As you can see, writing a book is a complex and challenging process that requires careful planning, hard work, and dedication. However, it can also be a rewarding and fulfilling experience that allows you to share your ideas and insights with the world. I wish you the best of luck as you begin this exciting journey!

If you have any specific questions or need further guidance, please don't hesitate to ask. I'm here to help!

Here are a few more tips that may be helpful as you write your book:

* Don't be afraid to ask for help when you need it. Whether you're struggling with a particular aspect of your book or just need some moral support, don't hesitate to reach out to others for assistance.
* Take care of yourself physically and mentally. Writing a book can be demanding, and it's important to make sure you're taking care of your physical and mental health.
* Stay positive and focused on your goals. It's easy to get discouraged or sidetracked when working on a long-term project like a book, but it's important to stay positive and focused on your goals in order to stay motivated and make progress.
* Celebrate your successes, no matter how small they may seem. Writing a book is a big accomplishment, and it's important to take the time to celebrate your successes along the way.
* Most importantly, have fun and enjoy the process of writing your book. This will help you stay motivated and engaged as you work on your project, and it will also make the final product more enjoyable for your readers.

I hope these tips are helpful as you begin to write your book. If you have any specific questions or need further guidance, please don't hesitate to ask. I'm here to help!

As you can see, writing a book is a complex and challenging process that requires careful planning, hard work, and dedication. However, it can also be a rewarding and fulfilling experience that allows you to share your ideas and insights with the world. I wish you the best of luck as you begin this exciting journey!

If you have any specific questions or need further guidance, please don't hesitate to ask. I'm here to help!

Here are a few more tips that may be helpful as you write your book:

* Don't be afraid

iibw · 2024-03-23T02:39:41Z

I should probably mention, for all these tests with Transformers, I'm using version 4.36.1. If needed, I can provide the Transformers code as well.

PerkzZheng · 2024-03-24T04:01:35Z

So to summarize,

Mistral v0.1 also generates repeated sentences after enabling int8 weight only, but looks good without int8 weight only ?
Mistral v0.2 generates repeated sentences no matter with or without int8 weight only ?

bprus · 2024-03-25T07:50:02Z

I have the very same issue with Mixtral on Nvidia H100 in version 0.8.0.
I tried building both with int8 weight quantization and no quantization and the issue is still there.
I can provide more details if needed.

iibw · 2024-03-25T15:24:06Z

@PerkzZheng

Mistral v0.1 also generates repeated sentences after enabling int8 weight only, but looks good without int8 weight only ?

Mistral v0.1 is perfectly fine as far as I have seen. Mixtral 8x7b is the one appearing to also have this problem. I can't test Mixtral 8x7b without int8 weight only applied because it's too large for the A100 GPU I have access to, but according to @bprus, it doesn't seem to make a difference.

Mistral v0.2 generates repeated sentences no matter with or without int8 weight only ?

I haven't tried Mistral v0.2 with int8 weight only so I can't say for sure, but given what @bprus said, int8 weight only doesn't seem to change anything.

So, to recap:

Mistral-7b-Instruct-v0.1 built with normal config appears to be fine
Mistral-7b-Instruct-v0.2 built with normal config appears to have the issue
Mixtral-8x7b-Instruct-v0.1 built with int8 weight only appears to have the issue
Mixtral-8x7b-Instruct-v0.1 built with normal config appears to have the issue according to @bprus

PerkzZheng · 2024-03-26T01:39:49Z

Thanks for the summary. I will see if I can reproduce and find the root cause of this.

bprus · 2024-03-26T07:05:41Z

I think it's the same as: #722

iibw · 2024-03-26T22:32:34Z

It appears to have been introduced with #667 or maybe the code that was written in #349 for Mistral v0.1 is being applied to Mistral v0.2 and Mixtral 8x7b without some changes necessary to run them correctly.

iibw · 2024-03-26T22:48:24Z

@PerkzZheng I'm not sure if this helps or not, but it looks like there are more changes between Mistral v0.1 and Mistral v0.2 than just the removal of the sliding window

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/commit/41b61a33a2483885c981aa79e0df6b32407ed873

iibw · 2024-03-27T15:10:51Z

Also, it doesn't seem like the Nvidia demo at https://build.nvidia.com/mistralai/mixtral-8x7b-instruct has this issue and it says that it uses Triton Inference Server which is probably using TensorRT-LLM as its backend.

PerkzZheng · 2024-04-03T10:18:19Z

@iibw can you have a try with the main branch ?
this is what I got with Mistral v0.2 and your prompt.

Input [Text 0]: "<s> [INST] Please write an essay on the thermodynamics of pizza. [/INST]"
Output [Text 0 Beam 0]: "Title: The Thermodynamics of Pizza: A Delicious Exploration of Energy Transformations

Abstract:
Pizza, a beloved food item enjoyed by millions around the world, offers an intriguing platform to explore the fundamental principles of thermodynamics. This essay delves into the fascinating world of energy transformations in the context of pizza production, cooking, and consumption.

Introduction:
Thermodynamics, the branch of physics that deals with heat and temperature, provides a framework to understand the transformations of energy in various systems. In our daily lives, we encounter numerous examples of energy transformations, some as simple as a cup of hot coffee cooling down to room temperature or as complex as the combustion engine in a car. In this essay, we will explore the thermodynamics of pizza, from its production to consumption.

Production:
The production of pizza involves several energy transformations. The primary raw materials, such as flour, water, yeast, tomatoes, and cheese, undergo various processes to create the final product. The energy required to produce these raw materials comes from the sun, through the process of photosynthesis in plants, or from non-renewable sources like fossil fuels.

During the baking process, the dough is transformed into a golden-brown crust. This transformation occurs due to the application of heat, which causes the water in the dough to evaporate, producing steam. The heat also denatures the proteins in the dough, allowing it to set and form a solid structure. This process is an endothermic reaction, meaning it absorbs energy from its surroundings.

Cooking:
The cooking of pizza is another fascinating example of energy transformations. The pizza is typically cooked in a wood-fired or gas-fired oven, which provides the high temperatures necessary to cook the pizza evenly and quickly. The heat from the oven transfers energy to the pizza, causing the water in the dough and toppings to evaporate, producing steam. This steam helps to cook the pizza from the inside out, while the high temperatures also help to melt the cheese and brown the crust.

Consumption:
The consumption of pizza involves the transformation of potential energy, stored in the pizza, into kinetic energy as we eat it. Our bodies use this energy to perform various functions, such as digestion, muscle movement, and maintaining body temperature. The energy from the pizza is ultimately derived from the energy stored in the sun, through the process of photosynthesis in plants or the burning of fossil fuels.

Conclusion:
The thermodynamics of pizza offers a delightful exploration of the energy transformations that occur in our daily lives. From the production of the raw materials to the cooking of the pizza and its consumption, each step involves the transfer of energy between various forms. Understanding these energy transformations not only deepens our appreciation for the science behind our favorite food but also highlights the interconnectedness of various systems in the natural world. So, the next time you enjoy a slice of pizza, take a moment to savor not only its delicious flavors but also the fascinating thermodynamic processes that brought it to your table."

WeiXiaoSummer · 2024-04-05T19:05:34Z

I tried with the main branch on mistral v0.2 and I'm experiencing the same error. The model repeats texts after max output length is reached.

PerkzZheng · 2024-04-07T02:02:05Z

I tried with the main branch on mistral v0.2 and I'm experiencing the same error. The model repeats texts after max output length is reached.

it should throw an error if the output length exceeds the max_output_length.
if you build the engine with a large enough output length, do you observe any repeated sentences ?

WeiXiaoSummer · 2024-04-07T18:45:12Z

I tried with the main branch on mistral v0.2 and I'm experiencing the same error. The model repeats texts after max output length is reached.

it should throw an error if the output length exceeds the max_output_length. if you build the engine with a large enough output length, do you observe any repeated sentences ?

Sorry I made a typo in my original response. What I mean was the model repeats texts continually until the max output length is reached. But I made another try today and the issue is now surprisingly gone.

PerkzZheng · 2024-04-08T01:51:22Z

the relevant bugs/issues might have been fixed in the main branch. let us know if you find other issues or you can close this. thanks.

bprus · 2024-04-08T11:58:14Z

Hi!
I've built the current main branch today, and unfortunately, the issue still persists for Mixtral.

I've built the image with make -C docker release_build CUDA_ARCHS="90-real"

Then converted with:

python3 convert_checkpoint.py --model_dir mistralai/Mixtral-8x7B-v0.1 \
                              --output_dir /models/rt/Mixtral-8x7B_1gpu_fp16_wq8 \
                              --dtype float16 \
                              --use_weight_only \
                              --weight_only_precision int8 \
                              --tp_size 1 \
                              --workers 1 \
                              --load_model_on_cpu

And built with:

trtllm-build --checkpoint_dir /models/rt/Mixtral-8x7B_1gpu_fp16_wq8 \
             --output_dir /models/engines/Mixtral-8x7B_1gpu_fp16_wq8_pc \
             --gemm_plugin float16 \
             --workers 1 \
             --remove_input_padding enable \
             --use_paged_context_fmha enable \
             --max_input_len 2048 \
             --max_batch_size 64

Here are 2 examples of outputs I get:

mpirun -n 1 --allow-run-as-root python3 ../run.py --max_output_len=1000 \
                  --tokenizer_dir mistralai/Mixtral-8x7B-v0.1 \
                  --engine_dir /models/engines/Mixtral-8x7B_1gpu_fp16_wq8_pc \
                  --input_text "What is machine learning?"

Input [Text 0]: "<s> What is machine learning?"
Output [Text 0 Beam 0]: "

Machine learning is a branch of artificial intelligence that uses algorithms to learn from data and make predictions or decisions. It is a way of teaching computers to learn from data, without being explicitly programmed.

Machine learning algorithms are used in a variety of applications, such as image recognition, natural language processing, and predictive analytics. They can be used to find patterns in data, make predictions, and make decisions.

Machine learning is a rapidly growing field, and there are many different types of algorithms and techniques. Some of the most common types of machine learning algorithms include supervised learning, unsupervised learning, and reinforcement learning.

Supervised learning is a type of machine learning where the algorithm is trained on a set of labeled data. The algorithm is then able to make predictions on new data.

Unsupervised learning is a type of machine learning where the algorithm is not given any labeled data. The algorithm is instead given a set of unlabeled data and is tasked with finding patterns in the data.

Reinforcement learning is a type of machine learning where the algorithm is given a set of rules and is tasked with finding the best way to achieve a goal.

Machine learning is a powerful tool that can be used to solve a variety of problems. It is a rapidly growing field, and there are many different types of algorithms and techniques.

## What is machine learning and how does it work?

Machine learning is a branch of artificial intelligence that deals with the construction and study of algorithms that can learn from and make predictions on data.

Machine learning algorithms are used in a variety of applications, such as email filtering, spam detection, and online recommendation systems. They are also used in more complex applications, such as self-driving cars and medical diagnosis.

Machine learning algorithms are typically divided into two categories: supervised and unsupervised.

Supervised learning algorithms are trained on a set of data that has been labeled with the correct answer. The algorithm then learns to predict the correct answer for new data.

Unsupervised learning algorithms are not given any labeled data. Instead, they are given a set of data and are tasked with finding patterns in the data.

Machine learning algorithms can be further divided into two categories: regression and classification.

Regression algorithms are used to predict a continuous value, such as the price of a stock.

Classification algorithms are used to predict a discrete value, such as the type of animal in a picture.

Machine learning algorithms are typically trained on a set of data that has been labeled with the correct answer. The algorithm then learns to predict the correct answer for new data.

Unsupervised learning algorithms are not given any labeled data. Instead, they are given a set of data and are tasked with finding patterns in the data.

Machine learning algorithms can be further divided into two categories: regression and classification.

Regression algorithms are used to predict a continuous value, such as the price of a stock.

Classification algorithms are used to predict a discrete value, such as the type of animal in a picture.

## What is machine learning and its types?

Machine learning is a branch of artificial intelligence that deals with the construction and study of algorithms that can learn from and make predictions on data.

There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning.

Supervised learning is where the algorithm is given a set of training data, and the algorithm learns to predict the output for new data.

Unsupervised learning is where the algorithm is given a set of data, and the algorithm learns to find patterns in the data.

Reinforcement learning is where the algorithm is given a set of data, and the algorithm learns to find the best way to achieve a goal.

## What is machine learning and its applications?

Machine learning is a branch of artificial intelligence that deals with the construction and study of algorithms that can learn from and make predictions on data.

Machine learning algorithms are used in a variety of applications, such as email filtering, spam detection, and online recommendation systems. They are also used in more complex applications, such as self-driving cars and medical diagnosis.

Machine learning algorithms are typically divided into two categories: supervised learning and unsupervised learning.

Supervised learning algorithms are trained on a set of data that has been labeled with the correct answer. The algorithm then learns to predict the correct answer for new data.

Unsupervised learning algorithms are not given any labeled data. Instead, they are given a set of data and are tasked with finding patterns in the data.

Machine learning algorithms can be"

and

mpirun -n 1 --allow-run-as-root python3 ../run.py --max_output_len=1000 \
                  --tokenizer_dir mistralai/Mixtral-8x7B-v0.1 \
                  --engine_dir /models/engines/Mixtral-8x7B_1gpu_fp16_wq8_pc \
                  --input_text "What is SOTA?"

Input [Text 0]: "<s> What is SOTA?"
Output [Text 0 Beam 0]: "

SOTA is an acronym for Summits On The Air. It is an award scheme for radio amateurs and shortwave listeners that encourages portable operation in mountainous areas. SOTA has been carefully designed to make participation possible for everyone – this is not just for mountaineers! There are awards for activators (those who ascend to the summits) and chasers (who either operate from home, a local hilltop or are even Activators on other summits).

The scheme began in May 2002 with the UK and now includes most of Europe, New Zealand, Japan, Canada, the USA and others.

The SOTA program is a great way to get out and enjoy the outdoors and to meet other hams.

For more information, visit the SOTA website.

SOTA in the Northwest

The Northwest SOTA Association (NW) covers the states of Washington, Oregon, Idaho, and Montana.

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA).

The Northwest SOTA Association is part of the Western USA SOTA Association (WUSA)."

I ran more comprehensive tests and for ~400 requests the average number of output tokens is 985. For other methods of serving (like TGI or vLLM or Transformers) the same requests generate around 550 tokens.

If you need any more information I'm glad to provide it.

PerkzZheng · 2024-04-08T12:04:37Z

@bprus have you observed this issue using fp16 weight instead of int8 weight-only ?
and also try to set --use_paged_context_fmha disable, and see if anything is changed.

bprus · 2024-04-08T12:31:22Z

So I built with --use_paged_context_fmha disable:

trtllm-build --checkpoint_dir /models/rt/Mixtral-8x7B_1gpu_fp16_wq8 \
             --output_dir /models/engines/Mixtral-8x7B_1gpu_fp16_wq8_pc \
             --gemm_plugin float16 \
             --workers 1 \
             --remove_input_padding enable \
             --use_paged_context_fmha disable \
             --max_input_len 2048 \
             --max_batch_size 64

And unfortunately, it didn't change anything and the results are exactly the same.
Right now, I'm unable to test fp16 because of GPU availability. I need 2xH100 for this.
However, in my previous tests (on older versions) it didn't matter if I skipped quantization.
#1305 (comment)

I'll try to re-run the test when I get the chance.

PerkzZheng · 2024-04-08T13:11:04Z

So I built with --use_paged_context_fmha disable:
trtllm-build --checkpoint_dir /models/rt/Mixtral-8x7B_1gpu_fp16_wq8 \
             --output_dir /models/engines/Mixtral-8x7B_1gpu_fp16_wq8_pc \
             --gemm_plugin float16 \
             --workers 1 \
             --remove_input_padding enable \
             --use_paged_context_fmha disable \
             --max_input_len 2048 \
             --max_batch_size 64
And unfortunately, it didn't change anything and the results are exactly the same. Right now, I'm unable to test fp16 because of GPU availability. I need 2xH100 for this. However, in my previous tests (on older versions) it didn't matter if I skipped quantization. #1305 (comment)

I'll try to re-run the test when I get the chance.

Thanks. I will see if I can reproduce this issue.

djns99 · 2024-04-16T05:38:23Z

HI @bprus does the issue persist on the latest main branch for you? I tried following your steps, but was unable reproduce it.

Here is what I tested

python ~/scratch/remote-project/tekit/examples/llama/convert_checkpoint.py \
 --model_dir ~/scratch/models/Mixtral-8x7B-Instruct-v0.1/ \
 --output_dir /workspace/tllm_mixtral_bf16/ \
 --dtype float16 --tp_size 1 \
 --use_weight_only --weight_only_precision int8 \
 --load_model_on_cpu

trtllm-build --checkpoint_dir /workspace/tllm_mixtral_bf16   \
          --output_dir /workspace/tllm_mixtral_bf16_engine/    \
          --gemm_plugin float16   \
           --workers 1   \
           --remove_input_padding enable   \
           --use_paged_context_fmha disable   \
           --max_input_len 2048   \
           --max_batch_size 64

mpirun -n 1 python3 ./examples/run.py \
 --max_output_len=1000 \
 --engine_dir /workspace/tllm_mixtral_bf16_engine/ \
 --tokenizer_dir ~/scratch/models/Mixtral-8x7B-Instruct-v0.1/ \
 --input_text "[INST] Please write an essay on the thermodynamics of pizza. [/INST]"

Output:

Input [Text 0]: "<s> [INST] Please write an essay on the thermodynamics of pizza. [/INST]"
Output [Text 0 Beam 0]: "Title: The Thermodynamics of Pizza: A Delicious Exploration

Pizza, a beloved dish across the globe, is not only a culinary delight but also an interesting subject of thermodynamic study. Thermodynamics, the branch of physics that deals with heat and temperature and their relation to energy, mass, and radiation, can be applied to the pizza-making and consumption process in various fascinating ways.

The pizza-making process begins with the preparation of the dough. Flour, water, yeast, and salt are mixed and kneaded, resulting in the formation of gluten, a protein that gives the dough its elasticity. This process is endothermic, meaning it absorbs heat from the surroundings. The energy required for this reaction comes from the kinetic energy of the mixer or the heat generated by the baker's hands.

The next step is the proofing of the dough, where it is left to rise. During this period, the yeast ferments the sugars in the dough, producing carbon dioxide gas. This process is exothermic, releasing heat into the surroundings. The increase in temperature can be observed as the dough rises due to the expansion of the gas bubbles.

The toppings, including sauce, cheese, and various meats and vegetables, are then added to the dough. The sauce, usually made from tomatoes, is acidic and can cause the cheese to melt more quickly due to the lowering of the pH level, an exothermic process. The cheese, typically mozzarella, melts as it absorbs heat from the oven, undergoing a phase transition from a solid to a liquid state.

The pizza is then baked in a hot oven, typically at temperatures between 450 and 500 degrees Fahrenheit. This high heat causes the pizza to undergo several thermodynamic processes. The water in the dough evaporates, turning into steam, which contributes to the pizza's texture and helps the crust become crispy. The Maillard reaction also occurs, a complex series of chemical reactions between amino acids and reducing sugars that gives the pizza its distinctive brown color and flavor.

Finally, the pizza is served and consumed, often at a temperature higher than the surrounding environment. The heat from the pizza is transferred to the mouth and surrounding tissues, causing a sensation of warmth or heat. This heat is then dissipated through the process of convection, where the warm air in the mouth rises and is replaced by cooler air from the surroundings.

In conclusion, the thermodynamics of pizza involves various heat transfers and chemical reactions. From the preparation of the dough to the baking of the pizza and its consumption, thermodynamics plays a crucial role in the pizza-making process. Understanding these thermodynamic principles can enhance the pizza-making experience, leading to better-tasting and more consistently cooked pizzas. And, of course, the pleasure of eating pizza is a delightful way to appreciate the principles of thermodynamics in action."

bprus · 2024-04-17T12:59:16Z

Hi, @djns99 !
I checked today on the v0.9.0 release, and the issue seems to be resolved. Thanks a lot for all your help!

Yet, I stumbled upon another minor issue. In convert_checkpoint.py when --load_model_on_cpu is set, the conversion fails. It's because the preload_model function tries to load the model on device_map='auto'. I worked around it by adding load_model_on_cpu:

def preload_model(model_dir, load_model_on_cpu):
    from transformers import AutoConfig, AutoModelForCausalLM
    if "vila" in model_dir:
        sys.path.append(model_dir + "/../VILA")
        from llava.model import LlavaConfig, LlavaLlamaForCausalLM
        AutoConfig.register("llava_llama", LlavaConfig)
        AutoModelForCausalLM.register(LlavaConfig, LlavaLlamaForCausalLM)

    hf_config = AutoConfig.from_pretrained(model_dir, trust_remote_code=True)
    if hf_config.model_type == "llava":
        from transformers import LlavaForConditionalGeneration
        hf_llava = LlavaForConditionalGeneration.from_pretrained(
            model_dir, torch_dtype="auto")
        model = hf_llava.language_model
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_dir,
            device_map='auto' if not load_model_on_cpu else 'cpu',
            torch_dtype='auto',
            trust_remote_code=True,
        )
    return model

Just a heads up that you might want to fix it sometime 😉

Once again, thanks for all the help!

iibw · 2024-05-01T04:09:44Z

Most of my prompts aren't repeating endlessly now, but there's still one. Passing "What is machine learning?" as the prompt to Mixtral 8x7b continues to loop endlessly. I don't think this happens with Transformers so the root problem hasn't been fixed yet.

iibw · 2024-05-01T05:22:25Z

I tested "What is machine learning?" with transformers and it also loops endlessly for that prompt so it seems like this is an expected output. This clears everything up for me so I'll go ahead and close this issue. Thanks for the help!

anubhav-agrawal-mu-sigma · 2024-08-29T15:50:40Z

I am facing same issue on lllama3-8b-instruct and phi3-mini-128k-instruct

lfr-0531 · 2024-09-01T15:32:36Z

I am facing same issue on lllama3-8b-instruct and phi3-mini-128k-instruct

@anubhav-agrawal-mu-sigma Which version of TensorRT-LLM do you use? And can you share more details that can help us reproduce the issue?

anubhav-agrawal-mu-sigma · 2024-09-03T09:51:15Z

I am facing same issue on lllama3-8b-instruct and phi3-mini-128k-instruct

@anubhav-agrawal-mu-sigma Which version of TensorRT-LLM do you use? And can you share more details that can help us reproduce the issue?

@lfr-0531
Here are the details:

Using tensorrt-llm v0.11.0, using docker image nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3

python3 /tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py \
  --model_dir "/Meta-Llama-3-8B-Instruc" \
  --output_dir "/triton-build/chkpoint/llama3/1gpu/" \
  --dtype float16


trtllm-build --checkpoint_dir "/triton-build/chkpoint/llama3/1gpu/" \
    --output_dir "/triton-build/engine/llama3/1gpu/" \
    --gemm_plugin auto \
    --gpus_per_node 1 \
    --max_batch_size 64 

/tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir="/triton-build/engine/llama3/1gpu/" --max_output_len 100 \
  --tokenizer_dir "/Meta-Llama-3-8B-Instruct" \
  --input_text "You are a helpful assistant who answers questions based on given context. Context: MIKE is a student from New York and he is studying in NYU. Question: who is MIKE? "

PerkzZheng · 2024-09-04T14:31:47Z

@anubhav-agrawal-mu-sigma can you try with the main branch ?
this is what I have got with the main branch on H100.

Input [Text 0]: "<|begin_of_text|>You are a helpful assistant who answers questions based on given context. Context: MIKE is a student from New York and he is studying in NYU. Question: who is MIKE? "
Output [Text 0 Beam 0]: " Answer: MIKE is a student from New York and he is studying in NYU.  Context: MIKE is a student from New York and he is studying in NYU. Question: where is MIKE studying?  Answer: MIKE is studying in NYU.  Context: MIKE is a student from New York and he is studying in NYU. Question: where is MIKE from?  Answer: MIKE is from New York.  Context: MIKE is a"
[TensorRT-LLM][INFO] Refreshed the MPI local session

is this output expected ?

And please open another issue with more detailed information like GPU architecture.

iibw added the bug Something isn't working label Mar 15, 2024

byshiue assigned PerkzZheng Mar 19, 2024

byshiue added the triaged Issue has been triaged by maintainers label Mar 19, 2024

iibw changed the title ~~Mistral 7b experiences degraded performance (using official docs)~~ Mistral 7b and Mixtral 8x7b experience degraded performance (using official docs) Mar 25, 2024

iibw mentioned this issue Mar 26, 2024

Mixtral generation doesn't stop #722

Closed

iibw closed this as completed May 1, 2024

Oseltamivir mentioned this issue Oct 30, 2024

Tokens per sample upper limit for GPTJ mlcommons/inference#1728

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistral 7b and Mixtral 8x7b experience degraded performance (using official docs) #1305

Mistral 7b and Mixtral 8x7b experience degraded performance (using official docs) #1305

iibw commented Mar 15, 2024

PerkzZheng commented Mar 19, 2024

iibw commented Mar 21, 2024 •

edited

Loading

PerkzZheng commented Mar 22, 2024

iibw commented Mar 22, 2024

iibw commented Mar 23, 2024 •

edited

Loading

iibw commented Mar 23, 2024 •

edited

Loading

PerkzZheng commented Mar 24, 2024

bprus commented Mar 25, 2024

iibw commented Mar 25, 2024 •

edited

Loading

PerkzZheng commented Mar 26, 2024

bprus commented Mar 26, 2024

iibw commented Mar 26, 2024 •

edited

Loading

iibw commented Mar 26, 2024

iibw commented Mar 27, 2024

PerkzZheng commented Apr 3, 2024

WeiXiaoSummer commented Apr 5, 2024

PerkzZheng commented Apr 7, 2024

WeiXiaoSummer commented Apr 7, 2024

PerkzZheng commented Apr 8, 2024

bprus commented Apr 8, 2024

PerkzZheng commented Apr 8, 2024

bprus commented Apr 8, 2024

PerkzZheng commented Apr 8, 2024

djns99 commented Apr 16, 2024 •

edited

Loading

bprus commented Apr 17, 2024

iibw commented May 1, 2024

iibw commented May 1, 2024

anubhav-agrawal-mu-sigma commented Aug 29, 2024

lfr-0531 commented Sep 1, 2024

anubhav-agrawal-mu-sigma commented Sep 3, 2024

PerkzZheng commented Sep 4, 2024 •

edited

Loading

Mistral 7b and Mixtral 8x7b experience degraded performance (using official docs) #1305

Mistral 7b and Mixtral 8x7b experience degraded performance (using official docs) #1305

Comments

iibw commented Mar 15, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

PerkzZheng commented Mar 19, 2024

iibw commented Mar 21, 2024 • edited Loading

PerkzZheng commented Mar 22, 2024

iibw commented Mar 22, 2024

iibw commented Mar 23, 2024 • edited Loading

iibw commented Mar 23, 2024 • edited Loading

PerkzZheng commented Mar 24, 2024

bprus commented Mar 25, 2024

iibw commented Mar 25, 2024 • edited Loading

PerkzZheng commented Mar 26, 2024

bprus commented Mar 26, 2024

iibw commented Mar 26, 2024 • edited Loading

iibw commented Mar 26, 2024

iibw commented Mar 27, 2024

PerkzZheng commented Apr 3, 2024

WeiXiaoSummer commented Apr 5, 2024

PerkzZheng commented Apr 7, 2024

WeiXiaoSummer commented Apr 7, 2024

PerkzZheng commented Apr 8, 2024

bprus commented Apr 8, 2024

PerkzZheng commented Apr 8, 2024

bprus commented Apr 8, 2024

PerkzZheng commented Apr 8, 2024

djns99 commented Apr 16, 2024 • edited Loading

bprus commented Apr 17, 2024

iibw commented May 1, 2024

iibw commented May 1, 2024

anubhav-agrawal-mu-sigma commented Aug 29, 2024

lfr-0531 commented Sep 1, 2024

anubhav-agrawal-mu-sigma commented Sep 3, 2024

PerkzZheng commented Sep 4, 2024 • edited Loading

iibw commented Mar 21, 2024 •

edited

Loading

iibw commented Mar 23, 2024 •

edited

Loading

iibw commented Mar 23, 2024 •

edited

Loading

iibw commented Mar 25, 2024 •

edited

Loading

iibw commented Mar 26, 2024 •

edited

Loading

djns99 commented Apr 16, 2024 •

edited

Loading

PerkzZheng commented Sep 4, 2024 •

edited

Loading