[BUG] RAM UTILISATION IS INCREASING RAPIDLY #639

UTSAV-44 · 2024-09-25T08:03:55Z

OS

Linux

GPU Library

CUDA 12.x

Python version

3.11

Pytorch version

2.4.1

Model

No response

Describe the bug

For enforcing model to give response in json format, I am using ExLlamaV2TokenEnforcerFilter and ExLlamaV2PrefixFilter classes and appending to to filters list and passing as filters for generating output from model. As my usecase are limited so ,I thought of caching these both class by storing it in a dict and reusing it. But by doing this I observed that system ram utilization is increasing and after few iterations it leads to Out of Memory. Usually it takes 10-15 GB of system RAM but overtime the memory usage goes over 128 GB causing OOM.

I am sharing the code snippet

  ```
def run_mihup_llm_inference(self, call_transcript: str, prompt_tuples: List[Tuple]) -> List[json]:
      self.cache.reset()
      common_transcript = format_transcript_text(call_transcript)
      prompts = []
      filters = []
      use_case_ids = []
      for upper_tuple in prompt_tuples:
          use_case_id = upper_tuple[1]
          use_case_ids.append(use_case_id)
          p = upper_tuple[0]
          prompt_str = p[0]
          prompt_question_combined = format_llama3_prompt(mihup_system_prompt, common_transcript + prompt_str)
          prompts.append(prompt_question_combined)
          filter_schema_parser = p[1]

          print_memory_usage()

          if use_case_id not in self.universal_filter_map:
              print("Not found in the cache memory")

              self.universal_filter_map[use_case_id] = [
                  ExLlamaV2TokenEnforcerFilter(filter_schema_parser, self.tokenizer),
                  ExLlamaV2PrefixFilter(self.model, self.tokenizer, ["{", " {"])
              ]
          else:
              self.universal_filter_map[use_case_id][0].token_sequence = []
              self.universal_filter_map[use_case_id][1].current_prefixes = set()
              self.universal_filter_map[use_case_id][1].current_str = ""
              self.universal_filter_map[use_case_id][1].prefix_strings = ["{", " {"]
              print("Found in the cache memory")

          print("length of map : ", len(self.universal_filter_map[use_case_id]))
          # Create fresh instances each time
          filters.append(self.universal_filter_map[use_case_id])

      # print(prompts)

      outputs = self.generator.generate(
          prompt=prompts,
          filters=filters,
          filter_prefer_eos=True,
          max_new_tokens=1536,
          add_bos=True,
          stop_conditions=get_llama3_stop_conditions(self.tokenizer),
          completion_only=True,
          encode_special_tokens=True,
      )

      final_output = []
      skipped_index = []
      for i in range(len(outputs)):
          output_json = None
          try:
              output_json = json.loads(outputs[i])
          except ValueError as e:
              skipped_index.append(i)
              print("error: ", outputs[i])
          if output_json is not None:
              final_output.append(json.loads(outputs[i]))

      # assert len(final_output) == len(use_case_ids)

      # gc.collect()
      print_memory_usage()

      use_case_id_key = "use_case_id"
      for idx in range(len(final_output)):
          if idx not in skipped_index:
              final_output[idx][use_case_id_key] = use_case_ids[idx]

      return final_output


### Reproduction steps

I have shared the code snippet above

### Expected behavior

Previously when i was not caching the classes it was not happening .

### Logs

_No response_

### Additional context

_No response_

### Acknowledgements

- [X] I have looked for similar issues before submitting this one.
- [X] I understand that the developers have lives and my issue will be answered when possible.
- [X] I understand the developers of this program are human, and I will ask my questions politely.

The text was updated successfully, but these errors were encountered:

turboderp · 2024-09-25T18:09:32Z

I suppose you could draw filters from a pool, but you'd have to explicitly reset them since they are stateful objects. Support for that is going to be up to LMFE, and I'm not sure if resetting fully clears all the internal state they might have. There's obviously something being retained but you'd have to dig into the source code of LMFE to try and figure out what it is.

UTSAV-44 added the bug Something isn't working label Sep 25, 2024

UTSAV-44 mentioned this issue Oct 1, 2024

RAM UTILISATION IS INCREASING RAPIDLY noamgat/lm-format-enforcer#145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RAM UTILISATION IS INCREASING RAPIDLY #639

[BUG] RAM UTILISATION IS INCREASING RAPIDLY #639

UTSAV-44 commented Sep 25, 2024 •

edited

Loading

turboderp commented Sep 25, 2024

[BUG] RAM UTILISATION IS INCREASING RAPIDLY #639

[BUG] RAM UTILISATION IS INCREASING RAPIDLY #639

Comments

UTSAV-44 commented Sep 25, 2024 • edited Loading

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

turboderp commented Sep 25, 2024

UTSAV-44 commented Sep 25, 2024 •

edited

Loading