SmartFlowAI · aJupyter · Mar 23, 2024 · Mar 23, 2024 · Mar 23, 2024 · Mar 23, 2024
diff --git a/README.md b/README.md
@@ -38,7 +38,6 @@
  <a href="https://github.com/SmartFlowAI/EmoLLM/issues">提出新特性</a>
  </div>
 
-
 <!-- 本篇README.md面向开发者 -->
 
 **EmoLLM** 是一系列能够支持 **理解用户-支持用户-帮助用户** 心理健康辅导链路的心理健康大模型，由 `LLM`指令微调而来，欢迎大家star~⭐⭐。目前已经开源的 `LLM` 微调配置如下：
@@ -49,6 +48,7 @@
 | :-------------------: | :--------: |
 | InternLM2_7B_chat | QLORA |
 | InternLM2_7B_chat | 全量微调 |
+| InternLM2_7B_base | QLORA |
 | InternLM2_1_8B_chat | 全量微调 |
 | InternLM2_20B_chat | LORA |
 | Qwen_7b_chat | QLORA |
@@ -110,13 +110,14 @@
 </details>
 
 ### 🏆荣誉栏
+
 - 项目荣获上海人工智能实验室举办的**2024浦源大模型系列挑战赛春季赛*****50强***
 
 <p align="center">
  <a href="https://github.com/SmartFlowAI/EmoLLM/">
  <img src="assets/浦语挑战赛TOP50.jpg" alt="浦语挑战赛TOP50">
 </p>
- 
+
 - 项目荣获公众号**NLP工程化**[推文宣传](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A)
 
 ### 🎯路线图
@@ -151,9 +152,10 @@
  - [如何参与本项目](#如何参与本项目)
  - [作者（排名不分先后）](#作者排名不分先后)
  - [版权说明](#版权说明)
+ - [引用](#引用)
  - [特别鸣谢](#特别鸣谢)
  - [Star History](#star-history)
- - [🌟Contributors](#-contributors)
+ - [🌟 Contributors](#-contributors)
  - [交流群](#交流群)
 
 ###### 开发前的配置要求
@@ -234,7 +236,7 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
 | [ZeyuBa](https://github.com/ZeyuBa) | 自动化所在读硕士 | | |
 | [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | 宾夕法尼亚大学在读硕士 | | |
 | [Nobody-ML](https://github.com/Nobody-ML) | 中国石油大学（华东）在读本科生 | | |
-| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora/) |MiniSora主要维护| 数据清洗、文档翻译 |
+| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora/) |[MiniSora](https://github.com/mini-sora/minisora/)主要维护者，管理员| LLM微调、数据清洗、文档翻译 |
 | [Mxoder](https://github.com/Mxoder) | 北京航空航天大学在读本科生 | | |
 | [Anooyman](https://github.com/Anooyman) | 南京理工大学硕士 | | |
 | [Vicky-3021](https://github.com/Vicky-3021) | 西安电子科技大学硕士（研0） | | |
@@ -248,8 +250,8 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
 
 该项目签署了 MIT 授权许可，详情请参阅 [LICENSE](https://github.com/SmartFlowAI/EmoLLM/blob/main/LICENSE)
 
-
 ### 引用
+
 如果本项目对您的工作有所帮助，请使用以下格式引用：
 
 ```bibtex
@@ -300,7 +302,6 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
 [OpenXLab_App-url]: https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0
 [OpenXLab_Model-url]: https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full
 
-
 ## 交流群
 
 - 如果失效，请移步Issue区

diff --git a/README_EN.md b/README_EN.md
@@ -25,7 +25,7 @@
 <h3 align="center">EmoLLM</h3>
 
  <p align="center">
- <a href="README.md">简体中文</a> | English 
+ <a href="README.md">简体中文</a> | English
  <br />
  <br />
  <a href="https://github.com/SmartFlowAI/EmoLLM"><strong>Explore the documentation of this project »</strong></a>
@@ -42,7 +42,6 @@
 
 <!-- 本篇README.md面向开发者 -->
 
-
 **EmoLLM** is a series of large language models designed to understand, support and help customers in mental health counseling. It is fine-tuned from the LLM instructions. We really appreciate it if you could give it a star~⭐⭐. The open-sourced configuration is as follows:
 
 <div align="center">
@@ -51,6 +50,7 @@
 | :-------------------: | :------: |
 | InternLM2_7B_chat | QLORA |
 | InternLM2_7B_chat | full fine-tuning |
+| InternLM2_7B_base | QLORA |
 | InternLM2_1_8B_chat | full fine-tuning |
 | InternLM2_20B_chat | LORA |
 | Qwen_7b_chat | QLORA |
@@ -90,7 +90,6 @@ The Model aims to fully understand and promote the mental health of individuals,
 
 - 【2024.2.18】 The full fine-tuned version based on Qwen1_5-0_5B-Chat has been [open-sourced](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary). Friends with limited computational resources can now dive in and explore it.
 
-
 <details>
 <summary>View More</summary>
 
@@ -173,8 +172,6 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
  - [Deployment Guide](#deployment-guide)
  - View More Details
 
-
-
 ### File Directory Explanation
 
 ```
@@ -203,8 +200,8 @@ For details, see the [fine-tuning guide](xtuner_config/README.md)
 - Demo deployment: see [deployment guide](./demo/README.md) for details.
 - Quantitative deployment based on [LMDeploy](https://github.com/InternLM/lmdeploy/): see [deploy](./deploy/lmdeploy.md)
 
-
 ### RAG (Retrieval Augmented Generation) Pipeline
+
 - See [RAG](./rag/)
 
 <details>
@@ -251,7 +248,7 @@ This project uses Git for version control. You can see the currently available v
 | [ZeyuBa](https://github.com/ZeyuBa) | Institute of Automation, Master's student | | |
 | [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | University of Pennsylvania, Master's student | | |
 | [Nobody-ML](https://github.com/Nobody-ML) | China University of Petroleum (East China), Undergraduate student | | |
-| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora) |Maintainer and Admin| Data Cleaning and Docs Translation |
+| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora) |Maintainer and Admin of [MiniSora](https://github.com/mini-sora/minisora) | LLM Fine-Tuning, Data Cleaning and Docs Translation |
 | [Mxoder](https://github.com/Mxoder) | Beihang University, Undergraduate student | | |
 | [Anooyman](https://github.com/Anooyman) | Nanjing University of Science and Technology, Master's student | | |
 | [Vicky-3021](https://github.com/Vicky-3021) | Xidian University, Master's student (Research Year 0) | | |
@@ -308,6 +305,7 @@ The project is licensed under the MIT License. Please refer to the details
 [OpenXLab_Model-url]: https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full
 
 ## Communication group
+
 - If it fails, go to the Issue section.
 
 <p align="center">

diff --git a/datasets/deduplicate.py b/datasets/deduplicate.py
@@ -5,6 +5,9 @@
 from hashlib import md5
 from simhash import Simhash
 
+import time
+import numpy as np
+
 def extract_text_from_json(obj, content):
  # print(content)
  if isinstance(obj, dict):
@@ -29,7 +32,7 @@ def is_duplicate_absolutely(d1, d2):
 def hash_dict(dict_obj):
  content = extract_text_from_json(dict_obj,'')
  content = content.replace('\n', '').replace('\t', '').replace(' ', '')
- print(content)
+ # print(content)
  # m = get_minhash(content)
  m = Simhash(content)
  return m
@@ -43,10 +46,19 @@ def get_simhash(dict_obj):
  return Simhash(dict_obj)
 
 # 使用绝对匹配和MinHash对dict列表去重
-def deduplicate_json(data_list, threshold=0.8):
+def deduplicate_json(data_list, threshold=0.8, time_print=True):
  seen_hashes = []
  keep = []
  duplicate = []
+
+ # global start 
+ start = time.time()
+ last_start_seen_hashes = start
+ last_start_duplicate = start
+ stop1 = 0
+ stop2 = 0
+ print_interval = 500
+
  for item in data_list:
  if not item['conversation']:
  continue
@@ -60,15 +72,36 @@ def deduplicate_json(data_list, threshold=0.8):
  has_similar = False
  # for stored_min_hash, stored_text in seen_hashes:
  # if stored_min_hash.jaccard(min_hash) > threshold:
+
  for stored_min_hash, stored_text in seen_hashes:
  if 1 - (stored_min_hash.distance(sim_hash)/64.0) > threshold:
  has_similar = True
  duplicate.append(item)
+
+ print_len_duplicate = len(duplicate)+1
+ if print_len_duplicate%print_interval == 0:
+ if time_print:
+ stop1 = time.time()
+ print(f'print_len_duplicate={print_len_duplicate} Time: ', np.round(stop1 - last_start_duplicate, 5), np.round(stop1 - start , 5))
+ last_start_duplicate = stop1
+ else:
+ print(f'print_len_duplicate={print_len_duplicate}')
+
  break
  if not has_similar:
- # seen_hashes.append((min_hash,item))
+
  seen_hashes.append((sim_hash,item))
  keep.append(item)
+
+
+ print_len_seen_hashes = len(seen_hashes)+1
+ if print_len_seen_hashes%print_interval == 0:
+ if time_print:
+ stop2 = time.time()
+ print(f'print_len_seen_hashes={print_len_seen_hashes} Time: ', str(np.round(stop2 - last_start_seen_hashes,5)), str(np.round(stop2 - start, 5)))
+ last_start_seen_hashes = stop2
+ else:
+ print(f'print_len_seen_hashes={print_len_seen_hashes}')
  else:
  duplicate.append(item)
 
@@ -77,7 +110,8 @@ def deduplicate_json(data_list, threshold=0.8):
 
 if __name__ == '__main__': 
  DUP_THRESH = 0.8
- data_ai = 'qwen' 
+ data_ai = 'FatherLikeBF' 
+ # root_dir = rf'./datasets/{data_ai}/'
  root_dir = rf'./{data_ai}/'
  dedup_output_dir = os.path.join(root_dir,'dedup')
  if not os.path.exists(dedup_output_dir):
@@ -93,9 +127,14 @@ def deduplicate_json(data_list, threshold=0.8):
  if is_json_file(file_path):
  with open(file_path, 'r', encoding='utf-8') as f:
  data = json.load(f)
- dedup_data, duplicate = deduplicate_json(data, DUP_THRESH) 
+ dedup_data, duplicate = deduplicate_json(data, DUP_THRESH) 
+
  with open(os.path.join(root_dir, 'dedup','dedup_' + file), 'w', encoding='utf-8') as output_file:
  json.dump(dedup_data, output_file, ensure_ascii=False, indent=4)
+
+ with open(os.path.join(root_dir, 'dedup','dup_' + file), 'w', encoding='utf-8') as output_file:
+ json.dump(duplicate, output_file, ensure_ascii=False, indent=4)
+
  for item in dedup_data:
  logger.info(f'dedup_data: {item}')
  for item in duplicate:

diff --git a/demo/cli_internlm2.py b/demo/cli_internlm2.py
@@ -1,17 +1,23 @@
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 from openxlab.model import download
+from modelscope import snapshot_download
 
-download(model_repo='jujimeizuo/EmoLLM_Model', 
- output='model')
+# download model in openxlab
+model_name_or_path =download(model_repo='ajupyter/EmoLLM_internlm2_7b_full', 
+ output='EmoLLM_internlm2_7b_full')
 
-model_name_or_path = "model"
+# download model in modelscope
+model_name_or_path = snapshot_download('chg0901/EmoLLM-InternLM7B-base')
+
+# offline model
+# model_name_or_path = "/root/StableCascade/emollm2/EmoLLM/xtuner_config/merged"
 
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map='auto')
 model = model.eval()
 
-system_prompt = "你是一个由aJupyter、Farewell、jujimeizuo、Smiling&Weeping研发（排名按字母顺序排序，不分先后）、散步提供技术支持、上海人工智能实验室提供支持开发的心理健康大模型。现在你是一个心理专家，我有一些心理问题，请你用专业的知识帮我解决。"
+system_prompt = '你是心理健康助手EmoLLM，由EmoLLM团队打造。你旨在通过专业心理咨询，协助来访者完成心理诊断。请充分利用专业心理学知识与咨询技术，一步步帮助来访者解决心理问题。'
 
 messages = [(system_prompt, '')]