Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add base model qlora fintuning config file and optimize deduplicate.py #128

Merged
merged 8 commits into from
Mar 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,6 @@
<a href="https://github.com/SmartFlowAI/EmoLLM/issues">提出新特性</a>
</div>


<!-- 本篇README.md面向开发者 -->

**EmoLLM** 是一系列能够支持 **理解用户-支持用户-帮助用户** 心理健康辅导链路的心理健康大模型,由 `LLM`指令微调而来,欢迎大家star~⭐⭐。目前已经开源的 `LLM` 微调配置如下:
Expand All @@ -49,6 +48,7 @@
| :-------------------: | :--------: |
| InternLM2_7B_chat | QLORA |
| InternLM2_7B_chat | 全量微调 |
| InternLM2_7B_base | QLORA |
| InternLM2_1_8B_chat | 全量微调 |
| InternLM2_20B_chat | LORA |
| Qwen_7b_chat | QLORA |
Expand Down Expand Up @@ -110,13 +110,14 @@
</details>

### 🏆荣誉栏

- 项目荣获上海人工智能实验室举办的**2024浦源大模型系列挑战赛春季赛*****50强***

<p align="center">
<a href="https://github.com/SmartFlowAI/EmoLLM/">
<img src="assets/浦语挑战赛TOP50.jpg" alt="浦语挑战赛TOP50">
</p>

- 项目荣获公众号**NLP工程化**[推文宣传](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A)

### 🎯路线图
Expand Down Expand Up @@ -151,9 +152,10 @@
- [如何参与本项目](#如何参与本项目)
- [作者(排名不分先后)](#作者排名不分先后)
- [版权说明](#版权说明)
- [引用](#引用)
- [特别鸣谢](#特别鸣谢)
- [Star History](#star-history)
- [🌟Contributors](#-contributors)
- [🌟 Contributors](#-contributors)
- [交流群](#交流群)

###### 开发前的配置要求
Expand Down Expand Up @@ -234,7 +236,7 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
| [ZeyuBa](https://github.com/ZeyuBa) | 自动化所在读硕士 | | |
| [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | 宾夕法尼亚大学在读硕士 | | |
| [Nobody-ML](https://github.com/Nobody-ML) | 中国石油大学(华东)在读本科生 | | |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora/) |MiniSora主要维护| 数据清洗、文档翻译 |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora/) |[MiniSora](https://github.com/mini-sora/minisora/)主要维护者,管理员| LLM微调、数据清洗、文档翻译 |
| [Mxoder](https://github.com/Mxoder) | 北京航空航天大学在读本科生 | | |
| [Anooyman](https://github.com/Anooyman) | 南京理工大学硕士 | | |
| [Vicky-3021](https://github.com/Vicky-3021) | 西安电子科技大学硕士(研0) | | |
Expand All @@ -248,8 +250,8 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git

该项目签署了 MIT 授权许可,详情请参阅 [LICENSE](https://github.com/SmartFlowAI/EmoLLM/blob/main/LICENSE)


### 引用

如果本项目对您的工作有所帮助,请使用以下格式引用:

```bibtex
Expand Down Expand Up @@ -300,7 +302,6 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
[OpenXLab_App-url]: https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0
[OpenXLab_Model-url]: https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full


## 交流群

- 如果失效,请移步Issue区
Expand Down
12 changes: 5 additions & 7 deletions README_EN.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
<h3 align="center">EmoLLM</h3>

<p align="center">
<a href="README.md">简体中文</a> | English
<a href="README.md">简体中文</a> | English
<br />
<br />
<a href="https://github.com/SmartFlowAI/EmoLLM"><strong>Explore the documentation of this project »</strong></a>
Expand All @@ -42,7 +42,6 @@

<!-- 本篇README.md面向开发者 -->


**EmoLLM** is a series of large language models designed to understand, support and help customers in mental health counseling. It is fine-tuned from the LLM instructions. We really appreciate it if you could give it a star~⭐⭐. The open-sourced configuration is as follows:

<div align="center">
Expand All @@ -51,6 +50,7 @@
| :-------------------: | :------: |
| InternLM2_7B_chat | QLORA |
| InternLM2_7B_chat | full fine-tuning |
| InternLM2_7B_base | QLORA |
| InternLM2_1_8B_chat | full fine-tuning |
| InternLM2_20B_chat | LORA |
| Qwen_7b_chat | QLORA |
Expand Down Expand Up @@ -90,7 +90,6 @@ The Model aims to fully understand and promote the mental health of individuals,

- 【2024.2.18】 The full fine-tuned version based on Qwen1_5-0_5B-Chat has been [open-sourced](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary). Friends with limited computational resources can now dive in and explore it.


<details>
<summary>View More</summary>

Expand Down Expand Up @@ -173,8 +172,6 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
- [Deployment Guide](#deployment-guide)
- View More Details



### File Directory Explanation

```
Expand Down Expand Up @@ -203,8 +200,8 @@ For details, see the [fine-tuning guide](xtuner_config/README.md)
- Demo deployment: see [deployment guide](./demo/README.md) for details.
- Quantitative deployment based on [LMDeploy](https://github.com/InternLM/lmdeploy/): see [deploy](./deploy/lmdeploy.md)


### RAG (Retrieval Augmented Generation) Pipeline

- See [RAG](./rag/)

<details>
Expand Down Expand Up @@ -251,7 +248,7 @@ This project uses Git for version control. You can see the currently available v
| [ZeyuBa](https://github.com/ZeyuBa) | Institute of Automation, Master's student | | |
| [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | University of Pennsylvania, Master's student | | |
| [Nobody-ML](https://github.com/Nobody-ML) | China University of Petroleum (East China), Undergraduate student | | |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora) |Maintainer and Admin| Data Cleaning and Docs Translation |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora) |Maintainer and Admin of [MiniSora](https://github.com/mini-sora/minisora) | LLM Fine-Tuning, Data Cleaning and Docs Translation |
| [Mxoder](https://github.com/Mxoder) | Beihang University, Undergraduate student | | |
| [Anooyman](https://github.com/Anooyman) | Nanjing University of Science and Technology, Master's student | | |
| [Vicky-3021](https://github.com/Vicky-3021) | Xidian University, Master's student (Research Year 0) | | |
Expand Down Expand Up @@ -308,6 +305,7 @@ The project is licensed under the MIT License. Please refer to the details
[OpenXLab_Model-url]: https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full

## Communication group

- If it fails, go to the Issue section.

<p align="center">
Expand Down
49 changes: 44 additions & 5 deletions datasets/deduplicate.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@
from hashlib import md5
from simhash import Simhash

import time
import numpy as np

def extract_text_from_json(obj, content):
# print(content)
if isinstance(obj, dict):
Expand All @@ -29,7 +32,7 @@ def is_duplicate_absolutely(d1, d2):
def hash_dict(dict_obj):
content = extract_text_from_json(dict_obj,'')
content = content.replace('\n', '').replace('\t', '').replace(' ', '')
print(content)
# print(content)
# m = get_minhash(content)
m = Simhash(content)
return m
Expand All @@ -43,10 +46,19 @@ def get_simhash(dict_obj):
return Simhash(dict_obj)

# 使用绝对匹配和MinHash对dict列表去重
def deduplicate_json(data_list, threshold=0.8):
def deduplicate_json(data_list, threshold=0.8, time_print=True):
seen_hashes = []
keep = []
duplicate = []

# global start
start = time.time()
last_start_seen_hashes = start
last_start_duplicate = start
stop1 = 0
stop2 = 0
print_interval = 500

for item in data_list:
if not item['conversation']:
continue
Expand All @@ -60,15 +72,36 @@ def deduplicate_json(data_list, threshold=0.8):
has_similar = False
# for stored_min_hash, stored_text in seen_hashes:
# if stored_min_hash.jaccard(min_hash) > threshold:

for stored_min_hash, stored_text in seen_hashes:
if 1 - (stored_min_hash.distance(sim_hash)/64.0) > threshold:
has_similar = True
duplicate.append(item)

print_len_duplicate = len(duplicate)+1
if print_len_duplicate%print_interval == 0:
if time_print:
stop1 = time.time()
print(f'print_len_duplicate={print_len_duplicate} Time: ', np.round(stop1 - last_start_duplicate, 5), np.round(stop1 - start , 5))
last_start_duplicate = stop1
else:
print(f'print_len_duplicate={print_len_duplicate}')

break
if not has_similar:
# seen_hashes.append((min_hash,item))

seen_hashes.append((sim_hash,item))
keep.append(item)


print_len_seen_hashes = len(seen_hashes)+1
if print_len_seen_hashes%print_interval == 0:
if time_print:
stop2 = time.time()
print(f'print_len_seen_hashes={print_len_seen_hashes} Time: ', str(np.round(stop2 - last_start_seen_hashes,5)), str(np.round(stop2 - start, 5)))
last_start_seen_hashes = stop2
else:
print(f'print_len_seen_hashes={print_len_seen_hashes}')
else:
duplicate.append(item)

Expand All @@ -77,7 +110,8 @@ def deduplicate_json(data_list, threshold=0.8):

if __name__ == '__main__':
DUP_THRESH = 0.8
data_ai = 'qwen'
data_ai = 'FatherLikeBF'
# root_dir = rf'./datasets/{data_ai}/'
root_dir = rf'./{data_ai}/'
dedup_output_dir = os.path.join(root_dir,'dedup')
if not os.path.exists(dedup_output_dir):
Expand All @@ -93,9 +127,14 @@ def deduplicate_json(data_list, threshold=0.8):
if is_json_file(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
data = json.load(f)
dedup_data, duplicate = deduplicate_json(data, DUP_THRESH)
dedup_data, duplicate = deduplicate_json(data, DUP_THRESH)

with open(os.path.join(root_dir, 'dedup','dedup_' + file), 'w', encoding='utf-8') as output_file:
json.dump(dedup_data, output_file, ensure_ascii=False, indent=4)

with open(os.path.join(root_dir, 'dedup','dup_' + file), 'w', encoding='utf-8') as output_file:
json.dump(duplicate, output_file, ensure_ascii=False, indent=4)

for item in dedup_data:
logger.info(f'dedup_data: {item}')
for item in duplicate:
Expand Down
14 changes: 10 additions & 4 deletions demo/cli_internlm2.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,23 @@
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from openxlab.model import download
from modelscope import snapshot_download

download(model_repo='jujimeizuo/EmoLLM_Model',
output='model')
# download model in openxlab
model_name_or_path =download(model_repo='ajupyter/EmoLLM_internlm2_7b_full',
output='EmoLLM_internlm2_7b_full')

model_name_or_path = "model"
# download model in modelscope
model_name_or_path = snapshot_download('chg0901/EmoLLM-InternLM7B-base')

# offline model
# model_name_or_path = "/root/StableCascade/emollm2/EmoLLM/xtuner_config/merged"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map='auto')
model = model.eval()

system_prompt = "你是一个由aJupyter、Farewell、jujimeizuo、Smiling&Weeping研发(排名按字母顺序排序,不分先后)、散步提供技术支持、上海人工智能实验室提供支持开发的心理健康大模型。现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。"
system_prompt = '你是心理健康助手EmoLLM,由EmoLLM团队打造。你旨在通过专业心理咨询,协助来访者完成心理诊断。请充分利用专业心理学知识与咨询技术,一步步帮助来访者解决心理问题。'

messages = [(system_prompt, '')]

Expand Down
Loading