Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

你好,求解释 #4

Open
cristianohello opened this issue Dec 19, 2023 · 1 comment
Open

你好,求解释 #4

cristianohello opened this issue Dec 19, 2023 · 1 comment

Comments

@cristianohello
Copy link

为了解决这个问题,我们在现有指令数据集的基础上,通过指令数据过滤-指令数据整合两个阶段对数据进行二次优化。

这个是如何通过指令数据过滤-指令数据整合两个阶段对数据进行二次优化的?

@Zlasejd
Copy link
Owner

Zlasejd commented Dec 19, 2023

指令数据过滤是通过正则表达式结合人工审核来过滤的,指令数据整合则是采用 TF-IDF 算法结合余弦相似度的方法对所有对话数据进行两两计算,得到若干可能语义重复的数据对,经过多次实验,选择相似度结果高于 0.8 的数据对,之后通过 ChatGPT 对两者的语义重复程度进行判断,删除其中高度重复的对话、合并语义相似的对话为一个新的更加复杂的对话。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants