We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
为了解决这个问题,我们在现有指令数据集的基础上,通过指令数据过滤-指令数据整合两个阶段对数据进行二次优化。
这个是如何通过指令数据过滤-指令数据整合两个阶段对数据进行二次优化的?
The text was updated successfully, but these errors were encountered:
指令数据过滤是通过正则表达式结合人工审核来过滤的,指令数据整合则是采用 TF-IDF 算法结合余弦相似度的方法对所有对话数据进行两两计算,得到若干可能语义重复的数据对,经过多次实验,选择相似度结果高于 0.8 的数据对,之后通过 ChatGPT 对两者的语义重复程度进行判断,删除其中高度重复的对话、合并语义相似的对话为一个新的更加复杂的对话。
Sorry, something went wrong.
No branches or pull requests
为了解决这个问题,我们在现有指令数据集的基础上,通过指令数据过滤-指令数据整合两个阶段对数据进行二次优化。
这个是如何通过指令数据过滤-指令数据整合两个阶段对数据进行二次优化的?
The text was updated successfully, but these errors were encountered: