-
msmarco 微软开源的深度学习数据集
-
openai/human-eval 评估代码语言模型
-
T2Ranking: A large-scale Chinese Benchmark for Passage Ranking THUIR/T2Ranking
-
UER:About Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo
语义相似度
- CLUEbenchmark/SimCLUE
- shibing624/nli_zh
- liucongg/NLPDataSet
- 千言数据集:文本相似度
- DMetaSoul/chinese-semantic-textual-similarity
大模型数据集
- BEIR 信息检索Benchmark
- CLUEbenchmark 中文语言理解测评基准
- Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation
- MMLU 大量多任务语言理解,57个任务
- Leaderboard - C-Eval 中文评估集
- GSM8K 研究生数学题
- BBH-HARD
优化算法
-
平行策略
- Data Parallelism
- Pipeline Parallelism pipeline parallelism splits the input minibatch into multiple microbatches and pipelines the execution of these microbatches across multiple GPUs.
- Tensor Parallelism Rowwise, Colwise and Pairwise Parallelism.
- Sequence Parallelism
- Zero Redundancy Optimizer (ZeRO)
- Auto-Parallelism
-
flash-attention Fast and memory-efficient exact attention
-
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
- MLCommons The mission of MLCommons™ is to make machine learning better for everyone.
大量、优质、非重复的数据。
- [2019] CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
- ngram语言模型—基于KneserNey及Modified Kneser Ney平滑
- The RefinedWeb Dataset for Falcon LLM:Outperforming Curated Corpora with Web Data, and Web Data Only
- the pile
- Quality at a Glance:An Audit of Web-Crawled Multilingual Datasets 小语种质量差,用数据之前最好抽样100条看看。
- CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common Crawl Our analysis shows that 14--52, of the documents within a crawl and around~0.7--2.5, between the crawls are near-duplicates.
- [2021] Extracting Training Data from Large Language Models 通过prompt从LLM提取训练数据,大模型比小模型更容易被攻击。
- [2023] Quantifying Memorization Across Neural Language Models 6B的GPT-J模型至少记忆了Pile中1%的数据。模型越大、数据重复越多、上下文越长的case越容易记忆。对训练数据做去重可以减少记忆带来的危害。
- [2022] DeepMind: Scaling Language Models: Methods, Analysis & Insights from Training Gopher 在152种任务上测试发现:规模对阅读理解、事实检查、毒性鉴别有用,对逻辑推理、数学推理作用较小。用了一堆启发式规则来过滤低质。用goggle的安全搜索网页过滤内容。
- ChenghaoMou/text-dedup 对比了多种方法,minhash最好
- [2022] Deduplicating Training Data Makes Language Models Better 去重可以避免记忆,用到精确子串去重+minhash
- minhash+lsh
- hazyresearch/meerkat 基础大模型的数据可视化