|
1 | | -# Dataset Preprocessing Documentation - DeepSeek-R1 |
2 | | - |
3 | | -## Model: DeepSeek-R1 |
4 | | -**Dataset:** Multi-domain Evaluation Ensemble |
5 | | -**Evaluation Task:** Multi-domain Reasoning and Code Generation |
6 | | - |
7 | | -## Data Source |
8 | | -- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2 bucket |
9 | | -- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/` |
10 | | -- **Components:** AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench (code_generation_lite) |
11 | | -- **Licenses:** |
12 | | - - AIME: [CC0](https://creativecommons.org/public-domain/cc0/) |
13 | | - - MATH500: [MIT](https://opensource.org/license/mit) |
14 | | - - GPQA: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) |
15 | | - - MMLU-Pro: [MIT](https://opensource.org/license/mit) |
16 | | - - LiveCodeBench: [CC](https://creativecommons.org/share-your-work/cclicenses/) |
17 | | - |
18 | | -## Current Implementation |
19 | | - |
20 | | -### Files Available |
21 | | -- **Main Dataset:** `mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl` |
22 | | -- **Calibration Set:** `mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl` |
23 | | -- **Format:** Preprocessed pickle files ready for evaluation |
24 | | - |
25 | | -### Download Process |
26 | | -```bash |
27 | | -# Install Rclone |
28 | | -sudo -v ; curl https://rclone.org/install.sh | sudo bash |
29 | | - |
30 | | -# Configure access |
31 | | -rclone config create mlc-inference s3 provider=Cloudflare \ |
32 | | - access_key_id=f65ba5eef400db161ea49967de89f47b \ |
33 | | - secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b \ |
34 | | - endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com |
35 | | - |
36 | | -# Download datasets |
37 | | -rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P |
38 | | -rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P |
| 1 | +# DeepSeek-R1 Preprocessing |
| 2 | + |
| 3 | +## Model Configuration |
| 4 | +- **Model**: `deepseek-ai/DeepSeek-R1` |
| 5 | +- **Revision**: `56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad` |
| 6 | +- **Max Length**: 32,768 tokens (32K) |
| 7 | + |
| 8 | +## Tokenization |
| 9 | +```python |
| 10 | +from transformers import AutoTokenizer |
| 11 | + |
| 12 | +# From utils/tokenization.py |
| 13 | +tokenizer = AutoTokenizer.from_pretrained( |
| 14 | + "deepseek-ai/DeepSeek-R1", |
| 15 | + revision="56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad" |
| 16 | +) |
39 | 17 | ``` |
40 | 18 |
|
41 | | -## Missing Documentation (Addresses Issue #2245) |
42 | | - |
43 | | -The following preprocessing information is **not currently available**, making reproduction and adaptation difficult: |
44 | | - |
45 | | -### 1. Original Data Sources |
46 | | -- **Raw Dataset Locations:** Where each component dataset was obtained |
47 | | -- **Version Information:** Specific versions/commits of source datasets |
48 | | -- **Access Methods:** How to obtain raw data independently |
49 | | - |
50 | | -### 2. Preprocessing Pipeline |
51 | | -- **Tokenization Method:** Which tokenizer was used and configuration |
52 | | -- **Input Formatting:** How different dataset formats were standardized |
53 | | -- **Quality Filtering:** Criteria for sample inclusion/exclusion |
54 | | -- **Ensemble Strategy:** How multiple datasets were combined |
55 | | - |
56 | | -### 3. Dataset Statistics |
57 | | -- **Sample Counts:** Number of samples from each component dataset |
58 | | -- **Distribution:** How samples are balanced across domains |
59 | | -- **Difficulty Levels:** Complexity distribution of included problems |
| 19 | +## Preprocessing Method |
60 | 20 |
|
61 | | -### 4. Validation Process |
62 | | -- **Quality Control:** How preprocessing quality was verified |
63 | | -- **Consistency Checks:** Validation of format standardization |
64 | | -- **Error Handling:** How malformed samples were addressed |
| 21 | +The preprocessing varies by backend: |
65 | 22 |
|
66 | | -## Adaptation Challenges |
67 | | - |
68 | | -**For Different Tokenizers:** |
69 | | -- Cannot modify tokenization without access to raw data |
70 | | -- No documentation of original tokenization parameters |
71 | | -- Unable to test preprocessing consistency |
72 | | - |
73 | | -**For Different Models:** |
74 | | -- Cannot adapt input formatting without preprocessing scripts |
75 | | -- No guidance on prompt template modifications |
76 | | -- Unable to reproduce dataset with different filtering criteria |
77 | | - |
78 | | -## Recommended Improvements |
79 | | - |
80 | | -To fully address issue #2245 and improve reproducibility: |
81 | | - |
82 | | -### 1. Raw Data Access |
83 | | -- Provide scripts to download original datasets |
84 | | -- Document exact versions and sources used |
85 | | -- Include data licenses and attribution |
86 | | - |
87 | | -### 2. Preprocessing Scripts |
88 | | -- Create preprocessing pipeline (similar to `llama2-70b/processorca.py`) |
89 | | -- Document tokenization and formatting steps |
90 | | -- Include quality filtering logic |
91 | | - |
92 | | -### 3. Documentation |
93 | | -- Add detailed preprocessing methodology |
94 | | -- Include dataset statistics and composition |
95 | | -- Provide adaptation guidelines |
| 23 | +### PyTorch/vLLM Backends (Chat Template Enabled) |
| 24 | +```python |
| 25 | +# From utils/tokenization.py |
| 26 | +tokens = tokenizer.apply_chat_template( |
| 27 | + [{"role": "user", "content": prompt}], |
| 28 | + add_generation_prompt=True, |
| 29 | + max_length=32768, |
| 30 | + truncation=True |
| 31 | +) |
| 32 | +``` |
96 | 33 |
|
97 | | -### 4. Validation |
98 | | -- Include preprocessing verification scripts |
99 | | -- Document expected outputs and checksums |
100 | | -- Provide quality metrics |
| 34 | +### SGLang Backend (No Chat Template) |
| 35 | +```python |
| 36 | +tokens = tokenizer.encode( |
| 37 | + prompt, |
| 38 | + truncation=True, |
| 39 | + max_length=32768 |
| 40 | +) |
| 41 | +``` |
101 | 42 |
|
102 | | -## Temporary Workaround |
| 43 | +## Backend Configuration |
| 44 | +| Backend | uses_chat_template | input_type | |
| 45 | +|---------|-------------------|------------| |
| 46 | +| PyTorch | True | tokenized | |
| 47 | +| vLLM | True | text | |
| 48 | +| SGLang | False | text | |
103 | 49 |
|
104 | | -Until full preprocessing documentation is available: |
105 | | -1. Use provided preprocessed datasets for standard evaluation |
106 | | -2. Contact maintainers for specific adaptation requirements |
107 | | -3. Reference `llama2-70b/processorca.py` for preprocessing patterns |
108 | | -4. Consider contributing preprocessing scripts based on reverse engineering |
| 50 | +## Dataset Format |
| 51 | +Input data should have a `text_input` column containing the prompts. |
109 | 52 |
|
110 | | -## See Also |
111 | | -- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing |
112 | | -- `PREPROCESSING-TEMPLATE.md` - Standard template for future models |
113 | | -- Repository issue #2245 - Discussion of preprocessing documentation gaps |
| 53 | +## Accuracy Target |
| 54 | +``` |
| 55 | +"mean-accuracy": 81.3582 |
| 56 | +``` |
0 commit comments