Regarding repo-level data processing procedure

Thank you for your excellent work! I have a few questions regarding your repository-level data processing procedure, specifically concerning sections 2.2.1 and 2.2.5.

#### Regarding Section 2.2.1 *Preprocessing*
> "... implementing deduplication at both the repository and file levels. For each level, we performed exact-deduplication using SHA256 hashes of contents and near-deduplication via the MinHash algorithm. This two-tier strategy yielded two variants of the code corpus..." 

1. Is this two-tier process in sequence or independently in parallel? Explicitly, did you first concatenate files to create repo-level samples, perform repo-level exact and near-deduplication, and then split into individual files for file-level exact and near-deduplication? Or was the process structured differently?
2. For the repo-level exact deduplication to work effectively, did you concatenate the files within each repo in a specific order (e.g., lexical order of file paths) beforehand?
3. Did you use the same MinHash parameters for both the repo-level and file-level near-deduplication? If possible, could you share the threshold?

#### Regarding Section 2.2.1 *Quality Filtering* and Section 2.2.5 *Long-Context Data for Continued Pretraining*
> "... This corpus supports 89 programming languages, forged into both the repository-level and file-level code data shown in Figure 2..."
> "We selected high-quality repositories based on average file quality scores. For mainstream programming languages (e.g., Python, Java, and C), we implemented topological concatenation based on file dependencies. For HTML, SQL, and Shell, we used random concatenation. "

4. I'd like to confirm my understanding of this process: For the repo-level deduplicated corpus, did you first apply the file-level quality filter to each individual file within all repositories? Then, did you calculate an average file-quality score for each repository based on its constituent files, and apply the topological/random concatenation only to the subset of selected high-score repositories?

> "Each repository was mapped to a single string sequence, with exceptionally large repositories (e.g., PyTorch) being decomposed into multiple independent subgraphs to avoid oversized sequences while preserving logical coherence. "
5. For repositories that have multiple independent subgraphs but whose concatenated total size < 32k, how did you concat these subgraphs to form the final single string sequence? Just randomly?

Thank you in advance for your time and any clarification you can provide!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding repo-level data processing procedure #8

Regarding Section 2.2.1 Preprocessing

Regarding Section 2.2.1 Quality Filtering and Section 2.2.5 Long-Context Data for Continued Pretraining

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regarding repo-level data processing procedure #8

Description

Regarding Section 2.2.1 Preprocessing

Regarding Section 2.2.1 Quality Filtering and Section 2.2.5 Long-Context Data for Continued Pretraining

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions