Skip to content

Regarding repo-level data processing procedure #8

@potato18z

Description

@potato18z

Thank you for your excellent work! I have a few questions regarding your repository-level data processing procedure, specifically concerning sections 2.2.1 and 2.2.5.

Regarding Section 2.2.1 Preprocessing

"... implementing deduplication at both the repository and file levels. For each level, we performed exact-deduplication using SHA256 hashes of contents and near-deduplication via the MinHash algorithm. This two-tier strategy yielded two variants of the code corpus..."

  1. Is this two-tier process in sequence or independently in parallel? Explicitly, did you first concatenate files to create repo-level samples, perform repo-level exact and near-deduplication, and then split into individual files for file-level exact and near-deduplication? Or was the process structured differently?
  2. For the repo-level exact deduplication to work effectively, did you concatenate the files within each repo in a specific order (e.g., lexical order of file paths) beforehand?
  3. Did you use the same MinHash parameters for both the repo-level and file-level near-deduplication? If possible, could you share the threshold?

Regarding Section 2.2.1 Quality Filtering and Section 2.2.5 Long-Context Data for Continued Pretraining

"... This corpus supports 89 programming languages, forged into both the repository-level and file-level code data shown in Figure 2..."
"We selected high-quality repositories based on average file quality scores. For mainstream programming languages (e.g., Python, Java, and C), we implemented topological concatenation based on file dependencies. For HTML, SQL, and Shell, we used random concatenation. "

  1. I'd like to confirm my understanding of this process: For the repo-level deduplicated corpus, did you first apply the file-level quality filter to each individual file within all repositories? Then, did you calculate an average file-quality score for each repository based on its constituent files, and apply the topological/random concatenation only to the subset of selected high-score repositories?

"Each repository was mapped to a single string sequence, with exceptionally large repositories (e.g., PyTorch) being decomposed into multiple independent subgraphs to avoid oversized sequences while preserving logical coherence. "

  1. For repositories that have multiple independent subgraphs but whose concatenated total size < 32k, how did you concat these subgraphs to form the final single string sequence? Just randomly?

Thank you in advance for your time and any clarification you can provide!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions