Skip to content

PAIR-code/pretraining-tda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scalable Influence and Fact Tracing for Large Language Model Pretraining

This is the landing page for the code and data release accompanying Scalable Influence and Fact Tracing for Large Language Model Pretraining (Chang et al. 2024).

Specifically, this includes:

  • Data files in JSON lines (.jsonl) format for:
    • The set of 5.4k prompts (queries) used for fact tracing evaluation, as well as the full set of 1.2M queries these are sampled from.
    • TDA method outputs (retrieved and scored proponents) corresponding to the experiments in Section 5 and Section 6 of the paper.
    • TDA method outputs corresponding to additional evaluation tasks in Appendix A.5 of the paper.
    • The corpus of 19.6M sentences from T-REx Wikipedia abstracts (Section 4.2 and 5 of the paper).
  • A data viewer app to make it easier to look at and analyze sets of retrieved proponents.

TDA Output Viewer

Lists of proponent passages are challenging to work with in spreadsheets or plain text files, due to the amount of text on-screen and the difficulty of quickly looking at scores, identifying string matches, or filtering to specific types of examples. We found it useful to write custom HTML visualizations, and packaged these into a simple viewer app:

https://pair-code.github.io/pretraining-tda/demo

You can load a .jsonl file of TDA results (a set of test examples and their retrieved proponents from the training set) from a URL or by uploading from your computer; see below for links to load the experiments from the paper. The app runs entirely in-browser and doesn't send your data to any server. For more information, see the user guide and app documentation.

Data files

Evaluation queries

The set of 5.4k triples and associated prompts (factual queries) used in the experiments in the paper: https://storage.googleapis.com/tda-resources/2410.17413/public/trex_facts_sample.jsonl

The full set of 1.2M triples which these are sampled from: https://storage.googleapis.com/tda-resources/2410.17413/public/trex_facts.jsonl (1GB file). Note that the set of 5.4k is not sampled uniformly from this; see Section 4.2 of the paper for more details.

Each record has the following fields:

  • fact_id
  • kilt_id
  • entity0, relation, and entity1
  • entity0_uri, predicate_uri, and entity1_uri
  • entity0_alias and entity1_alias - alternative surface forms
  • trex_sentences- mapping to the T-REx sentences, below
  • c4_frequency - annotation, based on string matching, of how frequently this fact appears in the C4 pretraining corpus
  • is_repetition - if the fact contains repetition between entity0 and entity1
  • prompt0, prompt1, prompt2 - input prompts for this fact, generated using different templates. Unless otherwise specified, we use prompt0 for experiments in the paper.

TDA method results

The results files as used in the main paper are linked in the tables below. Each record has the following fields:

  • example_id
  • query_set
  • inputs_plaintext - the prompt (query) string; for T-REx facts, this is prompt0 from the query files above
  • targets_plaintext - the target string, generally entity1 from the query files above
  • proponents (as string[]) - proponent passage text
  • proponent_ids (as string[]) - passage IDs (for T-REx or C4)
  • proponent_scores (as number[]) - passage scores from the TDA method (e.g. Equation (1) of the paper)

For TDA methods that support a notion of "opponents" (this includes most gradient-based methods, but not the BM25 or Gecko baselines) we also include fields analogous to the proponents:

  • opponents (as string[])
  • opponent_ids (as string[])
  • opponent_scores (as number[])

And for T-REx records in Tables 1 and 2 (some fields marked optional):

  • fact_id
  • relation
  • 8b_generations (as string[]) - decoder samples from the 8B model, for estimating confidence in the LLM's answer
  • 8b_confidence (as number) - fraction of samples from the 8B that match the target entity or an alias
  • c4_frequency and c4_frequency_bucket - frequency of the fact in the C4 corpus, based on string matching. Bucket groups this into 0, 1, 2, 3, 4, 5, with 5 containing the most common facts.
  • has_trex_sentence - for retrievals from T-REx sentences, if there exists any sentence in T-REx containing this fact (optional, only Table 1)
  • proponent_correct (as boolean[]) - for retrievals from T-REx sentences, whether each proponent contains the fact, according to the T-REx annotations (optional, only Table 1)
  • proponent_ais_scores (as number[]) - for retrievals from C4, scores from the AIS (entailment) model for each proponent (optional, only Table 2)

For all tasks outside of T-REx, we retrieve proponents using TrackStar with the non-task-specific Hessian approximation (see Appendix A.5 in the paper). The additional tasks have the following optional fields:

  • is_8b_correct - for T-REx and arithmetic tasks, whether the 8B model generation matches the ground truth; for PIQA and COPA, whether the 8B model assigns higher probability to the ground truth than to the alternative completion; for story generation, this field is not included (no "ground truth" to compare to).
  • groundtruth - for T-REx incorrect predictions, the ground-truth target (entity); otherwise, this is omitted and targets_plaintext is equal to the ground truth answer.

Table 1: T-REx facts, retrievals from T-REx sentences

Method Download .jsonl file Viewer link
BM25 trex_retrievals_bm25.jsonl view in app
Gecko trex_retrievals_gecko.jsonl view in app
TRAK trex_retrievals_trak.jsonl view in app
Exp 1 trex_retrievals_exp1.jsonl view in app
Exp 2 trex_retrievals_exp2.jsonl view in app
Exp 3 trex_retrievals_exp3.jsonl view in app
Exp 4 trex_retrievals_exp4.jsonl view in app
Exp 5 trex_retrievals_exp5.jsonl view in app
TrackStar trex_retrievals_trackstar.jsonl view in app

Table 2: T-REx facts, retrievals from C4

Method Download .jsonl file Viewer link
BM25 c4_trex_retrievals_bm25.jsonl view in app
Gecko c4_trex_retrievals_gecko.jsonl view in app
Gradient dot product c4_trex_retrievals_grad_dot.jsonl view in app
Gradient cosine c4_trex_retrievals_grad_cosine.jsonl view in app
TrackStar c4_trex_retrievals_trackstar.jsonl view in app

Appendix A.5: Additional tasks, retrievals from C4

Task Download .jsonl file Viewer link
T-REx incorrect predictions c4_trex_incorrectpred_retrievals_trackstar.jsonl view in app
COPA c4_copa_retrievals_trackstar_nontaskspecific.jsonl view in app
PIQA c4_piqa_retrievals_trackstar_nontaskspecific.jsonl view in app
Arithmetic word problems c4_arithmeticwordproblem_retrievals_trackstar_nontaskspecific.jsonl view in app
Simple arithmetic c4_arithmetic_retrievals_trackstar_nontaskspecific.jsonl view in app
Story generation c4_storygeneration_retrievals_trackstar_nontaskspecific.jsonl view in app

T-REx sentences

This is the corpus of 19.6 M sentences as described in Section 4.2 and Section 5, and used for the experiments in Section 5 of the paper. The data is approximately 6GB, split across 20 shards:

https://storage.googleapis.com/tda-resources/2410.17413/public/trex_sentences.jsonl-000[XY]-of-00020:

Or fetch all using gsutil: gsutil -m cp 'gs://tda-resources/2410.17413/public/trex_sentences.jsonl-*' /path/to/local/dir

Each record has the following fields:

  • sentence_id
  • text
  • abstract_uri
  • sent_idx_in_abst - index of this sentence in the original abstract
  • fact_triples - relevant fact triples which are found in this sentence

Citing this work

If you use this data or find the data viewer useful, please cite our paper at:

@article{chang2024scalable,
  title={Scalable Influence and Fact Tracing for Large Language Model Pretraining},
  author={Chang, Tyler A. and Rajagopal, Dheeraj and Bolukbasi, Tolga and Dixon, Lucas and Tenney, Ian},
  journal={arXiv preprint arXiv:2410.17413},
  year={2024}
}

License and disclaimer

Copyright 2024 DeepMind Technologies Limited All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials, except as set out below, are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode. This dataset contains passages from:

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published