Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge probing data stats don't match #936

Closed
shtoshni opened this issue Oct 17, 2019 · 6 comments
Closed

Edge probing data stats don't match #936

shtoshni opened this issue Oct 17, 2019 · 6 comments
Labels
jiant-v1-legacy Relevant to versions <= v1.3.2

Comments

@shtoshni
Copy link

Hi guys,

I'm working on the edge probing task and I used the edge_data_stats script to verify the data stats. Interestingly, while the total number of tokens and instances are matching the stats presented in Appendix B of the paper, the total targets number is significantly off. So for ontonotes-coref's train split, I get the target count to be 207830 or ~208K which is almost 40K from the 248K number reported in the paper. Similarly, this target count stat is off for the development and test part as well. I also checked the stats for const (constituency) and that seems to be quite off as well. Any idea as to why this might be happening. Also, is it possible for you guys to double-check these numbers.

Thanks,
Shubham

@sleepinyourhat
Copy link
Contributor

@iftenney, any guesses?

@iftenney
Copy link
Collaborator

@pitrack can you confirm how you calculated the numbers in the table?

@pitrack
Copy link
Contributor

pitrack commented Oct 22, 2019

Thanks for bringing this up. There are a couple errors in the data table. Here is what I found:

  1. POS is correct
  2. Constituents are correct
  3. Dependencies are correct
  4. Entities are correct.
  5. SRL train/dev are correct. Test* should be 26715 examples, 711746 tokens, and 61716 targets.
    a. ... of which 41735 are core roles targets
    b. ... of which 18290 are non-core role targets
  6. In Coreference, the targets should be 207830/26333/27800, which looks like what you're getting.
  7. In SPR1, there are 518 dev examples, not 513.
  8. In SPR2, the token counts should be 47K/5.6K/4.9K (specifically 5592/4929)
  9. For Winograd, there are 958/223/518 examples, 14K/4.3K/8.0K tokens (specifically 14384/4331/7952), and 1787/379/949 targets
  10. I am currently unable to verify Relations, though a quick glance suggests something might be a bit off, since 85%/15% of 8000 training examples is 6.8K/1.2K, not 6.9K/1.1K. Relations is correct. The exact split is 6851/1149, which rounds up/down correctly, and the fraction is 85.64%/14.36%. The sampling rate was set to 0.85, but we ended up with slightly more..

* Originally reported SRL test stats were on the conll-2012 test set, but all OntoNotes-derived data for this paper are split according to the official train/dev/test sets.

I apologize for the errors in this table; I should have double-checked them on the final version of the data that we used in our final draft.

What numbers are you getting for constituents/part-of-speech? For const/*.json, you should be getting 3.9M/545K/403K targets but this includes both Part-of-Speech (terminals) and Constituents (non-terminals). When split using split_constituent_data.py, you should be getting the numbers in the table.

@shtoshni
Copy link
Author

shtoshni commented Oct 22, 2019

Thanks for the quick check!
Currently, we're only focussing on coreference and constituency but it's good to know the corrected stats as well.
Right now we're only working on the coreference & constituent task. The numbers for coreference are exactly matching up :)
For constituent, the numbers are indeed matching after splitting the data using the split script.
BTW regarding the model performances reported on the constituent task, are the reported numbers just on the non-terminals or include terminals as well?

@pitrack
Copy link
Contributor

pitrack commented Oct 22, 2019

Since we split up the "consts" data into non-terminals ("Constituents") and terminals ("Part of Speech"), the number reported for "Constituents" is just the non-terminals

@sleepinyourhat
Copy link
Contributor

Thanks @pitrack – would you mind putting together a readme somewhere with these updated numbers?

pitrack added a commit to pitrack/jiant that referenced this issue Oct 22, 2019
Thanks to nyu-mll#936, we've discovered errors in our data statistics reporting in the edge probing paper. This table contains the corrected values. As there is more space here, the full (unrounded) values are reported instead. This was generated by a script that read the stats.tsv file and the diff vs. the paper should match my comment on the issue yesterday.
pruksmhc pushed a commit that referenced this issue Oct 26, 2019
Thanks to #936, we've discovered errors in our data statistics reporting in the edge probing paper. This table contains the corrected values. As there is more space here, the full (unrounded) values are reported instead. This was generated by a script that read the stats.tsv file and the diff vs. the paper should match my comment on the issue yesterday.
Kelina added a commit that referenced this issue Dec 18, 2019
* Readme update for bert npi paper (#915)

* Update README.md

* minor fix

* Typo fix

* typo fix

* Fixing index problem & minor pytorch_transformers_interface cleanup (#916)

* update boundry func with offsets

* update tasks that use indexes

* remove outdated temporary fix

* Prepare for 1.2.1 release.

* QA-SRL (#716)

* Initial QASRL

* Updated pred writing for QASRL

* Add validation shuffle to QASRL

* Remove tqdm, modify class check in preds

* qasrl rebase cleanup

* Update QA-SRL to new repo changes

* Removing src

* QASRL Cleanup

* updating to new model format

* csv to tsv

* QASRL update

*  Implementing Data Parallel (#873)

* implemented data parallel

* black style

* Resolve last of merge marks

* deleting irrelevant logs

* adding new way to get attribute

* updating to master

* torch.Tensor -> torch.tensor for n_exs

* black style

* black style

* Merge master

* adapting other tasks to multiple GPU"

* adding helper function for model attributes

* adding get_model_attribute to main.py

* deleting unecessary n_inbput for span_module

* black style

* revert comment change

* fixing batch size keys

* opt_params -> optimizer_params

* Remove extraneous cahnges

* changed n_exs to one-liner

* adapting args.cuda to multi-GPU setting

* adding use_cuda variable

* Fixing parsing for case of args.cuda=subset

* fixing tests

* fixing nits, cleaning up parse_cuda function

* additional nit

* deleted extra space

* Revert nit

* refactoring into get_batch_size

* removing use_cuda

* adding options.py

* removing use_cuda in tests, deleting extra changes

* change cuda default

* change parse_cuda_list_args import

* jiant.options -> jiant.utils.options

* change demo.conf cuda setting

* fix bug -> make parse_cuda return int if only one gpu

* fix bug

* fixed tests

* revert test_retokenize change

* cleaning up code

* adding addiitonal jiant.options

* Separating cuda_device = int case with multiple cuda_device case

* deleting remains of uses_cuda

* remove time logging

* remove use_cuda from evaluate

* val_interval -> validation_interval

* adding cuda comment to tutorial

* fixed typo

* replace correct_sent_indexing with non inplace version (#921)

* replace correct_sent_indexing with non inplace version

* Update modules.py

* Update modules.py

* Abductive NLI (aNLI) (#922)

* anli

* anli fix

* Adding aNLI link, additional test/dev warning

* SocialIQA (#924)

* black style

* adding SocialQA

* black style

* black style

* fixed socialQA task

* black style

* Update citation

* Nit

* senteval

* socialIQA naming

* reverse unnecessary add

* Fixing bug with restoring checkpoint with two gpus + cleaning CUDA parsing related code (#928)

* black style

* remove

* cleaning up code around cuda-parsing

* adding defaulting to -1 if there is no cuda devices detected

* fixing nits, throw error instead of log warning for cuda not found

* Updating CoLA inference script (#931)

* Adding Senteval Tasks (#926)

* black style

* adding initial senteval, senteval preprocessing script

* black

* adding senteval to registry

* fixing bigram-shift

* adding label_namespace arg, fixing the ksenteval tasks

* revert extra changes

* black style

* change name -> senteval-probing

* fixing senteval-probing tasks

* renamed senteval -> sentevalprobing

* delete extra imports

* black style

* renaming files and cleaning up preprocessing code

* nit

* black

* deleting pdb

* Senteval -> SE shorthand

* fixing code style

* Speed up retokenization (#935)

* black style

* pre-loading tokenizer before retokenization function

* Scitail (#943)

* scitail

* Scitail

* Scitail

* update Scitail, removed config

* update Scitail, removed config

* Add corrected data stastistics (#941)

Thanks to #936, we've discovered errors in our data statistics reporting in the edge probing paper. This table contains the corrected values. As there is more space here, the full (unrounded) values are reported instead. This was generated by a script that read the stats.tsv file and the diff vs. the paper should match my comment on the issue yesterday.

* CommonsenseQA+hellaswag (#942)

* add commonsenseqa task

* add hellaswag task

* dabug

* from #928

* add special tokens to CommensenseQA input

* format

* revert irrelevant change

* Typo fix

* delete

* rename stuff

* Update qa.py

* black

* fix name (#945)

* CCG update (#948)

* generalize ccg to other transformer models

* debug

* I don't know who broke this at what time, but let's just fix it here now

* Fixing senteval-probing preprocessing (#951)

* Copying configs from superglue

* adding senteval probing config commands

* adding meta-script for transfer and probing exps

* Adding meta bash script fixed

* give_permissions script

* small fix transfer_analysis.sh (#946)

model_*.th might indicate several models; fixed to model_*.best.th

* lr_patience fix

* target_task training -> pretrain training

* adding edgeprobing configs and command

* adding edge probing conf

* fix load_target_train bug

* add hyperparameter sweeping

* val_interval change

* adding sweep function

* Task specific val_intervals

* add reload_vocab to hyperparameter sweep

* adding batch_size specification

* fixing senteval-word-content

* fixing senteval preprocess script

* revert extra delete

* remove extra files

* black format

* black formatting trainer.py

* remove load_data()

* removing extra changes

* Adding tokenizer alignment function (#953)

* Copying configs from superglue

* adding senteval probing config commands

* adding meta-script for transfer and probing exps

* Adding meta bash script fixed

* give_permissions script

* small fix transfer_analysis.sh (#946)

model_*.th might indicate several models; fixed to model_*.best.th

* lr_patience fix

* target_task training -> pretrain training

* adding edgeprobing configs and command

* adding edge probing conf

* fix load_target_train bug

* add hyperparameter sweeping

* val_interval change

* adding sweep function

* Task specific val_intervals

* add reload_vocab to hyperparameter sweep

* adding batch_size specification

* fixing senteval-word-content

* fixing senteval preprocess script

* revert extra delete

* remove extra files

* black format

* black formatting trainer.py

* remove load_data()

* removing extra changes

* adding alignment mapping function

* fix comment nits

* comment nit

* adding example of token_alignment

* Function words probing (#949)

* add nli prob task template

* Create acceptablity_probing.py

* specify nli probing tasks

* port acceptablity probing tasks

* add directory name

* debug

* debug

* format

* black

* revert unintended change

* CosmosQA (#952)

* misc run scripts

* cosmosqa

* cosmosqa

* cosmosqa

* cosmosqa run

* cleaned up repo

* cleaned up repo

* reformatted

* qqp fix (#956)

* QAMR + QA-SRL Update (#932)

* qamr

* tokenization

* temp qamr

* qamr

* QASRL

* Undo slicing

* quick hack to bypass bad qasrl examples

* f1 em fix

* tokenization fixes

* average

* New tokenization aligner

* update example counts

* Cleanup

* Typography

* Set _unk_id in Roberta module (#959)

Currently the `_unk_id` for Roberta is not set correctly, which triggers the assertion error on line 118.

* Fixing load_target_train_checkpoint with mixing setting (#960)

* adding loading for mix

* black style

* update pytorch and numpy version requirements (#965)

* CCG update (#955)

* generalize ccg to other transformer models

* debug

* I don't know who broke this at what time, but let's just fix it here now

* ccg lazy iterator

* debug

* clean up

* debug

* debug ccg, minor cleanup

* add adversarial_nli tasks (#966)

* Update README.md

* Citation fix
phu-pmh pushed a commit that referenced this issue Apr 17, 2020
Thanks to #936, we've discovered errors in our data statistics reporting in the edge probing paper. This table contains the corrected values. As there is more space here, the full (unrounded) values are reported instead. This was generated by a script that read the stats.tsv file and the diff vs. the paper should match my comment on the issue yesterday.
@jeswan jeswan added the jiant-v1-legacy Relevant to versions <= v1.3.2 label Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jiant-v1-legacy Relevant to versions <= v1.3.2
Projects
None yet
Development

No branches or pull requests

5 participants