QA-SRL #716

zphang · 2019-06-21T12:19:18Z

Starting with a basic implementation of QA-SRL, seeking feedback.

There is quite a lot of repeated/branching input sets in QA-SRL. Each sentence has multiple "verbs", each verb corresponds to multiple questions, each question has multiple answers from different annotators, and each answer contains potentially more than one span. Currently I'm doing a naive implementation of enumerating all of these. For a sense of scale, there are about 44k sentences in the training set, but about 600k+ total training examples when expanded.
The QA-SRL / Large-Scale QA-SRL papers did not talk too much about scoring, and where they did I'm not sure if it works best for pretraining. I'm currently implementing the following:
Training loss: I am currently following the BERT formulation for SQuAD of straightforwardly predicting the span start and ends via a softmax over the number of tokens in the input, with a cross-entropy loss. No correction for ensuring start < end.
Evaluation: I am currently using the SQuAD-style macro-F1, which treats the predicted / gold tokens as individual observations, computing the F1 for each example, and then averaging across examples.

sleepinyourhat · 2019-06-21T15:06:51Z

This all sounds reasonable. You might get some odd behavior with SQuAD-style prediction, since it's not really well defined with uncertainty: When the model is predicting the end of a span, it doesn't know which span it predicted the start for. But I think this is a reasonable setup to start with.

zphang · 2019-06-29T15:47:19Z

Updated with pred writing logic.

One (maybe minor) issue is that because the data instances are constructed from expanding the input examples, the instances will be heavily clustered. This means that for the periodic validation calculations, where only the first val_data_limit instances are used, there is a lot less diversity in examples than would be ideal.

Otherwise should be ready to merge.

sleepinyourhat · 2019-06-29T15:55:13Z

Hrm. Do you see an easy way to shuffle the examples when loading them?

I know the loading code is a bit hairy, so if not, mind just adding a warning in the loader that says not to use val_data_limit?

zphang · 2019-06-29T16:03:00Z

Using val_data_limit is probably still recommended, since the validation set is quite large otherwise.

I could simply add a fixed-seed random shuffle while loading the validation set. That way the order will always be consistent.

This issue may affect others tasks that create instances by expanding inputs (e.g. MultiRCTask).

sleepinyourhat · 2019-06-29T16:03:47Z

Fixed shuffle sounds good to me!

sleepinyourhat · 2019-06-29T16:03:56Z

(Reviewing remaining changes now.)

zphang · 2019-06-29T16:06:36Z

Or actually any task that has an order to the validation data, if the examples aren't shuffled somewhere else along the way.

jiant/src/trainer.py

Lines 832 to 838 in 3cab32d

    
           if self._val_data_limit >= 0: 
        
               max_data_points = min(task.n_val_examples, self._val_data_limit) 
        
           else: 
        
               max_data_points = task.n_val_examples 
        
           val_generator = BasicIterator(batch_size, instances_per_epoch=max_data_points)( 
        
               task.val_data, num_epochs=1, shuffle=False 
        
           )

sleepinyourhat

Neat, thanks for doing this! Mostly looks sane, with a couple of issues, but I'm not sure I can evaluate the actual model-building code. @W4ngatang or @pruksmhc, mind quickly skimming that part?

src/allennlp_mods/span_metrics.py

src/evaluate.py

src/modules/modules.py

src/tasks/qa.py

pep8speaks · 2019-06-29T17:37:52Z

Hello @zphang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file jiant/modules/simple_modules.py:

Line 177:101: E501 line too long (103 > 100 characters)

You can repair most issues by installing black and running: black -l 100 ./*. If you contribute often, have a look at the 'Contributing' section of the README for instructions on doing this automatically.

Comment last updated at 2019-10-01 05:44:32 UTC

zphang · 2019-06-30T01:28:10Z

Updated after feedback.

sleepinyourhat

Looks good to me, but it'd be good to have someone review the model code, so not merging yet.

sleepinyourhat · 2019-08-12T19:18:20Z

@zphang Are you still planning to merge this? It'd definitely be helpful, but there's more to do to get it up to date. Otherwise close.

zphang · 2019-08-12T19:19:07Z

I'll take a look to get it up to date.

sleepinyourhat · 2019-08-19T17:40:42Z

@pruksmhc - Do you have time to take a look at the modeling side?

jiant/tasks/tasks.py

jiant/utils/utils.py

sleepinyourhat

Looks sane. Don't merge until:

You've verified that you get performance that's not too far from external numbers.
You get an approval for the modeling code.

Thanks for taking this on!

sleepinyourhat · 2019-08-27T19:58:11Z

Bump—there are lots of code changes going on around now, so the sooner you merge, the easier it'll be.

zphang · 2019-08-27T20:45:30Z

Still keeping an eye on this: There weren't any BERT-based scores as far as I saw besides https://arxiv.org/pdf/1901.11373.pdf, and the experimental setup isn't quite the same. More importantly, I found that performance between BERT-Base and BERT-Large was suspiciously close, so I am still investigating.

sleepinyourhat · 2019-09-16T18:46:19Z

Could you update or close this this this week?

zphang · 2019-09-16T19:08:53Z

Will do.

W4ngatang · 2019-09-20T15:42:42Z

jiant/allennlp_mods/span_metrics.py

+
+
+@Metric.register("spanf1")
+class SpanF1Measure(Metric):


any reason not to use the allennlp version?

https://github.com/allenai/allennlp/blob/master/allennlp/training/metrics/span_based_f1_measure.py

It looks like AllenNLP has two options. The one you linked which is based on Conll SRL, which is based on BIO (+variants) style of labels, and https://github.com/allenai/allennlp/blob/master/allennlp/training/metrics/squad_em_and_f1.py for SQuAD, but is based on raw strings. Neither is quite what we are looking for (token-based EM/F1 for spans). Arguably we can reformulate the labels to match the BIO variant (I'm not too familiar with this), but let me know what you think.

W4ngatang · 2019-09-20T15:44:58Z

jiant/tasks/qa.py

+        yield from self.sentences
+
+    def process_split(
+        self, split, indexers, model_preprocessing_interface


what is model_preprocessing_interface?

It seems to be a new addition following the new integration of pytorch_transformers, for handling the different model types.

W4ngatang · 2019-09-20T15:46:05Z

jiant/tasks/qa.py

+    def _shuffle_data(cls, data, seed=1234):
+        # Shuffle validation data to ensure diversity in periodic validation with val_data_limit
+        indices = list(range(len(data[0])))
+        random.Random(seed).shuffle(indices)


what's with the seed here? Is it not enough to seed random in main ?

See: #716 (comment)

The goal is to have a fixed-order shuffle of the validation data, potentially across runs. Hence, we want it to be independent of the usual random seed.

W4ngatang · 2019-09-20T15:48:44Z

jiant/evaluate.py

@@ -473,6 +487,11 @@ def _write_rte_preds(
            preds_fh.write("{0}\n".format(json.dumps(out_d)))


+def _write_simple_csv_preds(task, preds_df: pd.DataFrame, pred_dir: str, split_name: str):


W4ngatang · 2019-09-20T15:50:22Z

jiant/modules/simple_modules.py

@@ -234,3 +234,30 @@ def forward(self, s1, s2, mask1, mask2, idx1=[], idx2=[]):
        pair_emb = torch.cat([emb1, emb2, torch.abs(emb1 - emb2), emb1 * emb2], 1)
        logits = self.classifier(pair_emb)
        return logits
+
+
+class TokenProjectionEncoder(nn.Module):


Just checking my understanding: TokenProjectionEncoder applies the same linear projections to each vec in a sequence of vectors, and TokenMultiProjectionEncoder applies multiple linear projections to each vec (like multi-head attention) ?

Are you normally setting d_proj << d_inp ? They're the same by default

After taking a look, I've refactored this to make it based on a dict of classifiers.

Actually, decided to go with removed the unnecessary projection layer. Pushed update.

W4ngatang

lgtm, but do look into if we can use allennlp's span F1 instead of adding our own.

zphang · 2019-09-20T23:41:24Z

In that case, should we merge and I'll add/assign an issue to myself to revisit the evaluation?

zphang · 2019-09-20T23:49:55Z

Wait, sorry, just saw the comments on the commit. Let me address them.

zphang · 2019-10-01T06:10:36Z

I think this is good to merge.

sleepinyourhat · 2019-10-01T14:25:01Z

Cool, do add the issue, though, if this wasn't fixed.

* Readme update for bert npi paper (#915) * Update README.md * minor fix * Typo fix * typo fix * Fixing index problem & minor pytorch_transformers_interface cleanup (#916) * update boundry func with offsets * update tasks that use indexes * remove outdated temporary fix * Prepare for 1.2.1 release. * QA-SRL (#716) * Initial QASRL * Updated pred writing for QASRL * Add validation shuffle to QASRL * Remove tqdm, modify class check in preds * qasrl rebase cleanup * Update QA-SRL to new repo changes * Removing src * QASRL Cleanup * updating to new model format * csv to tsv * QASRL update * Implementing Data Parallel (#873) * implemented data parallel * black style * Resolve last of merge marks * deleting irrelevant logs * adding new way to get attribute * updating to master * torch.Tensor -> torch.tensor for n_exs * black style * black style * Merge master * adapting other tasks to multiple GPU" * adding helper function for model attributes * adding get_model_attribute to main.py * deleting unecessary n_inbput for span_module * black style * revert comment change * fixing batch size keys * opt_params -> optimizer_params * Remove extraneous cahnges * changed n_exs to one-liner * adapting args.cuda to multi-GPU setting * adding use_cuda variable * Fixing parsing for case of args.cuda=subset * fixing tests * fixing nits, cleaning up parse_cuda function * additional nit * deleted extra space * Revert nit * refactoring into get_batch_size * removing use_cuda * adding options.py * removing use_cuda in tests, deleting extra changes * change cuda default * change parse_cuda_list_args import * jiant.options -> jiant.utils.options * change demo.conf cuda setting * fix bug -> make parse_cuda return int if only one gpu * fix bug * fixed tests * revert test_retokenize change * cleaning up code * adding addiitonal jiant.options * Separating cuda_device = int case with multiple cuda_device case * deleting remains of uses_cuda * remove time logging * remove use_cuda from evaluate * val_interval -> validation_interval * adding cuda comment to tutorial * fixed typo * replace correct_sent_indexing with non inplace version (#921) * replace correct_sent_indexing with non inplace version * Update modules.py * Update modules.py * Abductive NLI (aNLI) (#922) * anli * anli fix * Adding aNLI link, additional test/dev warning * SocialIQA (#924) * black style * adding SocialQA * black style * black style * fixed socialQA task * black style * Update citation * Nit * senteval * socialIQA naming * reverse unnecessary add * Fixing bug with restoring checkpoint with two gpus + cleaning CUDA parsing related code (#928) * black style * remove * cleaning up code around cuda-parsing * adding defaulting to -1 if there is no cuda devices detected * fixing nits, throw error instead of log warning for cuda not found * Updating CoLA inference script (#931) * Adding Senteval Tasks (#926) * black style * adding initial senteval, senteval preprocessing script * black * adding senteval to registry * fixing bigram-shift * adding label_namespace arg, fixing the ksenteval tasks * revert extra changes * black style * change name -> senteval-probing * fixing senteval-probing tasks * renamed senteval -> sentevalprobing * delete extra imports * black style * renaming files and cleaning up preprocessing code * nit * black * deleting pdb * Senteval -> SE shorthand * fixing code style * Speed up retokenization (#935) * black style * pre-loading tokenizer before retokenization function * Scitail (#943) * scitail * Scitail * Scitail * update Scitail, removed config * update Scitail, removed config * Add corrected data stastistics (#941) Thanks to #936, we've discovered errors in our data statistics reporting in the edge probing paper. This table contains the corrected values. As there is more space here, the full (unrounded) values are reported instead. This was generated by a script that read the stats.tsv file and the diff vs. the paper should match my comment on the issue yesterday. * CommonsenseQA+hellaswag (#942) * add commonsenseqa task * add hellaswag task * dabug * from #928 * add special tokens to CommensenseQA input * format * revert irrelevant change * Typo fix * delete * rename stuff * Update qa.py * black * fix name (#945) * CCG update (#948) * generalize ccg to other transformer models * debug * I don't know who broke this at what time, but let's just fix it here now * Fixing senteval-probing preprocessing (#951) * Copying configs from superglue * adding senteval probing config commands * adding meta-script for transfer and probing exps * Adding meta bash script fixed * give_permissions script * small fix transfer_analysis.sh (#946) model_*.th might indicate several models; fixed to model_*.best.th * lr_patience fix * target_task training -> pretrain training * adding edgeprobing configs and command * adding edge probing conf * fix load_target_train bug * add hyperparameter sweeping * val_interval change * adding sweep function * Task specific val_intervals * add reload_vocab to hyperparameter sweep * adding batch_size specification * fixing senteval-word-content * fixing senteval preprocess script * revert extra delete * remove extra files * black format * black formatting trainer.py * remove load_data() * removing extra changes * Adding tokenizer alignment function (#953) * Copying configs from superglue * adding senteval probing config commands * adding meta-script for transfer and probing exps * Adding meta bash script fixed * give_permissions script * small fix transfer_analysis.sh (#946) model_*.th might indicate several models; fixed to model_*.best.th * lr_patience fix * target_task training -> pretrain training * adding edgeprobing configs and command * adding edge probing conf * fix load_target_train bug * add hyperparameter sweeping * val_interval change * adding sweep function * Task specific val_intervals * add reload_vocab to hyperparameter sweep * adding batch_size specification * fixing senteval-word-content * fixing senteval preprocess script * revert extra delete * remove extra files * black format * black formatting trainer.py * remove load_data() * removing extra changes * adding alignment mapping function * fix comment nits * comment nit * adding example of token_alignment * Function words probing (#949) * add nli prob task template * Create acceptablity_probing.py * specify nli probing tasks * port acceptablity probing tasks * add directory name * debug * debug * format * black * revert unintended change * CosmosQA (#952) * misc run scripts * cosmosqa * cosmosqa * cosmosqa * cosmosqa run * cleaned up repo * cleaned up repo * reformatted * qqp fix (#956) * QAMR + QA-SRL Update (#932) * qamr * tokenization * temp qamr * qamr * QASRL * Undo slicing * quick hack to bypass bad qasrl examples * f1 em fix * tokenization fixes * average * New tokenization aligner * update example counts * Cleanup * Typography * Set _unk_id in Roberta module (#959) Currently the `_unk_id` for Roberta is not set correctly, which triggers the assertion error on line 118. * Fixing load_target_train_checkpoint with mixing setting (#960) * adding loading for mix * black style * update pytorch and numpy version requirements (#965) * CCG update (#955) * generalize ccg to other transformer models * debug * I don't know who broke this at what time, but let's just fix it here now * ccg lazy iterator * debug * clean up * debug * debug ccg, minor cleanup * add adversarial_nli tasks (#966) * Update README.md * Citation fix

* Initial QASRL * Updated pred writing for QASRL * Add validation shuffle to QASRL * Remove tqdm, modify class check in preds * qasrl rebase cleanup * Update QA-SRL to new repo changes * Removing src * QASRL Cleanup * updating to new model format * csv to tsv * QASRL update

zphang requested review from iftenney, pruksmhc, sleepinyourhat and W4ngatang as code owners June 21, 2019 12:19

zphang closed this Jun 29, 2019

zphang reopened this Jun 29, 2019

zphang changed the title ~~[WIP] QA-SRL~~ QA-SRL Jun 29, 2019

sleepinyourhat requested changes Jun 29, 2019

View reviewed changes

src/allennlp_mods/span_metrics.py Outdated Show resolved Hide resolved

src/evaluate.py Outdated Show resolved Hide resolved

src/modules/modules.py Outdated Show resolved Hide resolved

src/tasks/qa.py Outdated Show resolved Hide resolved

src/tasks/qa.py Outdated Show resolved Hide resolved

sleepinyourhat approved these changes Jun 30, 2019

View reviewed changes

zphang added 6 commits August 19, 2019 11:30

Initial QASRL

1544509

Updated pred writing for QASRL

b9e5f90

Add validation shuffle to QASRL

2e27267

Remove tqdm, modify class check in preds

a625928

qasrl rebase cleanup

c463b63

Update QA-SRL to new repo changes

78d41f5

sleepinyourhat reviewed Aug 19, 2019

View reviewed changes

jiant/tasks/tasks.py Outdated Show resolved Hide resolved

sleepinyourhat reviewed Aug 19, 2019

View reviewed changes

jiant/tasks/tasks.py Show resolved Hide resolved

sleepinyourhat reviewed Aug 19, 2019

View reviewed changes

jiant/utils/utils.py Outdated Show resolved Hide resolved

zphang added 2 commits August 19, 2019 16:18

Removing src

ead1cdb

QASRL Cleanup

70765d4

sleepinyourhat approved these changes Aug 19, 2019

View reviewed changes

zphang added 3 commits September 16, 2019 15:10

Merge remote-tracking branch 'origin/master' into qasrl

715fb69

updating to new model format

22a7c62

Merge remote-tracking branch 'origin/master' into qasrl

4d0462c

W4ngatang reviewed Sep 20, 2019

View reviewed changes

W4ngatang approved these changes Sep 20, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into qasrl

6842652

zphang added 3 commits September 20, 2019 19:58

csv to tsv

5e7ec36

QASRL update

3923e32

Merge remote-tracking branch 'origin/master' into qasrl

628dcde

sleepinyourhat merged commit b19ca78 into nyu-mll:master Oct 1, 2019

pyeres mentioned this pull request Feb 5, 2020

Minor refactor for task structure consistency and clean up in preprocessing #1004

Merged

jeswan mentioned this pull request Sep 17, 2020

[CLOSED] QA-SRL nyu-mll/jiant-v1-legacy#716

Closed

jeswan added the jiant-v1-legacy Relevant to versions <= v1.3.2 label Sep 17, 2020

		@@ -473,6 +487,11 @@ def _write_rte_preds(
		preds_fh.write("{0}\n".format(json.dumps(out_d)))


		def _write_simple_csv_preds(task, preds_df: pd.DataFrame, pred_dir: str, split_name: str):

QA-SRL #716

QA-SRL #716

Conversation

zphang commented Jun 21, 2019

sleepinyourhat commented Jun 21, 2019

zphang commented Jun 29, 2019 • edited Loading

sleepinyourhat commented Jun 29, 2019

zphang commented Jun 29, 2019

sleepinyourhat commented Jun 29, 2019

sleepinyourhat commented Jun 29, 2019

zphang commented Jun 29, 2019 • edited Loading

sleepinyourhat left a comment

Choose a reason for hiding this comment

pep8speaks commented Jun 29, 2019 • edited Loading

Comment last updated at 2019-10-01 05:44:32 UTC

zphang commented Jun 30, 2019

sleepinyourhat left a comment

Choose a reason for hiding this comment

sleepinyourhat commented Aug 12, 2019

zphang commented Aug 12, 2019

sleepinyourhat commented Aug 19, 2019

sleepinyourhat left a comment

Choose a reason for hiding this comment

sleepinyourhat commented Aug 27, 2019

zphang commented Aug 27, 2019

sleepinyourhat commented Sep 16, 2019

zphang commented Sep 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

W4ngatang left a comment

Choose a reason for hiding this comment

zphang commented Sep 20, 2019

zphang commented Sep 20, 2019

zphang commented Oct 1, 2019

sleepinyourhat commented Oct 1, 2019

zphang commented Jun 29, 2019 •

edited

Loading

zphang commented Jun 29, 2019 •

edited

Loading

pep8speaks commented Jun 29, 2019 •

edited

Loading