Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

v1.3.0

Compare
Choose a tag to compare
@stephenroller stephenroller released this 07 Jul 19:56
bd9ac8f

v1.3.0 Changelog

Major Features

Fully Sharded Data Parallel

Implementation of DeepSpeed/FairScale's Zero2 sharding. Improves training speed and reduces memory usage over vanilla DistributedDataParallel. Switch to the new mode with --ddp-backend zero2 to see free improvements in your training! (#3740)

Swappable Transformer Components

We've added support for overriding internal components within Transformers. It is now easy to swap only an attention module, or customize your layers, without having to fully override all classes. (#3567, #3703, #3708, #3715, #3638)

ChunkTeacher no longer requires num_examples/num_episodes to be correct

For as long as we've had ChunkTeacher, the value of num_examples/num_episodes must be exactly correct, or your training would hang. Furthermore, that calculation would need to be done outside of ParlAI. We've relaxed this restriction: these methods can now return arbitrary values, and you will correctly iterate through all of your data. However, using the wrong value of num_examples can cause the "epoch" counter (used in parlai train) to be wrong relative to your dataset. (#3681, #3745)

Eliminate dummy batches and init_cuda_buffer

You are no longer required to implement dummy batches in your Generator agents, when using custom batch formats. Additionally, you will no longer see a dummy batch as the first batch when debugging. Instead, the first batch your agent sees will be reserved as the future dummy batch. (#3732, #3744)

Paper Releases

Reducing Hallucination

Exploratory architectures that add retrieval mechanisms to dialogue models, reducing hallucination while maintaining conversational ability. (#3611, #3657, #3693, #3688, #3668)

Hash Layers & Ladder Transformers

More Parameters or More Compute? Answer: Both! Two new methods that explore this question: Hash Layers for more parameters, and Staircase Attention for more power per parameter. (#3697, #3699, #3700, #3746, #3747)

Minor Features

  • [TGA] Substantial speedups during generation on GPUs (#3730, #3729, #3669)
  • [Datasets] Add GLUE teachers, and support for HuggingFace datasets (#3570, #3624)
  • [Datasets] [Safety] Release the Non Adversarial Data (#3684)
  • [TA] Support temp history via special field in observation (#3617)
  • [TGA] Allow setting prefix tokens (#3760)
  • [TCA] Classifier on generator for TGA (#3716)
  • [ChunkTeacher] Remove exception for specifying non-streaming data (#3653)
  • [Transformer] Better initiaization for segment embeddings (#3680)
  • [Message] Add a new json_safe_payload method for serialization (#3643, #3726, #3686)
  • [JIT] Support special tokens in torchscript module. (#3644)
  • [JIT] Fix a parsing error with parlai torchscript in Python 3.8 (#3641)
  • [ACUTE] Support randomize_conversations (#3636, #3642)

Bugfixes

  • [train] Fix bugs with loading validation impatience. (#3713)
  • [train] Fix LR scheduler cooldown (#3719)
  • [train] Dynamic Batching doesn't choke with really small datasets (#3721)
  • [Logging] Fix a bug with world logging and multitasking (#3718)
  • [Mutators] Ensure mutations do not persist across epochs (#3649)
  • [BART] Do not add start/end tokens multiple times (#3714)
  • [TCA] weighted_f1 no longer assumes binary classification (#3728)
  • [Safety] Fix a Static Task bug and Safety README (#3612)
  • [logging] Fix an issue where --loglevel debug was ignored (#3658)
  • [Tensorboard] Fix an exception in some versions of Tensorboard (#3637)
  • [vacuum] Add support for PathManager in vacuum (#3635)
  • [Crowdsourcing] Slightly improve the analysis script to make it more robust (#3683, #3629)
  • Various locations where the change to is_padding caused issues (#3704, #3634, #3674)
  • Various typos/lint (#3621, #3622, #3646)

Developer changes