v1.3.0
v1.3.0 Changelog
Major Features
Fully Sharded Data Parallel
Implementation of DeepSpeed/FairScale's Zero2 sharding. Improves training speed and reduces memory usage over vanilla DistributedDataParallel. Switch to the new mode with --ddp-backend zero2
to see free improvements in your training! (#3740)
Swappable Transformer Components
We've added support for overriding internal components within Transformers. It is now easy to swap only an attention module, or customize your layers, without having to fully override all classes. (#3567, #3703, #3708, #3715, #3638)
ChunkTeacher no longer requires num_examples/num_episodes to be correct
For as long as we've had ChunkTeacher, the value of num_examples/num_episodes
must be exactly correct, or your training would hang. Furthermore, that calculation would need to be done outside of ParlAI. We've relaxed this restriction: these methods can now return arbitrary values, and you will correctly iterate through all of your data. However, using the wrong value of num_examples
can cause the "epoch" counter (used in parlai train
) to be wrong relative to your dataset. (#3681, #3745)
Eliminate dummy batches and init_cuda_buffer
You are no longer required to implement dummy batches in your Generator agents, when using custom batch formats. Additionally, you will no longer see a dummy batch as the first batch when debugging. Instead, the first batch your agent sees will be reserved as the future dummy batch. (#3732, #3744)
Paper Releases
Reducing Hallucination
Exploratory architectures that add retrieval mechanisms to dialogue models, reducing hallucination while maintaining conversational ability. (#3611, #3657, #3693, #3688, #3668)
Hash Layers & Ladder Transformers
More Parameters or More Compute? Answer: Both! Two new methods that explore this question: Hash Layers for more parameters, and Staircase Attention for more power per parameter. (#3697, #3699, #3700, #3746, #3747)
Minor Features
- [TGA] Substantial speedups during generation on GPUs (#3730, #3729, #3669)
- [Datasets] Add GLUE teachers, and support for HuggingFace datasets (#3570, #3624)
- [Datasets] [Safety] Release the Non Adversarial Data (#3684)
- [TA] Support temp history via special field in observation (#3617)
- [TGA] Allow setting prefix tokens (#3760)
- [TCA] Classifier on generator for TGA (#3716)
- [ChunkTeacher] Remove exception for specifying non-streaming data (#3653)
- [Transformer] Better initiaization for segment embeddings (#3680)
- [Message] Add a new
json_safe_payload
method for serialization (#3643, #3726, #3686) - [JIT] Support special tokens in torchscript module. (#3644)
- [JIT] Fix a parsing error with
parlai torchscript
in Python 3.8 (#3641) - [ACUTE] Support randomize_conversations (#3636, #3642)
Bugfixes
- [train] Fix bugs with loading validation impatience. (#3713)
- [train] Fix LR scheduler cooldown (#3719)
- [train] Dynamic Batching doesn't choke with really small datasets (#3721)
- [Logging] Fix a bug with world logging and multitasking (#3718)
- [Mutators] Ensure mutations do not persist across epochs (#3649)
- [BART] Do not add start/end tokens multiple times (#3714)
- [TCA] weighted_f1 no longer assumes binary classification (#3728)
- [Safety] Fix a Static Task bug and Safety README (#3612)
- [logging] Fix an issue where
--loglevel debug
was ignored (#3658) - [Tensorboard] Fix an exception in some versions of Tensorboard (#3637)
- [vacuum] Add support for PathManager in vacuum (#3635)
- [Crowdsourcing] Slightly improve the analysis script to make it more robust (#3683, #3629)
- Various locations where the change to
is_padding
caused issues (#3704, #3634, #3674) - Various typos/lint (#3621, #3622, #3646)
Developer changes
- Helper functions for building deterministic data splits (#3676)
- Teacher URL updates (#3749, #3627, #3678)
- CI bugfixes & version bumps (#3754, #3724, #3672, #3652, #3710, #3628, #3452, #3720)
- Documentation updates (#3748, #3690, #3742, #3671)
- Mutators and Scripts support for parlai_internal (#3623, #3625)
- [Crowdsourcing] Small refactor in Model-Chat