Evaluation plays a critical role in deep learning as a fundamental + block of any prediction-based system. However, the vast number of + Natural Language Processing (NLP) tasks and the development of various + metrics have led to challenges in evaluating different systems with + different metrics. To address these challenges, we introduce jury, a + toolkit that provides a unified evaluation framework with standardized + structures for performing evaluation across different tasks and + metrics. The objective of jury is to standardize and improve metric + evaluation for all systems and aid the community in overcoming the + challenges in evaluation. Since its open-source release, jury has + reached a wide audience and is publicly available.
+NLP tasks possess inherent complexity, requiring a comprehensive
+ evaluation of model performance beyond a single metric comparison.
+ Established benchmarks such as WMT
+ (
Although employing multiple metric evaluation is common, there is a + challenge in practical use because widely-used metric libraries lack + support for combined and/or concurrent metric computations. + Consequently, researchers face the burden of evaluating their models + per metric, a process exacerbated by the scale and complexity of + recent models and limited hardware capabilities. This bottleneck + impedes the efficient assessment of NLP models and highlights the need + for enhanced tooling in the metric computation for convenient + evaluation. In order for concurrency to be beneficial at a maximum + level, the system may require hardware accordingly. Having said that, + the availability of the hardware comes into question.
+The extent of achievable concurrency in NLP research has
+ traditionally relied upon the availability of hardware resources
+ accessible to researchers. However, significant advancements have
+ occurred in recent years, resulting in a notable reduction in the cost
+ of high-end hardware, including multi-core CPUs and GPUs. This
+ progress has transformed high-performance computing resources, which
+ were once prohibitively expensive and predominantly confined to
+ specific institutions or research labs, into more accessible and
+ affordable assets. For instance, in BERT
+ (
To ease the use of automatic metrics in NLG research, several
+ hands-on libraries have been developed such as
+
We designed a system that enables the creation of user-defined
+ metrics with a unified structure and the usage of multiple metrics in
+ the evaluation process. Our library also utilizes the
+
We would also like to express our appreciation to Cemil Cengiz for + fruitful discussions.
+