diff --git a/joss.06452/10.21105.joss.06452.crossref.xml b/joss.06452/10.21105.joss.06452.crossref.xml new file mode 100644 index 0000000000..e0c3b40e97 --- /dev/null +++ b/joss.06452/10.21105.joss.06452.crossref.xml @@ -0,0 +1,214 @@ + + + + 20240520T154923-4a4e413a1a67cb30a2cb88f7f631936219ebd363 + 20240520154923 + + JOSS Admin + admin@theoj.org + + The Open Journal + + + + + Journal of Open Source Software + JOSS + 2475-9066 + + 10.21105/joss + https://joss.theoj.org + + + + + 05 + 2024 + + + 9 + + 97 + + + + Jury: A Comprehensive Evaluation Toolkit + + + + Devrim + Cavusoglu + + + Secil + Sen + + + Ulas + Sert + + + Sinan + Altinuc + + + + 05 + 20 + 2024 + + + 6452 + + + 10.21105/joss.06452 + + + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + + + + Software archive + 10.5281/zenodo.11170894 + + + GitHub review issue + https://github.com/openjournals/joss-reviews/issues/6452 + + + + 10.21105/joss.06452 + https://joss.theoj.org/papers/10.21105/joss.06452 + + + https://joss.theoj.org/papers/10.21105/joss.06452.pdf + + + + + + Relevance of unsupervised metrics in +task-oriented dialogue for evaluating natural language +generation + Sharma + CoRR + abs/1706.09799 + 2017 + Sharma, S., El Asri, L., Schulz, H., +& Zumer, J. (2017). Relevance of unsupervised metrics in +task-oriented dialogue for evaluating natural language generation. CoRR, +abs/1706.09799. http://arxiv.org/abs/1706.09799 + + + Findings of the 2020 conference on machine +translation (WMT20) + Barrault + Proceedings of the fifth conference on +machine translation + 2020 + Barrault, L., Biesialska, M., Bojar, +O., Costa-jussà, M. R., Federmann, C., Graham, Y., Grundkiewicz, R., +Haddow, B., Huck, M., Joanis, E., Kocmi, T., Koehn, P., Lo, C., +Ljubešić, N., Monz, C., Morishita, M., Nagata, M., Nakazawa, T., Pal, +S., … Zampieri, M. (2020). Findings of the 2020 conference on machine +translation (WMT20). Proceedings of the Fifth Conference on Machine +Translation, 1–55. +https://aclanthology.org/2020.wmt-1.1 + + + GLUE: A multi-task benchmark and analysis +platform for natural language understanding + Wang + Proceedings of the 2018 EMNLP workshop +BlackboxNLP: Analyzing and interpreting neural networks for +NLP + 10.18653/v1/W18-5446 + 2018 + Wang, A., Singh, A., Michael, J., +Hill, F., Levy, O., & Bowman, S. (2018). GLUE: A multi-task +benchmark and analysis platform for natural language understanding. +Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and +Interpreting Neural Networks for NLP, 353–355. +https://doi.org/10.18653/v1/W18-5446 + + + BERT: Pre-training of deep bidirectional +transformers for language understanding + Devlin + Proceedings of the 2019 conference of the +north American chapter of the association for computational linguistics: +Human language technologies, volume 1 (long and short +papers) + 10.18653/v1/N19-1423 + 2019 + Devlin, J., Chang, M.-W., Lee, K., +& Toutanova, K. (2019). BERT: Pre-training of deep bidirectional +transformers for language understanding. Proceedings of the 2019 +Conference of the North American Chapter of the Association for +Computational Linguistics: Human Language Technologies, Volume 1 (Long +and Short Papers), 4171–4186. +https://doi.org/10.18653/v1/N19-1423 + + + XLNet: Generalized autoregressive pretraining +for language understanding + Yang + Advances in neural information processing +systems + 32 + 2019 + Yang, Z., Dai, Z., Yang, Y., +Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). XLNet: +Generalized autoregressive pretraining for language understanding. In H. +Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, & R. +Garnett (Eds.), Advances in neural information processing systems (Vol. +32). Curran Associates, Inc. +https://proceedings.neurips.cc/paper_files/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf + + + Datasets: A community library for natural +language processing + Lhoest + Proceedings of the 2021 conference on +empirical methods in natural language processing: System +demonstrations + 10.18653/v1/2021.emnlp-demo.21 + 2021 + Lhoest, Q., Villanova del Moral, A., +Jernite, Y., Thakur, A., Platen, P. von, Patil, S., Chaumond, J., Drame, +M., Plu, J., Tunstall, L., Davison, J., Šaško, M., Chhablani, G., Malik, +B., Brandeis, S., Le Scao, T., Sanh, V., Xu, C., Patry, N., … Wolf, T. +(2021). Datasets: A community library for natural language processing. +Proceedings of the 2021 Conference on Empirical Methods in Natural +Language Processing: System Demonstrations, 175–184. +https://doi.org/10.18653/v1/2021.emnlp-demo.21 + + + Bleu: A method for automatic evaluation of +machine translation + Papineni + Proceedings of the 40th annual meeting of the +association for computational linguistics + 10.3115/1073083.1073135 + 2002 + Papineni, K., Roukos, S., Ward, T., +& Zhu, W.-J. (2002). Bleu: A method for automatic evaluation of +machine translation. Proceedings of the 40th Annual Meeting of the +Association for Computational Linguistics, 311–318. +https://doi.org/10.3115/1073083.1073135 + + + + + + diff --git a/joss.06452/10.21105.joss.06452.pdf b/joss.06452/10.21105.joss.06452.pdf new file mode 100644 index 0000000000..b312e4ec83 Binary files /dev/null and b/joss.06452/10.21105.joss.06452.pdf differ diff --git a/joss.06452/paper.jats/10.21105.joss.06452.jats b/joss.06452/paper.jats/10.21105.joss.06452.jats new file mode 100644 index 0000000000..95e323f262 --- /dev/null +++ b/joss.06452/paper.jats/10.21105.joss.06452.jats @@ -0,0 +1,391 @@ + + +
+ + + + +Journal of Open Source Software +JOSS + +2475-9066 + +Open Journals + + + +6452 +10.21105/joss.06452 + +Jury: A Comprehensive Evaluation Toolkit + + + + +Cavusoglu +Devrim + + + +* + + + +Sen +Secil + + + + + + +Sert +Ulas + + + + + +Altinuc +Sinan + + + + + + +OBSS AI + + + + +Middle East Technical University + + + + +Bogazici University + + + + +* E-mail: + + +23 +1 +2024 + +9 +97 +6452 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2022 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +Python +natural-language-generation +evaluation +metrics +natural-language-processing + + + + + + Summary +

Evaluation plays a critical role in deep learning as a fundamental + block of any prediction-based system. However, the vast number of + Natural Language Processing (NLP) tasks and the development of various + metrics have led to challenges in evaluating different systems with + different metrics. To address these challenges, we introduce jury, a + toolkit that provides a unified evaluation framework with standardized + structures for performing evaluation across different tasks and + metrics. The objective of jury is to standardize and improve metric + evaluation for all systems and aid the community in overcoming the + challenges in evaluation. Since its open-source release, jury has + reached a wide audience and is publicly available.

+
+ + Statement of need +

NLP tasks possess inherent complexity, requiring a comprehensive + evaluation of model performance beyond a single metric comparison. + Established benchmarks such as WMT + (Barrault + et al., 2020) and GLUE + (Wang + et al., 2018) rely on multiple metrics to evaluate models on + standardized datasets. This practice promotes fair comparisons across + different models and pushes advancements in the field. Embracing + multiple metric evaluations provides valuable insights into a model’s + generalization capabilities. By considering diverse metrics, such as + accuracy, F1 score, BLEU, and ROUGE, researchers gain a holistic + understanding of a model’s response to never-seen inputs and its + ability to generalize effectively. Furthermore, task-specific NLP + metrics enable the assessment of additional dimensions, such as + readability, fluency, and coherence. The comprehensive evaluation + facilitated by multiple metric analysis allows for trade-off studies + and aids in assessing generalization for task-independent models. + Given these numerous advantages, NLP specialists lean towards + employing multiple metric evaluations.

+

Although employing multiple metric evaluation is common, there is a + challenge in practical use because widely-used metric libraries lack + support for combined and/or concurrent metric computations. + Consequently, researchers face the burden of evaluating their models + per metric, a process exacerbated by the scale and complexity of + recent models and limited hardware capabilities. This bottleneck + impedes the efficient assessment of NLP models and highlights the need + for enhanced tooling in the metric computation for convenient + evaluation. In order for concurrency to be beneficial at a maximum + level, the system may require hardware accordingly. Having said that, + the availability of the hardware comes into question.

+

The extent of achievable concurrency in NLP research has + traditionally relied upon the availability of hardware resources + accessible to researchers. However, significant advancements have + occurred in recent years, resulting in a notable reduction in the cost + of high-end hardware, including multi-core CPUs and GPUs. This + progress has transformed high-performance computing resources, which + were once prohibitively expensive and predominantly confined to + specific institutions or research labs, into more accessible and + affordable assets. For instance, in BERT + (Devlin + et al., 2019) and XLNet + (Yang + et al., 2019), it is stated that they leveraged the training + process by using powerful yet cost-effective hardware resources. Those + advancements show that the previously constraining factor for hardware + accessibility has been mitigated, allowing researchers to overcome the + limitations associated with achieving concurrent processing + capabilities in NLP research.

+

To ease the use of automatic metrics in NLG research, several + hands-on libraries have been developed such as + nlg-eval + (Sharma + et al., 2017) and datasets/metrics + (Lhoest + et al., 2021) (now as evaluate). Although + those libraries cover widely-used NLG metrics, they don’t allow using + multiple metrics in one go (i.e. combined evaluation), or they provide + a crude way of doing so if they do. Those libraries restrict their + users to compute each metric sequentially if users want to evaluate + their models with multiple metrics which is time-consuming. Aside from + this, there are a few problems in the libraries that support combined + evaluation such as individual metric construction and passing compute + time arguments (e.g. n-gram for BLEU + (Papineni + et al., 2002)), etc. Our system provides an effective + computation framework and overcomes the aforementioned challenges.

+

We designed a system that enables the creation of user-defined + metrics with a unified structure and the usage of multiple metrics in + the evaluation process. Our library also utilizes the + datasets package to promote open-source contribution; + when users implement metrics, the implementation can be contributed to + the datasets package. Any new metric released by the + datasets package will be readily available in our + library as well.

+
+ + Acknowledgements +

We would also like to express our appreciation to Cemil Cengiz for + fruitful discussions.

+
+ + + + + + + SharmaShikhar + El AsriLayla + SchulzHannes + ZumerJeremie + + Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation + CoRR + 2017 + abs/1706.09799 + http://arxiv.org/abs/1706.09799 + + + + + + BarraultLoı̈c + BiesialskaMagdalena + BojarOndřej + Costa-jussàMarta R. + FedermannChristian + GrahamYvette + GrundkiewiczRoman + HaddowBarry + HuckMatthias + JoanisEric + KocmiTom + KoehnPhilipp + LoChi-kiu + LjubešićNikola + MonzChristof + MorishitaMakoto + NagataMasaaki + NakazawaToshiaki + PalSantanu + PostMatt + ZampieriMarcos + + Findings of the 2020 conference on machine translation (WMT20) + Proceedings of the fifth conference on machine translation + Association for Computational Linguistics + Online + 202011 + https://aclanthology.org/2020.wmt-1.1 + 1 + 55 + + + + + + WangAlex + SinghAmanpreet + MichaelJulian + HillFelix + LevyOmer + BowmanSamuel + + GLUE: A multi-task benchmark and analysis platform for natural language understanding + Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP + Association for Computational Linguistics + Brussels, Belgium + 201811 + https://aclanthology.org/W18-5446 + 10.18653/v1/W18-5446 + 353 + 355 + + + + + + DevlinJacob + ChangMing-Wei + LeeKenton + ToutanovaKristina + + BERT: Pre-training of deep bidirectional transformers for language understanding + Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) + Association for Computational Linguistics + Minneapolis, Minnesota + 201906 + https://aclanthology.org/N19-1423 + 10.18653/v1/N19-1423 + 4171 + 4186 + + + + + + YangZhilin + DaiZihang + YangYiming + CarbonellJaime + SalakhutdinovRuss R + LeQuoc V + + XLNet: Generalized autoregressive pretraining for language understanding + Advances in neural information processing systems + + WallachH. + LarochelleH. + BeygelzimerA. + dAlché-BucF. + FoxE. + GarnettR. + + Curran Associates, Inc. + 2019 + 32 + https://proceedings.neurips.cc/paper_files/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf + + + + + + + + LhoestQuentin + Villanova del MoralAlbert + JerniteYacine + ThakurAbhishek + PlatenPatrick von + PatilSuraj + ChaumondJulien + DrameMariama + PluJulien + TunstallLewis + DavisonJoe + ŠaškoMario + ChhablaniGunjan + MalikBhavitvya + BrandeisSimon + Le ScaoTeven + SanhVictor + XuCanwen + PatryNicolas + McMillan-MajorAngelina + SchmidPhilipp + GuggerSylvain + DelangueClément + MatussièreThéo + DebutLysandre + BekmanStas + CistacPierric + GoehringerThibault + MustarVictor + LagunasFrançois + RushAlexander + WolfThomas + + Datasets: A community library for natural language processing + Proceedings of the 2021 conference on empirical methods in natural language processing: System demonstrations + Association for Computational Linguistics + Online; Punta Cana, Dominican Republic + 202111 + https://aclanthology.org/2021.emnlp-demo.21 + 10.18653/v1/2021.emnlp-demo.21 + 175 + 184 + + + + + + PapineniKishore + RoukosSalim + WardTodd + ZhuWei-Jing + + Bleu: A method for automatic evaluation of machine translation + Proceedings of the 40th annual meeting of the association for computational linguistics + Association for Computational Linguistics + Philadelphia, Pennsylvania, USA + 200207 + https://aclanthology.org/P02-1040 + 10.3115/1073083.1073135 + 311 + 318 + + + + +