Skip to content

[npj Digital Medicine] An In-Depth Evaluation of Federated Learning on Biomedical Natural Language Processing for Information Extraction

Notifications You must be signed in to change notification settings

GaoxiangLuo/LLM-BioMed-NER-RE

Repository files navigation

LLM-BioMed-NER-RE

This repository contains the LLM evaluation code for the npj Digital Medicine paper "An In-Depth Evaluation of Federated Learning on Biomedical Natural Language Processing for Information Extraction". The datasets used in this paper were downloaded from [FedNLP Repo]. In particular, they are NCBI-Disease, 2018 n2c2 datasets for named entity recognition (NER); GAD and 2018 n2c2 for relation extraction (RE).

Table 1: The results of LLMs with the best among 1/5/10/20-shot prompting on NER and RE tasks, compared with Blue BERT and GPT 2 trained with federated learning.

Model NER RE
NCBI 2018 n2c2 2018 n2c2 GAD
Strict Lenient Strict Lenient F1
Mistral 8x7B Instruct 0.409 0.587 0.514 0.648 0.314 0.459
GPT 3.5 0.575 0.719 0.565 0.705 0.290 0.485
GPT 4 0.722 0.834 0.616 0.751 0.882 0.543
PaLM 2 Bison 0.640 0.756 0.544 0.653 0.407 0.468
PaLM 2 Unicorn 0.726 0.848 0.621 0.749 0.888 0.549
Gemini 1.0 Pro 0.654 0.779 0.566 0.694 0.411 0.541
Llama 3 70B Instruct 0.685 0.786 0.551 0.695 0.319 0.458
Claude 3 Opus 0.788 0.879 0.680 0.787 0.832 0.569
Blue BERT (FL) 0.824 0.899 0.954 0.986 0.950 0.714
GPT 2 (FL) 0.784 0.840 0.830 0.868 0.946 0.721

NOTE:

  • GPTs' checkpoints are gpt-4-1106-preview and gpt-3.5-turbo-1106.
  • Mistral 8x7B Instruct was running on half-precision (~85GB), and Llama 3 70B Instruct was running on 4-bit quantization (~45 GB).

Data

More details of the datasets can be found in data.

Models

Model RLHF-Tuned Instruction-Tuned Max Input Tokens
Mistral 8x7B No Yes 32K
GPT 3.5 (Chat) Yes No 16K
GPT 4 (Chat) Yes No 128K
PaLM 2 Bison (Chat) No No 8K
PaLM 2 Unicorn (Text) No No 8K
Gemini Pro (Chat) No No 32K
Claude 3 (Chat) Yes No 200K
Llama 3 70B Yes Yes 8K

The models used in this paper are mostly chat models and a text-completion model without specifically tuning for NER and RE tasks. We applied in-context learning by providing examples as prompts to the models. Even with 20-shot prompting, the input tokens length is still within 8K, which all models can handle in its context window.

The example notebooks are in the root folder.

About

[npj Digital Medicine] An In-Depth Evaluation of Federated Learning on Biomedical Natural Language Processing for Information Extraction

Topics

Resources

Stars

Watchers

Forks