Data and Code for the ACL 2024 Paper "Evaluating Very Long-Term Conversational Memory of LLM Agents"
Authors: Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri and Yuwei Fang
Paper: pdf
We release LoCoMo, a high-quality evaluation benchmark consisting of very long-term conversational data. The benchmark consists of ten conversations. Each conversation is annotated for the question-answering and event-summarization tasks. Additionally, the dialogs in each conversation can be used for the multimodal-dialog-generation task. See statistics of the dataset in the Table below.
The dataset can be found in the ./data/locomo10.json
file in this repository. Each sample represents a single conversation and it's corresponding annotations:
sample_id
: identifier for the sampleconversation
:- List of sessions (
session_<num>
) and their timestamps (session_<num>_date_time
). The numbers<num>
represent the chronological order of the sessions. * It also includes names of the two speakers i.e.,speaker_a
andspeaker_b
. - A turn within each session contains the name of the
speaker
, the dialog iddia_id
, and content of the dialogtext
. - If the turn contains images, it also includes a link to the image
img_url
, caption generated by the BLIP model for the imageblip_caption
and the search query used by the third party module icrawler to retrieve the image.
- List of sessions (
observation
(generated): Observations for each of the sessions inconversation
(session_<num>_observation
). See below for the code to regenerate observations. These observations are used as one of the databases for evaluating retrieval-augmented generation i.e., RAG models in our paper.session_summary
(generated): Session-level summaries for each session inconversation
(session_<num>_summary
). See below for the code to regenerate session-level summaries. These summaries are also used as one of the databases for evaluating RAG models in our paper.event_summary
(annotated): List of significant events for each speaker within each session inconversation
(events_session_<num>
). These are the ground truth annotations for the event summarization task in the LoCoMo dataset.qa
(annotated): Question-answer annotations for the question answering task in the LoCoMo dataset. Each sample containsquestion
,answer
,category
label and a list of dialog ids that contain the answer i.e.,evidence
, when available.
Note 1: This release is a subset of the conversations released previously with our first Arxiv version in March 2024. The initial release contained 50 conversations. We sampled a subset of the data to retain the longest conversations with high-quality annotations and for cost-effective evaluation of closed-source LLMs.
Note 2: We do not release the images. However, the web URLs, captions and search queries for the images are included in the dataset.
Configuration variables like API keys, output directories etc. are set in scripts/env.sh
and run at the beginning of all other scripts.
Generate very long-term conversations between two LLM-agents with pre-assigned personalities using our LLM-based generative framework
The code to generate conversations is available in scripts/generate_conversations.sh
and can be run as follows:
bash scripts/generate_conversations.sh
This code can be run under two settings:
- Generate conversations between agents assigned with custom personas. To enable this setting, point
--out-dir
to a directory containing the filesagent_a.json
andagent_b.json
. These files should contain thename
andpersona_summary
of the speaker represented by the agent. See an example atdata/multimodal_dialog/example
.
{
"name": "Angela",
"persona_summary": "Angela is a 31 year old woman who works as the manager of a gift shop in Chapel Hill. She curates interesting pieces from local artists and has maintained a beautiful gallery in the form of the gift shop. She also makes her own art sometimes, in the form of oil paintings."
}
- Create personalities using prompts from the MSC dataset. To enable this setting, point
--out-dir
to an empty directory. This will make the script sample a pair of personalities fromdata/msc_personas_all.json
.
See scripts/generate_conversations.py
for details on the various parameters that can be tweaked for generating the conversations. For example, --num-days
can be changed to specify the temporal span of the conversations.
Evaluate open-source and closed-source LLMs on the LoCoMo Question Answering Task with the (truncated) conversation as context
- Evaluate OpenAI models
bash scripts/evaluate_gpts.sh
- Evaluate Anthropic models
bash scripts/evaluate_claude.sh
- Evaluate Gemini models
bash scripts/evaluate_gemini.sh
- Evaluate models available on Huggingface
bash scripts/evaluate_hf_llm.sh
Generate observations and session summaries from LoCoMo conversations using gpt-3.5-turbo
for evaluating RAG-based models
We provide the observations and summaries with our release of the LoCoMo dataset. Follow these instructions to re-generate the same or for a different set of conversations.
- Generate observations from all sessions:
bash scripts/generate_observations.sh
- Generate summary of each session:
bash scripts/generate_session_summaries.sh
Note 3: Session-summaries are different from the event summaries of the event summarization task. The former summairze only a single session whereas event summaries are specific to each speaker and contain causal, temporal connections across sessions.
Evaluate retrieval-augmented gpt-3.5-turbo
on the LoCoMo question-answering task using (a) dialogs, (b) observations and (c) session summaries as databases.
- Evaluate
gpt-3.5-turbo
using retrieval-based augmentation
bash scripts/evaluate_rag_gpts.sh
Coming soon!
Coming soon!
Please cite our paper if you use LoCoMo in your works:
@article{maharana2024evaluating,
title={Evaluating very long-term conversational memory of llm agents},
author={Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei},
journal={arXiv preprint arXiv:2402.17753},
year={2024}
}