CornellMovieDialogsCorpus.jl is a Julia package that provides a thin wrapper for the Cornell Movie Dialogs Corpus.
Exported functions:
movie_conversations
movie_lines
movie_title_metadata
movie_character_metadata
movie_script_urls
Each of these loads the corresponding corpus database file.
Let's say you want to train a simple chatbot using "call-and-response" dialog pairs as training data, as in this pytorch tutorial.
using CornellMovieDialogsCorpus
First, create a Dict
that maps line IDs to the raw text.
id2text = Dict(l.line_id => l.text for l in movie_lines())
Now, create a dataset of (utterance, response) pairs from the movie conversations.
utterance_pairs = [(id2text[id], id2text[conv.lines[i+1]])
for conv in movie_conversations()
for (i, id) in enumerate(conv.lines[1:end-1])]
julia> utterance_pairs[1:5]
5-element Array{Tuple{Any,Any},1}:
("Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.", "Well, I thought we'd start with pronunciation, if that's okay with you.")
("Well, I thought we'd start with pronunciation, if that's okay with you.", "Not the hacking and gagging and spitting part. Please.")
("Not the hacking and gagging and spitting part. Please.", "Okay... then how 'bout we try out some French cuisine. Saturday? Night?")
("You're asking me out. That's so cute. What's your name again?", "Forget it.")
("No, no, it's my fault -- we didn't have a proper introduction ---", "Cameron.")