Skip to content

Latest commit

 

History

History
9 lines (5 loc) · 1.12 KB

README.md

File metadata and controls

9 lines (5 loc) · 1.12 KB

This code is a Python script that processes a text dataset and creates a new dataset of prompt-completion pairs.

The script assumes that the input dataset is in the form of a pandas DataFrame with columns 'Timestamp', 'Speaker', and 'Text'. It first filters out any rows where the speaker is not 'Jaime', and then groups the remaining rows by speaker and concatenates the text from consecutive rows by the same speaker.

Next, the script creates a new DataFrame where each row corresponds to a prompt-completion pair. The prompt is the text spoken by 'Joe Rogan' and the completion is the text spoken by 'Elon Musk' in the following exchange. The script assumes that the speaker order is as follows: 'Joe Rogan', 'Elon Musk', 'Joe Rogan', 'Elon Musk', etc.

The prompt-completion pairs are stored in a list of dictionaries, where each dictionary contains two keys: 'prompt' and 'completion'. Finally, the script creates a new pandas DataFrame from this list and writes it to disk.

The code could be improved by adding more error handling and making it more modular, so that it could be used with different input datasets and speaker orders.