Skip to content

xtuner #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,874 changes: 1,874 additions & 0 deletions xtuner/MedQA2019-structured-test.jsonl

Large diffs are not rendered by default.

4,340 changes: 4,340 additions & 0 deletions xtuner/MedQA2019-structured-train.jsonl

Large diffs are not rendered by default.

6,212 changes: 6,212 additions & 0 deletions xtuner/MedQA2019-structured.jsonl

Large diffs are not rendered by default.

Binary file added xtuner/MedQA2019.xlsx
Binary file not shown.
666 changes: 666 additions & 0 deletions xtuner/README.md

Large diffs are not rendered by default.

Binary file added xtuner/imgs/afterFT.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added xtuner/imgs/beforeFT.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added xtuner/imgs/bugfix1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added xtuner/imgs/bugfix2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added xtuner/imgs/cat_fly.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added xtuner/imgs/cfgs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added xtuner/imgs/dataProcessed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added xtuner/imgs/head.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added xtuner/imgs/medqa2019samples.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added xtuner/imgs/msagent_data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added xtuner/imgs/serper.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added xtuner/imgs/ysqd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 35 additions & 0 deletions xtuner/split2train_and_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
import json
import random

def split_conversations(input_file, train_output_file, test_output_file):
# Read the input JSONL file
with open(input_file, 'r', encoding='utf-8') as jsonl_file:
data = json.load(jsonl_file)

# Count the number of conversation elements
num_conversations = len(data)

# Shuffle the data randomly
random.shuffle(data)
random.shuffle(data)
random.shuffle(data)

# Calculate the split points for train and test
split_point = int(num_conversations * 0.7)

# Split the data into train and test
train_data = data[:split_point]
test_data = data[split_point:]

# Write the train data to a new JSONL file
with open(train_output_file, 'w', encoding='utf-8') as train_jsonl_file:
json.dump(train_data, train_jsonl_file, indent=4)

# Write the test data to a new JSONL file
with open(test_output_file, 'w', encoding='utf-8') as test_jsonl_file:
json.dump(test_data, test_jsonl_file, indent=4)

print(f"Split complete. Train data written to {train_output_file}, Test data written to {test_output_file}")

# Replace 'input.jsonl', 'train.jsonl', and 'test.jsonl' with your actual file names
split_conversations('MedQA2019-structured.jsonl', 'MedQA2019-structured-train.jsonl', 'MedQA2019-structured-test.jsonl')
35 changes: 35 additions & 0 deletions xtuner/xlsx2jsonl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
import openpyxl
import json

def process_excel_to_json(input_file, output_file):
# Load the workbook
wb = openpyxl.load_workbook(input_file)

# Select the "DrugQA" sheet
sheet = wb["DrugQA"]

# Initialize the output data structure
output_data = []

# Iterate through each row in column A and D
for row in sheet.iter_rows(min_row=2, max_col=4, values_only=True):
system_value = "You are a professional, highly experienced doctor professor. You always provide accurate, comprehensive, and detailed answers based on the patients' questions."

# Create the conversation dictionary
conversation = {
"system": system_value,
"input": row[0],
"output": row[3]
}

# Append the conversation to the output data
output_data.append({"conversation": [conversation]})

# Write the output data to a JSON file
with open(output_file, 'w', encoding='utf-8') as json_file:
json.dump(output_data, json_file, indent=4)

print(f"Conversion complete. Output written to {output_file}")

# Replace 'MedQA2019.xlsx' and 'output.jsonl' with your actual input and output file names
process_excel_to_json('MedQA2019.xlsx', 'output.jsonl')