Skip to content

asgaardlab/hf-question-answer

Repository files navigation

Official repository of the paper "Answering User Questions about Machine Learning Models through Standardized Model Cards"

Install packages

pip install -r requirements.txt

We used Python 3.11 for this project.

Prepare Data

Collect Data

To collect list of models and their discussions from Hugging Face Hub, run the following command from the data_collector directory.

python main.py
  • List of models will be saved in data/all_models.csv file.
  • Discussions along with pull requests will be saved inside the data/discussions directory. The directory structure is as followed:
├── data: all the data generated after running the scripts are saved in this directory
│   ├── discussions: directory to save all discussions and pull requests
│   │   ├── <model_id>: model repository to save discussions and pull requests. the `/` in the `model_id` is replaced with '@'. an empty directory means there are no discussions and pull requests in the repository.
│   │   │   ├── discussion_<discussion_number>.yaml: a discussion file containing the discussion details
│   │   │   ├── pull_request_<pull_request_number>.yaml: a pull request file containing the pull request details
  • A list of the downloaded discussions will be created in data/all_discussions.csv file

Select Sample Data for Analysis

To select sample data for manual analysis, run the following command from the data_analyzer directory.

python random_discussion_selector.py
  • 378 list of randomly selected discussions will be created in data/all_random_discussions.csv file.

Filter Data

Filter Sample Discussions

To filter the random discussions in data/all_random_discussions.csv file, run the following command from data_cleaner directory.

python random_discussion_cleaner.py
  • Filtered list of random discussions will be saved in data/cleaned_random_discussions.csv file.

Filter Models and All Discussions

To filter the models and all the discussions, run the following command from the data_cleaner directory.

python main.py
  • Filtered list of models will be saved in data/quality_models.csv.
  • Discussion list of the filtered models will be saved in data/quality_models_discussions.csv.
  • Filtered list of discussions will be saved in data/cleaned_discussions.csv file.

Analyze GPT's Performance in Classifying Discussion

Classify Sample Discussions

To classify the filtered random discussion posts using gpt-3.5-turbo-0125, run the following command from the discussion_classifier directory

python random_discussion_classifier.py

Please note that you need to have an OpenAI API key to run the classification. The key should be saved in the OPENAI_API_KEY variable of the util/constants.py file.

  • Classification will run 3 times, saving the results in data/random_discussion_classification directory. The result generated by GPT for each discussion will be saved in an md file in format <index>_<model_id>_<discussion_number>_result_gpt-3-5.md. The 3 runs' results will be saved in run_1, run_2, and run_3 directories.
  • Classification results will also be saved in columns of data/cleaned_random_discussions.csv file in name contains_question_run_<run_number>.
  • Final decision about the class will be saved in data/cleaned_random_discussions.csv file in name contains_question_final_class.

Prepare Ground Truth

Two authors individually manually identified if the sample discussions contain questions. The ground truth is available in data/gpt_sample_discussion_classification.xlsx file. 1st_author_classes and 2nd_author_classes contains the classes of the two authors and agreed_classes is their agreed classes. Their agreement is calculated using Cohen's Kappa and saved in cohens_kappa sheet. The disagreement resolution is saved in disagreement_resolution sheet.

Evaluate Performance

Performance evaluation of GPT in classifying the sample discussion posts as question-containing post is available in the gpt_classification_evaluation sheet of data/gpt_sample_discussion_classification.xlsx file.

Identify Question-Containing Discussions

To classify all the filtered discussion posts using gpt-3.5-turbo-0125, run the following command from the discussion_classifier directory

python all_discussion_classifier.py

Please note that you need to have an OpenAI API key to run the classification. The key should be saved in the OPENAI_API_KEY variable of the util/constants.py file.

  • Classification will run 3 times, saving the results in data/all_discussion_classification directory. The result generated by GPT for each discussion will be saved in an md file in format <index>_<model_id>_<discussion_number>_result_gpt-3-5.md. The 3 runs' results will be saved in run_1, run_2, and tie_breakers directories.
  • Classification results will also be saved in columns of data/cleaned_discussions.csv file in name contains_question_run_1, contains_question_run_2, and contains_question_tie_breaker accordingly.
  • Final decision about the class will be saved in data/cleaned_discussions.csv file in name contains_question_final_class
  • List of question-containing discussions will be saved in data/all_questions.csv file.

Prepare Results

Generate Plots

To generate all the plots, run the following command from the plot_generator directory

python main.py
  • Plots will be generated in data/plots directory in pdf and png format.

Topic Modeling Discussion Posts

To train a BERTopic model on the discussion posts, first run the following command from the repository root

python -m spacy download en_core_web_sm

Then run the following command from the discussion_topic_modeller directory

python bertopic_topic_modeller.py
  • Trained BERTopic model file model_min_cluster_size_60 will be saved in data/bertopic_model.
  • Our trained model is available here.

To save the representative topics and keywords for each topic, run the following command from the discussion_topic_modeller directory

python topic_analyzer.py
  • Representative documents and keywords of the topics will be saved in data/bertopic_model/topics/<topic_id>.md file.

To visualize the topics, run the discussion_topic_modeller/bertopic_topic_visualizer.ipynb notebook.

To visualize the clusters of the topics, run the following command from the discussion_topic_modeller directory

python topic_cluster_visualizer.py
  • Topic ids of the same clusters will be printed in the console.
  • Cluster visualization will be saved in data/bertopic_model/model_min_cluster_size_60_hierarchy_plot.pdf file. The GPT generated labels for the topics have been used in the visualization.
  • Cluster visualization with our own labels will be generated in the data/bertopic_model/custom_label_hierarchy_plot.pdf file. The labels are available in the data/bertopic_model/topic_custom_label.csv file.

Result of Question Mapping

Two authors individually manually mapped the questions to the model cards. The mapping result is available in data/manual_question_mapping.xlsx file. The 1st and 2nd authors' mapping results are saved in author1_labels and author2_labels sheet respectively. The disagreement resolution is saved in the resolution column of the author1_labels sheet. To calculate the inter-rater agreement, run the following command from the data_analyzer directory

python irr_calculator.py
  • Kappa score of the 2 rounds of mapping will be printed in the console.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published