You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
irthomasthomas opened this issue
Mar 16, 2024
· 1 comment
Labels
datasetpublic datasets and embeddingsGit-RepoSource code repository like gitlab or ghPapersResearch paperspythonPython code, tools, infoSqliteSqlite DB and tools
This repository contains a zipped sample of the PeaTMOSS dataset, as well as a script that demonstrates possible interactions with the SQLite database used to store the metadata dataset. The complete PeaTMOSS dataset contains snapshots of Pre-Trained machine learning Model (PTM) repositories and the downstream Open-Source GitHub repositories that reuse the PTMs, metadata about the PTMs, the pull requests and issues of the GitHub Repositories, and links between the downstream GitHub repositories and the PTM models. The schema of the SQLite database is specified by PeaTMOSS.py and PeatMOSS.sql. The sample of the database is PeaTMOSS_sample.db. The full database, as well as all captured repository snapshots are available here.
- Note: When unzipping .tar.gz snapshots, include the flag
--strip-components=4
in the tar statement, like so
tar --strip-components=4 -xvzf {name}.tar.gz
If you do not do this, you will have 4 extraneous parent directories that encase the repository.
Globus
Globus Share
All zipped repos and the full metadata dataset are available through Globus Share.
If you do not have an account, follow the Globus docs on how to sign up. You may create an account through a partnered organization if you are a part of that organization, or through Google or ORCID accounts.
Globus Connect Personal
To access the metadata dataset using the globus.py script provided in the repository:
Create your own private Globus collection on Mac, Windows, or Linux
Once this is created, make sure your Globus Personal Connect is running before executing globus.py
NOTE: In some cases, you may run into permission issues on Globus when running the script. If this is the case, you will need to change local_endpoint.endpoint_id, located on line 29, to your private collection's UUID:
local_endpoint_id=local_endpoint.endpoint_id
To locate your private collection's UUID, click on the Globus icon on your taskbar and select "Web: Collection Details". On this page, scroll down to the bottom where the UUID field for your collection should be visible, and replace the variable with your collection's UUID expressed as a string. Then, use the activities tab to terminate the existing transfer and rerun globus.py.
Metadata Description
The following model hubs are captured in our database:
The content for each specific model hub is listed in the table below:
Model hub
#PTMs
#Snapshotted Repos
#Discussions (PRs, issues)
#Links
Size of Zipped Snapshots
Hugging Face
281,276
14,899
59,011
30,514
44TB
PyTorch Hub
362
361
52,161
13,823
1.3GB
We also offer two different formats of our datasets to facilitate the mining challenge for participants. An overview of these two formats can be found in the table below:
Formats
Description
Size
Metadata
It contains only the metadata of the PTM packagesr and a subset of the GitHub project metadata.
7.12GB
Full
It contains all metadata, adding the PTM package contents in each published version, and git history of the main branhes of the GitHub projects.
48.2TB
Dependencies
The scripts in the project depend upon the following software:
This section will explain how to use SQL and SQLAlchemy to interact with the database to answer the research questions outlined in the proposal.
Using SQL to query the database
One option users have to interact with the metadata dataset is to use plain SQL. The metadata dataset is stored in a SQLite database file called PeaTMOSS.db, which can be found in the Globus Share. This file can be queried through standard SQL queries, and this can be done from a terminal using sqlite3: SQLite CLI. Single queries can be executed like
$ sqlite3 PeaTMOSS.db '{query statement}'
Alternatively, you can start an SQLite instance by simply executing
$ sqlite3 PeaTMOSS.db
which can be terminated by CTRL + D or .quit. To output queries to files, the .output command can be used
sqlite> .output {filename}.txt
Research Question Example (SQL)
The following example has to do with research question GH2: "What do developers on GitHub discuss related to PTM use, e.g., in issues, and pull requests? What are developers’ sentiments regarding PTM use? Do the people do pull requests of PTMs have the right expertise?"
If someone wants to observe what developers on GitHub are currently discussing related to PTM usage, they can look at discussions in GitHub issues and pull requests. The following SQLite example shows queries that would help accomplish this task.
First, we will create an sqlite3 instance:
$ sqlite3 PeaTMOSS.db
Then, we will create an output file for our issues query, then execute that query:
sqlite> .output issues.txt
sqlite> SELECT id, title FROM github_issue WHERE state = 'OPEN' ORDER BY updated_at DESC LIMIT 100;
Output:
The above query selects the ID and Title fields from the github_issue table, and chooses the 100 most recent issues that are still open.
Next, we will create an output file for our pull requests query, then execute that query:
sqlite> .output pull_requests.txt
sqlite> SELECT id, title FROM github_pull_request WHERE state = 'OPEN' OR state = 'MERGED' ORDER BY updated_at DESC LIMIT 100;
Output:
Notice that the query is very similar to the issues query, as we are looking for similar information. The above query selects the ID and Title fields from the github_pull_request table, and chooses the 100 most recent pull requests that are either open or merged.
Querying this data can assist when beginning to observe current/recent discussions in GitHub about PTMs. From here, you may adjust these queries to include more/less entries by changing the LIMIT value, or you may adjust which fields the queries return. For example, if you want more detailed information you could select the "body" field in either table.
Using ORMs to query the database
This section will include more details about the demo provided in the repository, PeaTMOSS_demo.py. Once again, this method requires the PeaTMOSS.db file, which can be found in the Globus Share. Prior to running this demo, ensure that the conda environment has been created and activated, or you may run into errors.
The purpose of the demo, as described at by the comment at the top of its file, is to demonstrate how one may use SQLAlchemy to address one of the research questions. The question being addressed in the demo is I1: "It can be difficult to interpret model popularity numbers by download rates. To what extent does a PTM’s download rates correlate with the number of GitHub projects that rely on it, or the popularity of the GitHub projects?". The demo accomplishes this by looking at two main fields: the number of times a model is downloaded from its model hub, and the number of times a model is reused in a GitHub repository. The demo finds the 100 most downloaded models, and finds how many times each of those models are reused. Users can take this information and attempt to find a correlation.
Research Question Example (ORM)
PeaTMOSS_demo.py utilizes PeaTMOSS.py, which is used to describe the structure of the database so that we may interact with it using SQLAlchemy. To begin, you must create and SQLAlchemy engine using the database file
For each of these models, we want to know how many times they are being reused. The model_to_reuse_repository contains fields for model IDs and reuse repository IDs, effectively linking them together. If a model is reused in multiple repository its ID will show up multiple times in the model_to_reuse_repository table. Therefore, we want to see if these highly downloaded models are also highly reused. We can do this querying the model_to_reuse_repository table and only select entries where the model_id field is equivalent to the current model's ID:
This query will select all the instances of the current model's ID appears in the model_to_reuse_repository table. If we execute this query and count the number of elements in the result, we have the number of times that model has been reused:
irthomasthomas
changed the title
PeaTMOSS-Demos/README.md at main · PurdueDualityLab/PeaTMOSS-Demos
PeaTMOSS-Demos - database of real-wold uses of Pre-Trained Models.
Mar 16, 2024
datasetpublic datasets and embeddingsGit-RepoSource code repository like gitlab or ghPapersResearch paperspythonPython code, tools, infoSqliteSqlite DB and tools
PeaTMOSS-Demos
This repository contains information about the Pre-Trained Models in Open-Source Software (PeaTMOSS) dataset.
Table of Contents
About
This repository contains a zipped sample of the PeaTMOSS dataset, as well as a script that demonstrates possible interactions with the SQLite database used to store the metadata dataset. The complete PeaTMOSS dataset contains snapshots of Pre-Trained machine learning Model (PTM) repositories and the downstream Open-Source GitHub repositories that reuse the PTMs, metadata about the PTMs, the pull requests and issues of the GitHub Repositories, and links between the downstream GitHub repositories and the PTM models. The schema of the SQLite database is specified by PeaTMOSS.py and PeatMOSS.sql. The sample of the database is PeaTMOSS_sample.db. The full database, as well as all captured repository snapshots are available here.
- Note: When unzipping .tar.gz snapshots, include the flag
in the tar statement, like so
If you do not do this, you will have 4 extraneous parent directories that encase the repository.
Globus
Globus Share
All zipped repos and the full metadata dataset are available through Globus Share.
If you do not have an account, follow the Globus docs on how to sign up. You may create an account through a partnered organization if you are a part of that organization, or through Google or ORCID accounts.
Globus Connect Personal
To access the metadata dataset using the
globus.py
script provided in the repository:globus.py
NOTE: In some cases, you may run into permission issues on Globus when running the script. If this is the case, you will need to change
local_endpoint.endpoint_id
, located on line 29, to your private collection's UUID:To locate your private collection's UUID, click on the Globus icon on your taskbar and select "Web: Collection Details". On this page, scroll down to the bottom where the UUID field for your collection should be visible, and replace the variable with your collection's UUID expressed as a string. Then, use the activities tab to terminate the existing transfer and rerun globus.py.
Metadata Description
The following model hubs are captured in our database:
The content for each specific model hub is listed in the table below:
We also offer two different formats of our datasets to facilitate the mining challenge for participants. An overview of these two formats can be found in the table below:
Dependencies
The scripts in the project depend upon the following software:
Python 3.11
SQLAlchemy 2.0
How To Install
To run the scripts in this project, you must install python 3.11 and SQLAlchemy v2.0 or greater.
These package can be installed using the
anaconda
environment managerconda env create -f environment.yml
to create the anaconda environmentPeaTMOSS
conda activate PeaTMOSS
Alternatively, you can navigate to each packages respective pages and install them.
How to Run
After installing the anaconda environment, each demo script can be run using
python3 script_name.py
Tutorial
This section will explain how to use SQL and SQLAlchemy to interact with the database to answer the research questions outlined in the proposal.
Using SQL to query the database
One option users have to interact with the metadata dataset is to use plain SQL. The metadata dataset is stored in a SQLite database file called PeaTMOSS.db, which can be found in the Globus Share. This file can be queried through standard SQL queries, and this can be done from a terminal using sqlite3: SQLite CLI. Single queries can be executed like
$ sqlite3 PeaTMOSS.db '{query statement}'
Alternatively, you can start an SQLite instance by simply executing
which can be terminated by
CTRL + D
or.quit
. To output queries to files, the .output command can be usedsqlite> .output {filename}.txt
Research Question Example (SQL)
The following example has to do with research question GH2: "What do developers on GitHub discuss related to PTM use, e.g., in issues, and pull requests? What are developers’ sentiments regarding PTM use? Do the people do pull requests of PTMs have the right expertise?"
If someone wants to observe what developers on GitHub are currently discussing related to PTM usage, they can look at discussions in GitHub issues and pull requests. The following SQLite example shows queries that would help accomplish this task.
Output:
The above query selects the ID and Title fields from the github_issue table, and chooses the 100 most recent issues that are still open.
Output:
Notice that the query is very similar to the issues query, as we are looking for similar information. The above query selects the ID and Title fields from the github_pull_request table, and chooses the 100 most recent pull requests that are either open or merged.
Querying this data can assist when beginning to observe current/recent discussions in GitHub about PTMs. From here, you may adjust these queries to include more/less entries by changing the LIMIT value, or you may adjust which fields the queries return. For example, if you want more detailed information you could select the "body" field in either table.
Using ORMs to query the database
This section will include more details about the demo provided in the repository, PeaTMOSS_demo.py. Once again, this method requires the PeaTMOSS.db file, which can be found in the Globus Share. Prior to running this demo, ensure that the conda environment has been created and activated, or you may run into errors.
The purpose of the demo, as described at by the comment at the top of its file, is to demonstrate how one may use SQLAlchemy to address one of the research questions. The question being addressed in the demo is I1: "It can be difficult to interpret model popularity numbers by download rates. To what extent does a PTM’s download rates correlate with the number of GitHub projects that rely on it, or the popularity of the GitHub projects?". The demo accomplishes this by looking at two main fields: the number of times a model is downloaded from its model hub, and the number of times a model is reused in a GitHub repository. The demo finds the 100 most downloaded models, and finds how many times each of those models are reused. Users can take this information and attempt to find a correlation.
Research Question Example (ORM)
PeaTMOSS_demo.py
utilizesPeaTMOSS.py
, which is used to describe the structure of the database so that we may interact with it using SQLAlchemy. To begin, you must create and SQLAlchemy engine using the database filewhere
path
is a string that describes the filepath to the database file. Both relative and absolute file paths can be used.To find the 100 most downloaded models, we will query the model table
and execute the query
For each of these models, we want to know how many times they are being reused. The model_to_reuse_repository contains fields for model IDs and reuse repository IDs, effectively linking them together. If a model is reused in multiple repository its ID will show up multiple times in the model_to_reuse_repository table. Therefore, we want to see if these highly downloaded models are also highly reused. We can do this querying the model_to_reuse_repository table and only select entries where the model_id field is equivalent to the current model's ID:
This query will select all the instances of the current model's ID appears in the model_to_reuse_repository table. If we execute this query and count the number of elements in the result, we have the number of times that model has been reused:
In each iteration of the loop we can store this information in dictionaries, where the keys can be the names of the models:
And then at the end, we can simply print the results. From there, users may observe a level of correlation using a method they see fit.
Download Results:
Reuse Results:
Suggested labels
The text was updated successfully, but these errors were encountered: