Skip to content

Commit

Permalink
Import model checkpoints from public URLs (CCRI-POPROX#106)
Browse files Browse the repository at this point in the history
This updates our model checkpoint `.dvc` files to reference public(-ish)
URLs to allow external researchers to more easily obtain the data to use
`poprox-recommender`.

- HuggingFace models are directly imported from HuggingFace.
- Our model checkpoints are now imported from an S3 bucket for public
checkpoints (which is not yet actually publicly-accessible, but we will
fix that).
- MIND data must still be obtained from Microsoft and manually added.

It also removes the unused `word2int.tsv` and `ckpt-30000.pth` files.

Eventually, we can update the model checkpoints to import from the
training DVC repository.
  • Loading branch information
mdekstrand authored Oct 2, 2024
1 parent 167123a commit 114bc34
Show file tree
Hide file tree
Showing 9 changed files with 49 additions and 24 deletions.
12 changes: 0 additions & 12 deletions models/ckpt-30000.pth.dvc

This file was deleted.

7 changes: 7 additions & 0 deletions models/distilbert-base-uncased/model.safetensors.dvc
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,10 @@ outs:
size: 267954768
hash: md5
path: model.safetensors
md5: dd731a6d839cbfdd52ba4e065b61f265
frozen: true
deps:
- path: model.safetensors
repo:
url: https://huggingface.co/distilbert/distilbert-base-uncased
rev_lock: 12040accade4e8a0f71eabdb258fecc2e7e948be
7 changes: 7 additions & 0 deletions models/distilbert-base-uncased/tokenizer.json.dvc
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,10 @@ outs:
size: 466062
hash: md5
path: tokenizer.json
md5: d2bcada8b5513c9210ba8928bafd2124
frozen: true
deps:
- path: tokenizer.json
repo:
url: https://huggingface.co/distilbert/distilbert-base-uncased
rev_lock: 12040accade4e8a0f71eabdb258fecc2e7e948be
7 changes: 7 additions & 0 deletions models/model.safetensors.dvc
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,10 @@ outs:
size: 142805680
hash: md5
path: model.safetensors
md5: 7e0fb2f50423e59a2a62a4d660297f80
frozen: true
deps:
- etag: 851304fad87aa595b26edf79b1a606e0-18
size: 142805680
hash: md5
path: s3://poprox-public-models/nrms-baseline/model.safetensors
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,10 @@ outs:
size: 438081688
hash: md5
path: model.safetensors
md5: b7c02937f341f2e16506b29425e280be
frozen: true
deps:
- path: model.safetensors
repo:
url: https://huggingface.co/dima806/news-category-classifier-distilbert
rev_lock: e6d8809b863f80b7cd2bbaf241112b0b926e0da8
7 changes: 7 additions & 0 deletions models/news-category-classifier-distilbert/vocab.txt.dvc
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,10 @@ outs:
size: 231508
hash: md5
path: vocab.txt
md5: 721f828a6f53b9b76a86fbdcbdea2617
frozen: true
deps:
- path: vocab.txt
repo:
url: https://huggingface.co/dima806/news-category-classifier-distilbert
rev_lock: e6d8809b863f80b7cd2bbaf241112b0b926e0da8
7 changes: 7 additions & 0 deletions models/news_encoder.safetensors.dvc
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,10 @@ outs:
size: 275529184
hash: md5
path: news_encoder.safetensors
md5: 0c2738f5293159e8e95013eb32c8024a
frozen: true
deps:
- etag: 3de466cec52fde652d1642f55d3bb2f5-33
size: 275529184
hash: md5
path: s3://poprox-public-models/nrms-baseline/news_encoder.safetensors
7 changes: 7 additions & 0 deletions models/user_encoder.safetensors.dvc
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,10 @@ outs:
size: 10066184
hash: md5
path: user_encoder.safetensors
md5: cab8a0941111aa717a2a69011851d401
frozen: true
deps:
- etag: c8dcb260de4589573a8aa693466fff79-2
size: 10066184
hash: md5
path: s3://poprox-public-models/nrms-baseline/user_encoder.safetensors
12 changes: 0 additions & 12 deletions models/word2int.tsv.dvc

This file was deleted.

0 comments on commit 114bc34

Please sign in to comment.