Skip to content

Conversation

@BradyBonnette
Copy link
Contributor

@BradyBonnette BradyBonnette commented Jan 26, 2025

This adds Microsoft's Deberta V2 and V3 model into the candle ecosystem. It also includes an example file demonstrating how to use it, as well as a README for more information.

At the time of this commit, the model can only do Named Entity Recognition and Text Classification. There are other modes such as Question Answering, Multiple Choice and Masked Input that could still be developed for this at a later point in time.

pub type Label2Id = HashMap<String, u32>;

#[derive(Debug, Clone, PartialEq, Deserialize)]
pub struct Config {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a lot like the normal BERT configuration, but in going through the Python Transformers code I realized there were a lot of other configuration tidbits that didn't exist in other BERT models. So, I just started over with a new one just for Deberta.

@BradyBonnette
Copy link
Contributor Author

BradyBonnette commented Jan 26, 2025

Using a simple Python example, such as:

from transformers import pipeline
pipe = pipeline("token-classification", model="Clinical-AI-Apollo/Medical-NER")
result = pipe('45 year old woman diagnosed with CAD')
print(f"{result}")

This produces the following results:

[{'entity': 'B-AGE', 'score': 0.44746688, 'index': 1, 'word': '▁45', 'start': 0, 'end': 2},
 {'entity': 'I-AGE', 'score': 0.5699233, 'index': 2, 'word': '▁year', 'start': 2, 'end': 7},
 {'entity': 'I-AGE', 'score': 0.6126752, 'index': 3, 'word': '▁old', 'start': 7, 'end': 11},
 {'entity': 'B-SEX', 'score': 0.40775478, 'index': 4, 'word': '▁woman', 'start': 11, 'end': 17},
 {'entity': 'B-DISEASE_DISORDER', 'score': 0.34644428, 'index': 7, 'word': '▁CAD', 'start': 32, 'end': 36}]

This is in comparison with using the example.rs file provided by this PR:

cargo run  
  --example debertav2 
  --release 
  --features=cuda -- 
  --model-id=Clinical-AI-Apollo/Medical-NER 
  --revision=main 
  --sentence='45 year old woman diagnosed with CAD'

results in:

[[NERItem { entity: "B-AGE", word: "▁45", score: 0.44766188, start: 0, end: 2, index: 1 },
 NERItem { entity: "I-AGE", word: "▁year", score: 0.5694605, start: 2, end: 7, index: 2 },
 NERItem { entity: "I-AGE", word: "▁old", score: 0.6122175, start: 7, end: 11, index: 3 },
 NERItem { entity: "B-SEX", word: "▁woman", score: 0.40786597, start: 11, end: 17, index: 4 },
 NERItem { entity: "B-DISEASE_DISORDER", word: "▁CAD", score: 0.34624502, start: 32, end: 36, index: 7 }]]

There's a tiny amount of precision difference between the Python and Rust versions, but from my understanding it's so insignficant that it does not make a difference in accuracy nor performance.

@BradyBonnette BradyBonnette marked this pull request as ready for review January 27, 2025 14:26
Copy link
Collaborator

@LaurentMazare LaurentMazare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. I've put some mostly cosmetic comments in line, would be great if you can apply them to the whole file as I didn't bother repeating them. Mostly we should avoid everything that can panic, i.e. make actual errors rather than unwrap, use bail! to shorten error generation, etc.

@BradyBonnette
Copy link
Contributor Author

Looks pretty good. I've put some mostly cosmetic comments in line, would be great if you can apply them to the whole file as I didn't bother repeating them. Mostly we should avoid everything that can panic, i.e. make actual errors rather than unwrap, use bail! to shorten error generation, etc.

No problem, I can do that later today. Thanks for reviewing it, and if you see any other things that should change please let me know.

@BradyBonnette
Copy link
Contributor Author

Ok I pushed up some updates. Feel free to review it at your leisure and let me know about anything else you find!

@LaurentMazare
Copy link
Collaborator

Thanks for the PR! I've made some small tweaks to avoid some cases with foo.unwrap_or(bar) as in this case the value bar is generated even if foo is Some(_) (unwrap_or_else gets around this), in particular there were some bar expressions that contained some unwrap that could have likely failed.

@LaurentMazare LaurentMazare merged commit 4301753 into huggingface:main Jan 29, 2025
10 checks passed
@BradyBonnette
Copy link
Contributor Author

@LaurentMazare Thanks! I appreciate the merge. This was a fairly complicated model, but there seems like there could be use for it in other projects.

@Stealthwriter
Copy link

Deberta v3 large is the best model out there for text classification, it's top rated in kaggle competitions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants