Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨 Support updating template processors #1652

Merged
merged 22 commits into from
Jan 28, 2025
Merged
Changes from 1 commit
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
394eff4
current updates
ArthurZucker Oct 14, 2024
8b55e53
simplify
ArthurZucker Oct 14, 2024
59f6c60
set_item works, but `tokenizer._tokenizer.post_processor[1].single = …
ArthurZucker Oct 14, 2024
71d68ec
fix: `normalizers` deserialization and other refactoring
McPatate Jan 9, 2025
7c58995
fix: `pre_tokenizer` deserialization
McPatate Jan 10, 2025
d69cba8
feat: add `__len__` implementation for `normalizer::PySequence`
McPatate Jan 13, 2025
dfbbc52
feat: add `__setitem__` impl for `normalizers::PySequence`
McPatate Jan 13, 2025
1a31fc9
feat: add `__setitem__` impl to `pre_tokenizer::PySequence`
McPatate Jan 14, 2025
4bb595b
feat: add `__setitem__` impl to `post_processor::PySequence`
McPatate Jan 14, 2025
9e22b9e
test: add normalizer sequence setter check
McPatate Jan 15, 2025
adfaace
refactor: allow unused `processors::setter` macro
McPatate Jan 15, 2025
519b009
test: add `__setitem__` test for processors & pretok
McPatate Jan 16, 2025
3100401
refactor: `unwrap` -> `PyException::new_err()?`
McPatate Jan 16, 2025
2c6f83f
refactor: fmt
McPatate Jan 16, 2025
4aeee0b
refactor: remove unnecessary `pub`
McPatate Jan 16, 2025
d81c107
feat(bindings): add missing getters & setters for pretoks
McPatate Jan 20, 2025
547338b
feat(bindings): add missing getters & setters for processors
McPatate Jan 21, 2025
47eb857
refactor(bindings): rewrite RwLock poison error msg
McPatate Jan 21, 2025
b64e390
refactor: remove debug print
McPatate Jan 21, 2025
c541463
feat(bindings): add description as to why custom deser is needed
McPatate Jan 21, 2025
ff80e9f
feat: make post proc sequence elements mutable
McPatate Jan 27, 2025
3037074
fix(binding): serialization
McPatate Jan 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix: pre_tokenizer deserialization
McPatate committed Jan 14, 2025
commit 7c58995f3f6c914004a23418b97dfea8650de788
26 changes: 23 additions & 3 deletions bindings/python/src/pre_tokenizers.rs
Original file line number Diff line number Diff line change
@@ -704,13 +704,23 @@ impl Serialize for PyPreTokenizerWrapper {
}
}

#[derive(Clone, Deserialize)]
#[serde(untagged)]
#[derive(Clone)]
pub(crate) enum PyPreTokenizerTypeWrapper {
Sequence(Vec<Arc<RwLock<PyPreTokenizerWrapper>>>),
Single(Arc<RwLock<PyPreTokenizerWrapper>>),
}

impl<'de> Deserialize<'de> for PyPreTokenizerTypeWrapper {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: Deserializer<'de>,
{
let wrapper = PreTokenizerWrapper::deserialize(deserializer)?;
let py_wrapper: PyPreTokenizerWrapper = wrapper.into();
Ok(py_wrapper.into())
}
}

impl Serialize for PyPreTokenizerTypeWrapper {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
@@ -742,7 +752,17 @@ where
I: Into<PyPreTokenizerWrapper>,
{
fn from(pretok: I) -> Self {
PyPreTokenizerTypeWrapper::Single(Arc::new(RwLock::new(pretok.into())))
let pretok = pretok.into();
match pretok {
PyPreTokenizerWrapper::Wrapped(PreTokenizerWrapper::Sequence(seq)) => {
PyPreTokenizerTypeWrapper::Sequence(
seq.into_iter()
.map(|e| Arc::new(RwLock::new(PyPreTokenizerWrapper::Wrapped(e.clone()))))
.collect(),
)
}
_ => PyPreTokenizerTypeWrapper::Single(Arc::new(RwLock::new(pretok))),
}
}
}

17 changes: 15 additions & 2 deletions tokenizers/src/pre_tokenizers/sequence.rs
Original file line number Diff line number Diff line change
@@ -13,16 +13,29 @@ impl Sequence {
pub fn new(pretokenizers: Vec<PreTokenizerWrapper>) -> Self {
Self { pretokenizers }
}
}

pub fn get_pre_tokenizers(&self) -> &[PreTokenizerWrapper] {
impl AsRef<[PreTokenizerWrapper]> for Sequence {
fn as_ref(&self) -> &[PreTokenizerWrapper] {
&self.pretokenizers
}
}

pub fn get_pre_tokenizers_mut(&mut self) -> &mut [PreTokenizerWrapper] {
impl AsMut<[PreTokenizerWrapper]> for Sequence {
fn as_mut(&mut self) -> &mut [PreTokenizerWrapper] {
&mut self.pretokenizers
}
}

impl IntoIterator for Sequence {
type Item = PreTokenizerWrapper;
type IntoIter = std::vec::IntoIter<Self::Item>;

fn into_iter(self) -> Self::IntoIter {
self.pretokenizers.into_iter()
}
}

impl PreTokenizer for Sequence {
fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()> {
for pretokenizer in &self.pretokenizers {