Skip to content

Commit

Permalink
v0.0.4 release for Malayalam normalization
Browse files Browse the repository at this point in the history
  • Loading branch information
kurianbenoy committed Feb 3, 2024
1 parent 0ded881 commit 2d46103
Show file tree
Hide file tree
Showing 9 changed files with 98 additions and 84 deletions.
9 changes: 7 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
whisper_normalizer
================
# whisper_normalizer

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

Expand Down Expand Up @@ -37,6 +36,12 @@ as normalization step when evaluating competitive models like
[AssemblyAI Conformer-1
model](https://www.assemblyai.com/blog/conformer-1/).

## Models evaluated using Whisper normalization

- Massively Multilingual Speech (MMS) models by Meta
- Conformer 1 by AssemblyAI
- Conformer 2 by AssemblyAI

## How to use

OpenAI open source approach of text normalization/standardization is
Expand Down
40 changes: 14 additions & 26 deletions nbs/00_basic.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -38,7 +38,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -48,7 +48,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -109,7 +109,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -150,7 +150,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -198,7 +198,7 @@
"Note: It's not recommended to use this function for non-english languages because it removes vowels in languages like Malayalam which was identified by [kavya in this tweet](https://twitter.com/kavya_manohar/status/1752574864618365059)."
]
},
"execution_count": 5,
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -217,7 +217,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -226,7 +226,7 @@
"'എന റ കമ പ യ ട ടറ ന എന റ ഭ ഷ'"
]
},
"execution_count": 6,
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -238,7 +238,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -247,7 +247,7 @@
"'language is like a map that we use to navigate the world but it s also like a prison that keeps us from seeing what s beyond the walls but what if there was a way to break out of this prison to expand our map to explore new worlds with new words this is the possibility and the challenge offered by instruction tuned language models like gpt 4 a cutting edge technology that uses artificial neural networks to generate natural language texts based on user inputs gpt 4 can write anything from essays to novels to poems to tweets to code to recipes to jokes to lyrics to whatever you want it can even write things that don t exist yet things that no human has ever thought of or said before as wittgenstein s quote suggests language is a source of limitation and liberation gpt 4 pushes this idea to the extreme by giving us access to unlimited language this could be the most significant new technology in modern history because it has the potential to change many domains and industries from education to entertainment from journalism to justice from science to art these models could enable new forms of learning storytelling reporting reasoning discovery and creation they could also create new ethical social and cultural challenges that require careful reflection and regulation how we use this technology will depend on how we recognize its implications for ourselves and others this technology is a form of artificial intelligence the word intelligence derives from inter and legere to be intelligent then is to be able to choose between things to pick out what matters to read what is written intelligence is not just a quantity or a quality it is an activity a process a practice it is something that we do with our minds and our words but when we let gpt 4 do this for us are we not abdicating our intelligence are we not letting go of our ability to choose to pick out to read are we not becoming passive consumers of language instead of active producers '"
]
},
"execution_count": 7,
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -274,7 +274,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -283,7 +283,7 @@
"'language is like a map that we use to navigate the world but it s also like a prison that keeps us from seeing what s beyond the walls but what if there was a way to break out of this prison to expand our map to explore new worlds with new words this is the possibility and the challenge offered by instruction tuned language models like gpt 4 a cutting edge technology that uses artificial neural networks to generate natural language texts based on user inputs gpt 4 can write anything from essays to novels to poems to tweets to code to recipes to jokes to lyrics to whatever you want it can even write things that don t exist yet things that no human has ever thought of or said before as wittgenstein s quote suggests language is a source of limitation and liberation gpt 4 pushes this idea to the extreme by giving us access to unlimited language this could be the most significant new technology in modern history because it has the potential to change many domains and industries from education to entertainment from journalism to justice from science to art these models could enable new forms of learning storytelling reporting reasoning discovery and creation they could also create new ethical social and cultural challenges that require careful reflection and regulation how we use this technology will depend on how we recognize its implications for ourselves and others this technology is a form of artificial intelligence the word intelligence derives from inter and legere to be intelligent then is to be able to choose between things to pick out what matters to read what is written intelligence is not just a quantity or a quality it is an activity a process a practice it is something that we do with our minds and our words but when we let gpt 4 do this for us are we not abdicating our intelligence are we not letting go of our ability to choose to pick out to read are we not becoming passive consumers of language instead of active producers '"
]
},
"execution_count": 8,
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -312,7 +312,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -330,21 +330,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "python3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
Expand Down
42 changes: 15 additions & 27 deletions nbs/01_english.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -46,7 +46,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -56,7 +56,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -515,7 +515,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -538,7 +538,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -547,7 +547,7 @@
"'accessorize'"
]
},
"execution_count": 5,
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -559,7 +559,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -671,7 +671,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -729,7 +729,7 @@
"12. Replace any successive whitespace characters with a space."
]
},
"execution_count": 7,
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -748,7 +748,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -757,7 +757,7 @@
"'i am a little teapot short and stout tip me over and pour me out'"
]
},
"execution_count": 8,
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -769,7 +769,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -778,7 +778,7 @@
"'language is like a map that we use to navigate the world but it s also like a prison that keeps us from seeing what s beyond the walls but what if there was a way to break out of this prison to expand our map to explore new worlds with new words this is the possibility and the challenge offered by instruction tuned language models like gpt 4 a cutting edge technology that uses artificial neural networks to generate natural language texts based on user inputs gpt 4 can write anything from essays to novels to poems to tweets to code to recipes to jokes to lyrics to whatever you want it can even write things that don t exist yet things that no human has ever thought of or said before as wittgenstein s quote suggests language is a source of limitation and liberation gpt 4 pushes this idea to the extreme by giving us access to unlimited language this could be the most significant new technology in modern history because it has the potential to change many domains and industries from education to entertainment from journalism to justice from science to art these models could enable new forms of learning storytelling reporting reasoning discovery and creation they could also create new ethical social and cultural challenges that require careful reflection and regulation how we use this technology will depend on how we recognize its implications for ourselves and others this technology is a form of artificial intelligence the word intelligence derives from inter and legere to be intelligent then is to be able to choose between things to pick out what matters to read what is written intelligence is not just a quantity or a quality it is an activity a process a practice it is something that we do with our minds and our words but when we let gpt 4 do this for us are we not abdicating our intelligence are we not letting go of our ability to choose to pick out to read are we not becoming passive consumers of language instead of active producers'"
]
},
"execution_count": 9,
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -805,7 +805,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -823,21 +823,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "python3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit 2d46103

Please sign in to comment.