diff --git a/lessons/01_preprocessing.ipynb b/lessons/01_preprocessing.ipynb
index de33786..54c3836 100644
--- a/lessons/01_preprocessing.ipynb
+++ b/lessons/01_preprocessing.ipynb
@@ -5,17 +5,24 @@
"id": "d3e7ea21-6437-48e8-a9e4-3bdc05f709c9",
"metadata": {},
"source": [
- "# Python Text Analysis: Preprocessing\n",
+ "# AnΓ‘lisis de texto en python: Preprocesamiento\n",
"\n",
"* * * \n",
"\n",
+ "## Grupo 4\n",
+ "### Integrantes\n",
+ "* Carlos Chicaiza\n",
+ "* Emilio Mayorga\n",
+ "* Juan Vizuete\n",
+ "* Jessica Llumiguano\n",
+ "\n",
"
\n",
" \n",
- "### Learning Objectives \n",
+ "### Objetivos de Aprendizaje\n",
" \n",
- "* Learn common steps for preprocessing text data, as well as specific operations for preprocessing Twitter data.\n",
- "* Know commonly used NLP packages and what they are capable of.\n",
- "* Understand tokenizers, and how they have changed since the advent of Large Language Models.\n",
+ "* Aprender cuales son los pasos comunes para el procesamiento de datos, asi como tambien las operaciones que se realizan para el procesamiento de datos en twitter.\n",
+ "* Conocer los paquete de procesamiento de lenguaje natural mas utilizados y sus capacidades.\n",
+ "* Entender los tokenizadores y como han cambiado desde la apariciΓ³n de los modelos de lenguaje en gran escala.\n",
"
\n",
"\n",
"### Icons Used in This Notebook\n",
@@ -25,26 +32,223 @@
"π¬ **Demo**: Showing off something more advanced β so you know what Python can be used for!
\n",
"\n",
"### Sections\n",
- "1. [Preprocessing](#section1)\n",
- "2. [Tokenization](#section2)\n",
+ "1. [Preprocesamiento](#section1)\n",
"\n",
- "In this three-part workshop series, we'll learn the building blocks for performing text analysis in Python. These techniques lie in the domain of Natural Language Processing (NLP). NLP is a field that deals with identifying and extracting patterns of language, primarily in written texts. Throughout the workshop series, we'll interact with various packages for performing text analysis: starting from simple string methods to specific NLP packages, such as `nltk`, `spaCy`, and more recent ones on Large Language Models (`BERT`).\n",
+ "En estas tres partes del trabajo, vamos aprender conceptos bΓ‘sicos para realizar anΓ‘lisis de tecto en python. Estas tΓ©cnicas pertenecen al dominio del procesmiento de lenguaje natural (NLP). NlP es un camp enfocado a identificar y extraer patrones del lenguaje, principalmente en textos escritos. Durante el rabjo, interactuaremos con diversos paquetes para realizar anΓ‘lisis de texto, desde mΓ©todos simples de strings hasta paquetes especΓficos de NLP, como `nltk`, `spaCy` y otros modelos de lenguaje de gran escala como (`BERT`).\n",
"\n",
- "Now, let's have these packages properly installed before diving into the materials."
+ "Ahora bien, antes de iniciar, se debe instalar los siguientes paquetes:"
]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 1,
"id": "d442e4c7-e926-493d-a64e-516616ad915a",
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Collecting NLTK\n",
+ " Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)\n",
+ "Collecting click (from NLTK)\n",
+ " Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)\n",
+ "Collecting joblib (from NLTK)\n",
+ " Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)\n",
+ "Collecting regex>=2021.8.3 (from NLTK)\n",
+ " Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)\n",
+ "Collecting tqdm (from NLTK)\n",
+ " Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)\n",
+ "Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m1.5/1.5 MB\u001b[0m \u001b[31m13.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (796 kB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m796.9/796.9 kB\u001b[0m \u001b[31m21.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading click-8.1.8-py3-none-any.whl (98 kB)\n",
+ "Downloading joblib-1.4.2-py3-none-any.whl (301 kB)\n",
+ "Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)\n",
+ "Installing collected packages: tqdm, regex, joblib, click, NLTK\n",
+ "Successfully installed NLTK-3.9.1 click-8.1.8 joblib-1.4.2 regex-2024.11.6 tqdm-4.67.1\n",
+ "Note: you may need to restart the kernel to use updated packages.\n",
+ "Collecting transformers\n",
+ " Downloading transformers-4.50.1-py3-none-any.whl.metadata (39 kB)\n",
+ "Collecting filelock (from transformers)\n",
+ " Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)\n",
+ "Collecting huggingface-hub<1.0,>=0.26.0 (from transformers)\n",
+ " Downloading huggingface_hub-0.29.3-py3-none-any.whl.metadata (13 kB)\n",
+ "Collecting numpy>=1.17 (from transformers)\n",
+ " Downloading numpy-2.2.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)\n",
+ "Requirement already satisfied: packaging>=20.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from transformers) (24.2)\n",
+ "Collecting pyyaml>=5.1 (from transformers)\n",
+ " Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)\n",
+ "Requirement already satisfied: regex!=2019.12.17 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from transformers) (2024.11.6)\n",
+ "Collecting requests (from transformers)\n",
+ " Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)\n",
+ "Collecting tokenizers<0.22,>=0.21 (from transformers)\n",
+ " Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)\n",
+ "Collecting safetensors>=0.4.3 (from transformers)\n",
+ " Downloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)\n",
+ "Requirement already satisfied: tqdm>=4.27 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from transformers) (4.67.1)\n",
+ "Collecting fsspec>=2023.5.0 (from huggingface-hub<1.0,>=0.26.0->transformers)\n",
+ " Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)\n",
+ "Collecting typing-extensions>=3.7.4.3 (from huggingface-hub<1.0,>=0.26.0->transformers)\n",
+ " Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)\n",
+ "Collecting charset-normalizer<4,>=2 (from requests->transformers)\n",
+ " Downloading charset_normalizer-3.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (35 kB)\n",
+ "Collecting idna<4,>=2.5 (from requests->transformers)\n",
+ " Downloading idna-3.10-py3-none-any.whl.metadata (10 kB)\n",
+ "Collecting urllib3<3,>=1.21.1 (from requests->transformers)\n",
+ " Downloading urllib3-2.3.0-py3-none-any.whl.metadata (6.5 kB)\n",
+ "Collecting certifi>=2017.4.17 (from requests->transformers)\n",
+ " Using cached certifi-2025.1.31-py3-none-any.whl.metadata (2.5 kB)\n",
+ "Downloading transformers-4.50.1-py3-none-any.whl (10.2 MB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m10.2/10.2 MB\u001b[0m \u001b[31m56.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading huggingface_hub-0.29.3-py3-none-any.whl (468 kB)\n",
+ "Downloading numpy-2.2.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.1 MB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m16.1/16.1 MB\u001b[0m \u001b[31m58.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n",
+ "\u001b[?25hDownloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m767.5/767.5 kB\u001b[0m \u001b[31m32.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (471 kB)\n",
+ "Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m3.0/3.0 MB\u001b[0m \u001b[31m49.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading filelock-3.18.0-py3-none-any.whl (16 kB)\n",
+ "Downloading requests-2.32.3-py3-none-any.whl (64 kB)\n",
+ "Using cached certifi-2025.1.31-py3-none-any.whl (166 kB)\n",
+ "Downloading charset_normalizer-3.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (145 kB)\n",
+ "Downloading fsspec-2025.3.0-py3-none-any.whl (193 kB)\n",
+ "Downloading idna-3.10-py3-none-any.whl (70 kB)\n",
+ "Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)\n",
+ "Downloading urllib3-2.3.0-py3-none-any.whl (128 kB)\n",
+ "Installing collected packages: urllib3, typing-extensions, safetensors, pyyaml, numpy, idna, fsspec, filelock, charset-normalizer, certifi, requests, huggingface-hub, tokenizers, transformers\n",
+ "Successfully installed certifi-2025.1.31 charset-normalizer-3.4.1 filelock-3.18.0 fsspec-2025.3.0 huggingface-hub-0.29.3 idna-3.10 numpy-2.2.4 pyyaml-6.0.2 requests-2.32.3 safetensors-0.5.3 tokenizers-0.21.1 transformers-4.50.1 typing-extensions-4.12.2 urllib3-2.3.0\n",
+ "Note: you may need to restart the kernel to use updated packages.\n",
+ "Collecting spaCy\n",
+ " Downloading spacy-3.8.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)\n",
+ "Collecting spacy-legacy<3.1.0,>=3.0.11 (from spaCy)\n",
+ " Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)\n",
+ "Collecting spacy-loggers<2.0.0,>=1.0.0 (from spaCy)\n",
+ " Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)\n",
+ "Collecting murmurhash<1.1.0,>=0.28.0 (from spaCy)\n",
+ " Downloading murmurhash-1.0.12-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)\n",
+ "Collecting cymem<2.1.0,>=2.0.2 (from spaCy)\n",
+ " Downloading cymem-2.0.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.5 kB)\n",
+ "Collecting preshed<3.1.0,>=3.0.2 (from spaCy)\n",
+ " Downloading preshed-3.0.9-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)\n",
+ "Collecting thinc<8.4.0,>=8.3.4 (from spaCy)\n",
+ " Downloading thinc-8.3.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)\n",
+ "Collecting wasabi<1.2.0,>=0.9.1 (from spaCy)\n",
+ " Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)\n",
+ "Collecting srsly<3.0.0,>=2.4.3 (from spaCy)\n",
+ " Downloading srsly-2.5.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)\n",
+ "Collecting catalogue<2.1.0,>=2.0.6 (from spaCy)\n",
+ " Downloading catalogue-2.0.10-py3-none-any.whl.metadata (14 kB)\n",
+ "Collecting weasel<0.5.0,>=0.1.0 (from spaCy)\n",
+ " Downloading weasel-0.4.1-py3-none-any.whl.metadata (4.6 kB)\n",
+ "Collecting typer<1.0.0,>=0.3.0 (from spaCy)\n",
+ " Downloading typer-0.15.2-py3-none-any.whl.metadata (15 kB)\n",
+ "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from spaCy) (4.67.1)\n",
+ "Requirement already satisfied: numpy>=1.19.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from spaCy) (2.2.4)\n",
+ "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from spaCy) (2.32.3)\n",
+ "Collecting pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 (from spaCy)\n",
+ " Downloading pydantic-2.10.6-py3-none-any.whl.metadata (30 kB)\n",
+ "Collecting jinja2 (from spaCy)\n",
+ " Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)\n",
+ "Collecting setuptools (from spaCy)\n",
+ " Downloading setuptools-78.1.0-py3-none-any.whl.metadata (6.6 kB)\n",
+ "Requirement already satisfied: packaging>=20.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from spaCy) (24.2)\n",
+ "Collecting langcodes<4.0.0,>=3.2.0 (from spaCy)\n",
+ " Downloading langcodes-3.5.0-py3-none-any.whl.metadata (29 kB)\n",
+ "Collecting language-data>=1.2 (from langcodes<4.0.0,>=3.2.0->spaCy)\n",
+ " Downloading language_data-1.3.0-py3-none-any.whl.metadata (4.3 kB)\n",
+ "Collecting annotated-types>=0.6.0 (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spaCy)\n",
+ " Downloading annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)\n",
+ "Collecting pydantic-core==2.27.2 (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spaCy)\n",
+ " Downloading pydantic_core-2.27.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)\n",
+ "Requirement already satisfied: typing-extensions>=4.12.2 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spaCy) (4.12.2)\n",
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spaCy) (3.4.1)\n",
+ "Requirement already satisfied: idna<4,>=2.5 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spaCy) (3.10)\n",
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spaCy) (2.3.0)\n",
+ "Requirement already satisfied: certifi>=2017.4.17 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spaCy) (2025.1.31)\n",
+ "Collecting blis<1.3.0,>=1.2.0 (from thinc<8.4.0,>=8.3.4->spaCy)\n",
+ " Downloading blis-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)\n",
+ "Collecting confection<1.0.0,>=0.0.1 (from thinc<8.4.0,>=8.3.4->spaCy)\n",
+ " Downloading confection-0.1.5-py3-none-any.whl.metadata (19 kB)\n",
+ "Requirement already satisfied: click>=8.0.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from typer<1.0.0,>=0.3.0->spaCy) (8.1.8)\n",
+ "Collecting shellingham>=1.3.0 (from typer<1.0.0,>=0.3.0->spaCy)\n",
+ " Downloading shellingham-1.5.4-py2.py3-none-any.whl.metadata (3.5 kB)\n",
+ "Collecting rich>=10.11.0 (from typer<1.0.0,>=0.3.0->spaCy)\n",
+ " Downloading rich-13.9.4-py3-none-any.whl.metadata (18 kB)\n",
+ "Collecting cloudpathlib<1.0.0,>=0.7.0 (from weasel<0.5.0,>=0.1.0->spaCy)\n",
+ " Downloading cloudpathlib-0.21.0-py3-none-any.whl.metadata (14 kB)\n",
+ "Collecting smart-open<8.0.0,>=5.2.1 (from weasel<0.5.0,>=0.1.0->spaCy)\n",
+ " Downloading smart_open-7.1.0-py3-none-any.whl.metadata (24 kB)\n",
+ "Collecting MarkupSafe>=2.0 (from jinja2->spaCy)\n",
+ " Downloading MarkupSafe-3.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)\n",
+ "Collecting marisa-trie>=1.1.0 (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spaCy)\n",
+ " Downloading marisa_trie-1.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.0 kB)\n",
+ "Collecting markdown-it-py>=2.2.0 (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spaCy)\n",
+ " Downloading markdown_it_py-3.0.0-py3-none-any.whl.metadata (6.9 kB)\n",
+ "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spaCy) (2.19.1)\n",
+ "Collecting wrapt (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spaCy)\n",
+ " Downloading wrapt-1.17.2-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.4 kB)\n",
+ "Collecting mdurl~=0.1 (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spaCy)\n",
+ " Downloading mdurl-0.1.2-py3-none-any.whl.metadata (1.6 kB)\n",
+ "Downloading spacy-3.8.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (31.8 MB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m31.8/31.8 MB\u001b[0m \u001b[31m48.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
+ "\u001b[?25hDownloading catalogue-2.0.10-py3-none-any.whl (17 kB)\n",
+ "Downloading cymem-2.0.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (227 kB)\n",
+ "Downloading langcodes-3.5.0-py3-none-any.whl (182 kB)\n",
+ "Downloading murmurhash-1.0.12-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (138 kB)\n",
+ "Downloading preshed-3.0.9-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (156 kB)\n",
+ "Downloading pydantic-2.10.6-py3-none-any.whl (431 kB)\n",
+ "Downloading pydantic_core-2.27.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m49.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)\n",
+ "Downloading spacy_loggers-1.0.5-py3-none-any.whl (22 kB)\n",
+ "Downloading srsly-2.5.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m42.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading thinc-8.3.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m3.7/3.7 MB\u001b[0m \u001b[31m34.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading typer-0.15.2-py3-none-any.whl (45 kB)\n",
+ "Downloading wasabi-1.1.3-py3-none-any.whl (27 kB)\n",
+ "Downloading weasel-0.4.1-py3-none-any.whl (50 kB)\n",
+ "Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)\n",
+ "Downloading setuptools-78.1.0-py3-none-any.whl (1.3 MB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m44.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading annotated_types-0.7.0-py3-none-any.whl (13 kB)\n",
+ "Downloading blis-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m11.6/11.6 MB\u001b[0m \u001b[31m47.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n",
+ "\u001b[?25hDownloading cloudpathlib-0.21.0-py3-none-any.whl (52 kB)\n",
+ "Downloading confection-0.1.5-py3-none-any.whl (35 kB)\n",
+ "Downloading language_data-1.3.0-py3-none-any.whl (5.4 MB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m5.4/5.4 MB\u001b[0m \u001b[31m55.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading MarkupSafe-3.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23 kB)\n",
+ "Downloading rich-13.9.4-py3-none-any.whl (242 kB)\n",
+ "Downloading shellingham-1.5.4-py2.py3-none-any.whl (9.8 kB)\n",
+ "Downloading smart_open-7.1.0-py3-none-any.whl (61 kB)\n",
+ "Downloading marisa_trie-1.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m1.4/1.4 MB\u001b[0m \u001b[31m40.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading markdown_it_py-3.0.0-py3-none-any.whl (87 kB)\n",
+ "Downloading wrapt-1.17.2-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (89 kB)\n",
+ "Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)\n",
+ "Installing collected packages: cymem, wrapt, wasabi, spacy-loggers, spacy-legacy, shellingham, setuptools, pydantic-core, murmurhash, mdurl, MarkupSafe, cloudpathlib, catalogue, blis, annotated-types, srsly, smart-open, pydantic, preshed, markdown-it-py, marisa-trie, jinja2, rich, language-data, confection, typer, thinc, langcodes, weasel, spaCy\n",
+ "Successfully installed MarkupSafe-3.0.2 annotated-types-0.7.0 blis-1.2.0 catalogue-2.0.10 cloudpathlib-0.21.0 confection-0.1.5 cymem-2.0.11 jinja2-3.1.6 langcodes-3.5.0 language-data-1.3.0 marisa-trie-1.2.1 markdown-it-py-3.0.0 mdurl-0.1.2 murmurhash-1.0.12 preshed-3.0.9 pydantic-2.10.6 pydantic-core-2.27.2 rich-13.9.4 setuptools-78.1.0 shellingham-1.5.4 smart-open-7.1.0 spaCy-3.8.4 spacy-legacy-3.0.12 spacy-loggers-1.0.5 srsly-2.5.1 thinc-8.3.4 typer-0.15.2 wasabi-1.1.3 weasel-0.4.1 wrapt-1.17.2\n",
+ "Note: you may need to restart the kernel to use updated packages.\n",
+ "Collecting en-core-web-sm==3.8.0\n",
+ " Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m12.8/12.8 MB\u001b[0m \u001b[31m45.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n",
+ "\u001b[?25hInstalling collected packages: en-core-web-sm\n",
+ "Successfully installed en-core-web-sm-3.8.0\n",
+ "\u001b[38;5;2mβ Download and installation successful\u001b[0m\n",
+ "You can now load the package via spacy.load('en_core_web_sm')\n"
+ ]
+ }
+ ],
"source": [
"# Uncomment the following lines to install packages/model\n",
- "# %pip install NLTK\n",
- "# %pip install transformers\n",
- "# %pip install spaCy\n",
- "# !python -m spacy download en_core_web_sm"
+ "%pip install NLTK\n",
+ "%pip install transformers\n",
+ "%pip install spaCy\n",
+ "!python -m spacy download en_core_web_sm"
]
},
{
@@ -54,16 +258,22 @@
"source": [
"\n",
"\n",
- "# Preprocessing\n",
+ "# Preprocesamiento\n",
+ "\n",
+ "En la primera parte de este trabajo, se abordarΓ‘ el primer paso para el anΓ‘lisis de texto. Nuestra meta sera convertir los datos desordenados en un formato consistente. Este proceso se conoce como preprocesameinto/ limpieza de texto/ normalizaciΓ³n del texto.\n",
+ "\n",
+ "Al final del preprocesamiento, los datos seguirΓ‘n estando en un formato legible. En la segunda y tercera parte, se empzaΒ΄ra a convertir los datos de tecto en una representaciΓ³n numΓ©rica, un formato mΓ‘s adecuado para su precesamiento computacional.\n",
"\n",
- "In Part 1 of this workshop, we'll address the first step of text analysis. Our goal is to convert the raw, messy text data into a consistent format. This process is often called **preprocessing**, **text cleaning**, or **text normalization**.\n",
+ "π **Pregunta**: tomate un minuto para reflexionar con tus experiencias pasadas trabjando con datos de texto:\n",
+ "- ΒΏCuΓ‘l es el formato de los datos de texto con los que has trabajado (texto plano, CSV, XML)?\n",
"\n",
- "You'll notice that at the end of preprocessing, our data is still in a format that we can read and understand. In Parts 2 and 3, we will begin our foray into converting the text data into a numerical representationβa format that can be more readily handled by computers. \n",
+ "Hemos trabajado con datos en texto csv, xml, txt, para la limpieza de datos, anΓ‘lisis y entrenamiento de redes neuronales.\n",
+ "- ΒΏDe dΓ³nde provinieron (corpus estructurado, scrapping web, encuestas)?\n",
"\n",
- "π **Question**: Let's pause for a minute to reflect on **your** previous experiences working on text data. \n",
- "- What is the format of the text data you have interacted with (plain text, CSV, or XML)?\n",
- "- Where does it come from (structured corpus, scraped from the web, survey data)?\n",
- "- Is it messy (i.e., is the data formatted consistently)?"
+ "Los datos fueron obtenidos desde kaggle, ya que contiene un gran banco de datos de todo tipo.\n",
+ "- ΒΏLos datos estaban desordenados o inconsistentes?\n",
+ "\n",
+ "Los datos estuvieron en algunos casos desordenados, en otros casos los datos inconsistentes pue slos descartabamos ya que necesitabamos avanzar rapidamente con el proyecto."
]
},
{
@@ -71,21 +281,21 @@
"id": "4b35911a-3b3f-4a48-a7d1-9882aab04851",
"metadata": {},
"source": [
- "## Common Processes\n",
+ "## Procesos Comunes\n",
"\n",
- "Preprocessing is not something we can accomplish with a single line of code. We often start by familiarizing ourselves with the data, and along the way, we gain a clearer understanding of the granularity of preprocessing we want to apply.\n",
+ "El preprocesmiento no s epuede lograr con una sola linea de cΓ³digo. A menudo, nos familizarizamos con los datos para entender mejor el nivel de grnuralidad necesario para aplicar el preprocesamiento.\n",
"\n",
- "Typically, we begin by applying a set of commonly used processes to clean the data. These operations don't substantially alter the form or meaning of the data; they serve as a standardized procedure to reshape the data into a consistent format.\n",
+ "Tipicamente, al inicio aplicamos un listado de procesos comunmente utilizados para la limpieza de datos. estas operaciones no alteran sustancialmente la forma ni el significado de los datos; solo sirven como un procesamiento estandarizado para reorganizar los datos en un formato consistente.\n",
"\n",
- "The following processes, for examples, are commonly applied to preprocess English texts of various genres. These operations can be done using built-in Python functions, such as `string` methods, and Regular Expressions. \n",
- "- Lowercase the text\n",
- "- Remove punctuation marks\n",
- "- Remove extra whitespace characters\n",
- "- Remove stop words\n",
+ "Los siguientes procesos, por ejemplo, se aplican comunmente para el procesamiento de libreos de inglΓ©s en varios generos. Estas operaciones pueden ser realizadas usando funciones integradas en python, mΓ©todos como `string`, y expresiones regulares.\n",
+ "- Convertir a minΓΊsculas\n",
+ "- Eliminar signos de puntuaciΓ³n.\n",
+ "- Eliminar espacios en blanco que esten demΓ‘s.\n",
+ "- Eliminar palabrΓ‘s bacΓas.\n",
"\n",
- "After the initial processing, we may choose to perform task-specific processes, the specifics of which often depend on the downstream task we want to perform and the nature of the text data (i.e., its stylistic and linguistic features). \n",
+ "DespuΓ©s del preocesamiento inicial, nosotros podemos seleccionar los preocesos especΓficos segΓΊn la tarea, los detalles de estos procesos dependen de la tarea posterior que queremos llevar a cabo y el tipo de datos de texto (es decir, sus caracterΓticas estilisticas y linguisticas).\n",
"\n",
- "Before we jump into these operations, let's take a look at our data!"
+ "Β‘Antes de adentrarnos en estas operaciones, echemos un vistazo a nuestros datos!"
]
},
{
@@ -93,16 +303,49 @@
"id": "ec5d7350-9a1e-4db9-b828-a87fe1676d8d",
"metadata": {},
"source": [
- "### Import the Text Data\n",
+ "### Importar datos de texto\n",
"\n",
- "The text data we'll be working with is a CSV file. It contains tweets about U.S. airlines, scrapped from Feb 2015. \n",
+ "trabajaremos con un archivo CSV. Este archivo contiene tweets sobre aerolΓneas de EE.UU. recopilados en febrero de 2015\n",
"\n",
- "Let's read the file `airline_tweets.csv` into dataframe with `pandas`."
+ "Vamos a leer el archivo `airline_tweets.csv` dentro de un dataframe de `pandas`."
]
},
{
"cell_type": "code",
- "execution_count": 1,
+ "execution_count": 3,
+ "id": "6bda2022",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Collecting pandas\n",
+ " Downloading pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)\n",
+ "Requirement already satisfied: numpy>=1.26.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from pandas) (2.2.4)\n",
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from pandas) (2.9.0.post0)\n",
+ "Collecting pytz>=2020.1 (from pandas)\n",
+ " Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)\n",
+ "Collecting tzdata>=2022.7 (from pandas)\n",
+ " Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)\n",
+ "Requirement already satisfied: six>=1.5 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n",
+ "Downloading pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)\n",
+ "\u001b[2K \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m12.7/12.7 MB\u001b[0m \u001b[31m44.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n",
+ "\u001b[?25hDownloading pytz-2025.2-py2.py3-none-any.whl (509 kB)\n",
+ "Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)\n",
+ "Installing collected packages: pytz, tzdata, pandas\n",
+ "Successfully installed pandas-2.2.3 pytz-2025.2 tzdata-2025.2\n",
+ "Note: you may need to restart the kernel to use updated packages.\n"
+ ]
+ }
+ ],
+ "source": [
+ "%pip install pandas"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
"id": "3d1ff64b-53ad-4eca-b846-3fda20085c43",
"metadata": {},
"outputs": [],
@@ -119,7 +362,7 @@
},
{
"cell_type": "code",
- "execution_count": 2,
+ "execution_count": 5,
"id": "e397ac6a-c2ba-4cce-8700-b36b38026c9d",
"metadata": {},
"outputs": [
@@ -293,7 +536,7 @@
"4 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada) "
]
},
- "execution_count": 2,
+ "execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
@@ -308,13 +551,13 @@
"id": "ae3b339f-45cf-465d-931c-05f9096fd510",
"metadata": {},
"source": [
- "The dataframe has one row per tweet. The text of tweet is shown in the `text` column.\n",
- "- `text` (`str`): the text of the tweet.\n",
+ "El dataframe tiene una fila por cada tweet. El texto del tweet se muestra en la columna text.\n",
+ "- `text` (`str`): el texto del tweet.\n",
"\n",
- "Other metadata we are interested in include: \n",
- "- `airline_sentiment` (`str`): the sentiment of the tweet, labeled as \"neutral,\" \"positive,\" or \"negative.\"\n",
- "- `airline` (`str`): the airline that is tweeted about.\n",
- "- `retweet count` (`int`): how many times the tweet was retweeted."
+ "Otra informaciΓ³n relevante que nos interesa incluye:\n",
+ "- `airline_sentiment` (`str`): el sentimiento del tweet, etiquetado como \"neutral\", \"positivo\" o \"negativo\".\n",
+ "- `airline` (`str`): la aerolΓnea sobre la que se tuitea.\n",
+ "- `retweet count` (`int`): la cantidad de veces que el tweet fue retuiteado."
]
},
{
@@ -322,12 +565,12 @@
"id": "302c695b-4bd1-4151-9cb9-ef5253eb16df",
"metadata": {},
"source": [
- "Let's take a look at some of the tweets:"
+ "Echemos un vistazo a algunos de los tweets:"
]
},
{
"cell_type": "code",
- "execution_count": 3,
+ "execution_count": 7,
"id": "b690daab-7be5-4b8f-8af0-a91fdec4ec4f",
"metadata": {},
"outputs": [
@@ -352,7 +595,9 @@
"id": "8adc05fa-ad30-4402-ab56-086bcb09a166",
"metadata": {},
"source": [
- "π **Question**: What have you noticed? What are the stylistic features of tweets?"
+ "π **Pregunta**: ΒΏQuΓ© has notado? ΒΏCuΓ‘les son las caracterΓsticas estilΓsticas de los tweets?\n",
+ "\n",
+ "Los tweets son informales y son directos con respecto a los servicios de la aerolΓnea, a travΓ©s de ello se puede identificar el sentimiento de los usuarios."
]
},
{
@@ -360,20 +605,20 @@
"id": "c3460393-00a6-461c-b02a-9e98f9b5d1af",
"metadata": {},
"source": [
- "### Lowercasing\n",
+ "### Convertir a minΓΊsculas\n",
"\n",
- "While we acknowledge that a word's casing is informative, we often don't work in contexts where we can properly utilize this information.\n",
+ "Mientras reconocemos que el uso de mayΓΊsculas y minΓΊsculas en una palabra resulta ser informaciΓ³n, a menudo no trabajams en contextos donde podemos aprovechar adecuandamente esta infomaciΓ³n.\n",
"\n",
- "More often, the subsequent analysis we perform is **case-insensitive**. For instance, in frequency analysis, we want to account for various forms of the same word. Lowercasing the text data aids in this process and simplifies our analysis.\n",
+ "Mayormente, el anΓ‘lisis posterior que realizamos es insensible a las mayΓΊsculas. Por ejemplo, en el anΓ‘lisis de frecuencia, nosotros usualmente queremos considerar varias formas de la misma palabra. Convertir los datos de texto a minΓΊsculas facilita este proceso y smplifica nestro anΓ‘lisis.\n",
"\n",
- "We can easily achieve lowercasing with the string method [`.lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower); see [documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) for more useful functions.\n",
+ "Podemos lograr fΓ‘cilmente la conversiΓ³n a minΓΊsculas con el mΓ©todo de cadena [`.lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower); visitar [documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) para ver mΓ‘s funciones ΓΊtiles.\n",
"\n",
- "Let's apply it to the following example:"
+ "ApliquΓ©moslo al siguiente ejemplo:"
]
},
{
"cell_type": "code",
- "execution_count": 4,
+ "execution_count": 8,
"id": "58a95d90-3ef1-4bff-9cfe-d447ed99f252",
"metadata": {},
"outputs": [
@@ -393,7 +638,7 @@
},
{
"cell_type": "code",
- "execution_count": 5,
+ "execution_count": 9,
"id": "c66d91c0-6eed-4591-95fc-cd2eae2e0d41",
"metadata": {},
"outputs": [
@@ -2151,7 +2396,7 @@
],
"metadata": {
"kernelspec": {
- "display_name": "Python 3 (ipykernel)",
+ "display_name": ".venv",
"language": "python",
"name": "python3"
},
@@ -2165,7 +2410,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.11.4"
+ "version": "3.12.1"
}
},
"nbformat": 4,