Skip to content
View caue-paiva's full-sized avatar

Highlights

  • Pro

Block or report caue-paiva

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
caue-paiva/README.md

Hello! My name is Cauê and I am a computer science student at USP.

  • 🔭 Currently, my main interest is developing Data related projects, such as ETL (Extract, Transform and Load) Pipelines on the cloud, WebScrapping with Python and Data Warehouses for Analytics.
  • 🌱 I am learning how to use various technologies, a few examples are: Python, Pandas, Airflow, AWS, SQL, Postgres, Selenium, Langchain
  • 📫 You can reach me at the email cauepaivalira@outlook.com.

Caue's GitHub stats

My projects

Projects i develop as part of a São Paulo State Research Foundation (FAPESP) R&D grant program

Data Warehouse and automatic ETL pipeline for extracting and analyzing public brazilian goverment data

This project aims to develop a Data Warehouse (DW) that consolidates multiple public government data points over several years, focusing on socio-economic indicators. The DW will support analytical queries and time-series analysis, providing decision-makers with deeper insights into areas such as Economic Activity, Environmental Policies and Damage, and Public Health. Additionally, the project features an ETL pipeline to automate the collection, transformation, and loading of data from public sources into the DW.

Modules of the Project

Automatic ETL pipeline for extracting, cleaning and processing public brazilian goverment data using APIs and Webscrapping

Python and SQL scripts related to the Data Warehouse, its schema and the insertion and retrieval of Data

Projects i develop as part of a brazilian goverment R&D grant program (PIBIT CNPq)

The project builds upon the educational capabilities of Large Language Models (ex: GPT-3.5 and GPT-4) for education ,while also mitigating weaknesses such as hallucination and lack of knowledge about certain subjects and tests within the brazilian university admittance standardized test (ENEM).

To achieve these results an LLM application, using openAI models (gpt-3.5 turbo or gpt-4), along with aditional modules, such as internet search and retrieval augmented generarion for extra functionality, was developed.

According to feedback, over 60% of users said our solution has better and more accurate answers than chatGPT

Implementation of the Educational Chatbot described above but using the new OpenAI customGPTs service.

Helpful Prompts and data extracted from official sources about the ENEM test was used for better results.

For the purpose of RAG over ENEM test questions a GPT action and its associated API was used, the API is hosted on AWS API gateway and uses a Lambda Function for taking user inputs, embedding them with openAI embeddings and then querying Qdrant vectorDB for the N questions more similar to user input, with N being the number of questions the user asked.

For the educational chatbots, both the website and the customGPT version, i needed a large dataset of ENEM questions and their correct answers for the purpose of RAG and reduce LLM hallucinations (such as giving the wrong answer to a question) but no such large scale data was available online.

In such context i created this project, which combines PDF/data mining through libraries like PyMuPDF2 to transform the ENEM pdf into either textual data or into JSON files (Extraction and Transform part) and then a Qdrant VectorDB loader to load the data into the vectorstore (Load part). That combination is able to process either single tests PDFs (and their associated answer PDFs) or entire folders with multiple tests, loading hundreds of questions at once, all while providing metadata and stats about the extraction process (number of extracted questions per year and subject) to a CSV file, through a Pandas DataFrame.

Projects i developed to learn new technologies and concepts!

This project aims to collect and update data on cryptocurrencies like Bitcoin and Ethereum, storing the information in CSV files. These files cover extensive periods of trading data collected from the Binance US API.

The main technologies used are AWS Cloud (Lambda, API gateway, EC2 and S3), Apache Airflow for Data pipeline orchestration, Python and Pandas for manipulating the data

Heres the architecture of the Project/Pipeline:

Caue-airflow

Projects i developed as part of the Universidade of São Paulo Cientific Initiation Symposium (SIICUSP 2023)

Project developed in group for an eletronics class in university

The goal of this effort was the integrate Machine Learning Models , such as Computer vision and text classification, with a robot powered by a microcontroller (ESP-32)

My main contribution was with software development for the ESP-32 embedded systems, using C++ and modules such as Wi-Fi HTTP request handlers.

Heres the certificate for the Symposium


Technologies i am familiar with:

Caue-Js Caue-HTML Caue-Python Caue-GPT Caue-qdrant Caue-AWS

My social networks:

Popular repositories Loading

  1. ENEM_PDF_PARSER ENEM_PDF_PARSER Public

    This project tprovides a tool to extract ENEM (Brazilian SAT) tests into parsed .txt and json files

    Python 2

  2. intelli.gente_data_extraction intelli.gente_data_extraction Public

    Python 2 1

  3. airflow_project airflow_project Public

    Python 1

  4. Projetos_LangChain Projetos_LangChain Public

    Projetos Facul principalmente Python

    Python

  5. Eletro_usp Eletro_usp Public

    Forked from EnzoTM/Eletro_usp

  6. caue-paiva caue-paiva Public