INFO-RETRIEVAL

Streamlining Data to Insightful Retrieval

Developed with the software and tools below.

Table of Contents

Overview
Repository Structure
Data
Modules
Getting Started
- Installation
- Usage
- Tests

Overview

The info-retrieval project is designed to facilitate efficient information retrieval from large textual datasets. It encompasses modules for data preprocessing, tokenization, and the application of both traditional and machine learning-based retrieval models. Core functionalities include term frequency analysis, text normalization, invert index, and relevance scoring, supplemented by visualization tools to analyze frequency distributions. The project ensures smooth data handling, accurate retrieval operations, and seamless integration of various retrieval methodologies.

The info-retrieval project designed a complex information retrieval system to process, analyze, and retrieve textual data efficiently. Core functionalities include data processing and normalization, visualization of term distribution patterns, and both traditional and machine learning-based retrieval techniques. By leveraging packages like nltk, Gensim, Pandas, and NumPy, as well as some designed data structures like inverted indices, the system ensures robust data manipulation, retrieval and effective feature extraction. In addition to traditional models like TD-IDF based model, BM 25 model and likelihood model, this project also introduces some simple learning-to-rank models like logistic regression, LambdaMART, and neural network models.

Note: In all the descriptions in the repository, some different terms express the same meaning - passages and documents (or doc); tokens and terms

Repository Structure

└── info-retrieval/
    ├── data.py
    ├── display_tools.py
    ├── main.ipynb
    ├── README.md
    ├── requirements.txt
    ├── retrieve
    │   ├── learning.py
    │   └── tradition.py
    └── utils.py

Data

The exapmle data used for this project are two .tsv files with columns (qid, pid, query, passage, relevancy).

Modules

.

File	Summary
data.py	The `data.py` module serves as the backbone for handling data operations within the information retrieval system. Its primary role is to manage loading, preprocessing, and storing data crucial for the various retrieval and analysis tasks executed by the system. 1, Data Loading: data loader classes to load the data into complex dicts with collection-level and document-level statistics, transforming raw input into structured formats suitable for processing. 2, Data Preprocessing: Contains utilities for cleaning and preparing data. 3, Integration Hooks: Offers integrations with other components of the repository by creating classes of views of the complex dicts which behave like simple dicts, ensuring smooth data flow and consistency across different modules involved in the information retrieval pipeline.
display_tools.py	Visualize_frequency_zipfian function generates visual comparisons between normalized frequency data and Zipfian distributions in both linear and logarithmic scales, aiding in the evaluation of term distribution patterns.
main.ipynb	This notebook show some simple work flows of information retreival for this project.
requirements.txt	Outline dependencies essential for the project.
utils.py	Provide utility functions and a lemmatizer class with cached moethod for text processing. Implement normalized frequency calculation, token generation, and evaluation metrics like average precision and NDCG for information retrieval tasks, enhancing the overall functionality and robustness of the project.

retrieve

File Summary

tradition.py 1, Scorer Class: The core component, responsible for calculating retrieval scores. This class implements various scoring functions, including TF-IDF, BM 25 score, and log likelihood with smoothing methods, to evaluate the relevance of document-query pairs by utilizing the DataLoader from the data.py module to access essential data structures. 2, Traditional Retreival class: Achieve retrieval based on the score class. 3, Score fuction: score functions of these traditional retrieval models which can be used more generally. Overall, this file plays a critical role in the repository by enabling the traditional information retrieval methodologies.

learning.py Facilitates machine learning based information retrieval by defining models like Logistic Regression, LambdaMART, and MLP, along with a Trainer to manage training processes. Integrates with the parent repositorys architecture to enhance retrieval performance using advanced model-based techniques for predicting and ranking relevancy.

Getting Started

Installation

From `source`

Clone the repository:
$ git clone https://github.com/kangchengX/info-retrieval.git
Change to the project directory:
$ cd info-retrieval
Install the dependencies:
$ pip install -r requirements.txt

System Requirements:

Python: version 3.12.2

Data

The training data must have columns (qid, pid, query, passage). The validation data must have columns (qid, pid, query, passage, relevancy)

Usage

See examples in main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INFO-RETRIEVAL

Overview

Repository Structure

Data

Modules

Getting Started

Installation

From `source`

Data

Usage

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
retrieve		retrieve
.gitignore		.gitignore
data.py		data.py
display_tools.py		display_tools.py
main.ipynb		main.ipynb
readme.md		readme.md
requirements.txt		requirements.txt
utils.py		utils.py

kangchengX/info-retrieval

Folders and files

Latest commit

History

Repository files navigation

INFO-RETRIEVAL

Overview

Repository Structure

Data

Modules

Getting Started

Installation

From source

Data

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

From `source`

Packages