Project Proposal

Authors:

Till Grutschus
Jurek Sander
Ricardo Lammert Zepeda

Research Question

With the increasing amount of scientific publications, it becomes more and more difficult to keep up with the latest advances. Thus, it becomes imperative for different actors in the scientific community to quickly identify relevant publications. Researchers, publishers and readers are interested in the impact of scientific literatures. While researchers might want to optimize their own writing to increase visibility, publishers might want to quickly gauge the potential impact of papers they are considering for publication. Readers might want to filter new publications for relevance.

In this project, we want to address this problem by pondering the following question:

Can we predict the (future) impact of a scientific publication based on its metadata?

By answering this question, we hope to provide a tool that can help researchers, publishers and readers to quickly identify relevant publications.

Dataset

To adress this question, we will use data from the arXiv repository found here: https://www.kaggle.com/Cornell-University/arxiv.

Additionally, we will use citation data extracted from crossref.org.

In the following we will briefly describe the two datasets.

arXiv Dataset The arXiv dataset contains metadata of 1.7+ million scientific publications from the arXiv repository. A detailed description of the dataset can be found on the dataset's Kaggle page.

We expect the most relevant attributes for our project to be:

title: The title of the publication
abstract: The abstract of the publication
authors: The authors of the publication
categories: The categories of the publication
comments: E.g. number of pages, figures, tables.

A detailed feature analysis will be part of the project.

crossref.org Crossref offers a publicly available API to access citation data. Additionally, data dumps of the crossref database are available for download.

We will use the number of citations to extract a ground truth for the impact of a publication.

Methodology

Dataset acquisition

As described, we will use the crossref API to enrich the available arXiv dataset.
Data preprocessing

Textual data will be preprocessed, tokenized and vectorized as learned in the assignments. Missing data will be handled appropriately. Subsequently, we will perform a feature analysis to identify the most relevant features for our project and potentially reduce the dimensionality of the dataset.
Data mining

We will use the citation count to devise a discrete ordinal target variable for the impact of a publication. We will then evaluate different classification algorithms to predict the impact of a publication based on its metadata.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.vscode		.vscode
dataset/dataset_raw.parquet		dataset/dataset_raw.parquet
models		models
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
README.md		README.md
UU_Data_Mining_Project_Proposal.pdf		UU_Data_Mining_Project_Proposal.pdf
dataset_merge.ipynb		dataset_merge.ipynb
exploration_modeling.ipynb		exploration_modeling.ipynb
feature_extraction.ipynb		feature_extraction.ipynb
project_info.pdf		project_info.pdf
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Proposal

Research Question

Dataset

Methodology

About

Releases

Packages

Contributors 2

Languages

Grutschus/uu-data-mining-project

Folders and files

Latest commit

History

Repository files navigation

Project Proposal

Research Question

Dataset

Methodology

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages