Skip to content

crcresearch/FUNSD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of contents

This repository's content is intended to provide a guide on the original FUNSD dataset and its versions. Alongside the original dataset, there are two additional datasets derived from the it: the FUNSD Revised Dataset and the FUNSD+ Dataset.

The FUNSD (Form Understanding in Noisy Scanned Documents) dataset is commonly used to understand documents and extract information from scanned documents. It is particularly useful for tasks involving Text Detection, Optical Character Recognition, Spatial Layout Analysis, and Form Understanding.

s

Figure 1: Example of forms from the FUNSD dataset from https://guillaumejaume.github.io/FUNSD/img/two_forms.png. The labels indicate different types of information: questions (blue), answers (green), headers (yellow), and other elements (orange).

FUNSD Dataset (Original) [1]

The FUNSD dataset was published by Jaume et al. (2019) for form understanding in noisy scanned documents. It is available on Guillaume Jaume's homepage and is licensed for non-commercial, research, and educational purposes.

The FUNSD dataset consists of 199 document images, which is a subset sampled from the RVL-CDIP dataset introduced by Harley et al. (2015)[4]. The RVL-CDIP dataset comprises 400,000 grayscale images of various documents from the 1980s and 1990s. These scanned documents have a low resolution of 100 dpi and suffer from low quality due to various types of noise introduced by successive scanning and printing procedures. The RVL-CDIP dataset categorizes its images into four classes: letter, email, magazine, and form.

The authors of the FUNSD dataset manually reviewed 25,000 images from the "form" category, discarding unreadable and duplicate images. This process resulted in a refined set of 3,200 images, from which 199 images were randomly sampled for annotation to create the FUNSD dataset. The RVL-CDIP dataset is itself a subset of another dataset called the Truth Tobacco Industry Document (TTID)[5], an archive collection of scientific research, marketing, and advertising documents from the largest tobacco companies in the US.

In the original FUNSD dataset, the metadata of each image is contained in a separate JSON file, in which the form is represented as a list of interlinked semantic entities. Each entity consists of a group of words that belong together both semantically and spatially.

Each semantic entity is associated with:

  • a unique identifier,
  • a label which can be header, question, answer, or other,
  • a bounding box,
  • a list of words belonging to said entity,
  • and a list of links with other entities.

Each word is described by its contextual content: a OCR label and its bounding box.

Please keep in mind the following information:

  • the question and answer labels are similar to key and value;
  • the bounding boxes are represented as [left, top, right, bottom];
  • and the links are formatted as a pair of entity identifiers, with the identifier of the current entity being the first element.

The dataset contains 199 images, with over 30,000 word-level annotations and approximately 10,000 entities, and it was pre-divided into a training set of 149 images and a testing set of 50 images.

  • The FUNSD dataset can be downloaded from the original source.

  • We generated annotations from Azure AI Document Intelligence service for the images in the FUNSD dataset.
    You also can use git with dvc to download the dataset containing the Azure annotations.:

    # in your virtual environment
    pip install dvc[ssh]
    dvc get https://github.com/crcresearch/FUNSD datasets/FUNSD

    Note: After running the last command, you may be prompted to enter your password to SSH into the CRC Cluster.

  • You can also download it from the following link: https://www.crc.nd.edu/~pmoreira/funsd.zip

  • To plot the annotations on the image, you can use the following notebook: FUNSD notebook Instructions

  • Find the code used to generate the Azure annotations at here.

FUNSD Revised Dataset [2]

Vu et al. (2020) in [2] reported several inconsistencies in labeling, which could hinder the applicability of FUNSD to the key-value extraction problem. They revised the dataset by correcting the labels and adding new annotations.

FUNSD+ Dataset [3]

The FUNSD+ dataset is an enhanced version of the original FUNSD (Form Understanding in Noisy Scanned Documents) dataset, designed for more comprehensive document understanding tasks.

  • FUNSD+ notebook Instructions

  • Get the Dataset via DVC

    # in your virtual environment
    pip install dvc[ssh]
    dvc get https://github.com/crcresearch/FUNSD datasets/FUNSD_plus

Note: After running the last command, you may be prompted to enter your password to SSH into the CRC Cluster.

[1] Jaume, Guillaume, Ekenel, Hazim Kemal, and Thiran, Jean-Philippe. "FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents," 2019. https://arxiv.org/pdf/1905.13538

[2] Vu, Hieu M., and Nguyen, Diep Thi-Ngoc. "Revising FUNSD Dataset for Key-Value Detection in Document Images," 2020. https://arxiv.org/pdf/2010.05322

[3] Zagami, Davide, and Helm, Christopher. "FUNSD+: A Larger and Revised FUNSD Dataset," 2022. https://konfuzio.com/de/funsd-plus/

[4] A. W. Harley, A. Ufkes, K. G. Derpanis, "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval," in ICDAR, 2015. https://arxiv.org/abs/1502.07058

[5] Truth Tobacco Industry Documents Library. https://www.industrydocuments.ucsf.edu/tobacco