Skip to content

Commit

Permalink
Update M3DocVQA download (#7)
Browse files Browse the repository at this point in the history
  • Loading branch information
j-min authored Feb 15, 2025
1 parent 27aac8d commit 29e6ac2
Show file tree
Hide file tree
Showing 5 changed files with 144 additions and 229 deletions.
134 changes: 89 additions & 45 deletions m3docvqa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ The scripts allows users to:
## Installation

```
git clone <url-tbd>
cd <repo-name-tbd>/m3docvqa
git clone https://github.com/bloomberg/m3docrag
cd m3docrag/m3docvqa
```

### Install Python Package
Expand Down Expand Up @@ -111,90 +111,134 @@ Output:

A JSONL file `id_url_mapping.jsonl` containing the ID and corresponding URL mappings.

### Step 3: Download Wikipedia Articles as PDFs
Use the `download_pdfs` action to download Wikipedia articles in a PDF format based on the generated mapping.
### Step 3: Create Split Files
Use the `create_splits` action to create the per-split doc ids.

```bash
python main.py download_pdfs --metadata_path=./id_url_mapping.jsonl --pdf_dir=./pdfs --result_log_path=./download_results.jsonl --first_n=10 --supporting_doc_ids_per_split=./supporting_doc_ids_per_split.json --split=dev
python main.py create_splits --split_metadata_file=./multimodalqa/MMQA_dev.jsonl --split=dev
python main.py create_splits --split_metadata_file=./multimodalqa/MMQA_train.jsonl --split=train
```

Options:
- `--metadata_path`: Path to the id_url_mapping.jsonl file.
- `--pdf_dir`: Directory to save the downloaded PDFs.
- `--result_log_path`: Path to log the download results.
- `--first_n`: Downloads the first N PDFs for testing. **Do not use this option for downloading all the PDFs.**
- `--supporting_doc_ids_per_split`: Path to JSON file containing document IDs for each split. `dev` is the default split, as all of the experimental results in the `M3DocRAG` paper were reported on the `dev` split. Anyone interested in downloading the PDFs in the `train` split can provide `--supporting_doc_ids_per_split=train` as the option. In case anyone is interested in downloading all the PDFs, one can also provide `--supporting_doc_ids_per_split=all` as an option.
**Note** - In the [M3DocRAG](https://arxiv.org/abs/2411.04952) paper, we only use the `dev` split for our experiments.

Output:

- PDF files for Wikipedia articles, saved in the `./pdfs/` directory.
- A `download_results.jsonl` file logging the status of each download.
- Files that store document IDs of each split: `./dev_doc_ids.json` and `./train_doc_ids.json`.

### Step 4: Check PDF Integrity
Use the `check_pdfs` action to verify the integrity of the downloaded PDFs.

### Step 4: Download Wikipedia Articles as PDFs
Use the `download_pdfs` action to download Wikipedia articles in a PDF format based on the generated mapping.

```bash
python main.py check_pdfs --pdf_dir=./pdfs
python main.py download_pdfs --metadata_path=./id_url_mapping.jsonl --pdf_dir=./pdfs_dev --result_log_dir=./download_logs/ --first_n=10 --per_split_doc_ids=./dev_doc_ids.json
```

Options:
- `--metadata_path`: Path to the id_url_mapping.jsonl file.
- `--pdf_dir`: Directory to save the downloaded PDFs.
- `--result_log_dir`: Directory to log the download results.
- `--first_n`: Downloads the first N PDFs for testing (default is -1, which means all the PDFs).
- `--per_split_doc_ids`: Path to JSON file containing document IDs for each split. `dev_doc_ids.json` is the default file, as all of the experimental results in the `M3DocRAG` paper were reported on the `dev` split. Anyone interested in downloading the PDFs in the `train` split can provide `--per_split_doc_ids=./train_doc_ids.json` as the option.

Output:

Identifies and logs corrupted or unreadable PDFs.
- PDF files for Wikipedia articles, saved in the `./pdfs_dev/` directory.
- A `download_results.jsonl` file logging the status of each download.

### Step 5: Organize Files into Splits
Use the `organize_files` action to organize the downloaded PDFs into specific splits (e.g., `train`, `dev`) based on a split information file.
If you want to download PDFs in parallel, you can try following commands with arguments `proc_id` and `n_proc`. `proc_id` is the process ID (default is 0), and `n_proc` is the total number of processes (default is 1).

```bash
python main.py organize_files --all_pdf_dir=./pdfs --target_dir_base=./splits --split=dev --split_metadata_file=./multimodalqa/MMQA_dev.jsonl
```

If train split is needed:
# e.g., distributed in 4 parallel jobs on the first 20 PDFs
N_total_processes=4

for i in $(seq 0 $((N_total_processes - 1)));
do
echo $i
python main.py \
download_pdfs \
--metadata_path './id_url_mapping.jsonl' \
--pdf_dir './pdfs_dev' \
--result_log_dir './download_logs/' \
--per_split_doc_ids './dev_doc_ids.json' \
--first_n=20 \
--proc_id=$i \
--n_proc=$N_total_processes &
done


# e.g., distributed in 16 parallel jobs on all dev PDFs
N_total_processes=16

for i in $(seq 0 $((N_total_processes - 1)));
do
echo $i
python main.py \
download_pdfs \
--metadata_path './id_url_mapping.jsonl' \
--pdf_dir './pdfs_dev' \
--result_log_dir './download_logs/' \
--per_split_doc_ids './dev_doc_ids.json' \
--first_n=-1 \
--proc_id=$i \
--n_proc=$N_total_processes &
done
```



### Step 5: Check PDF Integrity
Use the `check_pdfs` action to verify the integrity of the downloaded PDFs.

```bash
python main.py organize_files --all_pdf_dir=./pdfs --target_dir_base=./splits --split=train --split_metadata_file=./multimodalqa/MMQA_train.jsonl
python main.py check_pdfs --pdf_dir=./pdfs_dev
```

Output:

- Organized PDFs into directories in `./splits/pdfs_train/` and `./splits/pdfs_dev/`.
- Files that store document IDs of each split `./train_doc_ids.json` and `./dev_doc_ids.json`.
Identifies and logs corrupted or unreadable PDFs.

**Note** - In the [M3DocRAG](https://arxiv.org/abs/2411.04952) paper, we only use the `dev` split for our experiments.

### Step 6: Extract Images from PDFs
Use the `extract_images` action to extract images from the downloaded PDFs. A PNG image of each page of the PDFs is extracted. These images are used for both `retrieval` using `ColPali/ColQwen`, as well as `question answering` using the LLMs mentioned in the [M3DocRAG](https://arxiv.org/abs/2411.04952) paper.
### (Optional) Step 6: Extract Images from PDFs
When created embeddings in the [M3DocRAG](https://arxiv.org/abs/2411.04952) experiment, we extract images from the downloaded PDFs on the fly. But if the users want to extract images from the downloaded PDFs and save them for future use, they can use the `extract_images` action.

```bash
python main.py extract_images --pdf_dir=./splits/pdfs_dev/ --image_dir=./images/images_dev
python main.py extract_images --pdf_dir=./pdfs_dev/ --image_dir=./images_dev
```

Output:

Extracted images from the PDFs in the dev split are saved in the `./images/images_dev` directory.
Extracted images from the PDFs in the dev split are saved in the `./images_dev` directory.

After following these steps, your dataset directory structure will look like this:

```
```bash
./
# original MMQA files
|-- multimodalqa/
|   |-- MMQA_train.jsonl
|   |-- MMQA_dev.jsonl
|   |-- MMQA_texts.jsonl
|   |-- MMQA_images.jsonl
|   |-- MMQA_tables.jsonl
# generated files
|-- id_url_mapping.jsonl
|-- dev_doc_ids.json
|-- train_doc_ids.json
|-- supporting_doc_ids_per_split.json
|-- download_results.jsonl
|-- pdfs/
|   |-- <article_1>.pdf
|   |-- <article_2>.pdf
|-- images/
|-- |--images_dev/
|   | |-- <doc_id_1_page_1>.png
| | |-- <doc_id_2_page_2>.png
|-- splits/
|   |-- pdfs_dev/
|   |   |-- <doc_id_1>.pdf
|   |   |-- <doc_id_2>.pdf
# download logs
|-- download_logs/
|   |-- <process_id>_<first_n>.jsonl
# downloaded PDFs
|-- pdfs_dev/
|   |-- <article_dev_1>.pdf
|   |-- <article_dev_2>.pdf
# (Below are optional outputs)
# |-- pdfs_train/
# |   |-- <article_train_1>.pdf
# |   |-- <article_train_2>.pdf
# |-- images_dev/
# |   |-- <doc_id_dev_1_page_1>.png
# |   |-- <doc_id_dev_2_page_2>.png
# |-- images_train/
# |   |-- <doc_id_train_1_page_1>.png
# |   |-- <doc_id_train_2_page_2>.png
```
66 changes: 33 additions & 33 deletions m3docvqa/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
from pathlib import Path
from m3docvqa.downloader import download_wiki_page
from m3docvqa.pdf_utils import is_pdf_downloaded, is_pdf_clean, get_images_from_pdf
from m3docvqa.split_utils import create_split_dirs
from m3docvqa.split_utils import create_split_files
from m3docvqa.mmqa_downloader import download_and_decompress_mmqa
from m3docvqa.wiki_mapper import generate_wiki_links_mapping
from loguru import logger
Expand All @@ -52,6 +52,7 @@ def _prepare_download(
output_dir: Path | str,
first_n: int,
doc_ids: set,
check_downloaded: bool = False,
) -> tuple[list[str], list[Path]]:
"""Prepare URLs and save paths for downloading.
Expand All @@ -74,47 +75,48 @@ def _prepare_download(
break

doc_id = line.get("id")
url = line.get("url")
if doc_ids and doc_id not in doc_ids:
continue

url = line.get("url")
save_path = output_dir / f"{doc_id}.pdf"
if not is_pdf_downloaded(save_path):
urls.append(url)
save_paths.append(save_path)
if check_downloaded and is_pdf_downloaded(save_path):
continue

urls.append(url)
save_paths.append(save_path)

return urls, save_paths


def download_pdfs(
metadata_path: Path | str,
pdf_dir: Path | str,
result_log_path: Path | str,
supporting_doc_ids_per_split: Path | str,
result_log_dir: Path | str,
per_split_doc_ids: Path | str,
first_n: int = -1,
proc_id: int = 0,
n_proc: int = 1,
split: str = 'dev',
check_downloaded: bool = False,
):
"""Download Wikipedia pages as PDFs."""
# Load document ids for the specified split
if supporting_doc_ids_per_split:
with open(supporting_doc_ids_per_split, "r") as f:
doc_ids_per_split = json.load(f)
split_doc_ids = {
"train": set(doc_ids_per_split.get("train", [])),
"dev": set(doc_ids_per_split.get("dev", [])),
"all": set(doc_ids_per_split.get("train", []) + doc_ids_per_split.get("dev", []))
}
if split not in split_doc_ids:
raise ValueError(f"Invalid or missing split. Expected one of {split_doc_ids.keys()}")
doc_ids = split_doc_ids.get(split, split_doc_ids.get("all"))
logger.info(f"Downloading documents for split: {split} with {len(doc_ids)} document IDs.")

urls, save_paths = _prepare_download(metadata_path, pdf_dir, first_n, doc_ids)
logger.info(f"Starting download of {len(urls)} PDFs to {pdf_dir}")
download_results = download_wiki_page(urls, save_paths, "pdf", result_log_path, proc_id, n_proc)
logger.info(f"Download completed with {sum(download_results)} successful downloads out of {len(urls)}")
if per_split_doc_ids:
with open(per_split_doc_ids, "r") as f:
doc_ids = json.load(f)
logger.info(f"Downloading documents with {len(doc_ids)} document IDs from {metadata_path}.")

urls, save_paths = _prepare_download(metadata_path, pdf_dir, first_n, doc_ids, check_downloaded)

# split urls and save_paths (both are lists) into n_proc chunks
if n_proc > 1:
logger.info(f"[{proc_id}/{n_proc}] Splitting {len(urls)} URLs into {n_proc} chunks")
urls = urls[proc_id::n_proc]
save_paths = save_paths[proc_id::n_proc]

logger.info(f"[{proc_id}/{n_proc}] Starting download of {len(urls)} PDFs to {pdf_dir}")
download_results = download_wiki_page(urls, save_paths, "pdf", result_log_dir, proc_id, n_proc)
logger.info(f"[{proc_id}/{n_proc}] Download completed with {sum(download_results)} successful downloads out of {len(urls)}")


def check_pdfs(pdf_dir: str, proc_id: int = 0, n_proc: int = 1):
Expand Down Expand Up @@ -147,18 +149,16 @@ def extract_images(pdf_dir: str, image_dir: str, save_type='png'):

for pdf_path in tqdm(pdf_files, desc="Extracting images", unit="PDF"):
get_images_from_pdf(pdf_path, save_dir=image_dir, save_type=save_type)
logger.info(f"Images extracted from PDFs in {pdf_dir}")
logger.info(f"Images extracted from {pdf_dir} and saved to {image_dir}")


def organize_files(all_pdf_dir: Path | str, target_dir_base: Path | str, split_metadata_file: str | Path, split: str):
"""Organizes PDFs into directory splits based on split information file."""
create_split_dirs(
all_pdf_dir=all_pdf_dir,
target_dir_base=target_dir_base,
def create_splits(split_metadata_file: str | Path, split: str):
"""Create the per-split doc ids."""
create_split_files(
split_metadata_file=split_metadata_file,
split=split,
)
logger.info(f"Files organized for {split} split: in {target_dir_base}")
logger.info(f"Doc Ids Files created for {split} split")


def download_mmqa(output_dir: str):
Expand Down Expand Up @@ -193,7 +193,7 @@ def main():
"download_pdfs": download_pdfs,
"check_pdfs": check_pdfs,
"extract_images": extract_images,
"organize_files": organize_files,
"create_splits": create_splits,
})


Expand Down
33 changes: 17 additions & 16 deletions m3docvqa/src/m3docvqa/downloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,11 +47,6 @@ def _download_wiki_page(args: tuple[int, int, str, str, str, int]) -> tuple[bool
"""
order_i, total, url, save_path, save_type, proc_id = args

if is_pdf_downloaded(save_path):
if proc_id == 0:
logger.info(f"{order_i} / {total} - {save_path} already downloaded")
return True, None

try:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
Expand All @@ -78,7 +73,7 @@ def download_wiki_page(
urls: list[str],
save_paths: list[str],
save_type: str,
result_jsonl_path: str,
result_log_dir: str,
proc_id: int = 0,
n_proc: int = 1
) -> list[bool]:
Expand All @@ -88,7 +83,7 @@ def download_wiki_page(
urls (List[str]): List of Wikipedia URLs to download.
save_paths (List[str]): List of paths where each downloaded file will be saved.
save_type (str): File type to save each page as ('pdf' or 'png').
result_jsonl_path (str): Path to the JSONL file where download results will be logged.
result_log_dir (str): Path to the directory where the download results will be logged.
proc_id (int, optional): Process ID for parallel processing. Defaults to 0.
n_proc (int, optional): Total number of processes running in parallel. Defaults to 1.
Expand All @@ -99,23 +94,29 @@ def download_wiki_page(
all_args = [(i, total, url, str(save_path), save_type, proc_id)
for i, (url, save_path) in enumerate(zip(urls, save_paths))]

# create log directory if it doesn't exist
log_dir = Path(result_log_dir)
log_dir.mkdir(parents=True, exist_ok=True)

pbar = tqdm(total=len(all_args), ncols=100, disable=not (proc_id == 0))

results = []
n_downloaded = 0

# Log results to a JSONL file
with jsonlines.open(result_jsonl_path, 'w') as writer:
for args in all_args:
downloaded, error = _download_wiki_page(args)
for args in all_args:
downloaded, error = _download_wiki_page(args)

if downloaded:
n_downloaded += 1
if downloaded:
n_downloaded += 1

pbar.set_description(f"Process: {proc_id}/{n_proc} - Downloaded: {n_downloaded}/{total}")
pbar.update(1)
pbar.set_description(f"Process: {proc_id}/{n_proc} - Downloaded: {n_downloaded}/{total}")
pbar.update(1)

results.append(downloaded)
results.append(downloaded)

# Write to process-specific log file
proc_result_path = log_dir / f'process_{proc_id}_{n_proc}.jsonl'
with jsonlines.open(proc_result_path, mode='a') as writer:
writer.write({
'downloaded': downloaded,
'args': [arg if not isinstance(arg, Path) else str(arg) for arg in args],
Expand Down
Loading

0 comments on commit 29e6ac2

Please sign in to comment.