-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Updated description
- Loading branch information
Showing
1 changed file
with
164 additions
and
53 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,95 +1,206 @@ | ||
# Image Analysis Script for Web Accessibility | ||
|
||
# Alt-Text Scan Tool | ||
|
||
A Python script for scanning websites to evaluate the quality of `alt` text in images and generate actionable accessibility suggestions. | ||
|
||
--- | ||
|
||
## Overview | ||
|
||
This script analyzes images on a website for accessibility compliance. It identifies issues with alt text and other metadata, providing suggestions to improve accessibility. The script can parse sitemaps or crawl the website manually if a sitemap is unavailable or invalid. | ||
This tool crawls websites or parses their sitemap to collect images and analyze their `alt` attributes for accessibility compliance. It generates a CSV file summarizing issues, suggestions, and metadata for each image. | ||
|
||
--- | ||
|
||
## Features | ||
|
||
• Crawl websites for image data using sitemaps or manual crawling. | ||
• Analyze image metadata, including alt text, title, and size. | ||
• Generate detailed suggestions for improving alt text. | ||
• Exclude non-HTML content (e.g., PDFs, videos). | ||
• Output results to a CSV file with a summary of findings and recommendations. | ||
- **Crawl Websites**: Analyze images from websites either by crawling pages directly or parsing their sitemap. | ||
- **Accessibility Checks**: Detect missing, meaningless, or excessively long `alt` text. | ||
- **Readability Analysis**: Assess readability for `alt` text over 25 characters. | ||
- **Rate Limiting**: Throttle requests to avoid overloading servers. | ||
- **CSV Reports**: Save analysis results to a CSV file. | ||
- **New Features**: | ||
- Added support for crawling without relying on `sitemap.xml` using the `--crawl_only` option. | ||
- Readability analysis is now performed only on `alt` text longer than 25 characters. | ||
- Improved handling of nested sitemaps with recursive parsing. | ||
- Enhanced suggestions for WCAG compliance, including identifying decorative images and overly verbose `alt` text. | ||
|
||
--- | ||
|
||
## Installation | ||
|
||
### Prerequisites | ||
|
||
Ensure you have Python 3.10 or later installed. Install the following Python libraries: | ||
1. Python 3.10 or later. | ||
2. Install the required Python libraries: | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
pip install requests beautifulsoup4 pandas tqdm textblob readability-lxml textstat | ||
**Required Libraries**: | ||
- `requests` | ||
- `bs4` (BeautifulSoup) | ||
- `pandas` | ||
- `tqdm` | ||
- `textstat` | ||
- `textblob` | ||
|
||
--- | ||
|
||
## Usage | ||
|
||
Running the Script | ||
### Command-Line Arguments | ||
|
||
| Argument | Description | | ||
|-----------------------|-----------------------------------------------------------------------------| | ||
| `domain` | The base domain to analyze (e.g., `https://example.com`). | | ||
| `--sample_size` | Number of URLs to sample from the sitemap (default: 100). | | ||
| `--throttle` | Throttle delay in seconds between requests (default: 1). | | ||
| `--crawl_only` | Skip sitemap parsing and start crawling directly (default: `False`). | | ||
|
||
To run the script, use the following command: | ||
--- | ||
|
||
python3.10 alt_scan.py <domain> --sample_size <number> | ||
### Examples | ||
|
||
## Parameters | ||
#### 1. Analyze a Site Using the Sitemap | ||
```bash | ||
python alt_text_scan.py https://example.com --sample_size 200 --throttle 2 | ||
``` | ||
|
||
• <domain>: The starting URL for the website (e.g., https://example.com). | ||
• --sample_size: Maximum number of unique URLs to crawl (default: 100). | ||
This will: | ||
- Parse `https://example.com/sitemap.xml` to find URLs. | ||
- Sample up to 200 URLs for analysis. | ||
- Throttle requests with a 2-second delay. | ||
|
||
Example | ||
#### 2. Crawl a Site Directly | ||
```bash | ||
python alt_text_scan.py https://example.com --sample_size 200 --throttle 2 --crawl_only | ||
``` | ||
|
||
python3.10 alt_scan.py https://www.whitehouse.gov --sample_size 1000 | ||
This will: | ||
- Bypass `sitemap.xml`. | ||
- Crawl the site starting from the homepage. | ||
- Analyze up to 200 pages. | ||
|
||
This command crawls up to 1,000 unique pages on the specified domain and analyzes the images found. | ||
--- | ||
|
||
## Output | ||
|
||
The script generates two files: | ||
1. CSV File: <domain>_images.csv | ||
Contains detailed image metadata and suggestions for improving accessibility. | ||
2. Console Output: | ||
Provides progress updates and a summary of findings. | ||
The script generates a CSV file named after the domain being analyzed, e.g., `example.com_images.csv`. Each row corresponds to an image and contains: | ||
|
||
| Column | Description | | ||
|--------------------|----------------------------------------------------------------------------------| | ||
| `Image_url` | The URL of the image. | | ||
| `Alt_text` | The `alt` attribute of the image (if available). | | ||
| `Title` | The `title` attribute of the image (if available). | | ||
| `Count` | The number of times the image appears. | | ||
| `Source_URLs` | Pages where the image was found. | | ||
| `Size (KB)` | The size of the image in kilobytes. | | ||
| `Suggestions` | Recommendations for improving the `alt` text based on WCAG standards. | | ||
|
||
--- | ||
|
||
## Key Accessibility Checks | ||
|
||
1. **Missing or Empty `alt` Text**: | ||
- Detects images with no `alt` attribute or empty `alt` values. | ||
- Suggests adding meaningful descriptions. | ||
|
||
CSV Columns | ||
2. **Readability Analysis**: | ||
- Evaluates readability for `alt` text over 25 characters. | ||
- Suggests simplifying overly complex text. | ||
|
||
• Image_name: The file name of the image. | ||
• Image_url: The full URL of the image. | ||
• Alt_text: The alt text associated with the image. | ||
• Title: The title attribute of the image (if any). | ||
• Count: Number of occurrences of the image. | ||
• Source_URLs: Pages where the image is found. | ||
• Size (KB): Approximate size of the image in kilobytes. | ||
• Load_Time (s): Time taken to fetch the image. | ||
• Suggestions: Accessibility improvement recommendations. | ||
3. **Text Length**: | ||
- Flags `alt` text under 25 characters as too short. | ||
- Flags `alt` text over 250 characters as too verbose. | ||
|
||
## Features of Analysis | ||
4. **Meaningless `alt` Text**: | ||
- Identifies generic or placeholder `alt` text (e.g., "image of", "placeholder"). | ||
|
||
The script provides actionable suggestions, including: | ||
• “Image hidden with no semantic value” if an image is marked with aria-hidden or hidden attributes. | ||
• “No alt text provided” for images without alt attributes. | ||
• “Check if the SVG file includes a title” for SVGs without meaningful descriptions. | ||
• “Decorative image” for images with empty alt attributes. | ||
• Suggestions to avoid unnecessary phrases like “A picture of” in alt text. | ||
• Readability checks using a customizable threshold. | ||
5. **Large Image Files**: | ||
- Highlights images over 250 KB as candidates for optimization. | ||
|
||
## Troubleshooting | ||
--- | ||
|
||
Invalid or Missing Sitemap | ||
## Known Limitations | ||
|
||
If the sitemap cannot be parsed or is invalid, the script falls back to crawling the website starting from the homepage. | ||
1. **403 Forbidden Errors**: Some servers may block automated requests. Use `--throttle` to reduce request frequency or adjust headers in the script. | ||
2. **Large Sitemaps**: Parsing deeply nested sitemaps may exceed the recursion depth limit. Use the `--crawl_only` option if necessary. | ||
3. **CAPTCHA Restrictions**: Servers using CAPTCHAs or aggressive rate-limiting may block requests. | ||
|
||
Excluded Files | ||
--- | ||
|
||
The script excludes non-HTML content, such as: | ||
• Documents (.pdf, .docx, etc.) | ||
• Media files (.jpg, .mp4, etc.) | ||
• Archives (.zip, .rar, etc.) | ||
## Script | ||
|
||
## Logging Issues | ||
Below is the Python script: | ||
|
||
The script outputs warnings for any URLs it fails to process. | ||
```python | ||
import os | ||
import requests | ||
from bs4 import BeautifulSoup | ||
import pandas as pd | ||
from urllib.parse import urljoin, urlparse, urlunparse | ||
import argparse | ||
from tqdm import tqdm | ||
import xml.etree.ElementTree as ET | ||
import random | ||
import time | ||
from collections import defaultdict | ||
import re | ||
from textblob import TextBlob | ||
from readability.readability import Document | ||
from textstat import text_standard | ||
from datetime import datetime | ||
|
||
IMAGE_EXTENSIONS = ('.jpg', '.jpeg', '.png', '.gif', '.svg', '.tiff', '.avif', '.webp') | ||
|
||
# Function definitions | ||
def is_valid_image(url): | ||
... | ||
|
||
def parse_sitemap(sitemap_url, base_domain, headers=None, depth=3): | ||
... | ||
|
||
def crawl_site(start_url, max_pages=100, throttle=0): | ||
... | ||
|
||
def get_relative_url(url, base_domain): | ||
... | ||
|
||
def get_images(domain, sample_size=100, throttle=0, crawl_only=False): | ||
... | ||
|
||
def analyze_alt_text(images_df, domain, readability_threshold=8): | ||
... | ||
|
||
def process_image(img_url, img, page_url, domain, images_data): | ||
... | ||
|
||
def crawl_page(url, images_data, url_progress, domain, throttle, consecutive_errors): | ||
... | ||
|
||
# Main function | ||
def main(domain, sample_size=100, throttle=0, crawl_only=False): | ||
... | ||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser(description="Crawl a website and collect image data with alt text.") | ||
parser.add_argument('domain', type=str, help='The domain to crawl (e.g., https://example.com)') | ||
parser.add_argument('--sample_size', type=int, default=100, help='Number of URLs to sample from the sitemap') | ||
parser.add_argument('--throttle', type=int, default=1, help='Throttle delay (in seconds) between requests') | ||
parser.add_argument('--crawl_only', action='store_true', help='Start crawling directly without using the sitemap') | ||
args = parser.parse_args() | ||
main(args.domain, args.sample_size, throttle=args.throttle, crawl_only=args.crawl_only) | ||
``` | ||
|
||
--- | ||
|
||
## Contributing | ||
|
||
Feel free to submit issues or pull requests to improve this script. | ||
Contributions are welcome! Please open an issue or submit a pull request on [GitHub](https://github.com/CivicActions/site-evaluation-tools). | ||
|
||
--- | ||
|
||
## License | ||
|
||
This project is open-source and available under the MIT License. | ||
This project is licensed under the MIT License. |