diff --git a/alt-text-scan.py.md b/alt-text-scan.py.md index 1820ac6..9247912 100644 --- a/alt-text-scan.py.md +++ b/alt-text-scan.py.md @@ -1,95 +1,206 @@ -# Image Analysis Script for Web Accessibility + +# Alt-Text Scan Tool + +A Python script for scanning websites to evaluate the quality of `alt` text in images and generate actionable accessibility suggestions. + +--- ## Overview -This script analyzes images on a website for accessibility compliance. It identifies issues with alt text and other metadata, providing suggestions to improve accessibility. The script can parse sitemaps or crawl the website manually if a sitemap is unavailable or invalid. +This tool crawls websites or parses their sitemap to collect images and analyze their `alt` attributes for accessibility compliance. It generates a CSV file summarizing issues, suggestions, and metadata for each image. + +--- ## Features - • Crawl websites for image data using sitemaps or manual crawling. - • Analyze image metadata, including alt text, title, and size. - • Generate detailed suggestions for improving alt text. - • Exclude non-HTML content (e.g., PDFs, videos). - • Output results to a CSV file with a summary of findings and recommendations. +- **Crawl Websites**: Analyze images from websites either by crawling pages directly or parsing their sitemap. +- **Accessibility Checks**: Detect missing, meaningless, or excessively long `alt` text. +- **Readability Analysis**: Assess readability for `alt` text over 25 characters. +- **Rate Limiting**: Throttle requests to avoid overloading servers. +- **CSV Reports**: Save analysis results to a CSV file. +- **New Features**: + - Added support for crawling without relying on `sitemap.xml` using the `--crawl_only` option. + - Readability analysis is now performed only on `alt` text longer than 25 characters. + - Improved handling of nested sitemaps with recursive parsing. + - Enhanced suggestions for WCAG compliance, including identifying decorative images and overly verbose `alt` text. + +--- ## Installation ### Prerequisites -Ensure you have Python 3.10 or later installed. Install the following Python libraries: +1. Python 3.10 or later. +2. Install the required Python libraries: + + ```bash + pip install -r requirements.txt + ``` -pip install requests beautifulsoup4 pandas tqdm textblob readability-lxml textstat + **Required Libraries**: + - `requests` + - `bs4` (BeautifulSoup) + - `pandas` + - `tqdm` + - `textstat` + - `textblob` + +--- ## Usage -Running the Script +### Command-Line Arguments + +| Argument | Description | +|-----------------------|-----------------------------------------------------------------------------| +| `domain` | The base domain to analyze (e.g., `https://example.com`). | +| `--sample_size` | Number of URLs to sample from the sitemap (default: 100). | +| `--throttle` | Throttle delay in seconds between requests (default: 1). | +| `--crawl_only` | Skip sitemap parsing and start crawling directly (default: `False`). | -To run the script, use the following command: +--- -python3.10 alt_scan.py --sample_size +### Examples -## Parameters +#### 1. Analyze a Site Using the Sitemap +```bash +python alt_text_scan.py https://example.com --sample_size 200 --throttle 2 +``` - • : The starting URL for the website (e.g., https://example.com). - • --sample_size: Maximum number of unique URLs to crawl (default: 100). +This will: +- Parse `https://example.com/sitemap.xml` to find URLs. +- Sample up to 200 URLs for analysis. +- Throttle requests with a 2-second delay. -Example +#### 2. Crawl a Site Directly +```bash +python alt_text_scan.py https://example.com --sample_size 200 --throttle 2 --crawl_only +``` -python3.10 alt_scan.py https://www.whitehouse.gov --sample_size 1000 +This will: +- Bypass `sitemap.xml`. +- Crawl the site starting from the homepage. +- Analyze up to 200 pages. -This command crawls up to 1,000 unique pages on the specified domain and analyzes the images found. +--- ## Output -The script generates two files: - 1. CSV File: _images.csv -Contains detailed image metadata and suggestions for improving accessibility. - 2. Console Output: -Provides progress updates and a summary of findings. +The script generates a CSV file named after the domain being analyzed, e.g., `example.com_images.csv`. Each row corresponds to an image and contains: + +| Column | Description | +|--------------------|----------------------------------------------------------------------------------| +| `Image_url` | The URL of the image. | +| `Alt_text` | The `alt` attribute of the image (if available). | +| `Title` | The `title` attribute of the image (if available). | +| `Count` | The number of times the image appears. | +| `Source_URLs` | Pages where the image was found. | +| `Size (KB)` | The size of the image in kilobytes. | +| `Suggestions` | Recommendations for improving the `alt` text based on WCAG standards. | + +--- + +## Key Accessibility Checks + +1. **Missing or Empty `alt` Text**: + - Detects images with no `alt` attribute or empty `alt` values. + - Suggests adding meaningful descriptions. -CSV Columns +2. **Readability Analysis**: + - Evaluates readability for `alt` text over 25 characters. + - Suggests simplifying overly complex text. - • Image_name: The file name of the image. - • Image_url: The full URL of the image. - • Alt_text: The alt text associated with the image. - • Title: The title attribute of the image (if any). - • Count: Number of occurrences of the image. - • Source_URLs: Pages where the image is found. - • Size (KB): Approximate size of the image in kilobytes. - • Load_Time (s): Time taken to fetch the image. - • Suggestions: Accessibility improvement recommendations. +3. **Text Length**: + - Flags `alt` text under 25 characters as too short. + - Flags `alt` text over 250 characters as too verbose. -## Features of Analysis +4. **Meaningless `alt` Text**: + - Identifies generic or placeholder `alt` text (e.g., "image of", "placeholder"). -The script provides actionable suggestions, including: - • “Image hidden with no semantic value” if an image is marked with aria-hidden or hidden attributes. - • “No alt text provided” for images without alt attributes. - • “Check if the SVG file includes a title” for SVGs without meaningful descriptions. - • “Decorative image” for images with empty alt attributes. - • Suggestions to avoid unnecessary phrases like “A picture of” in alt text. - • Readability checks using a customizable threshold. +5. **Large Image Files**: + - Highlights images over 250 KB as candidates for optimization. -## Troubleshooting +--- -Invalid or Missing Sitemap +## Known Limitations -If the sitemap cannot be parsed or is invalid, the script falls back to crawling the website starting from the homepage. +1. **403 Forbidden Errors**: Some servers may block automated requests. Use `--throttle` to reduce request frequency or adjust headers in the script. +2. **Large Sitemaps**: Parsing deeply nested sitemaps may exceed the recursion depth limit. Use the `--crawl_only` option if necessary. +3. **CAPTCHA Restrictions**: Servers using CAPTCHAs or aggressive rate-limiting may block requests. -Excluded Files +--- -The script excludes non-HTML content, such as: - • Documents (.pdf, .docx, etc.) - • Media files (.jpg, .mp4, etc.) - • Archives (.zip, .rar, etc.) +## Script -## Logging Issues +Below is the Python script: -The script outputs warnings for any URLs it fails to process. +```python +import os +import requests +from bs4 import BeautifulSoup +import pandas as pd +from urllib.parse import urljoin, urlparse, urlunparse +import argparse +from tqdm import tqdm +import xml.etree.ElementTree as ET +import random +import time +from collections import defaultdict +import re +from textblob import TextBlob +from readability.readability import Document +from textstat import text_standard +from datetime import datetime + +IMAGE_EXTENSIONS = ('.jpg', '.jpeg', '.png', '.gif', '.svg', '.tiff', '.avif', '.webp') + +# Function definitions +def is_valid_image(url): + ... + +def parse_sitemap(sitemap_url, base_domain, headers=None, depth=3): + ... + +def crawl_site(start_url, max_pages=100, throttle=0): + ... + +def get_relative_url(url, base_domain): + ... + +def get_images(domain, sample_size=100, throttle=0, crawl_only=False): + ... + +def analyze_alt_text(images_df, domain, readability_threshold=8): + ... + +def process_image(img_url, img, page_url, domain, images_data): + ... + +def crawl_page(url, images_data, url_progress, domain, throttle, consecutive_errors): + ... + +# Main function +def main(domain, sample_size=100, throttle=0, crawl_only=False): + ... + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description="Crawl a website and collect image data with alt text.") + parser.add_argument('domain', type=str, help='The domain to crawl (e.g., https://example.com)') + parser.add_argument('--sample_size', type=int, default=100, help='Number of URLs to sample from the sitemap') + parser.add_argument('--throttle', type=int, default=1, help='Throttle delay (in seconds) between requests') + parser.add_argument('--crawl_only', action='store_true', help='Start crawling directly without using the sitemap') + args = parser.parse_args() + main(args.domain, args.sample_size, throttle=args.throttle, crawl_only=args.crawl_only) +``` + +--- ## Contributing -Feel free to submit issues or pull requests to improve this script. +Contributions are welcome! Please open an issue or submit a pull request on [GitHub](https://github.com/CivicActions/site-evaluation-tools). + +--- ## License -This project is open-source and available under the MIT License. +This project is licensed under the MIT License.