This is a collection of python scripts that will:
- Clean your Bookmark files of non-ascii characters with cleanr.py (I use this because my collection of exported bookmarks always seems to have some bad unicode characters and this script fixes that) If you don;t have these issues than don't use it, but if you do use it, make sure to run this first.
- Dedupe, fetch url descriptions and url images for CSV files using csvrl.py
- Dedupe and fetch url descriptions for bookmark.html files using htmrl.py
- Dedupe and fetch url descriptions for Raindrop.io exported html files using rdurl.py
- Dedupe and fetch url descriptions for any markdown file using mdurl.py
The File Cleanup Utility is a Python script that removes non-ASCII characters from various types of files including CSV, HTML, and Markdown. It normalizes non-ASCII characters to their closest ASCII equivalents or removes them entirely.
- Support for multiple file types: CSV, HTML, and Markdown.
- Logging: Detailed logging to a
.log
file. - User-Friendly: Interactive command-line interface for easy usage.
- Statistics: Shows the total number of changes made during the process.
- Python 3.x
- BeautifulSoup4 (
pip install beautifulsoup4
)
- Clone the repository or download the script to your local machine.
- Make sure Python is installed.
- Install required Python packages.
pip install beautifulsoup4
Run the script in your terminal:
python file_cleanup.py
You will be prompted to enter the file path and specify the file type (csv, html, md).
remove_non_ascii(text: str) -> Tuple[str, int]
- Replaces non-ASCII characters in a given text string with their closest ASCII equivalent or removes them. Returns a tuple with the new text and the number of changes made.process_csv(file_path: str) -> int
- Processes a CSV file to remove non-ASCII characters. Returns the total number of changes made.process_html(file_path: str) -> int
- Processes an HTML file to remove non-ASCII characters using BeautifulSoup. Returns the total number of changes made.process_md(file_path: str) -> int
- Processes a Markdown file to remove non-ASCII characters. Returns the total number of changes made.main()
- The main function which orchestrates the file cleanup process based on user input.
Logs are saved in a file named file_cleanup.log
in the same directory as the script.
This script reads a CSV file containing bookmark information, processes the URLs, and writes the updated data to a new CSV file. It performs the following tasks:
- Normalizes the URLs to remove redundant protocols.
- Removes duplicate URLs.
- Fetches book descriptions from the URLs using BeautifulSoup.
- Fetches book cover images using BeautifulSoup.
- Writes the updated data to a new CSV file.
To use the script, follow these steps:
- Save the script wherever you wish.
- Open a terminal or command prompt.
- Navigate to the directory where the script is saved.
- Run the script with the following command:
python csvrl.py
- The script will prompt you to enter the path to the CSV file.
- After providing the file path, the script will process the URLs.
- Fetch descriptions and cover images.
- Write the updated data to a new CSV file.
- Real time logging in the console as the script is running.
- It also creates a log file called url_processing.log
This script processes an HTML bookmark file to deduplicate URLs and fetch missing descriptions.
To use this script:
-
Ensure you have Python 3 and the required modules installed:
- BeautifulSoup
- Requests
- Logging
-
Save the
htmrl.py
script and run:python htmrl.py
-
Enter the path to your HTML bookmark file when prompted.
-
The script will process the file, removing duplicates and fetching descriptions.
-
Updated output will be saved to a new HTML file.
-
Progress and statistics will be logged to the console and bookmark_processing.log.
- Normalizes URLs to remove duplicate protocols
- Removes duplicate bookmark URLs
- Fetches missing descriptions using Requests
- Writes updated bookmarks to a new HTML file
- Provides statistics and logging
- Python 3
- BeautifulSoup
- Requests
- Logging
mdurl.py
is a Python script that helps you manage URLs in your markdown files. It can fetch the description of a URL and normalize the URL for consistency.
- Fetch Description - This script can fetch the description of a URL by sending a GET request to the URL and parsing the HTML response to find the meta description tag.
- Normalize URL - This script can normalize a URL by converting it to lowercase and removing the 'http://' or 'https://' prefix and trailing slashes.
fetch_description(url)
- This function sends a GET request to the provided URL and parses the HTML response to find the meta description tag. If the description is found, it is returned. If any error occurs during this process, it is logged and None is returned.normalize_url(url)
- This function normalizes the provided URL by converting it to lowercase, removing the 'http://' or 'https://' prefix, and removing any trailing slashes.
This script depends on the following Python libraries:
requests
- For sending HTTP requests.
BeautifulSoup
- For parsing HTML responses.
Make sure to install these dependencies using pip
This Python script is designed to process an HTML bookmark file, removing duplicate bookmarks and fetching missing descriptions for bookmarks. The script uses the following Python modules:
bs4
(BeautifulSoup) For parsing the HTML bookmark filerequests
For fetching descriptions for bookmarkslogging
For logging errors and informationos
For file handling
To use this script, follow these steps:
- Ensure that the required Python modules are installed (
bs4
,requests
,logging
, andos
). - Run the script.
- When prompted, enter the path to the HTML bookmark file to be processed.
- The script will remove duplicate bookmarks and fetch missing descriptions, and save the updated HTML to a new file in the same directory as the original file.
- The script begins by importing the required Python modules (
bs4
,requests
,logging
, andos
). - Next, the script initializes logging, statistics counters, and a function to normalize URLs.
- The user is prompted for the path to the HTML bookmark file to be processed.
- The script then reads the HTML file using
BeautifulSoup
, and initializes a dictionary to hold unique URLs. - The script iterates through all
<DT>
tags containing bookmarks, and extracts the URL and description (if available) for each bookmark. - Duplicate URLs are removed, and missing descriptions are fetched using
requests
, if possible. - Finally, the updated HTML is saved to a new file in the same directory as the original file, and statistics are displayed to the user.
This Python script is designed to fetch the description and title of a list of URLs and save them in a markdown file. The script uses the following Python modules:
requests
- For sending HTTP requests and fetching the raw HTML content of the URLsre
- For extracting URLs from user input using regular expressionsbs4
- (BeautifulSoup) For parsing the HTML content and extracting the description and title
To use this script, follow these steps:
- Ensure that the required Python modules are installed (
requests
,re
, andbs4
). - Run the script.
- When prompted, enter the URLs you want to fetch descriptions and titles for. URLs can be provided in plain format or in markdown format.
- The script will extract the URLs using regular expressions and fetch the description and title for each URL.
- The fetched information will be saved in a markdown file named
link_descriptions.md
in the same directory as the script.
The script begins by importing the required Python modules ( requests
, re
, and bs4
).
Next, the script defines a function fetch_description_and_title(url)
to fetch the description and title of a given URL. This function sends a GET request to fetch the raw HTML content, parses the HTML using BeautifulSoup
, and extracts the description and title using specific meta tags and title tags.
The script then defines the main()
function. This function prompts the user for a bulk list of URLs, extracts the URLs using regular expressions, and iterates through each URL to fetch the description and title using the fetch_description_and_title()
function. The fetched information is then written to a markdown file.
Finally, the main()
function is called if the script is run directly, executing the entire process.