Skip to content

This program recursively scans a website and outputs all found sub-pages and their broken links to two different CSV files

Notifications You must be signed in to change notification settings

HarvardChanSchool/selective-link-checker

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

selective-link-checker

This program recursively scans a website and outputs all found sub-pages and their broken links into two different CSV files.

Introduction

This project is a Javascript + NodeJS console script to find broken links in a website and its sub-pages and output them to a CSV file. I worked on this project as part of my internship with Harvard T.H. Chan School of Public Health, in order to find outdated links and information. The script uses Steven Vachon's broken-link-checker as a dependency to scrape and navigate URLs and to add links to the queue.

Instructions

You do not need an IDE in order to use this application, but you do need to install NodeJS and npm. You can find the latest installers here Once you have installed NodeJS and npm, follow the next steps:

  1. Download or clone this repository into a folder on your computer.
  2. Navigate to the folder that contains the repository's files.
  3. Open a terminal on that folder:
    • Windows: Shift + Click on the file explorer and select "Open Command Prompt window here" or "Open PowerShell window here".
    • macOS: Head into System Preferences and select Keyboard > Shortcuts > Services. Find "New Terminal at Folder" in the settings and click the box. Right click on the folder and select "Open Terminal".
  4. On the terminal, type npm ci and press Enter to install dependencies. Do not use npm install as that will edit the package-lock.json file and install the wrong dependencies
  5. On the terminal, type npm run slc and press Enter to start the application.
  6. Type or paste the starting URL for the website you would like to check, e.g. https://www.korg.com/us/products/software/kc_triton/ and press Enter. NOTE: Only website URLs that contain this original URL get added to the queue.
  7. Type excluded keywords and press Enter after each keyword. NOTE: When checking if a link/sub-page is broken, the program will look for these keywords on the URL as a string and will not enqueue URLs that contain them.
  8. After the last keyword, press Enter a second time to start the search.
  9. Once the search is completed, Type or paste the filepath for the CSV files to be exported. NOTE: This should include both the path and the filename and should end in '.csv'; e.g. C:\Users\xyzes\Documents\BrokenLinks.csv. The first file will contain all the broken links found per sub-page, and the second file will contain all the websites that were in the queue during the search.

Roadmap

  1. A user-friendly web interface to interact with the application.
  2. WordPress integration to output information about title, authors, editors, and timelines.

About

This program recursively scans a website and outputs all found sub-pages and their broken links to two different CSV files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 100.0%