Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle sites that use absolute links and sites that require the final slash in the URL #121

Merged
merged 5 commits into from
Sep 29, 2023

Commits on Sep 3, 2023

  1. Do the right thing with sites that require the final slash

    Some web sites will return 404 if you fetch a directory without the
    final slash. For example, https://archive.mozilla.org/pub/ works,
    https://archive.mozilla.org/pub does not. We need to do two things to
    accommodate this:
    
    * When processing the root URL of the filesystem, instead of stripping
      off the final slash, just set the offset to ignore it.
    * In the link structure, store the actual URL tail of the link
      separately from its name, final slash and all if there is one, and
      append that instead of the name when constructing the URL for curl.
    jikamens committed Sep 3, 2023
    Configuration menu
    Copy the full SHA
    fc857d6 View commit details
    Browse the repository at this point in the history
  2. Do the right thing with sites that use absolute links

    On some sites, the link to each subfolder is an absolute link rather
    than a relative one. To accommodate this, convert the links from
    absolute to relative before storing them in the link table.
    jikamens committed Sep 3, 2023
    Configuration menu
    Copy the full SHA
    c2a0283 View commit details
    Browse the repository at this point in the history
  3. Enabling debugging on command line should enable debug logging

    I believe an appropriate expectation is that if the user enables
    debugging with a command-line flag, then that should also enable
    messagse designated as debug messages in the code to be printed.
    jikamens committed Sep 3, 2023
    Configuration menu
    Copy the full SHA
    97b9273 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    53d1c9c View commit details
    Browse the repository at this point in the history
  5. Handle sites that put unencoded characters in URLs that curl dislikes

    Some sites put unencoded characters in their href attributes that
    really should be encoded, most notably spaces. Curl won't accept a URL
    with a space in it, and perhaps other such characters as well. Address
    this by properly encoding characters in URLs before feeding them to
    Curl.
    jikamens committed Sep 3, 2023
    Configuration menu
    Copy the full SHA
    5f61aac View commit details
    Browse the repository at this point in the history