Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Add ability to crawl links from the browser extension #104

Open
andrewdbate opened this issue Sep 19, 2021 · 3 comments
Open

Comments

@andrewdbate
Copy link

Thank you for making such a wonderful extension. I have been a user of SingleFile for a long time (ever since saving as MHTML was no longer well supported) and have recently been experimenting with SingleFileZ.

It would be extremely useful to be able to crawl links (as is currently possible from the CLI) from the browser extension.

This is because sometimes the website to crawl requires me to be logged in (or have a cookie set).

So my workflow would be to:

  1. log into the website, then
  2. start the crawl from the extension.

This would allow me to crawl websites that I cannot crawl from the CLI.

I do not know how difficult this would be to implement. Maybe it would be a case of just exposing the CLI functionality in the web extension version, or whether you are doing something in the CLI version that you cannot do with the WebExtensions API.

I appreciate that the browser may become slow / unresponsive while the crawl is happening (although ideally it would happen in the background), but I would be okay with this if it allowed me to crawl a site that I wouldn't be able to otherwise with SingleFileZ.

The alternatives often suggested for crawling websites when authentication is required or a cookie needs to be set are not very easy to use and take some time to set up. And nothing is as good at saving webpages as SingleFile / SingleFileZ. Thus I would prefer to use SingleFileZ for this.

Many thanks!

@gildas-lormeau
Copy link
Owner

If you can provide a list of URLs to crawl, then you can achieve this by selecting in the context menu SingleFileZ > view pending saves... and clicking on the 'Add URLs' button at the top right of the page. I'm still hesitant to implement a crawler in the extension itself because it's an advanced feature that would merit to exist as a separate extension I think.

@andrewdbate
Copy link
Author

Ah I didn't know about that... I didn't look for options in the context menu, I only looked on menu when right clicking the extension icon in the top bar. (The View pending saves... option is only available from the context menu, as I assume you know.)

Do you have any suggestions how I can obtain a list of URLs to save?

I agree that it would be useful to view the list of pages to save before beginning the saving process, so it might make sense to have it as a separate extension. However, you could argue that the list would probably only ever be used to save pages, so the two functionalities could exist together in on extension. In additional, the crawler functionality already exists in the CLI version.

@gildas-lormeau
Copy link
Owner

The entry is available in the context menu of the button in Firefox but not Chrome because if removes the entries it cannot display.

Depending on the website you're trying to crawl, you might find the list of URLs in the sitemap.xml file usually located in the root folder of the website. Otherwise you'll need a crawler like the one available with the CLI tool, or for example HTTrack.

Implementing a crawler requires to implement UI elements at least for all the --crawl- options of the CLI tool. It would also be required to persist the list of URLs to crawl and a UI for this because you might need to pause the crawl in order for example to restart the computer. I think this complex feature from a UI point of view is out of scope of SingleFile/SingleFileZ as they exist today. This feature might also lead the extensions to be rejected because of this policy: https://developer.chrome.com/docs/webstore/program_policies/?csw=1#extensions.

I think the simplest way to fix your issue would be to use the options to pass cookies to the CLI tool or create a userscript to pass the credentials to the login page when displayed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants