Feature request: Add ability to crawl links from the browser extension #104

andrewdbate · 2021-09-19T15:58:08Z

Thank you for making such a wonderful extension. I have been a user of SingleFile for a long time (ever since saving as MHTML was no longer well supported) and have recently been experimenting with SingleFileZ.

It would be extremely useful to be able to crawl links (as is currently possible from the CLI) from the browser extension.

This is because sometimes the website to crawl requires me to be logged in (or have a cookie set).

So my workflow would be to:

log into the website, then
start the crawl from the extension.

This would allow me to crawl websites that I cannot crawl from the CLI.

I do not know how difficult this would be to implement. Maybe it would be a case of just exposing the CLI functionality in the web extension version, or whether you are doing something in the CLI version that you cannot do with the WebExtensions API.

I appreciate that the browser may become slow / unresponsive while the crawl is happening (although ideally it would happen in the background), but I would be okay with this if it allowed me to crawl a site that I wouldn't be able to otherwise with SingleFileZ.

The alternatives often suggested for crawling websites when authentication is required or a cookie needs to be set are not very easy to use and take some time to set up. And nothing is as good at saving webpages as SingleFile / SingleFileZ. Thus I would prefer to use SingleFileZ for this.

Many thanks!

gildas-lormeau · 2021-09-19T19:59:55Z

If you can provide a list of URLs to crawl, then you can achieve this by selecting in the context menu SingleFileZ > view pending saves... and clicking on the 'Add URLs' button at the top right of the page. I'm still hesitant to implement a crawler in the extension itself because it's an advanced feature that would merit to exist as a separate extension I think.

andrewdbate · 2021-09-19T20:07:56Z

Ah I didn't know about that... I didn't look for options in the context menu, I only looked on menu when right clicking the extension icon in the top bar. (The View pending saves... option is only available from the context menu, as I assume you know.)

Do you have any suggestions how I can obtain a list of URLs to save?

I agree that it would be useful to view the list of pages to save before beginning the saving process, so it might make sense to have it as a separate extension. However, you could argue that the list would probably only ever be used to save pages, so the two functionalities could exist together in on extension. In additional, the crawler functionality already exists in the CLI version.

gildas-lormeau · 2021-09-19T20:47:31Z

The entry is available in the context menu of the button in Firefox but not Chrome because if removes the entries it cannot display.

Depending on the website you're trying to crawl, you might find the list of URLs in the sitemap.xml file usually located in the root folder of the website. Otherwise you'll need a crawler like the one available with the CLI tool, or for example HTTrack.

Implementing a crawler requires to implement UI elements at least for all the --crawl- options of the CLI tool. It would also be required to persist the list of URLs to crawl and a UI for this because you might need to pause the crawl in order for example to restart the computer. I think this complex feature from a UI point of view is out of scope of SingleFile/SingleFileZ as they exist today. This feature might also lead the extensions to be rejected because of this policy: https://developer.chrome.com/docs/webstore/program_policies/?csw=1#extensions.

I think the simplest way to fix your issue would be to use the options to pass cookies to the CLI tool or create a userscript to pass the credentials to the login page when displayed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Add ability to crawl links from the browser extension #104

Feature request: Add ability to crawl links from the browser extension #104

andrewdbate commented Sep 19, 2021

gildas-lormeau commented Sep 19, 2021

andrewdbate commented Sep 19, 2021

gildas-lormeau commented Sep 19, 2021

Feature request: Add ability to crawl links from the browser extension #104

Feature request: Add ability to crawl links from the browser extension #104

Comments

andrewdbate commented Sep 19, 2021

gildas-lormeau commented Sep 19, 2021

andrewdbate commented Sep 19, 2021

gildas-lormeau commented Sep 19, 2021