Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] How do I set up a skip file for excluding specific file types? #23

Open
Tiptop4792 opened this issue Aug 29, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@Tiptop4792
Copy link

Tiptop4792 commented Aug 29, 2024

The --skip paramenter works great for interrupted downloads.

However, the othter day I wanted to download only specific files and exclude others. I couldn't figure out how to set up a csv file on my own.

Also, it didn't work when I tried to amand waybackup_<sanitized_url>.csv, created by the downloader. I tried to add the links I didn't want to download to the row url_origin, but it didn't skip the links added.

Any advice? Thanks!!

@Tiptop4792 Tiptop4792 added the enhancement New feature or request label Aug 29, 2024
@bitdruid
Copy link
Owner

hey :) thank you for your issue. the skipset does filter by url_archive (the concatenated url for the snapshot)

one approach for you could be to remove the snapshots from the .cdx file. for this you have to keep that by --cdxbackup or --auto

@Tiptop4792
Copy link
Author

Awesome! Thanks!

Just to get this right:

I'd download the cdxbackup file, remove the snapshots I don't want and then reinsert the cdx file via --cdxinject <filepath>. Right?

@bitdruid
Copy link
Owner

yes thats right. the cdx file contains the pure json response from the server and thus only the containing snapshots will be downloaded.

if you use the --auto command, the downloader will handle the cdxbackup and its injection + skipping by its own. making it easier if you just want to make the process "failsafe" if any crash occurs.

in the long term maybe it would be an idea to add some kind of filter... is there a specific type or path you want to be removed?

@Tiptop4792
Copy link
Author

I came by two occasions where this can be super useful:

  1. The other day I had an issue with file names that where too long, I wanted to remove those files from the csv manually, but didn't manage to reinject them (didn't know about -cdxinject before). But maybe --auto would do the better job in that particular situation?
  2. I do a lot of bulk downloading and then searching for stuff. - So, I'm basically interested in text files, html. Don't need pictures, videos, etc. Removing those would speed up downloading and would take pressure off the Archive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants