newspaper-bulk

A command-line script to apply Newspaper3k's article extraction feature in bulk, made with large lists (>1K) of URLs in mind. The script is made to be as forgiving as possible, accepting any and all URLs insofar as they are in the first column of a .csv, .xlsx, .xls, or .csv. file. It uses Python's standard threading library to divvy up the work (default is for 100 threads; see Shane Lynn's tutorial). In addition to the requirements, make sure you have nltk's punkt package installed (via nlkt.download() in interactive Python) for Newspaper3k's article.nlp() to work properly.

usage

usage: newspaperbulk.py [-h] [-t THREADS] [-r] [-u] [-m MAX_RETRIES]
                        [-b BACKOFF]
                        filepath
                        
positional arguments:
  filepath              Enter the path of the .csv, .txt, .xlsx, or .xls file
                        containing the URLs. If the file is not in your
                        current directory, you must enter the absolute path.

optional arguments:
  -h, --help            show this help message and exit
  -t THREADS, --threads THREADS
                        Number of threads to launch (default 100).
  -r, --redirects       Select to allow redirects.
  -u, --unverified      Select to allow unverified SSL certificates.
  -m MAX_RETRIES, --max_retries MAX_RETRIES
                        Set the max number of retries (default 0 to fail on
                        first retry).
  -b BACKOFF, --backoff BACKOFF
                        Set the backoff factor (default 0).

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
README.md		README.md
newspaperbulk.py		newspaperbulk.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

newspaper-bulk

usage

About

Releases

Packages

Languages

dsynkov/newspaper-bulk

Folders and files

Latest commit

History

Repository files navigation

newspaper-bulk

usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages