Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable custom filename when downloading #6

Open
jorgepiloto opened this issue Sep 9, 2020 · 3 comments
Open

Enable custom filename when downloading #6

jorgepiloto opened this issue Sep 9, 2020 · 3 comments

Comments

@jorgepiloto
Copy link

Thank you so much for this tool! It is very powerful when you require to download a bunch of articles or papers.

I usually collect articles names and dois within a csv file. The problem is that SciDownl generates a new folder with desired filename for each article. After taking a look at the download method of SciHub class, I was able to identify this is the expected behavior. The fact that you can not impose downloaded filename but folder name is a bit annoying.

We could expect from user to pass outputs like my_paper/filename.pdf or just filename.pdf and create a new directory if required or not. I could implement this feature if you are interested in 🚀

@khanfarhan10
Copy link

@Tishacy please look into this issue!

@khanfarhan10
Copy link

Also returning the saved_filepath instead of None makes more sense while downloading the files

from scidownl.scihub import *

DOI = "10.1021/ol9910114"
out = 'paper'
sci = SciHub(DOI, out).download(choose_scihub_url_index=3)

@fridrichmrtn
Copy link

fridrichmrtn commented Sep 21, 2021

You can just use this hotfix>

from scidownl.scihub import *
class SciHub(SciHub):
    def __init__(self, doi, out='.', title=None):
        self.doi = doi
        self.out = out
        self.title = title
        self.sess = requests.Session()
        self.check_out_path()
        self.read_available_links()

    def find_pdf_in_html(self, html):
        pdf = {}
        soup = BeautifulSoup(html, 'html.parser') 
        pdf_url = soup.find('embed', {'id': 'pdf'}).attrs['src'].split('#')[0]
        pdf['pdf_url'] = pdf_url.replace('https', 'http') if 'http' in pdf_url else 'http:' + pdf_url
        #title = ' '.join(self._trim(soup.title.text.split('|')[1]).split('/')).split('.')[0]
        #title = title if title else pdf['pdf_url'].split('/')[-1].split('.pdf')[0]
        self.title = self.title if self.title else pdf['pdf_url'].split('/')[-1].split('.pdf')[0]
        pdf['title'] = self.check_title(self.title)
        print(STD_INFO + colored('PDF url', attrs=['bold']) + " -> \n\t%s" %(pdf['pdf_url']))
        print(STD_INFO + colored('Article title', attrs=['bold']) + " -> \n\t%s" %(pdf['title']))
        return pdf

Also fixed in this PR #16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants