Skip to content
This repository has been archived by the owner on Mar 30, 2023. It is now read-only.

[QUESTION]A question about 'Resume' #677

Closed
AnthonyChouGit opened this issue Mar 5, 2020 · 7 comments
Closed

[QUESTION]A question about 'Resume' #677

AnthonyChouGit opened this issue Mar 5, 2020 · 7 comments
Labels

Comments

@AnthonyChouGit
Copy link

AnthonyChouGit commented Mar 5, 2020

Issue Template

Please use this template!

Initial Check

If the issue is a request please specify that it is a request in the title (Example: [REQUEST] more features). If this is a question regarding 'twint' please specify that it's a question in the title (Example: [QUESTION] What is x?). Please only submit issues related to 'twint'. Thanks.

Make sure you've checked the following:

  • [] Python version is 3.6;
  • [] Updated Twint with pip3 install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint;
  • [] I have searched the issues and there are no duplicates of this issue/question/request.

Command Ran

Please provide the exact command ran including the username/search/code so I may reproduce the issue.

config = twint.Config()
config.Limit = 100000
config.Store_csv = True
config.Search = 'China'
config.Since = '2019-12-1'
config.Until = '2020-1-1'
config.Lang = 'en'
config.Output = '/root/datasets/unprocessed/China12.csv'
# config.Min_likes = 20
twint.run.Search(config)

I got this message:
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)

Description of Issue

Please use as much detail as possible.

I'm having the same problem as #670 . So I won't repeat the details. I guess it is because Twitter has updated its anti-scrawler system. I'm actually doing research on data science and seeking a way to retrieve a large amount of tweets efficiently. When I set Limit to 100,000, the retrieval process always stops at around 20,000.
So I consider using 'Resume' as an alternative to deal with this situation. But I can hardly find any information on your page about how to use it. The doc only mentions providing the path of a file containing the scroll id. But what is that file? Is that the csv file created by twint in the last loop? And, what is that scroll id? How could I get the so-call scroll id from the last loop and store it to a file? I suggest you provide a more specific explaination about it in the doc.
I really appreciate your work. Thank you.

Environment Details

Using Windows, Linux? What OS version? Running this in Anaconda? Jupyter Notebook? Terminal?

Centos

@pielco11
Copy link
Member

pielco11 commented Mar 5, 2020

config = twint.Config()
config.Limit = 100000
config.Store_csv = True
config.Search = 'China'
config.Since = '2019-12-1'
config.Until = '2020-1-1'
config.Lang = 'en'
config.Output = '/root/datasets/unprocessed/China12.csv'

config.Resume = "my_search_id_.txt"

# config.Min_likes = 20
twint.run.Search(config)

To resume the scrape, you have to provide the scroll id. This ID is stored in the requests made, so either you use config.Debug and extract the latest scroll id from twint-request_urls.log or you specify a custom file (in the example, my_search_id_.txt) and let twint do it for you.

config.Limit is just on twint side, Twitter is not aware of such param.

#604

@dfreelon
Copy link

May I request that the "Resume" entry on the Configuration wiki page be edited to reflect the information above? Currently the former reads:

Resume (string) - Resume from the latest scraped tweet ID, specify the filename that contains the ID.

This made me think Resume was looking for a file full of tweet IDs, not scroll IDs. It wasn't until I read this question that I realized what the issue was. This is an incredibly important feature of which more Twint users should be aware! Thanks for all your work.

@pielco11
Copy link
Member

My bad, that piece of wiki is outdated. 🙏

@pielco11
Copy link
Member

Updated

@vidyap-xgboost
Copy link

vidyap-xgboost commented Jul 7, 2020

Updated

@pielco11

Hi,
Can you link the wiki page to this issue? That'd be really helpful! TIA.

@jonatapaulino
Copy link

I wanted to get tweeters from different regions here in Brazil. Would geo be the parameter I could use to do this search? Thanks.

@leul12
Copy link

leul12 commented Oct 11, 2021

I wanted to get tweeters from different regions here in Brazil. Would geo be the parameter I could use to do this search? Thanks.

you might check the near arguments too

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests

6 participants