Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming collection after interruption #589

Open
igorbrigadir opened this issue Jan 27, 2022 · 8 comments
Open

Resuming collection after interruption #589

igorbrigadir opened this issue Jan 27, 2022 · 8 comments

Comments

@igorbrigadir
Copy link
Contributor

It should be possible to pick up where you left off, for a long running search - if you read the last request and use the pagination token. As far as i can tell, these pagination tokens don't expire. Will try and see how long they last (in case they do). This could also be used to resume searches across months - if you run out of your monthly quota or something.

Ideally it will be a --resume option - with a @resumable decorator on the command line options that can read an existing file, and continue appending and paginating given an existing pagination token. The client code should already support this, so these changes are mostly in the command line tool.

@Michael-Gauthier
Copy link

Hello @igorbrigadir, and everyone else! This is pretty crazy, because I came here to post the exact same question because of disconnection problems I randomly have in my office... So I do not really understand: is it a feature you are planning to add, or is it already possible to resume collections after interruptions? Thanks a lot in advance, as usual! : )

@edsu
Copy link
Member

edsu commented Jan 28, 2022

I agree it would be a nice feature to have (it doesn't exist yet), and shouldn't be too tricky given next_token is already persisted to the output?

@Michael-Gauthier
Copy link

Gods, when it is implemented, I'll send you guys chocolates or whatever, that will be so useful! ^^

Thanks again for your hard work and efforts to keep improving the tool by the way! : )

@igorbrigadir
Copy link
Contributor Author

is it a feature you are planning to add, or is it already possible to resume collections after interruptions?

Kinda both, it doesn't exist in the command line twarc yet, but it's possible to do this with the library, https://twittercommunity.com/t/pulling-a-large-data-set-with-twarc2-client/165685/6?u=igorbrigadir all you need is to extract a next_token and specify it as a parameter into the function and it should continue - but i need to double check that, and make sure that using old next_token values actually works like this.

@Michael-Gauthier
Copy link

Michael-Gauthier commented Jan 28, 2022

Thanks for the feedback, and for the link to the thread! So, if I understand correctly, I would need to find the next_token, add this to my "normal" twarc2 query program, and it should resume where it stopped?

@igorbrigadir
Copy link
Contributor Author

Yes, except there's no way to add the next_token to twarc2 yet - i'll have add that part in.

@Michael-Gauthier
Copy link

Ok, thanks a lot for your reactivity and your clarification! : )

@SamHames
Copy link
Contributor

Thinking about this because of #656 - I think there's two layers to this:

  1. The client methods need to consistently take the pagination token as a keyword argument everywhere (this is the easy bit)
  2. For the command line, it's actually a bit harder when we consider the bulk commands that are actually the biggest targets for resuming an interrupted operation.
    • At a minimum to resume in these cases we'd have to think about reading the last page of results, confirming whether we need to resume that last collected search based on the presence/absence of a next_page token, then find the appropriate point in the input file to resume from.
    • We'd also have to think about the actual file writing workflow, because we'd have to consider whether we resume from and continue appending to the same file, or do something else. My suggestion is that if the --resume argument is passed, then it makes sense to read from the output file to sync the state, then continue appending to that file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants