-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: twarc2 extract [thing] command for driving further collection #562
Comments
What you've suggested is fine in itself. My only question is whether you (plural) want that functionality to be part of (core) twarc or not - where is it that you're drawing the lines around the purpose of (core) twarc? That's not to say don't do it or that you're at the danger point yet, just make sure you know how it fits more broadly and keep aware of the dangers of scope creep and starting to just chuck features in. |
I mean, I like (1) on its own - line-delimited ID files as input are great, and fits in with the new searches etc. My comment above is specifically about the extract subcommand. |
I like this idea! Even if we don't end up implementing extra bits, we have to consolidate some validation code to share it between For me, the decision of what should be core twarc vs what should be plugin involves: If it has extra dependencies (pandas, youtube-dl, networkx etc) how commonly useful it is, and if it sticks to the original API or not. So I think (Also, maybe #542 should be fixed by this, if we separate out the commands and #561 is related too, if |
I do worry that the core twarc command is getting quite heavy. I'd like to see work on extract as a plugin first and let it prove its usefulness. |
Yeah, I'm happy to prototype in a plugin first.
I'm going to separately synthesise all of the things somewhere so we can have the "core" twarc discussion somewhere durable/to act as guidance for the future. |
As noted in #561 there's a confusing aspect when using tweet json as input to timelines and conversations commands.
Now that we have timelines, conversations and searches taking input files, I propose we:
twarc2 extract
command, that is focused on extracting interesting things from tweet json and creating a file that can be fed into (back into) those data driven commands. Each option would focus on building a deduplicated list of entries to drive another twarc2 command.The signature for this command could look like:
My hope is that by explicitly representing the intermediates as files, rather than just
magic processing
it makes workflows more repeatable and consistent, and it also allows for incorporating human judgment more easily into the process by editing a file.Example 1
This would allow you to do a hashtag snowball around a keyword of interest by something like the following:
Example 2
Extracting the conversations from the same collection (this replaces the conversations command reading from JSON)
Example 3
Extracting the timelines for a set of users tweeting about something (this replaces the timelines command reading from JSON)
The text was updated successfully, but these errors were encountered: