Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: twarc2 extract [thing] command for driving further collection #562

Open
SamHames opened this issue Oct 21, 2021 · 5 comments
Open

Comments

@SamHames
Copy link
Contributor

As noted in #561 there's a confusing aspect when using tweet json as input to timelines and conversations commands.

Now that we have timelines, conversations and searches taking input files, I propose we:

  1. Simplify timelines and conversations so they only take an input file of ids (not tweet json)
  2. Provide a twarc2 extract command, that is focused on extracting interesting things from tweet json and creating a file that can be fed into (back into) those data driven commands. Each option would focus on building a deduplicated list of entries to drive another twarc2 command.

The signature for this command could look like:

twarc2 extract [hashtags|mentions|authors|urls|conversation_ids] input.json output.txt --options-of-some-kind

My hope is that by explicitly representing the intermediates as files, rather than just magic processing it makes workflows more repeatable and consistent, and it also allows for incorporating human judgment more easily into the process by editing a file.

Example 1

This would allow you to do a hashtag snowball around a keyword of interest by something like the following:

twarc2 search "#auspol" auspol.json
twarc2 extract hashtags auspol.json auspol_hashtags.txt
# In between these steps, it would also be possible to edit the hashtags file first, for example, to identify a specific thematic
# subset of hashtags, or to remove generic hashtags that aren't relevant to the topic of interest.
twarc2 searches hashtags.txt auspol_snowball.json

Example 2

Extracting the conversations from the same collection (this replaces the conversations command reading from JSON)

twarc2 extract conversations auspol.json auspol_threads.txt
twarc2 conversations auspol_threads.txt auspol_threads.json

Example 3

Extracting the timelines for a set of users tweeting about something (this replaces the timelines command reading from JSON)

twarc2 extract authors auspol.json auspol_authors.txt
twarc2 timelines auspol_threads.txt auspol_authors.json
@betsybookwyrm
Copy link
Contributor

What you've suggested is fine in itself. My only question is whether you (plural) want that functionality to be part of (core) twarc or not - where is it that you're drawing the lines around the purpose of (core) twarc? That's not to say don't do it or that you're at the danger point yet, just make sure you know how it fits more broadly and keep aware of the dangers of scope creep and starting to just chuck features in.

@betsybookwyrm
Copy link
Contributor

I mean, I like (1) on its own - line-delimited ID files as input are great, and fits in with the new searches etc. My comment above is specifically about the extract subcommand.

@igorbrigadir
Copy link
Contributor

igorbrigadir commented Oct 22, 2021

I like this idea!

Even if we don't end up implementing extra bits, we have to consolidate some validation code to share it between searches and timelines and conversations (the ones that read files of IDs or attempt to read stuff from json) stuff like skipping blank lines, parsing IDs with or without quotes etc.

For me, the decision of what should be core twarc vs what should be plugin involves: If it has extra dependencies (pandas, youtube-dl, networkx etc) how commonly useful it is, and if it sticks to the original API or not. So I think twarc2 extract should be in core twarc because it's too useful not to be, and doesn't have any extra dependencies.

(Also, maybe twarc2 dehydrate should be an alias for twarc2 extract tweets?)

#542 should be fixed by this, if we separate out the commands and #561 is related too, if twarc2 extract can take care of extracting things, so that the other timelines etc commands can only accept IDs as input (right now they take datasets in too)

@edsu
Copy link
Member

edsu commented Oct 22, 2021

I do worry that the core twarc command is getting quite heavy. I'd like to see work on extract as a plugin first and let it prove its usefulness.

@SamHames
Copy link
Contributor Author

Yeah, I'm happy to prototype in a plugin first.

I do worry that the core twarc command is getting quite heavy.
Do you think part 1 of that proposal makes sense along those lines, at least once there's an equivalent plugin?

I'm going to separately synthesise all of the things somewhere so we can have the "core" twarc discussion somewhere durable/to act as guidance for the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants