Proposal: twarc2 extract [thing] command for driving further collection #562

SamHames · 2021-10-21T23:24:12Z

As noted in #561 there's a confusing aspect when using tweet json as input to timelines and conversations commands.

Now that we have timelines, conversations and searches taking input files, I propose we:

Simplify timelines and conversations so they only take an input file of ids (not tweet json)
Provide a twarc2 extract command, that is focused on extracting interesting things from tweet json and creating a file that can be fed into (back into) those data driven commands. Each option would focus on building a deduplicated list of entries to drive another twarc2 command.

The signature for this command could look like:

twarc2 extract [hashtags|mentions|authors|urls|conversation_ids] input.json output.txt --options-of-some-kind

My hope is that by explicitly representing the intermediates as files, rather than just magic processing it makes workflows more repeatable and consistent, and it also allows for incorporating human judgment more easily into the process by editing a file.

Example 1

This would allow you to do a hashtag snowball around a keyword of interest by something like the following:

twarc2 search "#auspol" auspol.json
twarc2 extract hashtags auspol.json auspol_hashtags.txt
# In between these steps, it would also be possible to edit the hashtags file first, for example, to identify a specific thematic
# subset of hashtags, or to remove generic hashtags that aren't relevant to the topic of interest.
twarc2 searches hashtags.txt auspol_snowball.json

Example 2

Extracting the conversations from the same collection (this replaces the conversations command reading from JSON)

twarc2 extract conversations auspol.json auspol_threads.txt
twarc2 conversations auspol_threads.txt auspol_threads.json

Example 3

Extracting the timelines for a set of users tweeting about something (this replaces the timelines command reading from JSON)

twarc2 extract authors auspol.json auspol_authors.txt
twarc2 timelines auspol_threads.txt auspol_authors.json

The text was updated successfully, but these errors were encountered:

betsybookwyrm · 2021-10-22T00:21:39Z

What you've suggested is fine in itself. My only question is whether you (plural) want that functionality to be part of (core) twarc or not - where is it that you're drawing the lines around the purpose of (core) twarc? That's not to say don't do it or that you're at the danger point yet, just make sure you know how it fits more broadly and keep aware of the dangers of scope creep and starting to just chuck features in.

betsybookwyrm · 2021-10-22T00:24:10Z

I mean, I like (1) on its own - line-delimited ID files as input are great, and fits in with the new searches etc. My comment above is specifically about the extract subcommand.

igorbrigadir · 2021-10-22T00:43:09Z

I like this idea!

Even if we don't end up implementing extra bits, we have to consolidate some validation code to share it between searches and timelines and conversations (the ones that read files of IDs or attempt to read stuff from json) stuff like skipping blank lines, parsing IDs with or without quotes etc.

For me, the decision of what should be core twarc vs what should be plugin involves: If it has extra dependencies (pandas, youtube-dl, networkx etc) how commonly useful it is, and if it sticks to the original API or not. So I think twarc2 extract should be in core twarc because it's too useful not to be, and doesn't have any extra dependencies.

(Also, maybe twarc2 dehydrate should be an alias for twarc2 extract tweets?)

#542 should be fixed by this, if we separate out the commands and #561 is related too, if twarc2 extract can take care of extracting things, so that the other timelines etc commands can only accept IDs as input (right now they take datasets in too)

edsu · 2021-10-22T01:31:53Z

I do worry that the core twarc command is getting quite heavy. I'd like to see work on extract as a plugin first and let it prove its usefulness.

SamHames · 2021-10-22T04:40:30Z

Yeah, I'm happy to prototype in a plugin first.

I do worry that the core twarc command is getting quite heavy.
Do you think part 1 of that proposal makes sense along those lines, at least once there's an equivalent plugin?

I'm going to separately synthesise all of the things somewhere so we can have the "core" twarc discussion somewhere durable/to act as guidance for the future.

SamHames mentioned this issue Oct 22, 2021

Add some notes about the purpose of core twarc and principles about w… #563

Merged

igorbrigadir mentioned this issue Jan 26, 2022

Miss-matching counts DocNow/twarc-hashtags#1

Open

igorbrigadir mentioned this issue Mar 16, 2022

Add new data type for places DocNow/twarc-csv#40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: twarc2 extract [thing] command for driving further collection #562

Proposal: twarc2 extract [thing] command for driving further collection #562

SamHames commented Oct 21, 2021

betsybookwyrm commented Oct 22, 2021

betsybookwyrm commented Oct 22, 2021

igorbrigadir commented Oct 22, 2021 •

edited

Loading

edsu commented Oct 22, 2021 •

edited

Loading

SamHames commented Oct 22, 2021

Proposal: twarc2 extract [thing] command for driving further collection #562

Proposal: twarc2 extract [thing] command for driving further collection #562

Comments

SamHames commented Oct 21, 2021

Example 1

Example 2

Example 3

betsybookwyrm commented Oct 22, 2021

betsybookwyrm commented Oct 22, 2021

igorbrigadir commented Oct 22, 2021 • edited Loading

edsu commented Oct 22, 2021 • edited Loading

SamHames commented Oct 22, 2021

igorbrigadir commented Oct 22, 2021 •

edited

Loading

edsu commented Oct 22, 2021 •

edited

Loading