Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress Bars #490

Merged
merged 52 commits into from
Jul 5, 2021
Merged

Progress Bars #490

merged 52 commits into from
Jul 5, 2021

Conversation

igorbrigadir
Copy link
Contributor

@igorbrigadir igorbrigadir commented Jun 21, 2021

fixes #437, fixes #289

I think i got it to work how i wanted, but the missing part is a better display for the items /sec and current total. There's probably a better way of doing that that i'm still trying to figure out.

This only adds a progress bar to the search endpoint as an example - i'll add them to other commands if this one looks good.

Commands with progress bars:

  • search (snowflake timestamp)
  • flatten (input file)
  • users (input file)
  • hydrate (input file)
  • mentions (manual count - API limit of 800)
  • conversation (snowflake timestamp)
  • conversations (input file)
  • timeline (timestamp or manual count)
  • timelines (input file)
  • following (count using user lookup)
  • followers (count using user lookup)

Commands with --limit terminate the progress bar early - i think this is a good visual clue as to what's happening.

@edsu
Copy link
Member

edsu commented Jun 21, 2021

Did you want this to be a draft PR? It cannot be merged while it's in that state.

@edsu
Copy link
Member

edsu commented Jun 21, 2021

By the way, I'm not sure I fully understood what you were proposing to do with the snowflake ids and time, that is really very cool.

@igorbrigadir
Copy link
Contributor Author

Yeah - i'm still working on making the progress bar output more interpretable, and adding progress bars to other commands (like followers which take ages) - they don't need the snowflake tricks, but they do need a user lookup for totals.

The snowflake -> timestamp trick gets around the issue of not knowing how many tweets there are but knowing the time span processed - those functions will also come in for #459

@igorbrigadir igorbrigadir mentioned this pull request Jun 22, 2021
@igorbrigadir
Copy link
Contributor Author

So, there's now a FileSizeProgressBar that uses the input file to measure progress. For API endpoints, it also checks the errors and matches up the parameters.

The TimestampProgressBar works on snowflake ids and time ranges.

@igorbrigadir
Copy link
Contributor Author

The fancier output for search progress bar also introduces a new dependency, humanize - but i think it's worth it.

@igorbrigadir igorbrigadir marked this pull request as ready for review July 3, 2021 00:44
@igorbrigadir igorbrigadir marked this pull request as draft July 3, 2021 01:42
@igorbrigadir igorbrigadir marked this pull request as ready for review July 3, 2021 03:35
@edsu
Copy link
Member

edsu commented Jul 3, 2021

The new progress bar is BEAUTIFUL -- much more readable. I like how twarc search now uses a time based estimate rather than the counts results.

@igorbrigadir
Copy link
Contributor Author

igorbrigadir commented Jul 3, 2021

Just fixing a few more things - disambiguating and filling in defaults when using --since-id vs --start-time etc.

@edsu
Copy link
Member

edsu commented Jul 3, 2021

It seems to me like this is ready to be merged/deployed?

@igorbrigadir
Copy link
Contributor Author

Not yet unfortunately - i found some more weirdness with twarc2 timeline commands i want to fix (it fails to do the right thing right now) and i'm pretty sure i can simplify things significantly

@igorbrigadir
Copy link
Contributor Author

Slightly better again - here are the test cases i'm using:

(DarpaDan is good because it's an account with few tweets, but spans a long time)

search (snowflake timestamp)

twarc2 search "from:DarpaDan" out.jsonl
twarc2 search --archive "from:DarpaDan" out.jsonl
twarc2 search --limit 10 --archive "from:DarpaDan" out.jsonl
twarc2 search --since-id 1367874294206889985 "from:DarpaDan" out.jsonl
twarc2 search --until-id 1367874294206889985 "from:DarpaDan" out.jsonl
twarc2 search --since-id 1408930651768542849 --until-id 1411024756485328896 "from:igorbrigadir" out.jsonl

flatten (input file)

twarc2 flatten results.jsonl out.jsonl

users (input file)

twarc2 users --usernames usernames.txt out.json
twarc2 users user_ids.txt out.json

hydrate (input file)

twarc2 hydrate ids.txt out.jsonl

mentions (manual count - API limit of 800)

twarc2 mentions igorbrigadir out.jsonl
twarc2 mentions --since-id 1411118294359318538 igorbrigadir out.jsonl

conversation (snowflake timestamp)

# calls search with a fixed `conversation_id` query so exactly the same as `twarc2 search`

conversations (input file)

twarc2 conversations ids.txt out.jsonl

timeline (timestamp or manual count)

twarc2 timeline --since-id 1367874294206889985 DarpaDan out.jsonl
twarc2 timeline --until-id 1367874294206889985 DarpaDan out.jsonl
twarc2 timeline --since-id 1367874294206889985 --until-id 1411361536728207362 DarpaDan out.jsonl

twarc2 timeline --use-search --since-id 1367874294206889985 DarpaDan out.jsonl
twarc2 timeline --use-search --until-id 1367874294206889985 DarpaDan out.jsonl
twarc2 timeline --use-search --since-id 1367874294206889985 --until-id 1411361536728207362 DarpaDan out.jsonl

twarc2 timeline igorbrigadir out.jsonl #max out 3200 limit

timelines (input file)

twarc2 timelines user_ids.txt out.json

following (count using user lookup)

twarc2 following igorbrigadir --limit 10 out.jsonl
twarc2 following DarpaDan out.jsonl

followers (count using user lookup)

twarc2 followers igorbrigadir --limit 10 out.jsonl
twarc2 followers DarpaDan out.jsonl

@igorbrigadir
Copy link
Contributor Author

Ok, maybe now it's good to go. This turned out to be way bigger than i expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Progress bar Tqdm library while scraping tweets
2 participants