Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor command line and add manually set fields and expansions #549

Merged
merged 30 commits into from
Oct 23, 2021

Conversation

igorbrigadir
Copy link
Contributor

@igorbrigadir igorbrigadir commented Sep 28, 2021

Fix #493
will also Fix #550

This also refactors the command2.py click command line options.

Needs a good bit of testing to make sure all the old commands still work.

@igorbrigadir
Copy link
Contributor Author

For example, twarc2 search --help now looks like:

Usage: twarc2 search [OPTIONS] QUERY [OUTFILE]

  Search for tweets.

Options:
  --end-time [%Y-%m-%d|%Y-%m-%dT%H:%M:%S]
                                  Match tweets sent before UTC time (ISO
                                  8601/RFC 3339)
  --start-time [%Y-%m-%d|%Y-%m-%dT%H:%M:%S]
                                  Match tweets created after UTC time (ISO
                                  8601/RFC 3339), e.g.  2021-01-01T12:31:04
  --until-id INTEGER              Match tweets sent prior to tweet id
  --since-id INTEGER              Match tweets sent after tweet id
  --max-results INTEGER           Maximum number of tweets per API response
  --limit INTEGER                 Maximum number of tweets to save
  --archive                       Use the full archive (requires Academic
                                  Research track)
  --place-fields TEXT             Retrieve specified place fields. Default is
                                  all available.
  --poll-fields TEXT              Retrieve specified poll fields. Default is
                                  all available.
  --media-fields TEXT             Retrieve specified media fields. Default is
                                  all available.
  --user-fields TEXT              Retrieve specified user fields. Default is
                                  all available.
  --tweet-fields TEXT             Retrieve specified tweet fields. Default is
                                  all available.
  --expansions TEXT               Request a specific set of expansions.
                                  Default is all available.
  --hide-progress                 Hide the Progress bar. Default: show
                                  progress, unless using pipes.
  --help                          Show this message and exit.

So you can run:

twarc2 search --max-results 500 --tweet-fields "id,text,lang,public_metrics"

To only get id,text,lang,public_metrics, and it will work with 500 results per page because context_annotations is not there to limit it to 100.

Maybe this will be common enough that it would warrant a --no-context-annotations or --fast switch? that will automatically imply:

--max-results 500 --tweet-fields "id,text,...everything except context annotations..."

Thoughts? Right now i'm inclined to leave it as is - to be aligned with the API.

@igorbrigadir
Copy link
Contributor Author

Gonna finish this so it works in a bit, and also gonna also try to use **kwargs in the methods for all the parameters. Other things to decide on: Move the decorators for command line to decorators2.py? or keep them in command2.py? I put them in with the command line because it made sense to me.

@edsu
Copy link
Member

edsu commented Sep 29, 2021

It looks like it removes 100 lines of code, which seems like a good direction to head in.

@SamHames
Copy link
Contributor

For the record I'm still opposed to customising fields, apart from the --no-context-annotations option. If people need that level of control I think other packages might be more suitable. Since there's also no downsides (apart from the context annotations affecting the pagesize) I'm not in favour of customisation for the sake of customisation.

I would suggest that if we do go down this path, I'd prefer that we actively discourage using it - if people are frequently dropping bits and pieces, it means that downstream tools consuming Twitter V2 JSON suddenly have a lot more variation to contend with.

Putting the grumbling aside the kwargs refactor and standardised decorators are great!

Suggestion:

  • Instead of exposing max_results as an additional parameter - set to 100 or 500 depending on whether the context annotations are included so that data collection is as fast as possible by default. I don't think the CLI needs to be as configurable as the library.
  • The decorators can stay in this file - they only really make sense in the context of the click CLI.

@igorbrigadir
Copy link
Contributor Author

Ah i thought i replied to this earlier... but yeah - i also think we should try to minimize people being able to shoot themselves in the foot and make things simple. But i'm also constantly finding the need to be able to make arbitrary API call parameters and having the command line have the same capability as the API really helps.

On the bright side - If we document a way of using it, people won't mess up - nobody seems to read --help strings anyway (lol), and if someone does and uses these "undocumented" features, they are also likely to know what they're doing.

With that in mind, and another related idea, i've a suggestion: a --data-coverage or --data-fidelity parameter for users:

twarc2 search "foobar" results.jsonl

default is still 100 per page, full expansions, specifying --data-coverage will give you a preset of expansions and fields:

twarc2 search --data-coverage max "foobar" results.jsonl

(get everything, default)

twarc2 search --data-coverage high "foobar" results.jsonl

(everything except context annotations, sets max results 500.)

twarc2 search --data-coverage low "foobar" results.jsonl

eg output for low: https://gist.github.com/igorbrigadir/e3c1b11f7c12cbb4656ddfd0be97b1ea (i wanted IDs only, but it's mandatory to get text and user names it seems) This is the minimal representation of tweets that still has all the references to everything - may be useful for large crawls that are aiming to analyze networks for example.

What do you think?

@SamHames
Copy link
Contributor

SamHames commented Sep 30, 2021 via email

@igorbrigadir
Copy link
Contributor Author

I added some more things and validation, here's how it behaves now:

max_results is 100 by default. If --no-context-annotations or --minimal-fields is specified with --archive it sets it to 500, unless it is manually set - in which case it's whatever value between 10 and 500 if using archive or 10 and 100 if using standard.

setting any --expansions or --*-fields is mutually exclusive with --no-context-annotations or --minimal-fields. (thanks to this really neat solution pallets/click#257 (comment) )

I think that covers it well enough!

@igorbrigadir
Copy link
Contributor Author

I fell behind a bit with this but will fix it up to include the latest changes in main

@igorbrigadir
Copy link
Contributor Author

I think that's all of them, but i haven't tested all the commands yet.

@SamHames
Copy link
Contributor

SamHames commented Oct 15, 2021 via email

@igorbrigadir
Copy link
Contributor Author

I think I did break a few things after all, so I'll have another go at fixing it up, but yeah - the searches and timelines ones I haven't even tried yet 😅

@igorbrigadir
Copy link
Contributor Author

Almost there i think! I still have a feeling something somewhere is wrong though.

@edsu
Copy link
Member

edsu commented Oct 15, 2021

I guess this would be where having a test of the CLI would be useful eh? Why don't we merge this work, since it's a lot, and put it out there and see what breaks?

@igorbrigadir
Copy link
Contributor Author

Yeah why not! Since the tests pass, maybe we can merge, and I'm going to attempt to add proper click tests in another PR? https://click.palletsprojects.com/en/8.0.x/testing/ or wait a bit more for me to add these to this PR? I'm fine with either way.

@igorbrigadir igorbrigadir marked this pull request as ready for review October 15, 2021 18:59
@edsu
Copy link
Member

edsu commented Oct 15, 2021

Yeah, even some basic click tests would be an awesome thing to add. But I don't think they are needed for this PR.

@edsu
Copy link
Member

edsu commented Oct 20, 2021

I was testing with twarc2 search --minimal-fields politics data.jsonl and noticed in the progress bar that the results were going up by 100 instead of 500. Wasn't one of the ideas of introducing control over the fields so that this could be increased automatically to 500?

@SamHames
Copy link
Contributor

@edsu - I think you need --archive set as well - I don't think the 500 tweets/page is available on the standard access endpoint.

I did just do some very basic checking myself and it seems to be mostly working - just a couple of minor things I'll push up in a minute.

@igorbrigadir
Copy link
Contributor Author

Had another pass though and found a few more things missing or not working. I think i tried all the commands now with --minimal-fields and it all worked so if that works correctly, all the others should too.

@edsu edsu merged commit 1f22b86 into main Oct 23, 2021
@edsu edsu deleted the manual-fields-expansions branch October 23, 2021 20:55
@edsu
Copy link
Member

edsu commented Oct 23, 2021

This was just released as part of v2.8.0! Thanks for all the work @igorbrigadir and spot checking @SamHames.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants