Spiltting a very large twarc2-generated jsonl file #641

juanimperial · 2022-05-30T17:38:40Z

juanimperial
May 30, 2022

Hello, and apologies for this question, due to my ignorance in coding-related matters.
In the past I had been accessing the Twitter database with API v1 tools that no longer work. I have now access to API v2 through an academic access account that allows retrieval of up to 10 million tweets per calendar month but --due to my coding ignorance-- I was unable to take advantage of it until I discovered your wonderful twarc2 scripts, for which I am extremely grateful and which I have been using and enjoying. I have been using twarc2 to access the archive, to collect, and also using the twarc-network plugin to generate gexf files that I have been analyzing networks with Gephi. I now have the problem that some of the users that I am following generate jsonl files of 8-9 million tweets of close to 40 Gb, and I cannot handle them with my hardware. One avenue is to analyze these in chunks, for instance slicing by calendar year. While I can --with difficulty-- import the file into a pandas dataframe and split it there, I find it impossible to turn these smaller files into jsonl files that will be recognized by the twarc-network plugin. I have tried to find a way to split the jsonl files into smaller jsonl files, but the information I get is confusing and I have not succeeded. I could try to pass "by hand" the nodes and edges information to Gephi, but, with my current state of python-pandas ignorance, I find the task daunting. If everything else fails, I will wait one month and consume next month's Twitter quota downloading again the same tweets from the archive in smaller time intervals, but this seems to be such a waste. Could you please suggest a way out of this tight spot? Any help would be really appreciated.
Thank you again for twarc2 and for your help.
With apologies,
Juan

igorbrigadir · 2022-06-16T00:23:28Z

igorbrigadir
Jun 16, 2022
Collaborator

Hi, didn't get round to answering you earlier but:

You generally have a few options: All of these do require some coding though - there's unfortunately no way around that, unless you're willing to wait a while for things to be updated later.

Modify / write your own plugin code for the network handling to stream in results (networkx can support large graphs but this requires some coding to get around it - i may revisit the existing plugin and see if i can fix this, but i may not be possible, it was a while since i looked at networkx - the issue is with storing things in memory and then writing out everything vs writing things incrementally)

Or you can try to split the results up at the source yourself, something relatively straight forward (without pandas) that should also maybe be a plugin: https://github.com/DocNow/twarc-ids/blob/main/twarc_ids.py this for example, iterated efficiently over a large file, and writes IDs, which you can build on and extend to write whole tweets based on created_at dates to split things up into years / months / days etc. This is probably the easiest approach. See this snippet for example: #506 (comment) (that just filters between dates though, doesn't write things to multiple files based on date)

Gephi can also handle large graphs, but it's easier to import these as CSVs of edges / nodes - which can be extracted from the source json and written directly line by line. This is the direction i'd recommend. You can use the existing twarc-csv code as a library - in order to process the large result file, and then turn things into Gephi friendly CSVs from that https://twarc-project.readthedocs.io/en/latest/api/library/#twarc-csv and see here for gephi formats: https://gephi.org/users/supported-graph-formats/csv-format/

So the idea is you use twarc-csv to iterate over a large file, and process each tweet yourself, writing out your edge list to import into gephi. To take advantage of twarc-csv, you can extend the CSVConverter class, and override the https://github.com/DocNow/twarc-csv/blob/main/csv_writer.py#L51 method to do stuff to the dataframe of results - so instead of just outputting the dataframe as CSV, you'd process it however you like, extracting whatever links and writing the edge list here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spiltting a very large twarc2-generated jsonl file #641

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Spiltting a very large twarc2-generated jsonl file #641

juanimperial May 30, 2022

Replies: 1 comment

igorbrigadir Jun 16, 2022 Collaborator

juanimperial
May 30, 2022

igorbrigadir
Jun 16, 2022
Collaborator