Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to export content to csv? #1

Open
anatoliivanov opened this issue Oct 31, 2021 · 6 comments
Open

How to export content to csv? #1

anatoliivanov opened this issue Oct 31, 2021 · 6 comments

Comments

@anatoliivanov
Copy link

Hey @Watchful1 , I ran the script to iterate over the contents of the zst dumps but the output shows the number of lines it has iterated, how do I export the contents to a csv file so that I can start using it for analysis and model building?

@Watchful1
Copy link
Owner

That depends on what you're trying to do. Which script did you use? Which dump files do you have?

@aryashah2k
Copy link

I think I have a similar doubt as @anatoliivanov, What he is trying to say (or) what I'm also trying to achieve is that export all of the lines to a comma separated value (csv) file, in the sense that I can view the data as an excel sheet and then use that for data analysis, etc.
@Watchful1 , I would appreciate your help with this.
Apparently, while running your script - "single_file.py" all we get is the number of lines it has iterated over.
How shall we use this data in terms of further analysis?

@Watchful1
Copy link
Owner

It's not a question with a single answer. It varies depending on what files you're processing, what filtering you want to do, what fields you want to output, etc.

But generally speaking this code is just intended as an example for reading the compressed files, actually doing something with the data once it's read would have to be done by editing the script yourself.

@aryashah2k
Copy link

aryashah2k commented Nov 28, 2021

Indeed, I am trying to figure that out, just out of curiosity, are these files in NDJSON format?(The files from academic torrents pushshift dumps?)
I am using the r/relationships data for my analysis.
source: https://academictorrents.com/details/cbe9a74749406433ca5c7b29d0c003dafb91d02b

@Watchful1
Copy link
Owner

Yes, these files are NDJSON compressed with ZStandard. But uncompressed all together it's something like 30 gigabytes. Even if you put it all in a single csv file, excel couldn't open it. That's more than most computers have RAM for, so unless you're using a program specifically suited for analysis of large amounts of data, it will struggle if it works at all.

With large amounts of data like this, it's important to have a specific plan for what analysis you want to do, then do it directly from the compressed files rather than trying to change it into some alternative format first.

@aryashah2k
Copy link

Yes buddy, I realised that now, Thanks. Will probably work out some way to analyze directly from the compressed files :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants