SorrTask762_start_downloading_data #771

tkpratardan · 2024-04-08T05:28:32Z

@sonaalKant @gpsaggese
I have developed an initial version of the code to download data from the specified sources mention in #762. However, we have a few bottlenecks with the free API:

goperigon.com -> limited API request (150)
newsdata.io -> daily limit of 10 articles
mediastack.com -> I haven't implemented a pipeline for this datasource because the free
version requires payment info.
thenewsapi.com -> limited API requests per day and some access to some endpoints is
blocked. Example, crypto.
marketaux.com -> limited API requests and access to some endpoints is blocked.

Review:

Is this according to expectations?
Do we need to implement data formatting and processing? LIke converting JSON responses to a dataframe and cleaning the texts? or is the data processing implemented in another script?

sonaalKant · 2024-04-08T13:46:11Z

Good start, Lets get some example data points from each API and log them in some doc so that we quickly see and analyse the next steps.

samarth9008 · 2024-04-08T20:20:17Z

research_llm/data_download.py

+# ## goperigon.com pipeline
+
+# %%
+class goperigonPipeline:


No need for OOPs programming inside notebooks. Over here, you can try interacting with API directly. Download the data and get a sense about it.

But lets follow what @sonaalKant says. This is just my opinion

My take is to do whatever is faster. This is exploration. We want a dataframe with data to look at

I kept the OOP method because I feel like this way gives us more functionality through controlling attributes of class instances. This could come quite handy while extracting data with different params/endpoints, making it faster, IMO. Let me know if this is okay.

tkpratardan · 2024-04-11T11:34:06Z

@gpsaggese @sonaalKant @samarth9008
I have made changes to the notebook, including some preliminary preprocessing and text cleaning for better readability.
Since the JSON responses were nested, I unpacked them by keeping the hierarchy intact. For ex:

We can interpret this as different keys under keywords and same index of the list across the columns form the corresponding values to the keys. Let me know if this makes sense.

Initial commit

f006d67

tkpratardan requested review from sonaalKant and gpsaggese April 8, 2024 05:28

samarth9008 reviewed Apr 8, 2024

View reviewed changes

tkpratardan and others added 2 commits April 11, 2024 07:14

dataframe

bd76aba

Merge branch 'master' into SorrTask762_start_downloading_data

cc612ab

tkpratardan requested review from gpsaggese and sonaalKant and removed request for sonaalKant and gpsaggese April 11, 2024 11:20

tkpratardan requested a review from samarth9008 April 11, 2024 15:18

gpsaggese and others added 2 commits May 29, 2024 18:28

Move

cf984d0

Merge branch 'master' into SorrTask762_start_downloading_data

e875293

gpsaggese merged commit 3eef900 into master May 29, 2024

gpsaggese deleted the SorrTask762_start_downloading_data branch May 29, 2024 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SorrTask762_start_downloading_data #771

SorrTask762_start_downloading_data #771

tkpratardan commented Apr 8, 2024 •

edited

Loading

sonaalKant commented Apr 8, 2024

samarth9008 Apr 8, 2024

samarth9008 Apr 8, 2024

gpsaggese Apr 9, 2024

tkpratardan Apr 11, 2024

tkpratardan commented Apr 11, 2024 •

edited

Loading

SorrTask762_start_downloading_data #771

SorrTask762_start_downloading_data #771

Conversation

tkpratardan commented Apr 8, 2024 • edited Loading

sonaalKant commented Apr 8, 2024

samarth9008 Apr 8, 2024

Choose a reason for hiding this comment

samarth9008 Apr 8, 2024

Choose a reason for hiding this comment

gpsaggese Apr 9, 2024

Choose a reason for hiding this comment

tkpratardan Apr 11, 2024

Choose a reason for hiding this comment

tkpratardan commented Apr 11, 2024 • edited Loading

tkpratardan commented Apr 8, 2024 •

edited

Loading

tkpratardan commented Apr 11, 2024 •

edited

Loading