Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SorrTask762_start_downloading_data #771

Merged
merged 5 commits into from
May 29, 2024

Conversation

tkpratardan
Copy link
Collaborator

@tkpratardan tkpratardan commented Apr 8, 2024

@sonaalKant @gpsaggese
I have developed an initial version of the code to download data from the specified sources mention in #762. However, we have a few bottlenecks with the free API:

  1. goperigon.com -> limited API request (150)
  2. newsdata.io -> daily limit of 10 articles
  3. mediastack.com -> I haven't implemented a pipeline for this datasource because the free
    version requires payment info.
  4. thenewsapi.com -> limited API requests per day and some access to some endpoints is
    blocked. Example, crypto.
  5. marketaux.com -> limited API requests and access to some endpoints is blocked.

Review:

  1. Is this according to expectations?
  2. Do we need to implement data formatting and processing? LIke converting JSON responses to a dataframe and cleaning the texts? or is the data processing implemented in another script?

@sonaalKant
Copy link
Collaborator

Good start, Lets get some example data points from each API and log them in some doc so that we quickly see and analyse the next steps.

# ## goperigon.com pipeline

# %%
class goperigonPipeline:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for OOPs programming inside notebooks. Over here, you can try interacting with API directly. Download the data and get a sense about it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But lets follow what @sonaalKant says. This is just my opinion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take is to do whatever is faster. This is exploration. We want a dataframe with data to look at

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the OOP method because I feel like this way gives us more functionality through controlling attributes of class instances. This could come quite handy while extracting data with different params/endpoints, making it faster, IMO. Let me know if this is okay.

@tkpratardan tkpratardan requested review from gpsaggese and sonaalKant and removed request for sonaalKant and gpsaggese April 11, 2024 11:20
@tkpratardan
Copy link
Collaborator Author

tkpratardan commented Apr 11, 2024

@gpsaggese @sonaalKant @samarth9008
I have made changes to the notebook, including some preliminary preprocessing and text cleaning for better readability.
Since the JSON responses were nested, I unpacked them by keeping the hierarchy intact. For ex:

We can interpret this as different keys under keywords and same index of the list across the columns form the corresponding values to the keys. Let me know if this makes sense.

@tkpratardan tkpratardan requested a review from samarth9008 April 11, 2024 15:18
@gpsaggese gpsaggese merged commit 3eef900 into master May 29, 2024
@gpsaggese gpsaggese deleted the SorrTask762_start_downloading_data branch May 29, 2024 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants