Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Scraper #64

Merged
merged 9 commits into from
Jun 12, 2024
Merged

Enhance Scraper #64

merged 9 commits into from
Jun 12, 2024

Conversation

urvishp80
Copy link
Contributor

@urvishp80 urvishp80 commented Apr 30, 2024

Bitcoin Transcripts:

This update will account for edits to transcripts and will handle this issue. Handling Edits and AI-Generated Transcripts in Bitcoin Transcripts Repository

Includes:

  • Unique custom 'id' (based on file path and filename) for each document to avoid duplication of data
  • Script that updates the document with the provided fields, keeping the other fields as it is, ensuring any changes in the file are updated in the ES index.
  • Convert code from JavaScript to Python
  • Remove redundant code

Mailing Lists:

  • update custom URL for domain field
  • change cron job to be run manually for an inactive mailing lists

Bitcoin Talk:

  • migrate from Node.js to Python

Bitcoin Optech:

  • migrate from Node.js to Python

- custom 'id' for each document
- Update the document based on custom 'id' in ES with provided fields and keep the existing fields as it is
- convert code from JavaScript to Python
- remove redundant code
- update custom url for domain field
- changes cron job to be run manually for inactive mailing lists
- migrate from Node.js to Python
- migrate from Node.js to Python
@urvishp80 urvishp80 changed the title Enhance BitcoinTranscripts Scraper Enhance Scraper May 15, 2024
Copy link
Contributor

@kouloumos kouloumos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the new logic for Bitcoin Transcripts, and it looks good overall! I left a couple of comments inline.

Unfortunately, I still can't test this locally due to the "ModuleNotFoundError: No module named 'common'" error I mentioned in our DMs.

Feature idea:
It would be extremely useful if all the scripts had a test mode to execute the logic without involving the Elasticsearch index. Perhaps exporting the docs in JSON could make it easier to test without the Elasticsearch dependency. What do you think?

bitcointranscripts/main.py Outdated Show resolved Hide resolved
bitcointranscripts/main.py Show resolved Hide resolved
bitcointranscripts/main.py Outdated Show resolved Hide resolved
bitcointranscripts/main.py Outdated Show resolved Hide resolved
bitcointranscripts/main.py Show resolved Hide resolved
bitcointranscripts/main.py Outdated Show resolved Hide resolved
bitcointranscripts/main.py Outdated Show resolved Hide resolved
bitcoinops/main.py Outdated Show resolved Hide resolved
@kouloumos kouloumos mentioned this pull request May 30, 2024
- optimizing the script
- migrate from Node.js to Python
- resolve: Empty document body #68
- removed typo "$" from url
- optimize the code
Copy link
Contributor

@kouloumos kouloumos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me and I think it can be merged iff it has been tested by @urvishp80.

As I previously said, I have difficulty testing this locally. The difficulty of testing locally because of coupling it so tightly with ES index has also been raised by @Emmanuel-Develops on discord. It's okay to merge this, and I will open an issue for dealing with locally running the scraper, what do you think @urvishp80 ?

Also, there are additional changes that needs to be done on the bitcoin transcripts scraper to allow filtering for AI-generated transcripts. As this PR contains a lot of changes and refactors, I think it's better to do that later alongside some other changes that I have in mind (I'll open an issue for these also).

bitcoinops/main.py Show resolved Hide resolved
bitcointranscripts/main.py Outdated Show resolved Hide resolved
Copy link
Contributor

@kouloumos kouloumos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice additions!

@urvishp80
Copy link
Contributor Author

closes #67

@urvishp80
Copy link
Contributor Author

closes #68

@urvishp80 urvishp80 merged commit 53d0f2a into master Jun 12, 2024
@urvishp80 urvishp80 deleted the enhance-bitcointranscripts branch June 13, 2024 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants