Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topics Index: streamlining Topic Extraction across bitcoin resources #83

Open
kouloumos opened this issue Oct 21, 2024 · 3 comments
Open
Assignees

Comments

@kouloumos
Copy link
Contributor

Why it’s Important:
The goal is to define a comprehensive list of topics that can be used across all resources in our infrastructure for consistent topic extraction. By creating a centralized, scalable source of topic definitions, we ensure that Large Language Models (LLMs) have access to a predefined list for accurate extraction and tagging. This enables users to explore specific topics efficiently across a wide range of Bitcoin-related sources.

Current Limitations:
Right now, our method for managing topics relies heavily on fetching topics from Optech, which is only integrated with Bitcoin Transcripts. This approach isn’t scalable across our broader data infrastructure. The current setup involves fetching and merging Optech Topics directly within the Bitcoin Transcripts repository. While useful, it limits broader adoption across our other products.

Here’s an overview of what we currently do:
In the Bitcoin Transcripts registry, we fetch Optech Topics and extract the following information for each topic using the CategoryInfo type:

{
  title: string;
  slug: string;
  optech_url: string;
  categories: string[];
  aliases?: string[];
  excerpt: string;
}

We also have topic tags that aren’t included in the original Optech Topics index, which are currently listed under Misc.

Additionally, we have a cron job called "Topic Modeling generation" that runs in the llm-fine-tuning repo. This job queries Elasticsearch for documents that don’t yet have topic modeling applied across specified sources. Using GPT and a predefined list of Bitcoin-related topics, it generates primary and secondary topics for each document. However, the list of topics used in this process is now outdated and doesn’t align with our new centralized topic list. We also no longer need both primary and secondary topics—one set of topics will suffice. Moreover, afaik the quality of the tags generated by this cron job has never been formally evaluated, and we never utilized them in any product.

New Approach:
To build a scalable infrastructure for topics extraction and tagging across all products, we need a centralized repository that defines topic tags and categories. This repository will replace the current Optech-only approach and be used by multiple applications, including Bitcoin Transcripts. Our new topic extraction code will fetch these predefined topics, ensuring consistent application across all resources.

Important Clarification:
It’s important to emphasize that this topics index is not competing with our "Decoding Bitcoin" product. While "Decoding Bitcoin" started as an index of topics, it has since evolved into a broader resource. The topics index described here focuses solely on providing concise excerpts for each topic, not full definitions or in-depth content. Its purpose is to help us better define topic tags for efficient extraction and tagging across our infrastructure.

Implementation Plan:

  1. Centralized Topic Repository:
    We will create a new repository to house all topic tags and categories. This will replace the current process of fetching and merging topics within Bitcoin Transcripts. The repository will be accessible to multiple projects, ensuring consistency in tagging and extraction.

  2. Standardized Topic Information:

    • We’ll start by using topics from Optech, following the CategoryInfo schema used in the Bitcoin Transcripts Registry.
    • We’ll introduce additional topic tags that aren’t found in Optech, covering a wider range of topics from sources like Stack Exchange and Delving Bitcoin.
    • We'll also align the "Topic Modeling generation" cron job with this new repository to ensure it uses an up-to-date list of topics. The current process of generating primary and secondary topics will be simplified to generate only one set of topics. Finally, we need to evaluate the quality of the existing tags generated by this process to ensure consistency and accuracy.
  3. Integration Across Products:

    • The new topics list will be consumed not only by Bitcoin Transcripts but by other parts of our ecosystem, such as Bitcoin Search and the scraper.
    • This integration will provide a unified experience for tagging and exploring content across all platforms.

Future Vision:
As this repository evolves, it could eventually become a standalone page, similar to a glossary, on the main Bitcoin Dev Project website. This would provide an additional resource for users to explore key Bitcoin topics in a more structured and comprehensive way.

@elraphty elraphty self-assigned this Oct 25, 2024
@jrakibi
Copy link

jrakibi commented Oct 28, 2024

I think this could greatly improve our data infrastructure.

Maybe having this in its own UI would be a good idea. We already have the data, so it seems like we just need to focus on consuming it and presenting it in a user-friendly interface.

Here are a few thoughts and questions I had:

  • What do you mean by primary and secondary topics?
  • Can you clarify the difference between tags and categories?
  • Is there a way to generate a curriculum for each topic simultaneously, or would that require a separate approach? It could be interesting if we could streamline this part of the process too.
  • Including links to resources for each topic would also be a great idea if we can make that happen.

cc @kouloumos

@urvishp80
Copy link
Contributor

  • What do you mean by primary and secondary topics?

@jrakibi Currently we generating topics from the summary text of a given document. These topics are generated using an LLM by passing in the summary text and receiving a list of topics as an output.

Then we filter each topic from this generated topics list based on our predefined list of topics (btc_topics.csv).

Therefore, the Primary Topics are the topic that are present in the predefined list and the rest are considered as Secondary Topics.

@kouloumos
Copy link
Contributor Author

@jrakibi

I think this could greatly improve our data infrastructure.

Maybe having this in its own UI would be a good idea. We already have the data, so it seems like we just need to focus on consuming it and presenting it in a user-friendly interface.

Before we can focus on UI presentation, we need to establish a well-structured Topics Index. The initial effort should go into finding the right format for this index. Once that’s in place, there’s a lot of potential for visualization, and we already have some foundational ideas for this—like in the original Bitcoin Search designs and the Explore page on Bitcoin Transcripts.

Here are a few thoughts and questions I had:

* What do you mean by primary and secondary topics?

Urvish provided a good overview of how we derive these topics, and there’s more on this in the llm-fine-tuning repo. However, the current approach isn’t fully optimized for production, so we’ll build a more effective process that integrates the new Topics Index.

* Can you clarify the difference between tags and categories?

Topics belong to Categories. Same idea as the structure on the Explore page of Bitcoin Transcripts.

* Is there a way to generate a curriculum for each topic simultaneously, or would that require a separate approach? It could be interesting if we could streamline this part of the process too.

All those ideas are taking us closer to curriculum generation. The first step here is establishing Topic tags for Resources in our Knowledge Base.

* Including links to resources for each topic would also be a great idea if we can make that happen.

Of course we'll make it happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants