-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topics Index: streamlining Topic Extraction across bitcoin resources #83
Comments
I think this could greatly improve our data infrastructure. Maybe having this in its own UI would be a good idea. We already have the data, so it seems like we just need to focus on consuming it and presenting it in a user-friendly interface. Here are a few thoughts and questions I had:
cc @kouloumos |
@jrakibi Currently we generating topics from the summary text of a given document. These topics are generated using an LLM by passing in the summary text and receiving a list of topics as an output. Then we filter each topic from this generated topics list based on our predefined list of topics (btc_topics.csv). Therefore, the Primary Topics are the topic that are present in the predefined list and the rest are considered as Secondary Topics. |
Before we can focus on UI presentation, we need to establish a well-structured Topics Index. The initial effort should go into finding the right format for this index. Once that’s in place, there’s a lot of potential for visualization, and we already have some foundational ideas for this—like in the original Bitcoin Search designs and the Explore page on Bitcoin Transcripts.
Urvish provided a good overview of how we derive these topics, and there’s more on this in the llm-fine-tuning repo. However, the current approach isn’t fully optimized for production, so we’ll build a more effective process that integrates the new Topics Index.
Topics belong to Categories. Same idea as the structure on the Explore page of Bitcoin Transcripts.
All those ideas are taking us closer to curriculum generation. The first step here is establishing Topic tags for Resources in our Knowledge Base.
Of course we'll make it happen. |
Why it’s Important:
The goal is to define a comprehensive list of topics that can be used across all resources in our infrastructure for consistent topic extraction. By creating a centralized, scalable source of topic definitions, we ensure that Large Language Models (LLMs) have access to a predefined list for accurate extraction and tagging. This enables users to explore specific topics efficiently across a wide range of Bitcoin-related sources.
Current Limitations:
Right now, our method for managing topics relies heavily on fetching topics from Optech, which is only integrated with Bitcoin Transcripts. This approach isn’t scalable across our broader data infrastructure. The current setup involves fetching and merging Optech Topics directly within the Bitcoin Transcripts repository. While useful, it limits broader adoption across our other products.
Here’s an overview of what we currently do:
In the Bitcoin Transcripts registry, we fetch Optech Topics and extract the following information for each topic using the
CategoryInfo
type:We also have topic tags that aren’t included in the original Optech Topics index, which are currently listed under Misc.
Additionally, we have a cron job called "Topic Modeling generation" that runs in the llm-fine-tuning repo. This job queries Elasticsearch for documents that don’t yet have topic modeling applied across specified sources. Using GPT and a predefined list of Bitcoin-related topics, it generates primary and secondary topics for each document. However, the list of topics used in this process is now outdated and doesn’t align with our new centralized topic list. We also no longer need both primary and secondary topics—one set of topics will suffice. Moreover, afaik the quality of the tags generated by this cron job has never been formally evaluated, and we never utilized them in any product.
New Approach:
To build a scalable infrastructure for topics extraction and tagging across all products, we need a centralized repository that defines topic tags and categories. This repository will replace the current Optech-only approach and be used by multiple applications, including Bitcoin Transcripts. Our new topic extraction code will fetch these predefined topics, ensuring consistent application across all resources.
Important Clarification:
It’s important to emphasize that this topics index is not competing with our "Decoding Bitcoin" product. While "Decoding Bitcoin" started as an index of topics, it has since evolved into a broader resource. The topics index described here focuses solely on providing concise excerpts for each topic, not full definitions or in-depth content. Its purpose is to help us better define topic tags for efficient extraction and tagging across our infrastructure.
Implementation Plan:
Centralized Topic Repository:
We will create a new repository to house all topic tags and categories. This will replace the current process of fetching and merging topics within Bitcoin Transcripts. The repository will be accessible to multiple projects, ensuring consistency in tagging and extraction.
Standardized Topic Information:
CategoryInfo
schema used in the Bitcoin Transcripts Registry.Integration Across Products:
Future Vision:
As this repository evolves, it could eventually become a standalone page, similar to a glossary, on the main Bitcoin Dev Project website. This would provide an additional resource for users to explore key Bitcoin topics in a more structured and comprehensive way.
The text was updated successfully, but these errors were encountered: