Skip to content

Commit 2787217

Browse files
jjasgharmingxzhaobjhargrave
committed
InstructLab and Deepsearch
This is the proposal to start integrating th document conversion system Deepsearch from IBM Research and InstructLab Co-authored-by: Ming Zhao <mingzhao@ibm.com> Co-authored-by: BJ Hargrave <hargrave@us.ibm.com> Signed-off-by: JJ Asghar <awesome@ibm.com>
1 parent b4e8df2 commit 2787217

File tree

1 file changed

+53
-0
lines changed

1 file changed

+53
-0
lines changed
+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
2+
# DeepSearch + InstructLab Integration Proposal
3+
4+
<https://github.com/DS4SD>
5+
6+
## Why is a Conversion Tool Necessary?
7+
8+
Managing submissions for the open-source InstructLab project has revealed a significant bottleneck in processing
9+
knowledge documents. For the InstructLab backend to effectively utilize these documents, they must be in markdown
10+
format. Currently, we only accept Wikipedia articles, but the built-in conversion tool is inadequate. Internally at
11+
IBM, and other companies, many knowledge submissions are in multiple document formats, including PDF format,
12+
necessitating conversion to markdown before being used in InstructLab.
13+
14+
Existing open-source methods, such as PanDoc, are inconsistent. While they preserve text, they struggle with parsing
15+
tables and special symbols, as evidenced by issues in PR #1154 of the taxonomy repo in the InstructLab project. Other
16+
open-source solutions have similar shortcomings.
17+
18+
## Why DeepSearch?
19+
20+
IBM's DeepSearch software excels in document conversion, outperforming traditional open-source methods. Utilizing a
21+
computer vision model layer, it accurately parses content in the files, including titles, headers, and tables.
22+
Additionally, it automatically implements RAG layers for models, which could benefit the InstructLab process in
23+
the future.
24+
25+
## Integration Proposal
26+
27+
To maintain the open-source nature of the project while leveraging the strengths of DeepSearch, we propose a
28+
two-pronged approach:
29+
30+
### Open-Source Conversion:
31+
32+
- Implement a basic document conversion tool in the UI using an open-source method such as PanDoc. This tool will be
33+
lightweight and easily hosted, ensuring it can be used and improved by the community.
34+
35+
### DeepSearch Integration:
36+
37+
- Enable the UI to switch the conversion endpoint to DeepSearch, allowing high-fidelity markdown conversions for
38+
backend use. This approach maintains an open-source version while benefiting from DeepSearch's superior
39+
conversion capabilities.
40+
41+
IBM Research and the DeepSearch team will host the DeepSearch endpoint for the open-source community. This
42+
arrangement benefits the community by streamlining contributions and provides data and exposure for the DeepSearch
43+
project. IBM's contribution underscores its commitment to supporting and improving open-source projects.
44+
45+
This integration will highlight the value of DeepSearch, highlighting their potential for those integrating
46+
InstructLab into their workflows. If the volume of community requests becomes unsustainable for the DeepSearch team,
47+
we hope for ample notification to allow the community to find alternative solutions. By then, we anticipate that the
48+
open-source versions will have improved sufficiently, or the value of the integration will justify continued support.
49+
50+
By adopting this two-pronged approach, we ensure the integrity of the open-source project while leveraging IBM's
51+
advanced DeepSearch capabilities. This strategy balances community collaboration with innovative technology,
52+
fostering innovation and improvement in document processing for the InstructLab project.
53+

0 commit comments

Comments
 (0)