|
| 1 | + |
| 2 | +# DeepSearch + InstructLab Integration Proposal |
| 3 | + |
| 4 | +<https://github.com/DS4SD> |
| 5 | + |
| 6 | +## Why is a Conversion Tool Necessary? |
| 7 | + |
| 8 | +Managing submissions for the open-source InstructLab project has revealed a significant bottleneck in processing |
| 9 | +knowledge documents. For the InstructLab backend to effectively utilize these documents, they must be in markdown |
| 10 | +format. Currently, we only accept Wikipedia articles, but the built-in conversion tool is inadequate. Internally at |
| 11 | +IBM, and other companies, many knowledge submissions are in multiple document formats, including PDF format, |
| 12 | +necessitating conversion to markdown before being used in InstructLab. |
| 13 | + |
| 14 | +Existing open-source methods, such as PanDoc, are inconsistent. While they preserve text, they struggle with parsing |
| 15 | +tables and special symbols, as evidenced by issues in PR #1154 of the taxonomy repo in the InstructLab project. Other |
| 16 | +open-source solutions have similar shortcomings. |
| 17 | + |
| 18 | +## Why DeepSearch? |
| 19 | + |
| 20 | +IBM's DeepSearch software excels in document conversion, outperforming traditional open-source methods. Utilizing a |
| 21 | +computer vision model layer, it accurately parses content in the files, including titles, headers, and tables. |
| 22 | +Additionally, it automatically implements RAG layers for models, which could benefit the InstructLab process in |
| 23 | +the future. |
| 24 | + |
| 25 | +## Integration Proposal |
| 26 | + |
| 27 | +To maintain the open-source nature of the project while leveraging the strengths of DeepSearch, we propose a |
| 28 | +two-pronged approach: |
| 29 | + |
| 30 | +### Open-Source Conversion: |
| 31 | + |
| 32 | +- Implement a basic document conversion tool in the UI using an open-source method such as PanDoc. This tool will be |
| 33 | +lightweight and easily hosted, ensuring it can be used and improved by the community. |
| 34 | + |
| 35 | +### DeepSearch Integration: |
| 36 | + |
| 37 | +- Enable the UI to switch the conversion endpoint to DeepSearch, allowing high-fidelity markdown conversions for |
| 38 | +backend use. This approach maintains an open-source version while benefiting from DeepSearch's superior |
| 39 | +conversion capabilities. |
| 40 | + |
| 41 | +IBM Research and the DeepSearch team will host the DeepSearch endpoint for the open-source community. This |
| 42 | +arrangement benefits the community by streamlining contributions and provides data and exposure for the DeepSearch |
| 43 | +project. IBM's contribution underscores its commitment to supporting and improving open-source projects. |
| 44 | + |
| 45 | +This integration will highlight the value of DeepSearch, highlighting their potential for those integrating |
| 46 | +InstructLab into their workflows. If the volume of community requests becomes unsustainable for the DeepSearch team, |
| 47 | +we hope for ample notification to allow the community to find alternative solutions. By then, we anticipate that the |
| 48 | +open-source versions will have improved sufficiently, or the value of the integration will justify continued support. |
| 49 | + |
| 50 | +By adopting this two-pronged approach, we ensure the integrity of the open-source project while leveraging IBM's |
| 51 | +advanced DeepSearch capabilities. This strategy balances community collaboration with innovative technology, |
| 52 | +fostering innovation and improvement in document processing for the InstructLab project. |
| 53 | + |
0 commit comments