Skip to content

Commit d2bf1a6

Browse files
committed
import wiki documents to SDG
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
1 parent 8fde6f7 commit d2bf1a6

File tree

1 file changed

+38
-0
lines changed

1 file changed

+38
-0
lines changed

docs/sdg/wiki-doc-source.md

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Wiki document source
2+
3+
Fetching information from wikis is an essential
4+
feature for fine-tuning LLMs on public knowledge.
5+
6+
## Interfaces
7+
8+
qna.yaml file, `document` section:
9+
10+
- Wiki Host: The base URL of a wiki host.
11+
- Page titles: The titles of the Wiki pages to fetch.
12+
- oldid: IDs of old releases.
13+
14+
The qna.yaml file can define single host and multiple spaces and pages,
15+
each with an optional version.
16+
17+
Example of fetch URL:
18+
19+
- https://en.wikipedia.org/w/index.php?title=IBM_Granite&oldid=1235007056&action=raw
20+
21+
Note that oldid is sufficient to reterieve a page:
22+
23+
- https://en.wikipedia.org/w/index.php?oldid=1235007056&action=raw
24+
25+
Page title is used for vaidation.
26+
27+
## Changes across modules
28+
29+
- [Schema module](https://github.com/instructlab/schema) defines the structure and validation rules for
30+
the qna.yaml file.
31+
- [SDG taxonomy module](https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/utils/taxonomy.py)
32+
fetches documents
33+
- [SDG unit tests](https://github.com/instructlab/sdg/tree/main/tests)
34+
35+
## Additional External Packages
36+
37+
- urllib
38+

0 commit comments

Comments
 (0)