Open
Description
This is my code:
Resource resource = new FileSystemResource(filePath);
List<Document> documents = new TikaDocumentReader(resource).read();
return new TokenTextSplitter(knowledgeBaseFileSlice.getDefaultChunkSize(),
knowledgeBaseFileSlice.getMinChunkSizeChars(),
knowledgeBaseFileSlice.getMinChunkLengthToEmbed(), knowledgeBaseFileSlice.getMaxNumChunks(),
knowledgeBaseFileSlice.isKeepSeparator()).apply(documents);
After TikaDocumentReader reads a Word document, the content read not only includes the text of the document, but also the XML information of the file, if I use getText(), the output will include the following content, like this:
docProps/app.xml
Normal.dotm 1 0 0 0 0 0 false false 0 WPS Office_10.1.0.7698_F1E327BC-269C-435d-A152-05C5408002CA 0
docProps/core.xml
2023-08-26T18:18:00Z admin admin 2023-08-26T18:18:41Z 1
docProps/custom.xml
2052-10.1.0.7698
word/styles.xml
word/settings.xml
word/theme/theme1.xml
word/document.xml
What went wrong???
Metadata
Metadata
Assignees
Labels
No labels