You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Resource resource = new FileSystemResource(filePath);
List<Document> documents = new TikaDocumentReader(resource).read();
return new TokenTextSplitter(knowledgeBaseFileSlice.getDefaultChunkSize(),
knowledgeBaseFileSlice.getMinChunkSizeChars(),
knowledgeBaseFileSlice.getMinChunkLengthToEmbed(), knowledgeBaseFileSlice.getMaxNumChunks(),
knowledgeBaseFileSlice.isKeepSeparator()).apply(documents);
After TikaDocumentReader reads a Word document, the content read not only includes the text of the document, but also the XML information of the file, if I use getText(), the output will include the following content, like this:
docProps/app.xml
Normal.dotm 1 0 0 0 0 0 false false 0 WPS Office_10.1.0.7698_F1E327BC-269C-435d-A152-05C5408002CA 0
This is my code:
After TikaDocumentReader reads a Word document, the content read not only includes the text of the document, but also the XML information of the file, if I use getText(), the output will include the following content, like this:
docProps/app.xml
Normal.dotm 1 0 0 0 0 0 false false 0 WPS Office_10.1.0.7698_F1E327BC-269C-435d-A152-05C5408002CA 0
docProps/core.xml
2023-08-26T18:18:00Z admin admin 2023-08-26T18:18:41Z 1
docProps/custom.xml
2052-10.1.0.7698
word/styles.xml
word/settings.xml
word/theme/theme1.xml
word/document.xml
What went wrong???
The text was updated successfully, but these errors were encountered: