Manipulate documents before sending to elasticsearch #1156

ddanewitz · 2021-05-26T14:35:35Z

ddanewitz
May 26, 2021

Hello,

I am trying to use FSCrawler with Elasticsearch for querying documents in a file system. However my additional task is to mark files containing personal information within the text before sending them to Elasticsearch for querying.

I found a method achieving this using the ingest pipeline with the painless scripting language but for me it would be easier to be able to manipulate the results from FSCrawler manually (using third party natural language processing tools) before they are sent to Elasticsearch.

To achieve this I tried to tunnel the results from FScrawler to a REST Endpoint on my local machine by changing in the _settings.yaml the elasticsearch - nodes - url parameter to my local REST Endpoint:

elasticsearch:
nodes:

url: "http://localhost:8180/FSCrawlerRestTest

But this seems to be forbidden since if i run fscrawler I get the following exception:

java.lang.IllegalArgumentException: Invalid HTTP host: localhost:8180/FSCrawlerRestTest

Is there a way to tunnel the results from FSCrawler to another REST Endpoint before eventually sending them to the elasticsearch REST endpoint or is my only change to manipulate the results by using the ingest pipeline?

dadoonet · 2021-07-21T15:55:23Z

dadoonet
Jul 21, 2021
Maintainer

You can try to use the workplace search output instead of the elasticsearch one.
This one is more flexible IMO than elasticsearch output.

There is otherwise this super nice PR #1004 which would be helpful maybe to implement what you are looking for.
Sadly I did not have time yet to rebase it and see if I can eventually merge it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manipulate documents before sending to elasticsearch #1156

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Manipulate documents before sending to elasticsearch #1156

ddanewitz May 26, 2021

Replies: 1 comment

dadoonet Jul 21, 2021 Maintainer

ddanewitz
May 26, 2021

dadoonet
Jul 21, 2021
Maintainer