Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for noob_ FSCrawler on Docker compose #1843

Open
shouari opened this issue Mar 26, 2024 · 12 comments
Open

Support for noob_ FSCrawler on Docker compose #1843

shouari opened this issue Mar 26, 2024 · 12 comments

Comments

@shouari
Copy link

shouari commented Mar 26, 2024

While running the FSCrawler o via docker compose, I face this error.
2024-03-25 17:47:19 21:47:19,717 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [doc_idx] does not exist. Exiting as we are in silent mode or no input available.

Here is _settings:

name: "doc_idx"
fs:
  indexed_chars: 100%
  lang_detect: true
  continue_on_error: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
elasticsearch:
  nodes:
    - url: "https://elasticsearch:9200"
  username: "elastic"
  password: "a123456"
  ssl_verification: false
rest :
  url: "http://fscrawler:8080"

and here is the docker-compose file fscrawler section

fscrawler:
    image: dadoonet/fscrawler:$FSCRAWLER_VERSION
    container_name: fscrawler
    restart: always
    volumes:
      - ../../test-documents/src/main/resources/documents/:/tmp/es:ro
      - ${PWD}/config:/root/.fscrawler
      - ${PWD}/logs:/usr/share/fscrawler/logs
      - ${PWD}/external:/usr/share/fscrawler/external
    depends_on:
      elasticsearch:
         condition: service_healthy
    ports: 
      - ${FSCRAWLER_PORT}:8080
    command: fscrawler doc_idx --restart --rest

Can you please help with this?

@dadoonet
Copy link
Owner

Where exactly did you put the job settings?

@shouari
Copy link
Author

shouari commented Mar 26, 2024

here is the structure I used
(the _settings is inside the documents_search folder)

.
├── config
│   └── documents_search
│              └── _settings.yaml
├── data
│   └── Files to index
├── logs
│   └── Empty folder so far
└── docker-compose.yml
└── .env

@dadoonet
Copy link
Owner

You need to change this line:

command: fscrawler doc_idx --restart --rest

To

command: fscrawler documents_search --restart --rest

Also note that you might have to change the name setting

name: "doc_idx"

To

name: "documents_search"

@shouari
Copy link
Author

shouari commented Mar 26, 2024

I did the mods above, but still the same error:

2024-03-26 16:06:59 20:06:59,762 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist. Exiting as we are in silent mode or no input available.

This is the debug:

2024-03-26 16:28:12 20:28:12,416 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
2024-03-26 16:28:12 20:28:12,416 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
2024-03-26 16:28:12 20:28:12,417 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [documents_search]...
2024-03-26 16:28:12 20:28:12,419 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [documents_search]...
2024-03-26 16:28:12 20:28:12,421 DEBUG [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist.
2024-03-26 16:28:12 20:28:12,421 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist. Exiting as we are in silent mode or no input available.

@dadoonet
Copy link
Owner

Could you share the full logs and switch to trace mode?

@shouari
Copy link
Author

shouari commented Mar 26, 2024

How to start Trace mode?
here is all the logs I could find:

2024-03-26 16:47:22 20:47:22,025 WARN  [f.p.e.c.f.c.FsCrawlerCli] --debug option has been deprecated. Use FS_JAVA_OPTS="-DLOG_LEVEL=debug" instead.
2024-03-26 16:47:22 20:47:22,118 INFO  [f.console] ,----------------------------------------------------------------------------------------------------.
2024-03-26 16:47:22 |       ,---,.  .--.--.     ,----..                                     ,--,           2.10-SNAPSHOT |
2024-03-26 16:47:22 |     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
2024-03-26 16:47:22 |   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
2024-03-26 16:47:22 |   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
2024-03-26 16:47:22 |   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
2024-03-26 16:47:22 |   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
2024-03-26 16:47:22 |   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
2024-03-26 16:47:22 |   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
2024-03-26 16:47:22 |   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |
2024-03-26 16:47:22 |   |   |  |  '--'.     / '   | '/  :|  , ;  /  /  ,.  | \   \   ' \ |;  :    ;'   |  / ||  , ;      |
2024-03-26 16:47:22 |   |   :  \    `--'---'  |   :    /  ---'  ;  :   .'   \ \   \  |--" |  ,   / |   :    | ---'       |
2024-03-26 16:47:22 |   |   | ,'               \   \ .'         |  ,     .-./  \   \ |     ---`-'   \   \  /             |
2024-03-26 16:47:22 |   `----'                  `---`            `--`---'       '---"                `----'              |
2024-03-26 16:47:22 +----------------------------------------------------------------------------------------------------+
2024-03-26 16:47:22 |                                        You know, for Files!                                        |
2024-03-26 16:47:22 |                                     Made from France with Love                                     |
2024-03-26 16:47:22 |                           Source: https://github.com/dadoonet/fscrawler/                           |
2024-03-26 16:47:22 |                          Documentation: https://fscrawler.readthedocs.io/                          |
2024-03-26 16:47:22 `----------------------------------------------------------------------------------------------------'
2024-03-26 16:47:22 
2024-03-26 16:47:22 20:47:22,142 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [94mb/1.9gb=4.81%], RAM [165.1mb/7.6gb=2.11%], Swap [1.7gb/2gb=87.61%].
2024-03-26 16:47:22 20:47:22,144 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
2024-03-26 16:47:22 20:47:22,157 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
2024-03-26 16:47:22 20:47:22,157 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [documents_search]...
2024-03-26 16:47:22 20:47:22,159 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [documents_search]...
2024-03-26 16:47:22 20:47:22,160 DEBUG [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist.
2024-03-26 16:47:22 20:47:22,160 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist. Exiting as we are in silent mode or no input available.

@dadoonet
Copy link
Owner

May be try this:

command: fscrawler documents_search --trace --restart --rest

@shouari
Copy link
Author

shouari commented Mar 27, 2024

Here is the output for the trace command:

2024-03-27 08:56:31 12:56:31,820 WARN  [f.p.e.c.f.c.FsCrawlerCli] --trace option has been deprecated. Use FS_JAVA_OPTS="-DLOG_LEVEL=trace" instead.
2024-03-27 08:56:31 12:56:31,852 INFO  [f.console] ,----------------------------------------------------------------------------------------------------.
2024-03-27 08:56:31 |       ,---,.  .--.--.     ,----..                                     ,--,           2.10-SNAPSHOT |
2024-03-27 08:56:31 |     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
2024-03-27 08:56:31 |   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
2024-03-27 08:56:31 |   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
2024-03-27 08:56:31 |   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
2024-03-27 08:56:31 |   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
2024-03-27 08:56:31 |   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
2024-03-27 08:56:31 |   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
2024-03-27 08:56:31 |   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |
2024-03-27 08:56:31 |   |   |  |  '--'.     / '   | '/  :|  , ;  /  /  ,.  | \   \   ' \ |;  :    ;'   |  / ||  , ;      |
2024-03-27 08:56:31 |   |   :  \    `--'---'  |   :    /  ---'  ;  :   .'   \ \   \  |--" |  ,   / |   :    | ---'       |
2024-03-27 08:56:31 |   |   | ,'               \   \ .'         |  ,     .-./  \   \ |     ---`-'   \   \  /             |
2024-03-27 08:56:31 |   `----'                  `---`            `--`---'       '---"                `----'              |
2024-03-27 08:56:31 +----------------------------------------------------------------------------------------------------+
2024-03-27 08:56:31 |                                        You know, for Files!                                        |
2024-03-27 08:56:31 |                                     Made from France with Love                                     |
2024-03-27 08:56:31 |                           Source: https://github.com/dadoonet/fscrawler/                           |
2024-03-27 08:56:31 |                          Documentation: https://fscrawler.readthedocs.io/                          |
2024-03-27 08:56:31 `----------------------------------------------------------------------------------------------------'
2024-03-27 08:56:31 
2024-03-27 08:56:31 12:56:31,868 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [93.8mb/1.9gb=4.8%], RAM [874.1mb/7.6gb=11.18%], Swap [1.9gb/2gb=99.46%].
2024-03-27 08:56:31 12:56:31,870 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
2024-03-27 08:56:31 12:56:31,871 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
2024-03-27 08:56:31 12:56:31,871 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [documents_search]...
2024-03-27 08:56:31 12:56:31,872 TRACE [f.p.e.c.f.f.MetaFileHandler] Removing file _status.json from /root/.fscrawler/documents_search if exists
2024-03-27 08:56:31 12:56:31,872 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [documents_search]...
2024-03-27 08:56:31 12:56:31,873 TRACE [f.p.e.c.f.f.MetaFileHandler] Reading file _settings.yaml from /root/.fscrawler/documents_search
2024-03-27 08:56:31 12:56:31,874 TRACE [f.p.e.c.f.f.MetaFileHandler] Reading file _settings.json from /root/.fscrawler/documents_search
2024-03-27 08:56:31 12:56:31,874 DEBUG [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist.
2024-03-27 08:56:31 12:56:31,874 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist. Exiting as we are in silent mode or no input available.

@dadoonet
Copy link
Owner

Could you share again:

  • The ./config/documents_search/_settings.yaml file
  • The ./docker-compose.yml file

Thanks

@shouari
Copy link
Author

shouari commented Mar 28, 2024

Sure here they are:

_settings.yaml

---
name: "documents_search"
fs:
  url: "C:\Users\shouari\Documents\Documentation_es\data"
  indexed_chars: 100%
  lang_detect: true
  continue_on_error: true
  ocr:
    language: "eng+fra"
    enabled: true
    pdf_strategy: "ocr_and_text"
elasticsearch:
  nodes:
    - url: "https://elasticsearch:9200"
  username: "elastic"
  password: "a123456"
  ssl_verification: false
rest :
  url: "http://fscrawler:8080"

docker-compose.yaml

'version: "2.2"
services:
  
  setup:
    image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
    volumes:
      - certs:/usr/share/elasticsearch/config/certs
    user: "0"
    command: >
      bash -c '
        if [ x${ELASTIC_PASSWORD} == x ]; then
          echo "Set the ELASTIC_PASSWORD environment variable in the .env file";
          exit 1;
        elif [ x${KIBANA_PASSWORD} == x ]; then
          echo "Set the KIBANA_PASSWORD environment variable in the .env file";
          exit 1;
        fi;
        if [ ! -f certs/ca.zip ]; then
          echo "Creating CA";
          bin/elasticsearch-certutil ca --silent --pem -out config/certs/ca.zip;
          unzip config/certs/ca.zip -d config/certs;
        fi;
        if [ ! -f certs/certs.zip ]; then
          echo "Creating certs";
          echo -ne \
          "instances:\n"\
          "  - name: elasticsearch\n"\
          "    dns:\n"\
          "      - elasticsearch\n"\
          "      - localhost\n"\
          "    ip:\n"\
          "      - 127.0.0.1\n"\
          > config/certs/instances.yml;
          bin/elasticsearch-certutil cert --silent --pem -out config/certs/certs.zip --in config/certs/instances.yml --ca-cert config/certs/ca/ca.crt --ca-key config/certs/ca/ca.key;
          unzip config/certs/certs.zip -d config/certs;
        fi;
        echo "Setting file permissions"
        chown -R root:root config/certs;
        find . -type d -exec chmod 750 \{\} \;;
        find . -type f -exec chmod 640 \{\} \;;
        echo "Waiting for Elasticsearch availability";
        until curl -s --cacert config/certs/ca/ca.crt https://elasticsearch:9200 | grep -q "missing authentication credentials"; do sleep 30; done;
        echo "Setting kibana_system password";
        until curl -s -X POST --cacert config/certs/ca/ca.crt -u elastic:${ELASTIC_PASSWORD} -H "Content-Type: application/json" https://elasticsearch:9200/_security/user/kibana_system/_password -d "{\"password\":\"${KIBANA_PASSWORD}\"}" | grep -q "^{}"; do sleep 10; done;
        echo "All done!";
      '
    healthcheck:
      test: ["CMD-SHELL", "[ -f config/certs/elasticsearch/elasticsearch.crt ]"]
      interval: 1s
      timeout: 5s
      retries: 120

  elasticsearch:
    depends_on:
      setup:
        condition: service_healthy
    image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
    volumes:
      - certs:/usr/share/elasticsearch/config/certs
      - esdata:/usr/share/elasticsearch/data
    ports:
      - ${ES_PORT}:9200
    environment:
      - node.name=elasticsearch
      - cluster.name=${CLUSTER_NAME}
      - cluster.initial_master_nodes=elasticsearch
      - ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
      - bootstrap.memory_lock=true
      - xpack.security.enabled=true
      - xpack.security.http.ssl.enabled=true
      - xpack.security.http.ssl.key=certs/elasticsearch/elasticsearch.key
      - xpack.security.http.ssl.certificate=certs/elasticsearch/elasticsearch.crt
      - xpack.security.http.ssl.certificate_authorities=certs/ca/ca.crt
      - xpack.security.http.ssl.verification_mode=certificate
      - xpack.security.transport.ssl.enabled=true
      - xpack.security.transport.ssl.key=certs/elasticsearch/elasticsearch.key
      - xpack.security.transport.ssl.certificate=certs/elasticsearch/elasticsearch.crt
      - xpack.security.transport.ssl.certificate_authorities=certs/ca/ca.crt
      - xpack.security.transport.ssl.verification_mode=certificate
      - xpack.license.self_generated.type=${LICENSE}
    mem_limit: ${MEM_LIMIT}
    ulimits:
      memlock:
        soft: -1
        hard: -1
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -s --cacert config/certs/ca/ca.crt https://localhost:9200 | grep -q 'missing authentication credentials'",
        ]
      interval: 10s
      timeout: 10s
      retries: 120

  kibana:
    depends_on:
      elasticsearch:
        condition: service_healthy
    image: docker.elastic.co/kibana/kibana:${STACK_VERSION}
    volumes:
      - certs:/usr/share/kibana/config/certs
      - kibanadata:/usr/share/kibana/data
    ports:
      - ${KIBANA_PORT}:5601
    environment:
      - SERVERNAME=kibana
      - ELASTICSEARCH_HOSTS=https://elasticsearch:9200
      - ELASTICSEARCH_USERNAME=kibana_system
      - ELASTICSEARCH_PASSWORD=${KIBANA_PASSWORD}
      - ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES=config/certs/ca/ca.crt
      - ENTERPRISESEARCH_HOST=http://localhost:${ENTERPRISE_SEARCH_PORT}
    mem_limit: ${MEM_LIMIT}
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -s -I http://localhost:5601 | grep -q 'HTTP/1.1 302 Found'",
        ]
      interval: 10s
      timeout: 10s
      retries: 120


  '# FSCrawler
  fscrawler:
    image: dadoonet/fscrawler:$FSCRAWLER_VERSION
    container_name: fscrawler
    restart: always
    volumes:
      - ../../test-documents/src/main/resources/documents/:/tmp/es:ro
      - ${PWD}/config:/root/.fscrawler
      - ${PWD}/logs:/usr/share/fscrawler/logs
      - ${PWD}/external:/usr/share/fscrawler/external
    depends_on:
      elasticsearch:
         condition: service_healthy
    ports: 
      - ${FSCRAWLER_PORT}:8080
    command: fscrawler documents_search --trace --restart --rest
volumes:
  certs:
    driver: local
  '# enterprisesearchdata:
 ' #   driver: local
  esdata:
    driver: local
  kibanadata:
    driver: local

@dadoonet
Copy link
Owner

dadoonet commented Mar 28, 2024

Try this:

---
name: "documents_search"
fs:
  url: "/tmp/es"
  indexed_chars: 100%
  lang_detect: true
  continue_on_error: true
  ocr:
    language: "eng+fra"
    enabled: true
    pdf_strategy: "ocr_and_text"
elasticsearch:
  nodes:
    - url: "https://elasticsearch:9200"
  username: "elastic"
  password: "a123456"
  ssl_verification: false
rest :
  url: "http://fscrawler:8080"

And change this volume to mount your document folder instead:

- ../../test-documents/src/main/resources/documents/:/tmp/es:ro

If it does not work, please inspect your container and check that /root/.fscrawler/config has the documents_search dir which has the _settings.yaml file.

@shouari
Copy link
Author

shouari commented Apr 1, 2024

I still face the same issue,

I'll have a look at the /root/.fscrawler/config as suggested and check for the documents_search folder..

If it does not exist, what might be the cause according to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants