Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fscrawler ignores exclusion folder for subdirectories #1974

Open
TonySoderbergRMT opened this issue Nov 20, 2024 · 0 comments
Open

fscrawler ignores exclusion folder for subdirectories #1974

TonySoderbergRMT opened this issue Nov 20, 2024 · 0 comments
Labels
check_for_bug Needs to be reproduced

Comments

@TonySoderbergRMT
Copy link

Describe the bug

Having a structure where only files in the folders named "publicerat" should be indexed. So I want to exclude other folders (arbets,original,historik,attachments). These are in multiple locations including subfolders.
In this case everything inside /arbets/, /original/, /historik/ and /attachments/ are getting indexed.

Job Settings

---
name: "rmt_view_doc"
fs:
  url: "G:\\dokument"
  update_rate: "15m"
  includes:
  - "*.docx"
  - "*.xlsx"
  - "*.pptx"
  - "*.pdf"
  excludes:
  - "*/historik/*"
  - "*/attachments/*"
  - "*/arbets/*"
  - "*/original/*"
  json_support: false
  filename_as_id: true
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: false
  lang_detect: false
  continue_on_error: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: true
  username: elastic
  password: xxx

Logs

18:27:43,059 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from G:\dokument
18:27:43,105 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 33 local files found
18:27:43,106 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(G:\dokument, G:\dokument\arbets) = \arbets
18:27:43,108 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [true], filename = [\arbets], includes = [[*.docx, *.xlsx, *.pptx, *.pdf]], excludes = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]
18:27:43,110 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [\arbets], excludes = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]
18:27:43,111 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] checking exclusion for filename = [\arbets], matches = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]
18:27:43,113 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [\arbets], includes = [[*.docx, *.xlsx, *.pptx, *.pdf]]
18:27:43,114 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] checking inclusion for filename = [\arbets], matches = [[*.docx, *.xlsx, *.pptx, *.pdf]]
18:27:43,119 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [\arbets], excludes = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]
18:27:43,120 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] checking exclusion for filename = [\arbets], matches = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]
18:27:43,123 DEBUG [f.p.e.c.f.FsParserAbstract] [\arbets] can be indexed: [true]
18:27:43,126 DEBUG [f.p.e.c.f.FsParserAbstract]   - folder: arbets
18:27:43,129 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [G:\dokument\arbets] content
18:27:43,130 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from G:\dokument\arbets
18:27:44,147 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 1512 local files found
18:27:44,149 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(G:\dokument, G:\dokument\arbets\0001.ppt) = \arbets\0001.ppt
18:27:44,150 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [\arbets\0001.ppt], includes = [[*.docx, *.xlsx, *.pptx, *.pdf]], excludes = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]
18:27:44,151 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [\arbets\0001.ppt], excludes = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]
18:27:44,151 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] checking exclusion for filename = [\arbets\0001.ppt], matches = [[*/historik/*, */attachments/*, */arbets/*, */original/*]]

Expected behavior

It's expected that fscrawler will not index folders that are in exclusion path.

Versions:

  • OS: Windows Server 2022
  • fscrawler fscrawler-distribution-2.10-20241120.045907-436
@TonySoderbergRMT TonySoderbergRMT added the check_for_bug Needs to be reproduced label Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
check_for_bug Needs to be reproduced
Projects
None yet
Development

No branches or pull requests

1 participant