Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional tags when crawling files #1984

Closed
4islam opened this issue Dec 7, 2024 · 6 comments · Fixed by #2017
Closed

Add additional tags when crawling files #1984

4islam opened this issue Dec 7, 2024 · 6 comments · Fixed by #2017
Assignees
Labels
feature_request for feature request

Comments

@4islam
Copy link

4islam commented Dec 7, 2024

Is your feature request related to a problem? Please describe.

I'm always frustrated when I can't categorize or add context to the files being indexed by FSCrawler. The current plugin lacks the ability to add custom tags during the crawling process, which limits the flexibility and searchability of the indexed files.

Describe the solution you'd like

I would like to have the capability to specify additional tags when crawling files with FSCrawler. These tags should be included in the metadata of each indexed file, allowing for better categorization, context, and enhanced search capabilities. This feature should be configurable through the fscrawler.yml file.

Describe alternatives you've considered

I've considered manually adding tags to the files after they are indexed, but this approach is time-consuming and inefficient. Another alternative is to use a different tool that supports tagging, but I prefer to continue using FSCrawler due to its other features and integration with Elasticsearch.


Feature Request: Add Additional Tags When Crawling Files

Summary:
Enhance the FSCrawler Elasticsearch plugin by adding the capability to include additional tags when crawling files. This feature will allow users to specify custom tags that can be associated with the files being indexed, providing more flexibility and improving searchability.

Description:
The current FSCrawler plugin for Elasticsearch efficiently indexes files and extracts metadata. However, it lacks the ability to add custom tags during the crawling process. By introducing a feature that allows users to specify additional tags, we can significantly enhance the plugin's functionality. These tags can be used to categorize files, add context, and improve the overall search experience.

Use Cases:

  1. Categorization: Users can categorize files based on project, department, or any other custom criteria, making it easier to organize and retrieve relevant documents.
  2. Contextual Information: Adding tags that provide context, such as "confidential," "urgent," or "archived," can help users quickly identify the nature of the files. In addition, when you are crawling folders, a meta.inf file can include metadata for each file's content. In the case of a book folder with pages to be indexed in it, a meta.inf file describes information like the book name, its author, its ISBN etc. which can be added to each indexed page as its metadata.
  3. Enhanced Search: Custom tags can be used as search filters, allowing users to perform more precise and targeted searches within the indexed files.

Implementation:

  • Configuration: Introduce a new configuration option in the fscrawler.yml file where users can define custom tags.
  • Tagging Mechanism: Modify the crawling process to include the specified tags in the metadata of each indexed file.
  • Search Integration: Ensure that the custom tags are indexed and searchable within Elasticsearch, allowing users to filter search results based on these tags.

Benefits:

  • Improved file organization and retrieval.
  • Enhanced search capabilities with custom filters.
  • Greater flexibility in managing and categorizing indexed files.

Conclusion:
Adding the ability to specify additional tags when crawling files will greatly enhance the FSCrawler plugin's usability and functionality. This feature will provide users with more control over their indexed data and improve the overall search experience.


Let me know if you need any more help!

@4islam 4islam added the feature_request for feature request label Dec 7, 2024
@dadoonet
Copy link
Owner

dadoonet commented Dec 7, 2024

It's similar to #884 right? Note that FSCrawler supports tags when using the REST API.

@4islam
Copy link
Author

4islam commented Dec 7, 2024

Yes. The feature is available via REST API but it would be really valuable while crawling files.

@dadoonet
Copy link
Owner

Could you describe and example of fscrawler.yml as you would like to see it?

@4islam
Copy link
Author

4islam commented Feb 10, 2025

Following the example of REST API (see https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#additional-tags), we should be able to not only specify additional global tags, but path to the file with tags for each folder branch.

In my case, where I am indexing a library of books under /library. The folder structure looks like this:

library
├── .meta.yml                        #potentially to define tags applicable to all books (good option to allow nesting)
├── book1
│   ├── .meta.yml
│   └── pages
│       ├── page1.txt
│       ├── page2.txt
│       └── page3.txt
├── book2
│   ├── .meta.yml
│   └── pages
│       ├── page1.txt
│       ├── page2.txt
│       └── page3.txt
└── book3
    ├── .meta.yml
    └── pages
        ├── page1.txt
        ├── page2.txt
        └── page3.txt

So fscrawler.yml file could look like this:

name: "my_fscrawler_job"
fs:
  url: "/library"
  update_rate: "15m"
  tags:                                           # Specify global tags
    - project: "ProjectName"
    - department: "DepartmentName"
  meta_file:
    path: "/.meta.yml"                            # Specify the path to the meta.yml file(s)
elasticsearch:
  nodes:
    - url: "http://localhost:9200"
  index: "my_fscrawler_index"
  bulk_size: 100
  flush_interval: "5s"

So if any folder had this .meta.yml file, fscrawler should fetch tag information from there for each folder and index it alongside each page from that folder. The file for each can look like:

author: "John Doe"
language: "English"
otherlanguages:
  - Arabic
  - Persian
isbn: "978-3-16-148410-0"
tags:
  - "fiction"
  - "adventure"
  - "bestseller"

@dadoonet
Copy link
Owner

I started a first implementation. It's not exactly what you asked for but it's a start at least.

Have a look at #2017 for the current status of the feature.

dadoonet added a commit that referenced this issue Feb 13, 2025
Add support for external tags

The goal of this feature is to allow users to provide additional metadata when crawling files. Whenever a directory is crawled, FSCrawler checks if a file named `.meta.yml` is present in the directory. If it is, the content of this file is used to enrich the document.

## Example

For example, if you have a file named `.meta.yml` in the directory `/path/to/data/dir`:

```yaml
external:
  myTitle: "My document title"
```

Then the document indexed will have a new field named `external.myTitle` with the value `My document title`.

## Supported Fields

Only supported fields can be added to the document. If you try to add a field which is not supported, it will be ignored.

For example, if you have the `.meta.yml` file contains:

```yaml
foo: "bar"
external:
  myTitle: "My document title"
```

The document indexed will have a new field named `external.myTitle` with the value `My document title`. The field `foo` will be ignored.

If you really want to add a field named `foo`, you need to add it first as an external tag:

```yaml
external:
  foo: "bar"
  myTitle: "My document title"
```

and then use an ingest pipeline to rename the `external.foo` field to `foo`.

## Overwriting Fields

The `.meta.yml` file can also overwrite existing fields. For example, if you have the following `.meta.yml` file:

```yaml
content: "HIDDEN"
```

Then the `content` field will be replaced by `HIDDEN` even though something else is extracted.

> **Note:** The `.meta.yml` file is not indexed. It is only used to enrich the document.

## Tags Settings

Here is a list of Tags settings (under `tags.` prefix):

| Name                  | Default value   | Documentation       |
|-----------------------|-----------------|---------------------|
| `tags.metaFilename`   | `.meta.yml`     | [Meta Filename](#meta-filename) |

### Meta Filename

You can use another filename for the external tags file. For example, if you want to use `meta_tags.json` instead of `.meta.yml`, you can set:

```yaml
tags:
  metaFilename: "meta_tags.json"
```

> **Note:** Only json and yaml files are supported.

Closes #1984.
@4islam
Copy link
Author

4islam commented Feb 14, 2025

Looks great. Thanks a lot. Appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature_request for feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants