Mappings and raw meta data #1689

reignblack · 2023-08-12T11:44:33Z

reignblack
Aug 12, 2023

Hi,
Im currently testing a setup of fscrawler:

indexing a large amount .pdf files into elastic search 8.6.
using the default mapping and set raw_metadata: false

This is working perfect, however I require the page number field xmpTPg:NPages, which is held within raw, so i set raw_metadata to true and although intitally this was fine for my test set of pdfs, ive noticed once i starting scaling i end up with mapping explosion from obsure meta data in the pdf files due to the default mapping being dynamic. example:
meta.raw.MSIP_Label_1ada0a2f-b917-4d51-b0d0-d418a10c8b23_Method
meta.raw.pdf:docinfo:custom:MSIP_Label_1ada0a2f-b917-4d51-b0d0-d418a10c8b23_Name

What ive currently done is disabled dyanamic and added the specific raw field.

"mappings": {
    "dynamic": false,  
          "raw": {
            "properties": {
              "xmpTPg:NPages": {
                "type": "integer"
              }

Although this still creates all the fields per say in elasticsearch, they are not indexed/searchable

Im still new to Elasticsearch and understand controlling mapping explosion is a best practice. So my question is what would be the best practice here to just extract xmpTPg:NPages from raw without the rest of the mappings being created dynamically?

Answered by dadoonet

Aug 15, 2023

I'm afraid that there is no way to do that directly in FSCrawler. Which is probably something we should support as an option. Something where we enable raw_metadata but we provide in addition to this a list of properties to keep. Default to *.

The only thing I can imagine for now would be to:

Disable the field meta.raw in Elasticsearch mapping.

You can change the 8/_settings.json file:

{
  "settings": {
    "number_of_shards": 1,
    "index.mapping.total_fields.limit": 2000,
    "analysis": {
      "analyzer": {
        "fscrawler_path": {
          "tokenizer": "fscrawler_path"
        }
      },
      "tokenizer": {
        "fscrawler_path": {
          "type": "path_hierarchy"
      …

View full answer

dadoonet · 2023-08-15T15:03:42Z

dadoonet
Aug 15, 2023
Maintainer

I'm afraid that there is no way to do that directly in FSCrawler. Which is probably something we should support as an option. Something where we enable raw_metadata but we provide in addition to this a list of properties to keep. Default to *.

The only thing I can imagine for now would be to:

Disable the field meta.raw in Elasticsearch mapping.

You can change the 8/_settings.json file:

{
  "settings": {
    "number_of_shards": 1,
    "index.mapping.total_fields.limit": 2000,
    "analysis": {
      "analyzer": {
        "fscrawler_path": {
          "tokenizer": "fscrawler_path"
        }
      },
      "tokenizer": {
        "fscrawler_path": {
          "type": "path_hierarchy"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "attachment": {
        "type": "binary",
        "doc_values": false
      },
      "attributes": {
        "properties": {
          "group": {
            "type": "keyword"
          },
          "owner": {
            "type": "keyword"
          }
        }
      },
      "content": {
        "type": "text"
      },
      "file": {
        "properties": {
          "content_type": {
            "type": "keyword"
          },
          "filename": {
            "type": "keyword",
            "store": true
          },
          "extension": {
            "type": "keyword"
          },
          "filesize": {
            "type": "long"
          },
          "indexed_chars": {
            "type": "long"
          },
          "indexing_date": {
            "type": "date",
            "format": "date_optional_time"
          },
          "created": {
            "type": "date",
            "format": "date_optional_time"
          },
          "last_modified": {
            "type": "date",
            "format": "date_optional_time"
          },
          "last_accessed": {
            "type": "date",
            "format": "date_optional_time"
          },
          "checksum": {
            "type": "keyword"
          },
          "url": {
            "type": "keyword",
            "index": false
          }
        }
      },
      "meta": {
        "properties": {
          "raw": {
            "type": "object",
            "enabled": false
          },
          "author": {
            "type": "text"
          },
          "date": {
            "type": "date",
            "format": "date_optional_time"
          },
          "keywords": {
            "type": "text"
          },
          "title": {
            "type": "text"
          },
          "language": {
            "type": "keyword"
          },
          "format": {
            "type": "text"
          },
          "identifier": {
            "type": "text"
          },
          "contributor": {
            "type": "text"
          },
          "coverage": {
            "type": "text"
          },
          "modifier": {
            "type": "text"
          },
          "creator_tool": {
            "type": "keyword"
          },
          "publisher": {
            "type": "text"
          },
          "relation": {
            "type": "text"
          },
          "rights": {
            "type": "text"
          },
          "source": {
            "type": "text"
          },
          "type": {
            "type": "text"
          },
          "description": {
            "type": "text"
          },
          "created": {
            "type": "date",
            "format": "date_optional_time"
          },
          "print_date": {
            "type": "date",
            "format": "date_optional_time"
          },
          "metadata_date": {
            "type": "date",
            "format": "date_optional_time"
          },
          "latitude": {
            "type": "text"
          },
          "longitude": {
            "type": "text"
          },
          "altitude": {
            "type": "text"
          },
          "rating": {
            "type": "byte"
          },
          "comments": {
            "type": "text"
          }
        }
      },
      "path": {
        "properties": {
          "real": {
            "type": "keyword",
            "fields": {
              "tree": {
                "type": "text",
                "analyzer": "fscrawler_path",
                "fielddata": true
              },
              "fulltext": {
                "type": "text"
              }
            }
          },
          "root": {
            "type": "keyword"
          },
          "virtual": {
            "type": "keyword",
            "fields": {
              "tree": {
                "type": "text",
                "analyzer": "fscrawler_path",
                "fielddata": true
              },
              "fulltext": {
                "type": "text"
              }
            }
          }
        }
      }
    }
  }
}

See https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#creating-your-own-mapping-analyzers

This will still generate the raw data but nothing will be indexed, which should be ok (not tested).

Then you can define an ingest pipeline in Elasticsearch:

PUT _ingest/pipeline/set_pagenum
{
  "description": "sets the page number",
  "processors": [
    {
      "set": {
        "field": "meta.pagenum",
        "value": "{{{meta.raw.xmpTPg:NPages}}}",
        "ignore_failure": true
      }
    }
  ]
}

And declare this set_pagenum pipeline in FSCrawler's job:

name: "test"
elasticsearch:
  pipeline: "set_pagenum"

See https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#using-ingest-node-pipeline

I did not test this but would be interested to know if this works for you. In which case, I think we should transform that in documentation.

0 replies

reignblack · 2023-08-15T16:34:44Z

reignblack
Aug 15, 2023
Author

Thankyou!

I tried the above, and the ingest did create the meta.pagenum as required, but meta.raw object false didnt work as all the raw fields are still shown in elasticsearch. So i guess it's still stored in _source, but not mapped and not searchable, similar to my workaround above.

However what Ive done now is just call another pipeline to remove meta.raw and this seemed to have done the trick and now I only have the fields I want and no longer stored in _source

PUT _ingest/pipeline/remove-meta-raw
{
"processors": [
{
"script": {
"source": "if (ctx.containsKey('meta') && ctx['meta'].containsKey('raw')) { ctx['meta'].remove('raw') }"
}
}
]
}

1 reply

dadoonet Aug 15, 2023
Maintainer

You could just change the pipeline to:

PUT _ingest/pipeline/set_pagenum
{
  "description": "sets the page number",
  "processors": [
    {
      "set": {
        "field": "meta.pagenum",
        "value": "{{{meta.raw.xmpTPg:NPages}}}",
        "ignore_failure": true
      }
    },
    {
      "remove": {
        "field": "meta.raw",
        "ignore_failure": true
      }
    }
  ]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mappings and raw meta data #1689

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Mappings and raw meta data #1689

reignblack Aug 12, 2023

Replies: 2 comments · 1 reply

dadoonet Aug 15, 2023 Maintainer

reignblack Aug 15, 2023 Author

dadoonet Aug 15, 2023 Maintainer

reignblack
Aug 12, 2023

Replies: 2 comments 1 reply

dadoonet
Aug 15, 2023
Maintainer

reignblack
Aug 15, 2023
Author

dadoonet Aug 15, 2023
Maintainer