Mappings and raw meta data #1689
-
Hi,
This is working perfect, however I require the page number field xmpTPg:NPages, which is held within raw, so i set raw_metadata to true and although intitally this was fine for my test set of pdfs, ive noticed once i starting scaling i end up with mapping explosion from obsure meta data in the pdf files due to the default mapping being dynamic. example: What ive currently done is disabled dyanamic and added the specific raw field. "mappings": {
"dynamic": false,
"raw": {
"properties": {
"xmpTPg:NPages": {
"type": "integer"
} Although this still creates all the fields per say in elasticsearch, they are not indexed/searchable Im still new to Elasticsearch and understand controlling mapping explosion is a best practice. So my question is what would be the best practice here to just extract xmpTPg:NPages from raw without the rest of the mappings being created dynamically? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
I'm afraid that there is no way to do that directly in FSCrawler. Which is probably something we should support as an option. Something where we enable The only thing I can imagine for now would be to:
You can change the {
"settings": {
"number_of_shards": 1,
"index.mapping.total_fields.limit": 2000,
"analysis": {
"analyzer": {
"fscrawler_path": {
"tokenizer": "fscrawler_path"
}
},
"tokenizer": {
"fscrawler_path": {
"type": "path_hierarchy"
}
}
}
},
"mappings": {
"properties": {
"attachment": {
"type": "binary",
"doc_values": false
},
"attributes": {
"properties": {
"group": {
"type": "keyword"
},
"owner": {
"type": "keyword"
}
}
},
"content": {
"type": "text"
},
"file": {
"properties": {
"content_type": {
"type": "keyword"
},
"filename": {
"type": "keyword",
"store": true
},
"extension": {
"type": "keyword"
},
"filesize": {
"type": "long"
},
"indexed_chars": {
"type": "long"
},
"indexing_date": {
"type": "date",
"format": "date_optional_time"
},
"created": {
"type": "date",
"format": "date_optional_time"
},
"last_modified": {
"type": "date",
"format": "date_optional_time"
},
"last_accessed": {
"type": "date",
"format": "date_optional_time"
},
"checksum": {
"type": "keyword"
},
"url": {
"type": "keyword",
"index": false
}
}
},
"meta": {
"properties": {
"raw": {
"type": "object",
"enabled": false
},
"author": {
"type": "text"
},
"date": {
"type": "date",
"format": "date_optional_time"
},
"keywords": {
"type": "text"
},
"title": {
"type": "text"
},
"language": {
"type": "keyword"
},
"format": {
"type": "text"
},
"identifier": {
"type": "text"
},
"contributor": {
"type": "text"
},
"coverage": {
"type": "text"
},
"modifier": {
"type": "text"
},
"creator_tool": {
"type": "keyword"
},
"publisher": {
"type": "text"
},
"relation": {
"type": "text"
},
"rights": {
"type": "text"
},
"source": {
"type": "text"
},
"type": {
"type": "text"
},
"description": {
"type": "text"
},
"created": {
"type": "date",
"format": "date_optional_time"
},
"print_date": {
"type": "date",
"format": "date_optional_time"
},
"metadata_date": {
"type": "date",
"format": "date_optional_time"
},
"latitude": {
"type": "text"
},
"longitude": {
"type": "text"
},
"altitude": {
"type": "text"
},
"rating": {
"type": "byte"
},
"comments": {
"type": "text"
}
}
},
"path": {
"properties": {
"real": {
"type": "keyword",
"fields": {
"tree": {
"type": "text",
"analyzer": "fscrawler_path",
"fielddata": true
},
"fulltext": {
"type": "text"
}
}
},
"root": {
"type": "keyword"
},
"virtual": {
"type": "keyword",
"fields": {
"tree": {
"type": "text",
"analyzer": "fscrawler_path",
"fielddata": true
},
"fulltext": {
"type": "text"
}
}
}
}
}
}
}
} This will still generate the raw data but nothing will be indexed, which should be ok (not tested). Then you can define an ingest pipeline in Elasticsearch: PUT _ingest/pipeline/set_pagenum
{
"description": "sets the page number",
"processors": [
{
"set": {
"field": "meta.pagenum",
"value": "{{{meta.raw.xmpTPg:NPages}}}",
"ignore_failure": true
}
}
]
} And declare this name: "test"
elasticsearch:
pipeline: "set_pagenum" See https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#using-ingest-node-pipeline I did not test this but would be interested to know if this works for you. In which case, I think we should transform that in documentation. |
Beta Was this translation helpful? Give feedback.
-
Thankyou! I tried the above, and the ingest did create the meta.pagenum as required, but meta.raw object false didnt work as all the raw fields are still shown in elasticsearch. So i guess it's still stored in _source, but not mapped and not searchable, similar to my workaround above. However what Ive done now is just call another pipeline to remove meta.raw and this seemed to have done the trick and now I only have the fields I want and no longer stored in _source PUT _ingest/pipeline/remove-meta-raw |
Beta Was this translation helpful? Give feedback.
I'm afraid that there is no way to do that directly in FSCrawler. Which is probably something we should support as an option. Something where we enable
raw_metadata
but we provide in addition to this a list of properties to keep. Default to*
.The only thing I can imagine for now would be to:
meta.raw
in Elasticsearch mapping.You can change the
8/_settings.json
file: