Use Case: Mass Ingestion of Electronic Documents
This directory contains the source code used to build the generator application that feeds the pipeline for this demo.
This format is contains the original content of the article. Articles are broken down into article (contains text) and image entries in the Ingest collection in teh Cosmos DB.
{
"id" : "GUID",
"asset_hash" : "hash of the item",
"artifact_type" : "article|image",
"properties" :
{
Dependent on artifact_type
}
}
Property | Type | Required | Article | Image |
---|---|---|---|---|
original_uri | String | Y | X | X |
retrieval_datetime | DateTime | Y | X | X |
post_date | DateTime | N | X | |
body | String | N | X | |
title | String | N | X | |
author | String | N | X | |
hero_image | String | N | X | |
child_images | Array(object) | N | X | |
internal_uri | String N | X | X |
The media object is used for child_images. The field media_id is the Document ID of the media document in the Articles table.
{
"mediaId": "9d30724f5b8043e49552f4b8eb02f010",
"origUri": "https://dummy/thirdgrade.jpg",
"internalUri": "https://dangtestrepo.blob.core.windows.net/scraped/thirdgrade.jpg"
}
This format is contains the results of analyzing a portion of the ingested article. There will be one for the main article and one for each image. These records are kept in the Processed collection in Cosmos DB.
{
"id" : "GUID",
"artifact_type" : "article|image",
“parent” : “parent id”,
"properties" : {
.... dependent on artifact type ......
}
"tags" :[interesting/need alerting/dealers choice!]
}
Property | Type | Required | Article | Image |
---|---|---|---|---|
processed_datetime | DateTime | Y | X | X |
processed_time* | Int | Y | X | X |
title** | object | N | X | |
body** | object | N | X | |
vision*** | object | N | X | |
face**** | object | N | X | |
tags | Array(string) | N | X | X |
* Total processing time (ms) |
** Text Field Analytics objects
*** Vision Analytics object
*** Face Analytics object
"body|title": {
"type": "Body|Title",
"orig_lang_code": "language detected",
"lang_code": "requested language",
"value": "Translated text content",
"key_phrases": [
"Array of strings, key phrases found"
],
"sentiment": 0.5,
"entities": [
{
"OriginalText": "(array of items found) British premier",
"Name": "Prime Minister of the United Kingdom",
"BingId": "2570ebea-8c42-048a-3350-57c9e4169167",
"WikipediaUrl": "https://en.wikipedia.org/wiki/Prime_Minister_of_the_United...."
}
....
]
}
"vision": {
"object_categories": ["array of strings of object categories found"],
"objects": ["array of strings of objects"],
"text": ["array of strings of text found in images"]
}
The face object is a list of People with gender and age.
"face": {
"people": [
{
"gender" : "gender of person found",
"age" : "age of person found"
}
]
}