Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DynamoDB table and Lambda functions for Document index #3

Open
xinli-cai opened this issue Feb 13, 2024 · 1 comment
Open

Update DynamoDB table and Lambda functions for Document index #3

xinli-cai opened this issue Feb 13, 2024 · 1 comment
Assignees

Comments

@xinli-cai
Copy link
Member

xinli-cai commented Feb 13, 2024

Steps:

  1. Update DynamoDB Table with the following attributes:
  • Construct a DynamoDB table, key-value pairs should be structured as key=features-properties-id and value= documents a JSON array.
    Here's a sample 'documents' JSON structure:
    "documents": [
    {
    "doc_id": "1" (type:string),
    "type": " " (type: string),
    "url": " " (type: string),
    "text": " " (type: string)
    },
    {
    "doc_id": " 2 " (type:string),
    "type": " " (type: string),
    "url": " " (type: string),
    "text": " " (type: string)
    },
    ....
    {
    "doc_id": " n" (type:string),
    "type": " " (type: string),
    "url": " " (type: string),
    "text": " " (type: string)
    }
    ]
  1. Create Lambda functions to extrat infromation from docx, PDF, txt, html format
  • Create lambda functions triggerred on a schedule for each document types
  • Perform mormalization and processing, including remove space, stopwords, punctuation, and stemming, converting to lower cases, etc from the text
  • Store the extracted information as a new column with the DynamoDB table, ensure the text is stored as a json array
@sajjadGG
Copy link
Collaborator

sajjadGG commented Feb 20, 2024

updated the table with the following schema
features_properties_id:
documents:{
url:
title:
type:
text:
}
the code is in this branch
https://github.com/Canadian-Geospatial-Platform/document-index/tree/feature/document_index_v2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants