Skip to content

Latest commit

 

History

History
386 lines (330 loc) · 20.8 KB

question1.md

File metadata and controls

386 lines (330 loc) · 20.8 KB

Stars Badge Forks Badge Pull Requests Badge Issues Badge GitHub contributors Visitors

Don't forget to hit the ⭐ if you like this repo.

Special Topic Data Engineering (SECP3843): Alternative Assessment

Name: Qaisara binti Rohzan

Matric No.: A20EC0133

Dataset: 04 - Companies

Question 1 (a)

The answer to the question is divided into the following segments:

Requirements

As an IT consultant, it is crucial for me to actively participate in overviewing the critical stages and concerns involved in developing a portal that seamlessly integrates various hardware and software applications. Implementing an amount of 5 servers in this project is possible, whereby the servers are used as:

  1. Main Web Server: Apache (Used by Django Web Framework)
  2. Database Server 1: MySQL
  3. Database Server 2: MongoDB
  4. Additional Application Server: Power BI Report Server
  5. Backup Server

While implementing 5 servers can be highly irresistible, there are several factors and questions that I must ask myself before deciding the final requirements needed to carry out this project:

  • Scalability: Will the project's traffic or data volume significantly increase in the future? If the project has the potential for expansion, having numerous servers can provide scalability and properly share the workload.

Answer: No, the project’s data volume will remain constant over the future. This is because the given project is based on a single dataset that contains 9,500 records of companies listed on Crunchbase. Having numerous servers will only lead to wasting resources. Unused servers consume hardware resources such as CPU, memory, storage, and electricity. This might result in inefficient resource utilisation and excessive maintenance and infrastructure costs.

  • Performance: Consider the project's estimated load and performance requirements. Having numerous servers can aid in load distribution, assuring optimal performance and responsiveness. It enables the use of dedicated resources for specific activities like web hosting, database management, and redundancy.

Answer: The database management activities can be considered as minimal as the project only revolves around the existing dataset. Moreover, this project would not require an on-premises Power BI Report Server that generates daily reports. Therefore, excess servers such as the Power BI Report and Backup Server may not be used properly, resulting in inefficient resource allocation. Valuable resources may have been diverted to other vital components or used to expand the existing infrastructure where it is most needed.

  • Cost and Resources: Examine the project's budget and resources. Additional servers result in greater hardware, software licence, and maintenance costs. Determine whether the benefits of having many servers exceed the costs.

Answer: Since the project can be considered as a low-scaled project, whereby all the software applications can be installed locally into our devices and are open-sourced (does not require any software licence). Additional servers can also lead to the increase of maintenance and support requirements. Excess servers necessitate additional maintenance, updates, troubleshootings, administrative and support tasks, which might distract resources and attention away from other project objectives.

  • Complexity and Management: Consider the difficulty of managing and maintaining several servers. More servers necessitate more effort for configuration, monitoring, and troubleshooting. Determine whether the project team has the competence and resources to efficiently manage several servers.

Answer: The project team does not have the competence and resources to efficiently manage several servers due to the lack of experience. The project's aim to seamlessly integrate the Django web framework, the JSON dataset, the MySQL and MongoDB database is straight-forward. Implementing additional and unnecessary servers will only increase the project's complexity. Managing and maintaining additional servers that are not actively used might add to the infrastructure's complexity. This complexity can lead to increased maintenance costs and possible sites of failure.

To mitigate these effects, it is best to re-evaluate the server architecture and make modifications based on project's portal real usage and requirements. After careful considerations, implementing the 3 Server Approach for this project appears to be a viable option, with each server performing a specific purpose. This architecture promotes optimal task distribution and maintains a manageable infrastructure while ensuring seamless integration, efficient data storage, and retrieval.

No Server Purpose Description
1 Main Web Server: Apache (Used by Django Web Framework)
    The main web server, which is powered by Apache, is critical in managing incoming HTTP requests and serving the Django web application. Apache, a popular web server, provides reliable performance and strong interoperability with the Django framework. The main web server efficiently handles web traffic by exploiting Apache's capabilities, providing the flawless delivery of dynamic web pages to consumers. It serves as the point of contact for user interactions, sending requests to the necessary components for processing and answer generation.
2 Database Server 1: MySQL
    This server is dedicated to hosting the MySQL database and allows for the efficient storage and retrieval of structured data. MySQL, a dependable relational database management system, works effortlessly with Django, enabling for the effective storing and management of project-related data. MySQL assures the integrity and reliability of the stored data due to its proven track record and substantial community support. This server enables the project to reap the benefits of a relational database, guaranteeing effective data storage and retrieval for diverse application capabilities.
3 Database Server 2: MongoDB
    The MongoDB database is hosted on the second database dedicated server. MongoDB, a versatile NoSQL database, excels at managing unstructured or semi-structured data like the JSON information provided. MongoDB is an excellent solution for applications demanding scalability and flexibility due to its ability to extend horizontally and accommodate massive amounts of data. The integration of MongoDB and Django allows for the effective storage and retrieval of JSON data within the project's portal. The query capabilities of MongoDB, as well as its document-oriented approach, allow for the effective management of various and growing data structures.

Integrating Django with JSON dataset

This segment of Question 1. (a) is divided into several parts:

Prerequisites

To carry out this segment of the question, it it crucial for to do the following:

  1. Install Python
  2. Install Visual Studio Code

Setting Up A Django Project

Django Installation

  1. Create a Django Project Folder. For this project I created a folder named AA on my Desktop, where it is easily accessible.

image


  1. Set up a virtual environment. In your current working directory, this command creates a new virtual environment called environment.
python -m venv environment

  1. Activate the virtual environment. When the process is finished, you must additionally activate the virtual environment.
environment\Scripts\activate

  1. Install Django.
pip install django

image


  1. Install necessary packages. Here we are installing packages that allows integration between our Django App and MongoDB, those packages are MySQL Client PyMongo and Djongo. Djongo is a smarter approach to database querying. It maps python objects to MongoDB documents.
pip install django mysqlclient pymongo
pip install djongo

Below are the output when running the commands:

image image



Create A Django Project

You can now create a project after setting up, activating your virtual environment and installing Django. To start a new Django project, open a new terminal in Visual Studio Code and run the following command. The project is named Companies in the code below, but it can be changed to any name you like.

python -m django startproject Companies

Navigate yourself to the project directory by inputing the command below:

cd Companies



Create A Django Application

The startapp command generates a default folder structure for a Django app. This tutorial uses CompaniesApp as the name for the app:

python manage.py startapp CompaniesApp

image



Configure Database Connection

Set up the MySQL and MongoDB connections. Alter the code for the databases in the Django project's'settings.py' file as shown below.

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'db_companies',
        'USER': 'root',
        'PASSWORD': '',
        'HOST': 'localhost',
        'PORT': '3306',
    },
    'mongodb': {
        'ENGINE': 'djongo',
        'NAME': 'AA',
        'ENFORCE_SCHEMA': False,
        'CLIENT': {
            'host': 'localhost',
            'port': 27017,
            'username': 'qaisara',
            'password': '8301',
            'authSource': 'admin',
            'authMechanism': 'SCRAM-SHA-1',
        }
    }
}



Create MySQL and MongoDB Models

  1. Defining models for MySQL. In the models.py file, define the models that represent the data and correspond to the structure of the JSON dataset.
from django.db import models

class Company(models.Model):
    _id = models.CharField(max_length=255, primary_key=True)
    acquisition = models.JSONField(null=True)
    acquisitions = models.JSONField(null=True)
    alias_list = models.JSONField(null=True)
    blog_feed_url = models.URLField(null=True)
    blog_url = models.URLField(null=True)
    category_code = models.CharField(max_length=255, null=True)
    competitions = models.JSONField(null=True)
    created_at = models.DateTimeField(null=True)
    crunchbase_url = models.URLField(null=True)
    deadpooled_day = models.IntegerField(null=True)
    deadpooled_month = models.IntegerField(null=True)
    deadpooled_url = models.URLField(null=True)
    deadpooled_year = models.IntegerField(null=True)
    description = models.TextField(null=True)
    email_address = models.EmailField(null=True)
    external_links = models.JSONField(null=True)
    founded_day = models.IntegerField(null=True)
    founded_month = models.IntegerField(null=True)
    founded_year = models.IntegerField(null=True)
    funding_rounds = models.JSONField(null=True)
    homepage_url = models.URLField(null=True)
    image = models.JSONField(null=True)
    investments = models.JSONField(null=True)
    ipo = models.JSONField(null=True)
    milestones = models.JSONField(null=True)
    name = models.CharField(max_length=255)
    number_of_employees = models.IntegerField(null=True)
    offices = models.JSONField(null=True)
    overview = models.TextField(null=True)
    partners = models.JSONField(null=True)
    permalink = models.CharField(max_length=255)
    phone_number = models.CharField(max_length=20, null=True)
    products = models.JSONField(null=True)
    providerships = models.JSONField(null=True)
    relationships = models.JSONField(null=True)
    screenshots = models.JSONField(null=True)
    tag_list = models.JSONField(null=True)
    total_money_raised = models.CharField(max_length=255, null=True)
    twitter_username = models.CharField(max_length=255, null=True)
    updated_at = models.DateTimeField(null=True)
    video_embeds = models.JSONField(null=True)
    class Meta:
        db_table = 'tb_companies'
        app_label = 'CompaniesApp'

class Users(models.Model):
    _id = models.CharField(max_length=100)
    name = models.CharField(max_length=100)
    email = models.CharField(max_length=100)
    password = models.CharField(max_length=100)
    class Meta:
        db_table = 'tb_users'
        app_label = 'CompaniesApp'
  1. Defining models for MongoDB. Create a new python file named models_mongodb.py under the CompaniesApp folder. Define the models that represent the data and correspond to the structure of the JSON dataset.
from djongo import models

class Company(models.Model):
    _id = models.CharField(max_length=255, primary_key=True)
    acquisition = models.JSONField(null=True)
    acquisitions = models.JSONField(null=True)
    alias_list = models.JSONField(null=True)
    blog_feed_url = models.URLField(null=True)
    blog_url = models.URLField(null=True)
    category_code = models.CharField(max_length=255, null=True)
    competitions = models.JSONField(null=True)
    created_at = models.DateTimeField(null=True)
    crunchbase_url = models.URLField(null=True)
    deadpooled_day = models.IntegerField(null=True)
    deadpooled_month = models.IntegerField(null=True)
    deadpooled_url = models.URLField(null=True)
    deadpooled_year = models.IntegerField(null=True)
    description = models.TextField(null=True)
    email_address = models.EmailField(null=True)
    external_links = models.JSONField(null=True)
    founded_day = models.IntegerField(null=True)
    founded_month = models.IntegerField(null=True)
    founded_year = models.IntegerField(null=True)
    funding_rounds = models.JSONField(null=True)
    homepage_url = models.URLField(null=True)
    image = models.JSONField(null=True)
    investments = models.JSONField(null=True)
    ipo = models.JSONField(null=True)
    milestones = models.JSONField(null=True)
    name = models.CharField(max_length=255)
    number_of_employees = models.IntegerField(null=True)
    offices = models.JSONField(null=True)
    overview = models.TextField(null=True)
    partners = models.JSONField(null=True)
    permalink = models.CharField(max_length=255)
    phone_number = models.CharField(max_length=20, null=True)
    products = models.JSONField(null=True)
    providerships = models.JSONField(null=True)
    relationships = models.JSONField(null=True)
    screenshots = models.JSONField(null=True)
    tag_list = models.JSONField(null=True)
    total_money_raised = models.CharField(max_length=255, null=True)
    twitter_username = models.CharField(max_length=255, null=True)
    updated_at = models.DateTimeField(null=True)
    video_embeds = models.JSONField(null=True)

    class Meta:
        abstract = True



Perform Database Migration

To perform database migration, I will have to make a migration before actually migrating the SQL commands that we have created earlier. makemigrations provides SQL instructions for preinstalled apps and my CompaniesApp model, meanwhile migrate runs the SQL commands stored in the database file. So, after running migrate, all of my CompaniesApp's tables are created in the database file. Please establish an empty MySQL database named db_companies beforehand to assure this.

 python manage.py makemigrations
 python manage.py migrate

Check your MySQL database (XAMPP > MySQL > Start > Admin) to confirm this migration procedure. Below are the output of the commands:

image image



Load Data from JSON

  1. In Django, create a script or management command that allows the application to read the Companies JSON dataset. For this task, we will be utilizing the Django Object Relational Mapping (ORM) which is widely used to perform CRUD operations on the database object.
  2. In MongoDB, create a new database and load the JSON file. Ensure that your connection with MongoDB is perfectly established.

Question 1 (b)

The system architecture will consist of the following components:

image

No Layer Description
1 Client Layer
    This is the layer that fully interacts with the user. This layer contains the portal's dashboard and user interface that accomodates to the user's easy navigation and this layer also processes all the users input. The client layer in this example is a web dashboard constructed using HTML, CSS, JavaScript, PHP and others.
2 Application Layer
    The layer that houses the system's business logic. This layer basically works as a platform that connects both the client layer as well as the database layer. The application layer in this scenario is constructed with Django, a Python web framework. Django includes a number of capabilities that make it simple to create web applications, including:
  • A Model-View-Template (MVT) architecture divides the data model, views, and templates.
  • A strong ORM (Object-Relational Mapping) that simplifies database interaction.
3 Database Layer
    The layer that stores the data for the system. As stated in the question, the system shall be working with the Companies dataset. The database layer in this example is made up of two databases:
  • MongoDb is used for storing massive amounts of unstructured data, in this case the JSON dataset that we have downloaded earlier. Djongo is a connector that allows Django to communicate with MongoDB.
  • MySQL is used for storing structured data such as the table to store different type of users. Django ORM is a connector that allows Django to communicate with MySQL.

Contribution 🛠️

Please create an Issue for any improvements, suggestions or errors in the content.

You can also contact me using Linkedin for any other queries or feedback.

Visitors