Skip to content

A Python app that imports data from a dataset (JSON) into a MySQL DB

License

Notifications You must be signed in to change notification settings

mattdark/json-mysql-importer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues MIT License LinkedIn

Table of Contents
  1. Importing from JSON to MySQL
  2. Getting Started
  3. Roadmap
  4. Contributing
  5. License
  6. Contact

Importing from JSON to MySQL

To read and analyze a dataset like the provided by the MovieNet project, you can use Pandas. This Python library, used for data analysis, supports the reading of the following file types:

  • CSV Files
  • JSON Files
  • HTML Files
  • Excel Files
  • SQL Files
  • Pickle Files

The MovieNet dataset was created for movie understanding. It contains the meta information of the movies from IMDB and TMDB, including:

  • Title
  • Genre
  • Country
  • Director
  • Writer
  • Cast

Every entry in this dataset is a JSON file, and has the following content:

JSON (click to expand)
{
  "imdb_id": "tt0000001",
  "tmdb_id": null,
  "douban_id": null,
  "title": "Carmencita (1894)",
  "genres": [
    "Documentary",
    "Short"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "1 min",
      "description": ""
    }
  ],
  "imdb_rating": null,
  "director": [
    {
      "id": "nm0005690",
      "name": "William K.L. Dickson"
    }
  ],
  "writer": null,
  "cast": [
    {
      "id": "nm1588970",
      "name": "Carmencita",
      "character": "Herself"
    }
  ],
  "overview": null,
  "storyline": "Presumably, the first woman ever to appear in a Kinetoscope film and possibly the first woman to take part in a motion picture in the United States, the Spaniard dancer, Carmencita, performs her appealing high-kick dance in front of the camera of William K.L. Dickson and William Heise, for Thomas Edison. In this segment of her New York music-hall act, she spins and twirls, exhibiting an admirable talent, a fashionable dress, and a really charming smile.",
  "plot": null,
  "synopsis": null
}

Using Pandas, a dataframe will be created from the more than 375 thousand JSON files that represent the meta information of the movies. This will permit that information can be read and analyzed.

def create_dataframe(filepath):
    json_pattern = os.path.join(filepath, '*.json')
    file_list = glob(json_pattern)

    json_list = []
    for file in tqdm(file_list, desc='Creating DataFrame'):
        with open (file) as f:
            exp = json.load(f)
            json_list.append(exp)

    df = pd.DataFrame(json_list)
    return df

After reading and analysing the dataset, this information will be imported into a MySQL DB using SQLAlchemy.

Schema definition is specified in the schema.py module. This information is required by SQLAlchemy to create the tables in the movienet database.

Executing time of the read_load_data.py script will depend on CPU.

(back to top)

Built With

(back to top)

Prerequisites

MySQL

While this script could work with any MySQL variant, tests were run on Percona Server for MySQL. If you already have MySQL running on your system, skip this step.

For installing Percona Server for MySQL from the repositories on Debian and Ubuntu, follow the instructions below. For other operating systems check the documentation.

Before installing Percona Server for MySQL, make sure curl and gnupg2 are installed.

$ sudo apt install gnupg2 curl

Then, install percona-release, a tool that allows users to automatically configure which Percona Software repositories are enabled or disabled.

Get the repository package:

$ wget https://repo.percona.com/apt/percona-release_latest.$(lsb_release -sc)_all.deb

Install the downloaded package with dpkg:

$ sudo dpkg -i percona-release_latest.$(lsb_release -sc)_all.deb

Enable the ps80 repository:

$ sudo percona-release setup ps80

Install percona-server-server, the package that provides the Percona Server for MySQL:

$ sudo apt install percona-server-server

MySQL Shell is also recommended to be installed.

$ sudo apt install percona-mysql-shell

After installing, make sure MySQL is running.

During installation, root password is assigned. You can log into MySQL with this user or create a new one. Also, movienet database must be created.

Log into MySQL using MySQL Shell:

$ mysqlsh root@localhost

Replace root with your user if necessary, and replace localhost with the IP address or URL for your MySQL server instance if needed.

Change to SQL mode and create the movienet database:

\sql
create database movienet;

(back to top)

Conda

Before running the Python script that imports the data into a MySQL DB, you must install Conda through Anaconda or Miniconda. Here you can find more information on which one is better for you.

To install Anaconda, go to anaconda.com/products/distribution and download the installer for your operating system.

If you're on Linux, run the following command after getting the installer:

$ bash Anaconda3-2022.05-Linux-x86_64.sh

Replacing the filename according to the version you're installing

It will prompt you to confirm some configuration details:

  • Accepting license
  • Confirming installation folder
  • Initializing Anaconda (by the installer)

(back to top)

Getting Started

Clone this repository

git clone https://github.com/mattdark/json-mysql-importer.git

Configure your Python environment

virtualenv and pip

Conda

When you clone this repository to your local development environment, you will find an environment.yml file that contains information about the Python version and dependencies required by this project. This file is used by Conda to configure the virtual environment and install dependencies.

After installing Anaconda, change to the json-mysql-importer directory:

$ cd json-mysql-importer

Create the environment:

$ conda env create -f environment.yml

This command will create a virtual environment for your project, install Python 3.10 and dependencies specified in the environment.yml file.

Once the environment is created, you can activate it by running:

$ conda activate percona

(back to top)

Download Dataset

A copy of the MovieNet dataset is required for running this script. Signing in to OpenDataLab using your Google or GitHub account, go to the download section and get the .zip file inside the Meta directory. Every entry in this dataset is a JSON file that contains information about every movie. Unzip the file and copy the JSONs to a datasets directory inside json-mysql-converter.

(back to top)

Running the script

Before running the script, don't forget to set user, password and host in the base.py module, required for connecting to your MySQL DB.

After setting up authentication details in the base.py module, run the Python script:

$ python read_load_data.py

This script will do the follow:

  • Generate database schema from schema.py module.
  • Insert Country_Dict and Genre_Dict values into countries and genres tables from movienet database.
  • Create DataFrame from JSON files in the MovieNet dataset.
  • Insert values from dataframe into movienet database.

A progress bar was added using tqdm.

(back to top)

Roadmap

  • Set up a Python Env with Anaconda
    • Create Env Config File (environment.yml)
      • Python Runtime
      • Dependencies
  • Create requirements.txt
  • Read JSON files
  • Import data to a dataframe
  • Create data catalogs
  • DB schema definition
  • Set up database connection
  • Import data to MysQL
  • Write a tutorial / blog post

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Mario García - @mariogmd - hi@mariog.xyz

Project Link: https://github.com/mattdark/json-mysql-importer

(back to top)

About

A Python app that imports data from a dataset (JSON) into a MySQL DB

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages