The Lecture Notebook and associated Homework use a LinkedIn dataset which was originally available on Kaggle. However, due to privacy concerns it is no longer available. We have therefore created a synthetic dataset of 10k records for you to use in the Lecture Notebook and associated Homework. They are ‘test_data_10000.zip’, ‘linkedin.nodes.zip’, and ‘linkedin.edges.zip’ in OpenDS4All/assets/data.
There are two approaches to using this data:
-
Instructors make it available to students: The instructor can download the zip file, unzip it (note that unzipping is optional since it can be done in the notebooks), put ‘linkedin_data_10000.json’, ‘linkedin.nodes’, and ‘linkedin_edges’ (or the zip files) on a server, and make the URL available to students. Students can then use the python package ‘urllib’ to fetch the data.
-
Provide zip files to students: Make the zip files available to students, who can then download them to local computer. If students use Google Colab to run the notebook, then instructor can share the files through Google Drive with students, who are then able to add the shared folder to their own Google Drive and map them to their Colab instance. The next step would be the same as running the notebook locally, where they store the data on a local computer (looks same like mapping Google Drive to Colab machine) and use the folder path to visit it.
There are two ways of doing this.
- The instructor (or each student) sets up a remote MongoDB instance by going to mongodb.com. Click on "Get started", sign up, agree to terms of service, and create a cluster. Use this location as 'Y' in the associated Lecture Notebook and Homework. This option is required if students are working in Colab.
- Students create a local instance by installing MongoDB (see instructions here and here for more details). The default server address is: mongodb://localhost:27017 . This option cannot be used if students are working in Colab.
Instructors who want more depth on this topic can refer to
-
Knowledge graphs: ontologies
-
Entity-Relationship (ER) model: Wikipedia page and relevant portions of this textbook "Database Management Systems," by Ramakrishnan and Gehrke.
-
NoSQL: MongoDB Tutorial
-
Transactions and concurrency: relevant portions of a basic database textbook, e.g. "Database Management Systems" by Ramakrishnan and Gehrke.