Skip to content

Commit

Permalink
801 Rewrite Webscraper to use SIS (#849)
Browse files Browse the repository at this point in the history
  • Loading branch information
bnavac authored Dec 8, 2023
1 parent da80d9b commit 0f00a6f
Show file tree
Hide file tree
Showing 14 changed files with 898 additions and 7 deletions.
90 changes: 90 additions & 0 deletions rpi_data/modules/READEME.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Package requirements
pandas
bs4
requests
lxml
regex
pyyaml
selenium

# Course Parser
Hopefully this will be the last one.
The relevant files in the folder are csv_to_course.py, course.py, headless_login.py, new_parse.py, and parse_runner.py
The other files in the folder are legacy code that were used in the old web scraper. For now they will remain in here as there are some pieces of code that could be useful if future edge cases pop up.

# How to run
run parse_runner.py with a term as the command argument. The term is formatted as termYEAR. If the specified csv doesn't exist it'll do a full parse, if it does it'll immediately start updating.

# Common issues with SIS scraping

------------------------------------------------------------------------------------------------------------------------------------------
Sel | CRN | Maj | Cod | Sec | Cmp | Crd | Nme | Dys | Tme | Cap | Act | Rem | WLC | WLA | WLR | XLC | XLA | XLR | Prof | Date | Loc | Attr
------------------------------------------------------------------------------------------------------------------------------------------
SR |99341|CSCI |1100 |01 | T | 4.00| CSI | MR |12-150| 24 | 3 | 21 | 0 | 0 | 0 | 0 | 0 | 0 |Stur |01-04 |TBA |Intro
------------------------------------------------------------------------------------------------------------------------------------------
While some details have been truncated to fit, this is an example of what we expect a course to be from sis. And for the most part, many courses on sis follow this format.
But, SIS is not perfect, and there are often many mistakes in courses.
The first main one (though this is moreso a design decision than a mistake) is that many parts of courses may be empty, for example
------------------------------------------------------------------------------------------------------------------------------------------
| | | | | | | | T |10-1150| | | | | | | | | |TBA |01-04 |TBA |Intro
------------------------------------------------------------------------------------------------------------------------------------------
This is the lab block for the above cs1 course. As you may notice, most of the details are missing, and so it is impossible to build out a course just from this information.
However, since this appears directly below the cs1 lecture block in sis, we will parse this course immedieatly after parsing the lecture block.
So, we keep a copy of the previous course that we parsed, and use that to fill in information about lab and test blocks.

Another common issue is the use of colspan, for example, Biomed 6940 in spring 2024, which looks like this in SIS
------------------------------------------------------------------------------------------------------------------------------------------
SR |90453|BMED |6940 |01 | T | 1-9 | REB | TBA | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |TBA |01-04 |TBA |
------------------------------------------------------------------------------------------------------------------------------------------
However, is parsed as
------------------------------------------------------------------------------------------------------------------------------------------
SR |90453|BMED |6940 |01 | T | 1-9 | REB | TBA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |TBA |01-04 |TBA |
-----------------------------------------------------------------------------------------------------------------------------------------
because of the use of colspan in the days column. This means that our indecies are off when we start formatting and processing stuff, which crashes the web scraper. We get around this by inserting a TBA for the value of the colspan that we see.

We also split some fields into two different fields, namely the start and end times a class (eg. 2:00pm - 3:50pm) and the start and end dates of the classes (eg. 01/08-04/24).
However there are often times where courses may have these fields as TBA or blank, eg admin 1030 which looks like this

------------------------------------------------------------------------------------------------------------------------------------------
SR |93972|ADMN |1030 |01 | T | 0.00 | AXPA | | TBA |1000| 443| 557| 0 | 0 | 0 | 0 | 0 | 0 |Cary |01-04 |TBA |
------------------------------------------------------------------------------------------------------------------------------------------

As you can see, the time field is TBA, which we can't easily divide into two seperate fields. At the moment we just add in two TBA's in this case. However in the event that another web scraper is needed or it breaks, this may be a source of failure.

These are the some of the more common offenders, but other issues can pop up, so generally, a row should always have exactly 21 things in it before we begin processing. Many issues that pop up with the webscraper are related to the rows not having a length of 21.

# Common Issues with catalog scraping

Unfortunately, SIS scraping is relatively simple compared to catalog scraping, which has many issues.
Though most of these issues will probably (hopefully) dissapear when the catalog api is implemented.
However I am under the assumption that speedup is all we can except from the api.
So, SIS will not give us everything that we want, in particular the prerequsites and corequisites of a course, in order to get that we will need to scrap from the catalog, in particular, this link
https://sis.rpi.edu/rss/bwckctlg.p_disp_course_detail?cat_term_in=?&subj_code_in=?&crse_numb_in=?
Where you would replace the ?'s with a basevalue (the integer representation of a semester - Spring 2024 -> 202401), Major, and course code.

Because there is notablly less information to parse, there are less issues with the catalog at present, though some of the issues are more severe.
![Alt text](image-1.png)

This is as close to the ideal course that one can find, there is a clear list of prerequisites and corequisites, as well as a description. (Though there is a slight issues where it's listed as "Prerequisites/Corequisites: Prerequisite:" instead of "Prerequisites/Corequisites: Prerequisites:" like other courses, but it's pretty good beyond that).

However, there are many courses that do not follow this, for example,

![Alt text](image-3.png)

Even though capstone is listed as having prerequsites or corequisites, it only has prerequisites, and that is difficult for a computer to distinguish, namely because it is missing the "Prerequites:" or "Corequisites" that other courses will have. For reasons that will be mentioned below, this is not too big of an issue with prerequisites, as there is a consistent way to get those, but getting corequisites consistently is difficult.

![Alt text](image-2.png)

This is another case of weird prerequisite and corequisite formatting, where is it difficult to parse the two.

![Alt text](image-4.png)

This is RCOS for next semester, however if you did not already know the course code, it would be very impossible to tell that this was RCOS. So, when parsed, there will be no prerequisites, corequisites, or description for the course, even though this is not actually true.

It is worth mentioning that there are two prerequisites, one called prerequisites, ie "Prerequisites/Corequisites: Prerequisites: CSCI 1200 and Introduction to Calculus (MATH 1010 or MATH 1500 or MATH 1020 or MATH 2010); MATH 1020 is strongly recommended.", and another called raw, or raw prerequistes in the database. Raw is

![Alt text](image-5.png)

In the webpage. When you click on a course in the explore page, this is the information that is displayed as the prerequisites of a course. Notable, raw is actually reliable and so many of the issues with prerequistes and corequisites mentioned here are mostly dealing with corequisites.

However, aside from raw, all of the other situtations are unique problems that do not have, or have limited solutions in the webscraper, this is especially true with the corequisite problem.
160 changes: 160 additions & 0 deletions rpi_data/modules/course.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
import time
import pdb
import copy
from typing import overload
class Course:
name:str
credits:int
days:str
stime:int
etime:int
profs:str
loc:str
max:int
curr:int
rem:int
dept:str
sdate:str
enddate:str
sem:str
crn:int
code:int
section:int
short:str
long:str
frequency:str
desc:str
raw:str
pre:list
co:list
major:str
school:str
lec:str
#Info will be an array of strings:
# [crn, major, code, section, credits, name, days, stime, etime, max, curr, rem, profs, sdate, enddate, loc]

def __init__(self, info):
self.crn = info[0]
self.major = info[1]
self.code = info[2]
self.section = info[3]
self.credits = info[4]
self.name = info[5]
self.days = info[6]
self.stime = info[7]
self.etime = info[8]
self.max = info[9]
self.curr = info[10]
self.rem = info[11]
self.profs = info[12]
self.sdate = info[13]
self.enddate = info[14]
self.loc = info[15]
self.long = self.processName(self.name)
self.frequency = ""
self.short = self.major + '-' + self.code
self.lec = "LEC"
self.desc = ""
self.raw = ""
self.pre = list()
self.co = list()
self.school = ""
self.sem = ""

def processName(self, name:str) -> str:
tmp = name.split()
for i in range(0, len(tmp), 1):
if not tmp[i].isalpha():
continue
tmp[i]= tmp[i][:1].upper() + tmp[i][1:].lower()
return ' '.join(tmp)
def addSemester(self, semester):
self.sem = semester.upper()
def addReqs(self, pre:list=[], co:list=[], raw:str="", desc: str=""):
self.desc = desc
self.raw = raw
self.pre = copy.deepcopy(pre)
self.co = copy.deepcopy(co)

def addReqsFromList(self, info: list=[]):
self.pre = info[0]
self.co = info[1]
self.raw = info[2]
self.desc = info[3]
def print(self):
for attr, value in self.__dict__.items():
print(attr, " : ", value)
#Turn the class back into a list.
#Because of the diffs in how we store vs how we want it to be, need to do a lot of swapping
#Maybe there's a diff way than doing this, hopefully there is
def decompose(self) -> list[str]:
retList = []
retList.append(self.name)
retList.append(self.lec)
retList.append(self.credits)
retList.append(self.days)
retList.append(self.stime)
retList.append(self.etime)
retList.append(self.profs)
retList.append(self.loc)
retList.append(self.max)
retList.append(self.curr)
retList.append(self.rem)
retList.append(self.major)
retList.append(self.sdate)
retList.append(self.enddate)
retList.append(self.sem)
retList.append(self.crn)
retList.append(self.code)
retList.append(self.section)
retList.append(self.short)
retList.append(self.long)
retList.append(self.desc)
retList.append(self.raw)
retList.append(self.frequency)
retList.append(self.pre)
retList.append(self.co)
retList.append(self.school)
return retList

def list_to_class(self, row):
self.name = row[0]
self.lec = row[1]
self.credits = row[2]
self.days = row[3]
self.stime = row[4]
self.etime = row[5]
self.profs = row[6]
self.loc = row[7]
self.max = row[8]
self.curr = row[9]
self.rem = row[10]
self.major = row[11]
self.sdate = row[12]
self.enddate = row[13]
self.sem = row[14]
self.crn = row[15]
self.code = row[16]
self.section = row[17]
self.short = row[18]
self.long = row[19]
self.desc = row[20]
self.raw = row[21]
self.pre = row[23]
self.co = row[24]
self.school = row[25]

def addSchool(self, school):
self.school = school
def __lt__(self, other):
#Note that we will maybe need to compare times? Idk how to handle the case where the classes
#are the same (ie lab, lecture, test) so at the moment the lecture appears last.
# So far we just sort in reverse order.
if self.major > other.major:
return self.major > other.major
if self.code > other.code:
return self.code > other.code
return self.section > other.section

def __str__(self):
return self.name
46 changes: 46 additions & 0 deletions rpi_data/modules/csv_to_course.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
import csv
from course import Course
import os
import pdb
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))

# This file takes our csv formatting and turns it into a course class type. If something goes wrong it's because you changed one of those two.

def parse_csv(filename):
courses = list()
i = 0
with open(os.path.join(__location__, filename), 'r', encoding="utf8") as f:
reader = csv.reader(f)
for row in reader:
if i == 0:
i += 1
continue
temp = Course(["" for _ in range(16)])
temp.name = row[0]
temp.lec = row[1]
temp.credits = row[2]
temp.days = row[3]
temp.stime = row[4]
temp.etime = row[5]
temp.profs = row[6]
temp.loc = row[7]
temp.max = row[8]
temp.curr = row[9]
temp.rem = row[10]
temp.major = row[11]
temp.sdate = row[12]
temp.enddate = row[13]
temp.sem = row[14]
temp.crn = row[15]
temp.code = row[16]
temp.section = row[17]
temp.short = row[18]
temp.long = row[19]
temp.desc = row[20]
temp.raw = row[21]
#empty column here
temp.pre = row[23]
temp.co = row[24]
temp.school = row[25]
courses.append(temp)
return courses
72 changes: 72 additions & 0 deletions rpi_data/modules/headless_login.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
import selenium as sel
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os
import sys

# Remember to add enviromental variables named rcsid and rcspw with your account info!!!
#
# THINGS THAT CAN POTENTIALLY GO WRONG HERE AND HOW TO FIX THEM:
#
# - If the RPI login website changes at all, it's very likely that the login will break. Fixing might involve changing what element selenium looks for.
# - DUO likes to change things. If they implement another 2FA type or add extra buttons for some reason you'll have to add more checks and button presses
# - Selenium errors can occur if your internet is slow or if you have multiple browser instances open, so try to avoid this
#
# - You need to install firefox (I hate Google Chrome, and you should too). If you change it to be a Chrome instance instead it probably won't work from my experience
# - To fix these things you can comment this line: "options.add_argument("--headless")" in the parse_runner file to see what goes wrong if python doesn't throw anything
# - Try restarting python/vscode or even your computer if it's throwing something really weird for no reason
# - You can try sending me a message on discord @gcm as a last resort


def login(driver):
URL = "http://sis.rpi.edu"
driver.get(URL) # uses a selenium webdriver to go to the sis website, which then redirects to the rcs auth website
username_box = driver.find_element(by=By.NAME, value = "j_username") # creates a variable which contains an element type, so that we can interact with it, j_username is the username text box
password_box = driver.find_element(by=By.NAME, value = "j_password") # j_password is the password box
submit = driver.find_element(by=By.NAME, value = "_eventId_proceed") # _eventId_proceed is the submit button
username = os.environ.get("rcsid", "NONEFOUND")
password = os.environ.get("rcspw", "NONEFOUND")
if (username == "NONEFOUND" or password == "NONEFOUND"):
print("username or password not found, check environment variables or input them manually")
username = input("Enter username: ")
password = input("Enter password: ")
username_box.send_keys(username) # enters the username
password_box.send_keys(password) # enters the password
submit.click() # click the submit button
while ("duosecurity" not in driver.current_url): # if you entered details incorrectly, the loop will be entered as you aren't on the duo verfication website (redo what we did before)
print("User or Password Incorrect.")
username_box = driver.find_element(by=By.NAME, value = "j_username") # we have to redefine the variables because the webpage reloads
password_box = driver.find_element(by=By.NAME, value = "j_password")
submit = driver.find_element(by=By.NAME, value = "_eventId_proceed")
username = input("Enter Username: ")
password = input("Enter Password: ")
username_box.clear() # the username box by default has your previous username entered, so we clear it
username_box.send_keys(username)
password_box.send_keys(password)
submit.click()
while len(driver.find_elements(By.XPATH, '/html/body/div/div/div[1]/div/div[2]/div[7]/a'))==0:
time.sleep(.1)
options = driver.find_element(By.XPATH, '/html/body/div/div/div[1]/div/div[2]/div[7]/a')
options.click()
while len(driver.find_elements(By.XPATH, '/html/body/div/div/div[1]/div/div[1]/ul/li[1]/a')) == 0:
time.sleep(.1)
duo_option = driver.find_element(By.XPATH, '/html/body/div/div/div[1]/div/div[1]/ul/li[1]/a')
duo_option.click()
while len(driver.find_elements(By.XPATH, '/html/body/div/div/div[1]/div/div[2]/div[3]')) == 0:
time.sleep(.1)
print("Your DUO code: "+ driver.find_element(by= By.XPATH, value = "/html/body/div/div/div[1]/div/div[2]/div[3]").text) # print the duo code
while len(driver.find_elements(By.XPATH, '//*[@id="trust-browser-button"]'))==0: # we need to press the trust browser button, so we wait until that shows up
time.sleep(.1)
trust_button = driver.find_element(By.XPATH, '//*[@id="trust-browser-button"]') #find and click it
trust_button.click()
time.sleep(3)
if (driver.current_url == "https://sis.rpi.edu/rss/twbkwbis.P_GenMenu?name=bmenu.P_MainMnu"): # check if we're in the right place
return "Success"
else:
print("login failed")
return "Failure"
Binary file added rpi_data/modules/image-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rpi_data/modules/image-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rpi_data/modules/image-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rpi_data/modules/image-4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rpi_data/modules/image-5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rpi_data/modules/image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 0f00a6f

Please sign in to comment.