801 Rewrite Webscraper to use SIS (#849)

YACS-RCOS · Dec 8, 2023 · 0f00a6f · 0f00a6f
1 parent da80d9b
commit 0f00a6f
Show file tree

Hide file tree

Showing 14 changed files with 898 additions and 7 deletions.
diff --git a/rpi_data/modules/READEME.md b/rpi_data/modules/READEME.md
@@ -0,0 +1,90 @@
+# Package requirements
+pandas
+bs4
+requests
+lxml
+regex
+pyyaml
+selenium
+
+# Course Parser
+Hopefully this will be the last one.
+The relevant files in the folder are csv_to_course.py, course.py, headless_login.py, new_parse.py, and parse_runner.py
+The other files in the folder are legacy code that were used in the old web scraper. For now they will remain in here as there are some pieces of code that could be useful if future edge cases pop up.
+
+# How to run
+run parse_runner.py with a term as the command argument. The term is formatted as termYEAR. If the specified csv doesn't exist it'll do a full parse, if it does it'll immediately start updating.
+
+# Common issues with SIS scraping
+
+------------------------------------------------------------------------------------------------------------------------------------------
+Sel | CRN | Maj | Cod | Sec | Cmp | Crd | Nme | Dys | Tme | Cap | Act | Rem | WLC | WLA | WLR | XLC | XLA | XLR | Prof | Date | Loc | Attr
+------------------------------------------------------------------------------------------------------------------------------------------
+SR  |99341|CSCI |1100 |01   | T   | 4.00| CSI | MR  |12-150| 24 | 3   | 21  | 0   | 0   | 0   | 0   | 0   | 0   |Stur  |01-04 |TBA  |Intro
+------------------------------------------------------------------------------------------------------------------------------------------
+While some details have been truncated to fit, this is an example of what we expect a course to be from sis. And for the most part, many courses on sis follow this format.
+But, SIS is not perfect, and there are often many mistakes in courses.
+The first main one (though this is moreso a design decision than a mistake) is that many parts of courses may be empty, for example
+------------------------------------------------------------------------------------------------------------------------------------------
+    |     |     |     |     |     |     |     | T  |10-1150|   |     |    |     |     |     |     |     |     |TBA  |01-04 |TBA  |Intro
+------------------------------------------------------------------------------------------------------------------------------------------
+This is the lab block for the above cs1 course. As you may notice, most of the details are missing, and so it is impossible to build out a course just from this information. 
+However, since this appears directly below the cs1 lecture block in sis, we will parse this course immedieatly after parsing the lecture block.
+So, we keep a copy of the previous course that we parsed, and use that to fill in information about lab and test blocks.
+
+Another common issue is the use of colspan, for example, Biomed 6940 in spring 2024, which looks like this in SIS
+------------------------------------------------------------------------------------------------------------------------------------------
+SR  |90453|BMED |6940 |01   | T   | 1-9 | REB | TBA  |     | 0 | 0   | 0  | 0   | 0   | 0   | 0   | 0   | 0   |TBA  |01-04 |TBA  |
+------------------------------------------------------------------------------------------------------------------------------------------
+However, is parsed as 
+------------------------------------------------------------------------------------------------------------------------------------------
+SR  |90453|BMED |6940 |01   | T   | 1-9 | REB | TBA       | 0 | 0   | 0  | 0   | 0   | 0   | 0   | 0   | 0   |TBA  |01-04 |TBA  |
+-----------------------------------------------------------------------------------------------------------------------------------------
+because of the use of colspan in the days column. This means that our indecies are off when we start formatting and processing stuff, which crashes the web scraper. We get around this by inserting a TBA for the value of the colspan that we see.
+
+We also split some fields into two different fields, namely the start and end times a class (eg. 2:00pm - 3:50pm) and the start and end dates of the classes (eg. 01/08-04/24).
+However there are often times where courses may have these fields as TBA or blank, eg admin 1030 which looks like this
+
+------------------------------------------------------------------------------------------------------------------------------------------
+SR  |93972|ADMN |1030 |01   | T   | 0.00 | AXPA |   |  TBA |1000| 443| 557| 0  | 0   | 0   | 0   | 0  | 0   |Cary  |01-04 |TBA  |
+------------------------------------------------------------------------------------------------------------------------------------------
+
+As you can see, the time field is TBA, which we can't easily divide into two seperate fields. At the moment we just add in two TBA's in this case. However in the event that another web scraper is needed or it breaks, this may be a source of failure.
+
+These are the some of the more common offenders, but other issues can pop up, so generally, a row should always have exactly 21 things in it before we begin processing. Many issues that pop up with the webscraper are related to the rows not having a length of 21.
+
+# Common Issues with catalog scraping
+
+Unfortunately, SIS scraping is relatively simple compared to catalog scraping, which has many issues.
+Though most of these issues will probably (hopefully) dissapear when the catalog api is implemented.
+However I am under the assumption that speedup is all we can except from the api.
+So, SIS will not give us everything that we want, in particular the prerequsites and corequisites of a course, in order to get that we will need to scrap from the catalog, in particular, this link
+https://sis.rpi.edu/rss/bwckctlg.p_disp_course_detail?cat_term_in=?&subj_code_in=?&crse_numb_in=?
+Where you would replace the ?'s with a basevalue (the integer representation of a semester - Spring 2024 -> 202401), Major, and course code.
+
+Because there is notablly less information to parse, there are less issues with the catalog at present, though some of the issues are more severe.
+![Alt text](image-1.png)
+
+This is as close to the ideal course that one can find, there is a clear list of prerequisites and corequisites, as well as a description. (Though there is a slight issues where it's listed as "Prerequisites/Corequisites: Prerequisite:" instead of "Prerequisites/Corequisites: Prerequisites:" like other courses, but it's pretty good beyond that).
+
+However, there are many courses that do not follow this, for example, 
+
+![Alt text](image-3.png)
+
+Even though capstone is listed as having prerequsites or corequisites, it only has prerequisites, and that is difficult for a computer to distinguish, namely because it is missing the "Prerequites:" or "Corequisites" that other courses will have. For reasons that will be mentioned below, this is not too big of an issue with prerequisites, as there is a consistent way to get those, but getting corequisites consistently is difficult.
+
+![Alt text](image-2.png)
+
+This is another case of weird prerequisite and corequisite formatting, where is it difficult to parse the two. 
+
+![Alt text](image-4.png)
+
+This is RCOS for next semester, however if you did not already know the course code, it would be very impossible to tell that this was RCOS. So, when parsed, there will be no prerequisites, corequisites, or description for the course, even though this is not actually true.
+
+It is worth mentioning that there are two prerequisites, one called prerequisites, ie "Prerequisites/Corequisites: Prerequisites: CSCI 1200 and Introduction to Calculus (MATH 1010 or MATH 1500 or MATH 1020 or MATH 2010); MATH 1020 is strongly recommended.", and another called raw, or raw prerequistes in the database. Raw is
+
+![Alt text](image-5.png)
+
+In the webpage. When you click on a course in the explore page, this is the information that is displayed as the prerequisites of a course. Notable, raw is actually reliable and so many of the issues with prerequistes and corequisites mentioned here are mostly dealing with corequisites.
+
+However, aside from raw, all of the other situtations are unique problems that do not have, or have limited solutions in the webscraper, this is especially true with the corequisite problem.
diff --git a/rpi_data/modules/course.py b/rpi_data/modules/course.py
@@ -0,0 +1,160 @@
+import time
+import pdb
+import copy
+from typing import overload
+class Course:
+    name:str
+    credits:int
+    days:str
+    stime:int
+    etime:int
+    profs:str
+    loc:str
+    max:int
+    curr:int
+    rem:int
+    dept:str
+    sdate:str
+    enddate:str
+    sem:str
+    crn:int 
+    code:int
+    section:int
+    short:str
+    long:str
+    frequency:str
+    desc:str 
+    raw:str
+    pre:list
+    co:list
+    major:str
+    school:str
+    lec:str
+    #Info will be an array of strings: 
+    # [crn, major, code, section, credits, name, days, stime, etime, max, curr, rem, profs, sdate, enddate, loc]
+
+    def __init__(self, info):
+        self.crn = info[0]
+        self.major = info[1]
+        self.code = info[2]
+        self.section = info[3]
+        self.credits = info[4]
+        self.name = info[5]
+        self.days = info[6]
+        self.stime = info[7]
+        self.etime = info[8]
+        self.max = info[9]
+        self.curr = info[10]
+        self.rem = info[11]
+        self.profs = info[12]
+        self.sdate = info[13]
+        self.enddate = info[14]
+        self.loc = info[15]
+        self.long = self.processName(self.name)
+        self.frequency = ""
+        self.short = self.major + '-' + self.code
+        self.lec = "LEC"
+        self.desc = ""
+        self.raw = ""
+        self.pre = list()
+        self.co = list()
+        self.school = ""
+        self.sem = ""
+
+    def processName(self, name:str) -> str:
+        tmp = name.split()
+        for i in range(0, len(tmp), 1):
+            if not tmp[i].isalpha():
+                continue 
+            tmp[i]= tmp[i][:1].upper() + tmp[i][1:].lower()
+        return ' '.join(tmp)
+    def addSemester(self, semester):
+        self.sem = semester.upper()
+    def addReqs(self, pre:list=[], co:list=[], raw:str="", desc: str=""):
+        self.desc = desc
+        self.raw = raw
+        self.pre = copy.deepcopy(pre)
+        self.co = copy.deepcopy(co)
+
+    def addReqsFromList(self, info: list=[]):
+        self.pre = info[0]
+        self.co = info[1]
+        self.raw = info[2]
+        self.desc = info[3]
+    def print(self):
+        for attr, value in self.__dict__.items():
+            print(attr, " : ", value)
+    #Turn the class back into a list. 
+    #Because of the diffs in how we store vs how we want it to be, need to do a lot of swapping
+    #Maybe there's a diff way than doing this, hopefully there is
+    def decompose(self) -> list[str]:
+        retList = []
+        retList.append(self.name)
+        retList.append(self.lec)
+        retList.append(self.credits)
+        retList.append(self.days)
+        retList.append(self.stime)
+        retList.append(self.etime)
+        retList.append(self.profs)
+        retList.append(self.loc)
+        retList.append(self.max)
+        retList.append(self.curr)
+        retList.append(self.rem)
+        retList.append(self.major)
+        retList.append(self.sdate)
+        retList.append(self.enddate)
+        retList.append(self.sem)
+        retList.append(self.crn)
+        retList.append(self.code)
+        retList.append(self.section)
+        retList.append(self.short)
+        retList.append(self.long)
+        retList.append(self.desc)
+        retList.append(self.raw)
+        retList.append(self.frequency)
+        retList.append(self.pre)
+        retList.append(self.co)
+        retList.append(self.school)
+        return retList
+
+    def list_to_class(self, row):
+        self.name = row[0]
+        self.lec = row[1]
+        self.credits = row[2]
+        self.days = row[3]
+        self.stime = row[4]
+        self.etime = row[5]
+        self.profs = row[6]
+        self.loc = row[7]
+        self.max = row[8]
+        self.curr = row[9]
+        self.rem = row[10]
+        self.major = row[11]
+        self.sdate = row[12]
+        self.enddate = row[13]
+        self.sem = row[14]
+        self.crn = row[15]
+        self.code = row[16]
+        self.section = row[17]
+        self.short = row[18]
+        self.long = row[19]
+        self.desc = row[20]
+        self.raw = row[21]
+        self.pre = row[23]
+        self.co = row[24]
+        self.school = row[25]
+
+    def addSchool(self, school):
+        self.school = school
+    def __lt__(self, other):
+        #Note that we will maybe need to compare times? Idk how to handle the case where the classes
+        #are the same (ie lab, lecture, test) so at the moment the lecture appears last.
+        #  So far we just sort in reverse order.
+        if self.major > other.major:
+            return self.major > other.major
+        if self.code > other.code:
+            return self.code > other.code
+        return self.section > other.section
+
+    def __str__(self):
+        return self.name
diff --git a/rpi_data/modules/csv_to_course.py b/rpi_data/modules/csv_to_course.py
@@ -0,0 +1,46 @@
+import csv
+from course import Course
+import os
+import pdb
+__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
+
+# This file takes our csv formatting and turns it into a course class type. If something goes wrong it's because you changed one of those two.
+
+def parse_csv(filename):
+    courses = list()
+    i = 0
+    with open(os.path.join(__location__, filename), 'r', encoding="utf8") as f:
+        reader = csv.reader(f)
+        for row in reader:
+            if i == 0:
+                i += 1
+                continue
+            temp = Course(["" for _ in range(16)])
+            temp.name = row[0]
+            temp.lec = row[1]
+            temp.credits = row[2]
+            temp.days = row[3]
+            temp.stime = row[4]
+            temp.etime = row[5]
+            temp.profs = row[6]
+            temp.loc = row[7]
+            temp.max = row[8]
+            temp.curr = row[9]
+            temp.rem = row[10]
+            temp.major = row[11]
+            temp.sdate = row[12]
+            temp.enddate = row[13]
+            temp.sem = row[14]
+            temp.crn = row[15]
+            temp.code = row[16]
+            temp.section = row[17]
+            temp.short = row[18]
+            temp.long = row[19]
+            temp.desc = row[20]
+            temp.raw = row[21]
+            #empty column here
+            temp.pre = row[23]
+            temp.co = row[24]
+            temp.school = row[25]
+            courses.append(temp)
+    return courses
diff --git a/rpi_data/modules/headless_login.py b/rpi_data/modules/headless_login.py
@@ -0,0 +1,72 @@
+import selenium as sel
+from selenium import webdriver
+from selenium.webdriver.common.by import By
+from selenium.webdriver.support.ui import Select
+from selenium.webdriver.firefox.options import Options
+from selenium.webdriver.support.ui import WebDriverWait 
+from selenium.webdriver.support import expected_conditions as EC 
+import time
+import os
+import sys
+
+# Remember to add enviromental variables named rcsid and rcspw with your account info!!!
+#
+# THINGS THAT CAN POTENTIALLY GO WRONG HERE AND HOW TO FIX THEM:
+# 
+# - If the RPI login website changes at all, it's very likely that the login will break. Fixing might involve changing what element selenium looks for.
+# - DUO likes to change things. If they implement another 2FA type or add extra buttons for some reason you'll have to add more checks and button presses
+# - Selenium errors can occur if your internet is slow or if you have multiple browser instances open, so try to avoid this
+# 
+# - You need to install firefox (I hate Google Chrome, and you should too). If you change it to be a Chrome instance instead it probably won't work from my experience
+# - To fix these things you can comment this line: "options.add_argument("--headless")" in the parse_runner file to see what goes wrong if python doesn't throw anything
+# - Try restarting python/vscode or even your computer if it's throwing something really weird for no reason
+# - You can try sending me a message on discord @gcm as a last resort
+
+
+def login(driver):
+    URL = "http://sis.rpi.edu"
+    driver.get(URL) # uses a selenium webdriver to go to the sis website, which then redirects to the rcs auth website
+    username_box = driver.find_element(by=By.NAME, value = "j_username") # creates a variable which contains an element type, so that we can interact with it, j_username is the username text box
+    password_box = driver.find_element(by=By.NAME, value = "j_password") # j_password is the password box
+    submit = driver.find_element(by=By.NAME, value = "_eventId_proceed") # _eventId_proceed is the submit button
+    username = os.environ.get("rcsid", "NONEFOUND")
+    password = os.environ.get("rcspw", "NONEFOUND")
+    if (username == "NONEFOUND" or password == "NONEFOUND"):
+        print("username or password not found, check environment variables or input them manually")
+        username = input("Enter username: ")
+        password = input("Enter password: ")
+    username_box.send_keys(username) # enters the username
+    password_box.send_keys(password) # enters the password
+    submit.click() # click the submit button
+    while ("duosecurity" not in driver.current_url): # if you entered details incorrectly, the loop will be entered as you aren't on the duo verfication website (redo what we did before)
+        print("User or Password Incorrect.")
+        username_box = driver.find_element(by=By.NAME, value = "j_username") # we have to redefine the variables because the webpage reloads
+        password_box = driver.find_element(by=By.NAME, value = "j_password")
+        submit = driver.find_element(by=By.NAME, value = "_eventId_proceed")
+        username = input("Enter Username: ")
+        password = input("Enter Password: ")
+        username_box.clear() # the username box by default has your previous username entered, so we clear it
+        username_box.send_keys(username)
+        password_box.send_keys(password)
+        submit.click()
+    while len(driver.find_elements(By.XPATH, '/html/body/div/div/div[1]/div/div[2]/div[7]/a'))==0:
+        time.sleep(.1)
+    options  = driver.find_element(By.XPATH, '/html/body/div/div/div[1]/div/div[2]/div[7]/a')
+    options.click()
+    while len(driver.find_elements(By.XPATH, '/html/body/div/div/div[1]/div/div[1]/ul/li[1]/a')) == 0:
+        time.sleep(.1)
+    duo_option = driver.find_element(By.XPATH, '/html/body/div/div/div[1]/div/div[1]/ul/li[1]/a')
+    duo_option.click()
+    while len(driver.find_elements(By.XPATH, '/html/body/div/div/div[1]/div/div[2]/div[3]')) == 0:
+        time.sleep(.1)
+    print("Your DUO code: "+ driver.find_element(by= By.XPATH, value = "/html/body/div/div/div[1]/div/div[2]/div[3]").text) # print the duo code
+    while len(driver.find_elements(By.XPATH, '//*[@id="trust-browser-button"]'))==0: # we need to press the trust browser button, so we wait until that shows up
+        time.sleep(.1)
+    trust_button = driver.find_element(By.XPATH, '//*[@id="trust-browser-button"]') #find and click it
+    trust_button.click()
+    time.sleep(3)
+    if (driver.current_url == "https://sis.rpi.edu/rss/twbkwbis.P_GenMenu?name=bmenu.P_MainMnu"): # check if we're in the right place
+        return "Success"
+    else:
+        print("login failed")
+        return "Failure"
diff --git a/rpi_data/modules/image-1.png b/rpi_data/modules/image-1.png
diff --git a/rpi_data/modules/image-2.png b/rpi_data/modules/image-2.png
diff --git a/rpi_data/modules/image-3.png b/rpi_data/modules/image-3.png
diff --git a/rpi_data/modules/image-4.png b/rpi_data/modules/image-4.png
diff --git a/rpi_data/modules/image-5.png b/rpi_data/modules/image-5.png
diff --git a/rpi_data/modules/image.png b/rpi_data/modules/image.png