Skip to content

BusinessAnalyticsTeamProject/DataMining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

62 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ Background

1. Increasing competition rate​ & Unclear passing criteria


42 School is rapidly gaining global recognition as a prestigious educational institution, particularly in Korea, where the competition rate is an astonishing 44 to 1. However, a significant challenge arises during the month-long testing period, as the nature of crucial activities remains shrouded in secrecy, posing a concern for prospective students and their preparation strategies.​

2. Globalized Campus of 42 school


There are 50 number of 42 campus and this education system is originated from 42 Ecole, which is the first campus of 42 school. 42 As schools become increasingly global, cracks will inevitably appear in their operating policies and educational standards.​ It is unclear whether the students at 42 School are being evaluated according to the standards of 42 Ecole, which has originality.​ Our team was launched to solve this situation using Business Analytics techniques.

😎 Purpose

image

1. Build a model that can predict the probability that an applicant will be Passed/Failed

2. Identify important factors that affect the final selection results of participants

3. Compare the selection criteria of 42 Seoul and Ecole 42 to identify the differences in educational priority and operating policies

πŸ“š Data Collection

Overall Collecting Process


We completed the data collection process by following these five steps. We created raw data through API calling, merged the crawled data, and deleted unnecessary columns.

1. User Raw Data: API

image
By calling /v2/campus/:/campus_id/users, we could separately collect raw data for all users of Seoul 42 Campus and Ecole 42 Campus, and the campus_ids for each are 29 and 1.

image
To get the raw data, we found the campus IDs of the Seoul and Ecole campuses and retrieved the data through API requests. This is what user raw data looks like.

2. Feedback and Evaluation Data: API with Python Code for Processing

image
By calling /v2/users/:user_id/scale_teams/as_corrector and /v2/users/:user_id/scale_teams/as_corrected, we were able to obtain data in json format with items for events in which a user participated as a correcter and correction recipient. After calling /v2/users/:user_id/scale_teams/as_corrector and as corrected to add the feedback received by one user and the feedback given by that user to another user as independent variables, the number of items is calculated from each response json format. By counting, we were able to extract data. image
This is the sample data structure of as_corrected data. By counting item named with corrcected, we've figured out how many evaluations they gave (corrector) and feedback they received(corrected).

3. Level​, Group Assignments​, Penalty​, Highest La-picsine​, Final Exam Score: Crawling with Python Code

image
In 42 School, each user has their own personal page. From there, we could retrieve statistical information about users. So, data is collected through crawling by accessing each user's page.
Level​: Overall progress that can be made through assignments, and midterm exams
Group Assignments: Optional group assignments
Penalty​: How many times cheated; each time a user get caught, 42 points will be deducted from assignment score
Highest C-picsine​: In assignments using the C language, the highest level of assignment completed (0~13)
Final Exam Score: as it is.

4. Merge all of files from API and crawled file.

image image image image
Files created through API calls and crawling include CSV files and a plain text file recording assignments and exam scores. From the plain text file, I extracted the Highest C Piscine, Final Exam Score, and the Number of Group Assignments. I then summed these scores and divided them by a certain value to derive a level that closely resembles the actual level, and created a CSV file from this data. Subsequently, we performed an inner join on each CSV file using the 'id' and 'login' information. All dummy data was filtered out based on the level and generation. This is how the final data looks like. Since the participants whose score is under 42 will automatically be failed, we could know almost ΒΌ student could not be passed in final exam.

πŸ›£ EDA

1. Pair plot

42Seoul


This data set is a data set after removing all data points with a Final Exam score of less than 42 points. It can be seen that even if the Final Exam score is 42 points or more, there are many people who fail the final selection, and the ratio is almost equal to the number of people in the PASS.

In 42Ecole

image

2.Statistics

42 Seoul

42Ecole

image

3.Box Plot

42 Seoul

42Ecole

image

4.Correlation Matrix

42 Seoul

42 Ecole

image

5.Pie Chart for PASS/FAIL Distribution

42 Seoul

42 Ecole

image

6.Bar Chart for distribution of other variables

42 Seoul

42 Ecole

image

7.Bubble Chart for Peer Reviews and Result of the Exam

42 seoul

42 Ecole

image

8. C Piscine Levels and Scaled Passed Ratio

42 Seoul

42 Ecole

image

πŸ’» Main Data Mining Process

Split the data into train set and test set

image

Pipelining in 42Seoul

Pipelining in 42Ecole

image

🚏 Post-Processing

Feature Importance

in 42Seoul

image

in 42Ecole

image

ROC Curve

In 42Seoul

image

In 42Ecole

image

Confusion Matrix

In 42Seoul

image

In 42Ecole

image

Accuracy score

image

Is high Accuracy better?

image

Conclusion

In conclusion, The original name of our project was 'what is important'. We thought there must be a reason why 42 Academy emphasizes the importance of the learning process and peer learning to la-piscine students over the results of the problem. Therefore, we believed that there would be elements more important than the scores of assignments and exams in this test. The outcome of the project indicates that the importance of peer evaluation is more significant than anything else. Through this project, we were able to quantitatively prove that enjoying knowledge through mutual learning and teaching, rather than the amount of individual study, is the core value that 42 Academy emphasizes.​

More importantly, by comparing and analyzing data from 42 distributed and global schools, we were able to recognize data differences and prioritize important factors through modeling.​ This not only guides a participant to learn in the right direction by recognizing his or her passing probability, but also serves as an important tool for quality assurance for the global 42 School management team.​ By utilizing this model and collecting and analyzing data from more campuses, the likelihood of running a good program that guarantees consistent quality will increase. ​

Shortcomings

In the data extraction part, it was disappointing that we couldn’t extract data from other campuses due to the long duration of crawling. We also regretted not being able to extract detailed data like campus popularity polls. Additionally, unlike in Korea, France was passing a considerably larger number of participants. Therefore, despite sampling, the data imbalance made it difficult to find a well-performing model.

πŸ‘ͺ Team Information

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •