GitHub

📄 Background

1. Increasing competition rate & Unclear passing criteria

42 School is rapidly gaining global recognition as a prestigious educational institution, particularly in Korea, where the competition rate is an astonishing 44 to 1. However, a significant challenge arises during the month-long testing period, as the nature of crucial activities remains shrouded in secrecy, posing a concern for prospective students and their preparation strategies.

2. Globalized Campus of 42 school

There are 50 number of 42 campus and this education system is originated from 42 Ecole, which is the first campus of 42 school. 42 As schools become increasingly global, cracks will inevitably appear in their operating policies and educational standards. It is unclear whether the students at 42 School are being evaluated according to the standards of 42 Ecole, which has originality. Our team was launched to solve this situation using Business Analytics techniques.

😎 Purpose

1. Build a model that can predict the probability that an applicant will be Passed/Failed

2. Identify important factors that affect the final selection results of participants

3. Compare the selection criteria of 42 Seoul and Ecole 42 to identify the differences in educational priority and operating policies

📚 Data Collection

Overall Collecting Process

We completed the data collection process by following these five steps. We created raw data through API calling, merged the crawled data, and deleted unnecessary columns.

1. User Raw Data: API

By calling /v2/campus/:/campus_id/users, we could separately collect raw data for all users of Seoul 42 Campus and Ecole 42 Campus, and the campus_ids for each are 29 and 1.

To get the raw data, we found the campus IDs of the Seoul and Ecole campuses and retrieved the data through API requests. This is what user raw data looks like.

2. Feedback and Evaluation Data: API with Python Code for Processing

By calling /v2/users/:user_id/scale_teams/as_corrector and /v2/users/:user_id/scale_teams/as_corrected, we were able to obtain data in json format with items for events in which a user participated as a correcter and correction recipient. After calling /v2/users/:user_id/scale_teams/as_corrector and as corrected to add the feedback received by one user and the feedback given by that user to another user as independent variables, the number of items is calculated from each response json format. By counting, we were able to extract data.
This is the sample data structure of as_corrected data. By counting item named with corrcected, we've figured out how many evaluations they gave (corrector) and feedback they received(corrected).

3. Level, Group Assignments, Penalty, Highest La-picsine, Final Exam Score: Crawling with Python Code

In 42 School, each user has their own personal page. From there, we could retrieve statistical information about users. So, data is collected through crawling by accessing each user's page.
Level: Overall progress that can be made through assignments, and midterm exams
Group Assignments: Optional group assignments
Penalty: How many times cheated; each time a user get caught, 42 points will be deducted from assignment score
Highest C-picsine: In assignments using the C language, the highest level of assignment completed (0~13)
Final Exam Score: as it is.

4. Merge all of files from API and crawled file.

Files created through API calls and crawling include CSV files and a plain text file recording assignments and exam scores. From the plain text file, I extracted the Highest C Piscine, Final Exam Score, and the Number of Group Assignments. I then summed these scores and divided them by a certain value to derive a level that closely resembles the actual level, and created a CSV file from this data. Subsequently, we performed an inner join on each CSV file using the 'id' and 'login' information. All dummy data was filtered out based on the level and generation. This is how the final data looks like. Since the participants whose score is under 42 will automatically be failed, we could know almost ¼ student could not be passed in final exam.

🛣 EDA

1. Pair plot

42Seoul

This data set is a data set after removing all data points with a Final Exam score of less than 42 points. It can be seen that even if the Final Exam score is 42 points or more, there are many people who fail the final selection, and the ratio is almost equal to the number of people in the PASS.

In 42Ecole

2.Statistics

42 Seoul

42Ecole

3.Box Plot

42 Seoul

42Ecole

4.Correlation Matrix

42 Seoul

42 Ecole

5.Pie Chart for PASS/FAIL Distribution

42 Seoul

42 Ecole

6.Bar Chart for distribution of other variables

42 Seoul

42 Ecole

7.Bubble Chart for Peer Reviews and Result of the Exam

42 seoul

42 Ecole

8. C Piscine Levels and Scaled Passed Ratio

42 Seoul

42 Ecole

💻 Main Data Mining Process

Split the data into train set and test set

Pipelining in 42Seoul

Pipelining in 42Ecole

🚏 Post-Processing

Feature Importance

in 42Seoul

in 42Ecole

ROC Curve

In 42Seoul

In 42Ecole

Confusion Matrix

In 42Seoul

In 42Ecole

Accuracy score

Is high Accuracy better?

Conclusion

In conclusion, The original name of our project was 'what is important'. We thought there must be a reason why 42 Academy emphasizes the importance of the learning process and peer learning to la-piscine students over the results of the problem. Therefore, we believed that there would be elements more important than the scores of assignments and exams in this test. The outcome of the project indicates that the importance of peer evaluation is more significant than anything else. Through this project, we were able to quantitatively prove that enjoying knowledge through mutual learning and teaching, rather than the amount of individual study, is the core value that 42 Academy emphasizes.

More importantly, by comparing and analyzing data from 42 distributed and global schools, we were able to recognize data differences and prioritize important factors through modeling. This not only guides a participant to learn in the right direction by recognizing his or her passing probability, but also serves as an important tool for quality assurance for the global 42 School management team. By utilizing this model and collecting and analyzing data from more campuses, the likelihood of running a good program that guarantees consistent quality will increase.

Shortcomings

In the data extraction part, it was disappointing that we couldn’t extract data from other campuses due to the long duration of crawling. We also regretted not being able to extract detailed data like campus popularity polls. Additionally, unlike in Korea, France was passing a considerably larger number of participants. Therefore, despite sampling, the data imbalance made it difficult to find a well-performing model.

👪 Team Information

Jeongmin Oh(jeongmino1207@gmail.com), Github Id: jeongmino
Kangmin Kim (rkdals0203@gmail.com), Github Id: rkdals0203
Aleksandra Kaniewska (@gmail.com), Github Id: alekann009
Eonseon Park (pocva6243@gmail.com), Github Id: eonpark

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.DS_Store		.DS_Store
BA_model_42Ecole.ipynb		BA_model_42Ecole.ipynb
BA_model_42Seoul.ipynb		BA_model_42Seoul.ipynb
README.md		README.md
data_set_42ecole.csv		data_set_42ecole.csv
data_set_42seoul.csv		data_set_42seoul.csv

BusinessAnalyticsTeamProject/DataMining

Folders and files

Latest commit

History

Repository files navigation

📄 Background

1. Increasing competition rate​ & Unclear passing criteria

2. Globalized Campus of 42 school

😎 Purpose

1. Build a model that can predict the probability that an applicant will be Passed/Failed

2. Identify important factors that affect the final selection results of participants

3. Compare the selection criteria of 42 Seoul and Ecole 42 to identify the differences in educational priority and operating policies

📚 Data Collection

Overall Collecting Process

1. User Raw Data: API

2. Feedback and Evaluation Data: API with Python Code for Processing

3. Level​, Group Assignments​, Penalty​, Highest La-picsine​, Final Exam Score: Crawling with Python Code

4. Merge all of files from API and crawled file.

🛣 EDA

1. Pair plot

42Seoul

In 42Ecole

2.Statistics

42 Seoul

42Ecole

3.Box Plot

42 Seoul

42Ecole

4.Correlation Matrix

42 Seoul

42 Ecole

5.Pie Chart for PASS/FAIL Distribution

42 Seoul

42 Ecole

6.Bar Chart for distribution of other variables

42 Seoul

42 Ecole

7.Bubble Chart for Peer Reviews and Result of the Exam

42 seoul

42 Ecole

8. C Piscine Levels and Scaled Passed Ratio

42 Seoul

42 Ecole

💻 Main Data Mining Process

Split the data into train set and test set

Pipelining in 42Seoul

Pipelining in 42Ecole

🚏 Post-Processing

Feature Importance

in 42Seoul

in 42Ecole

ROC Curve

In 42Seoul

In 42Ecole

Confusion Matrix

In 42Seoul

In 42Ecole

Accuracy score

Is high Accuracy better?

Conclusion

Shortcomings

👪 Team Information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

1. Increasing competition rate & Unclear passing criteria

3. Level, Group Assignments, Penalty, Highest La-picsine, Final Exam Score: Crawling with Python Code

Packages