seawavve / newsTopicClassification Public

Notifications You must be signed in to change notification settings
Fork 4
Star 4

국립국어원 신문 말뭉치(Ver.1.1) 데이터셋을 활용한 한글뉴스기사 주제분류AI

4 stars 4 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
img		img
preprocessing		preprocessing
NewsTopicClassificationwithKoGPT2.ipynb		NewsTopicClassificationwithKoGPT2.ipynb
README.md		README.md

Repository files navigation

newsTopicClassification

한글뉴스기사 주제분류AI
국립국어원 신문 말뭉치(Ver.1.1) sampling Dataset를 이용
SKT AI KoGPT2 fine-tuning 사용
뉴스 내용을 보고 토픽을 예측

Data

국립국어원 신문 말뭉치(Ver.1.1) sampling Dataset

Label
- 0: ITscience
- 1: culture
- 2: economy
- 3: entertainment
- 4: health
- 5: life
- 6: politic
- 7: social
- 8: sport
train
각 라벨별로 500개의 뉴스 문장+내용
test
각 라벨별로 50개의 뉴스 문장+내용

Model

SKT에서 만든 KoGPT2모델을 fine-tuning

선정이유
- 대용량 한글 데이터 학습 모델
  한국어 위키 백과, 뉴스, 모두의 말뭉치 v1, 청와대 국민청원 학습
- 최신 모델
  GPT2 발매년도: 2019
  KoGPT2 발매년도: 2021.5
GPT2
적은 모델 파라미터로 높은 성능을 내는 자연어처리 특화 모델
KoGPT2
SKT에서 제공하는 대용량 한글 데이터셋 학습 GPT2 모델
한국어 위키 백과, 뉴스, 모두의 말뭉치 V1, 청와대 국민청원 학습

Result

Loss Curve
Accuracy Curve
classification report
confusion matrix

life, social, culture가 서로 혼동
라벨을 통합하거나 데이터를 더 늘린다면 개선 가능

Feedback

한자, 특수문자를 제거
sampling data ⇒ full data

참고자료

SKT KoGPT2 repository
https://github.com/SKT-AI/KoGPT2
GPT2 Visualization
https://jalammar.github.io/illustrated-gpt2/

About

국립국어원 신문 말뭉치(Ver.1.1) 데이터셋을 활용한 한글뉴스기사 주제분류AI

Report repository

Releases

No releases published

Packages

No packages published

Languages