My previous side projects were mainly about creating stuff with #genAi. This project is about testing how well Large Language Models (LLMs) can be used to evaluate content.
The "Deutscher Presserat" provides a code of ethics (the "Pressecodex") for German media. The Pressecodex is a lengthy text description on what constitutes good journalism and what doesn't. Ideal for LLMs.
My goal with this project is to have an automated process, for evaluating news articles against the Pressecodex.
I implemented this process with #python, #goose3, #beautifulsoup and #langchain. The process has 3 parts.
- Download & Parse: the script downloads a news article and parses it with goose3. Modern news sites obfuscate the content of their website a lot, to make the usage of adblockers or scripts harder. That's why goose3 is a blessing. For most pages it can bypass all the nonesense.
- LLM Evaluation: The LLM is called. Each Pressecodex is evaluated separately. Sections which cannot be evaluated purely by reading the article text are skipped. Prompt: https://raw.githubusercontent.com/bastiandg/pressekodex/main/prompts/compliance.md
- Results Display: Results are put in a table and displayed in a browser (see screenshot).
The results are impressive. Depending on which #LLM (#gpt4 and #claude3 tested) is used, it can spot sensationalism, discrimination, violations of human dignity and predjudice. My favorite bit of judgement comes from claude3 opus:
The characterization of [Name Redacted] as an "Israel-Hasserin" in the headline, […] are arguably an inappropriate attack on her honor that violates Ziffer 9. […] On balance, I judge it as likely non-compliant with Ziffer 9, but I acknowledge it is a close call based just on the text of that clause.
It seems to have an awareness for the nuances of the language and finds a good way to express its uncertainty appropriately.
Important Note: This is a proof of concept. It has flaws and sometimes the evaluation goes wrong, especially when using weaker models like claude3 haiku.