Our solution for the O'Reilly Architectural Kata (Winter 2025)
We are a team of three senior IT consultants, working on different mandates at IPT in Zurich, Switzerland.
Certifiable, Inc. is an accredited leader in the provisioning of software architect certifications. The current certification process requires candidates to pass two different tests: an aptitude test and an architecture case study. The evaluation of these exams and maintenance of the exam database heavily relies on manual work by IT experts. This manual approach has become a bottleneck as the demand for certified architects continues to grow. To address this challenge, Certifiable, Inc. needs to modernize its software architecture by incorporating AI approaches, allowing them to scale their certification process while maintaining high quality standards.
We present ARCHIFY, an innovative software component that seamlessly integrates with the existing software system of Certifiable, Inc. without requiring modifications of other components. ARCHIFY speeds up the certification evaluation process by leveraging a comprehensive existing database of 120,000 previously graded certifications. By enriching a Large Language Model (LLM) with this "historical" data, ARCHIFY generates both automated grading suggestions and detailed candidate feedback.
Our design prioritizes responsible AI integration by maintaining human oversight throughout the evaluation process. Rather than surrendering decision-making of the grading entirely to AI, ARCHIFY integrates IT experts as "human in the loop", ensuring accuracy and accountability in the certification assessment. This balanced approach addresses system bottlenecks, currently hindering scaling, while preserving the critical role of human expertise in the evaluation.
We identified the following key objectives:
- Effective and Innovative AI Integration - Deliver a solution that incorporates generative AI in an innovative and practical way following industry best-practices
- Architectural Cohesion and Suitability - Deliver a solution that integrates well with the current architecture
- Our main contributions for this objective are:
- An ADR on how we want to integrate the AI within the existing architecture: ADR-010
- Our main contributions for this objective are:
- Accuracy and Reliability of AI Outcomes - We want that our solution contains mechanisms to maintain the integrity, correctness and trustworthiness of AI-generated results
We propose to integrate AI within two areas of the Certifiable Inc. System:
The manual effort in the current process is the main barrier to scalability. We address this by automating large parts of the grading process for both aptitude short questions and the architectural case study.
Aptitude exam questions will be graded by an AI system. We decided to use a state of the art LLM API, with performance, context length and cost as main decision drivers. We propose a solution that decides which exams will be additionally reviewed by a human grader. Additionally, we propose methods to secure our system both against malicious prompt injections and erroneous LLM outputs. Another important factor that we considered is how we will enrich the LLM prompt: for that, we use RAG to enrich LLM prompts with known question-answer tuples from past exams.
Architecture exam will be automatically evaluated by an LLM. The prompt to the LLM will include the set of evaluation criteria as well as technical knowledge relevant to the given case study. Which knowledge is relevant is determined with the help a vector database. The result of the automatic grading will always be reviewed by a human.
The second largest challenge Certifiable Inc. faces is maintenance of their exam base. With our solution we propose to automate parts of the maintenance process. AI will provide support for the creation and maintenance of architecture test cases, so human experts spend significantly less time on maintenance. Creating new exams is done using existing knowledge bases and previously taken exams (including case study scenarios). As with the automated grading process we want to keep the human in the loop. Generated questions and case studies will always be reviewed by a human before they are allowed to be used in exams.
Data needed to automate these use cases will be read directly from the databases of the existing Certifiable Inc. System. As there is no requirement for (near) realtime processing of data, this will be done by a polling mechanism within the new system components. Any output generated by the new system components will be written directly into the existing Certifiable Inc. Systems databases. This way, the new review processes can be integrated into the existing solutions for grading exams.
In a kick-off workshop, we identified context, made assumptions and formulated requirements that are relevant for our architecture contributions. The results of that workshop is available under List of requirements and assumptions.
We identified the following context, assumptions and requirements as most relevant for our contributions:
- The manual grading process is quite time-intensive - 3 hours for aptitude tests and 8 hours for case studies, showing the significant human effort currently required.
- Experts don't just grade - they're also responsible for analyzing reports and updating test questions, indicating a complex role beyond pure evaluation.
- "Reasoning Models" from late 2024 can perform multi-step evaluations and align with human reasoning for grading architecture solutions. This suggests a significant advancement in AI capabilities specifically relevant to architecture evaluation.
- Exam data read from the existing Certifiable, Inc. system is available as clear-text.
- Architectural diagrams are created in DSL, making them machine-readable and processable by LLMs.
- Grading quality must stay consistent and accurate.
- The one week turnaround for exam evaluations must be kept when demand increases.
- Maintaining the exam base must stay feasible when demand increases.
ARCHIFY integrates into the existing Certifiable Inc. software system and is visualized through C4 diagrams:
The full context diagram with the description of the Actors and Systems can be found here.
The container diagram describes the high-level interactions of ARCHIFY and Certifiable, Inc. containers.
![]() Aptitude Exam Automated Grading (C2) |
![]() Architecture Case Study Grading (C2) |
![]() Exam & Question Generation (C2) |
The component diagrams contain a more detailed description of the design of the individual automation use cases:
![]() Aptitude Exam Automated Grading (C3) |
![]() Architecture Case Study Grading (C3) |
![]() Exam & Question Generation (C3) |
Our current architecture has the following constraints:
- Even when providing context through previous exams and a knowledge base, LLMs might not be mature enough to grade an architecture case study end-to-end reliably with human expert level.
- Due to the non-deterministic nature of LLMs, exam gradings are inconsistent, even in rather trivial cases.
- Generative Models tend to over-estimate their own grading capabilities.
- The graded exam database and the knowledge base can contain outdated or biased information, which potentially "corrupts" the LLM grading.