Can a codegen agent develop a competitive chess playing program?
Most people are unable to do it given a few weeks of time. A small number of people can and the resulting programs are small and elegant (e.g. sunfish is just 131 lines of code). Building this as my litmus test to determine if we are approaching a threshold where agents can engage in non-trivial R&D work with a high degree of autonomy and maybe even come up with some fresh ideas via trial and error.