2026

📜: Paper link 🧑🏻‍💻: Developer blog & Github link 🗞️: News

🚀 새로운 시스템

이제 웹사이트에서 항목을 추가할 수 있습니다!

📝 항목 추가 방법

웹사이트 방문: NLP-Paper-News
"새 항목 추가" 또는 "일괄 추가" 버튼 클릭
GitHub Issue 템플릿 작성
관리자 승인 후 자동으로 README.md에 반영

🔄 자동화 프로세스

Issue 생성 → 2. 관리자 승인 → 3. README.md 업데이트 → 4. 자동 파싱 → 5. 웹사이트 업데이트

2026

🙇🏻 1월

1st week

📜 [UIUC, Stanford, … ] Adaptation of Agentic AI
- agentic AI가 현실의 복잡한 문제를 잘 풀지 못하는 이유를 adaptation 불가능에서 찾음
- agent adaptations & tool adaptations를 다루는 systematic framework
  - tool-execution-signaled & agent-output-signaled forms
  - offline data를 이용해 각 weight를 업데이트 하는 것으로 보임
🧑🏻‍💻 [IQuestLab] IQuest-Coder-V1
- code-flow multi-stage training paradigm을 통해 코딩 벤치마크(SWE Bench)에서 Sonnet 4.5를 제치고 SoTA를 달성
- Dual Specialization Paths: 두 갈래의 post-training을 통해 thinking model & instruct model 개발
- recurrent mechanism을 이용하여 model capability와 deployment footpring 간의 trade-off 최적화한 Efficient Archiecture
- 추가적인 scaling 없이 native 128K 지원
📜 [WeChat] Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling
- 현재 Multi-step RAG의 memory module은 정적인, passive storage라는 점을 문제로 지적
- HGMem: memory 개념을 dynamic, expressive structure로 extend 하는 hypergraph-based memory mechanism
- hyperedges는 각 distinct memory units에 해당하며 메모리 내에서 higher-order interaction로 progressive formation 가능해짐
  - 일반적인 edge와 달리 둘 이상의 정점을 한 번에 연결하는 개념
🧑🏻‍💻 [OpenCode AI] OpenCode
- 백그라운드에서 에이전트들을 쉽게 돌릴 수 있는 오픈소스로 큰 화제를 일으키고 있음
- TUI 지원되면서도 시각적으로 보기 편리하게 구성되어 있음
- Claude Code를 그대로 쓸 수도 있고 다른 모델들을 필요한 곳에 override 해서 사용하는 것도 가능
📜 [NVIDIA, Stanford, UC Berkeley] End-to-End Test-Time Training for Long Context
- long-context language modeling을 architecture design이 아닌 continual learning 문제로 정의
  - standard architecture: Transformer with sliding-window attention
- test time의 next-token prediction 상황에서 context를 compress하여 weight에 반영
- training time에 test-time에서 습득한 meta-learning을 통해 model initialization
📜 [US San Diego] Professional Software Developers Don't Vibe, They Control: AI Agent Use for Coding in 2025
- 3년차 이상의 softward developers 대상으로 조사한 결과, 숙련된 개발자들은 vibe code 하지 않고 planning & supervision을 통해 agents를 control 하고 있다고 보고
- agents가 코드 생성, 디버깅, boilerplate 등에는 적합하지만, architectural decisions에는 약하다는 주장

2025

🎄 12월

1st week

🧑🏻‍💻 [Karpathy] LLM Council
- 다양한 LLM들을 모아서 각 모델의 답변과 결과물을 취합하고 평가할 수 있도록 만든 프레임워크
- 쿼리를 제출하면 1) First Options 2) Review 3) Final Response 단계를 거치게 됨
🧑🏻‍💻 [DeepSeek AI] DeepSeek-V3.2: Efficient Reasoning & Agentic AI
- 685B 사이즈의 DeepSeek-V3.2-Speciale 모델은 Gemini-3.0-Pro에 준하는 reasoning 능력을 보여준다고 설명
- 세 가지 keys
  - DeepSeek Sparse Attention (DSA)
  - Scalable Reinforcement Learning Framework
  - Large-Scale Agentic Task Synthesis Pipeline
- 이전 버전 대비 chat template의 변화가 크다고 설명
🧑🏻‍💻 [Microsoft] Fara-7B: An Efficient Agentic Model for Computer Use
- 7B 사이즈의 모델로 여러 모델에 의존하는 복잡한 타시스템 대비 뛰어난 성능을 가졌다고 설명
- 웹 페이지를 인식하여 scrolling, typing, clicking 등 actions 수행 가능
- 이전 연구인 AgentInstruct 기반으로 synthetic data generation pipeline 개발
🧑🏻‍💻 [ByteDance] Vidi2: AI Video Understanding & Creation in Seconds
- Temporal Retrieval, Spatio-Temporal Grounding, VQA, Video Editing 등을 강점으로 설명
- VUE-TR-V2 벤치마크에서 GPT-5, Gemini-3-Pro 모델 능가하는 수준으로 리포트
- 10-30초 정도의 long-context video support
📜 [MiroMind] LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
- LMMs가 evidence가 sparse & temporally dispersed 한 경우에 long-form vidoe에서 할루시네이션을 일으키는 현상을 문제로 지적
- LMMs의 temporal grounding 능력을 video cropping tool로 이용하여 특정 video clip에 zoom in하고 finer-grained video frames를 resample 하도록 함
  - global-to-local reasoning loop
- VideoLIAH를 공개하여 training & evaluation 촉진
  - 247.9K samples for tool-integrated cold-start supervised fine-tuning
📜 [Alibaba] From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
- code LLMs에 대한 synthesis & practical guide 제공
- code pretraining, supervised fine-tuning, RL, scaling law, framework selection, hyperparameter sensitivity, model architectures, dataset comparisons 등 포함
🧑🏻‍💻 [HuggingFace] Transformers v5: Simple model definitions powering the AI ecosystem
- transformers v4 버전을 2020년 출시 이후 대규모 업데이트하여 v5 버전 공개
- AttentionInterface, 토크나이저 단일화, PyTorch 단일화 등
- 대규묘 pre-training 지원 강화, fine-tuning/post-training 생태계 연동
🧑🏻‍💻 [Mistral AI] Introducing Mistral 3
- small & desne models (14B, 8B, 3B) & Mistral Large 3 (activated 41B-675B, MoE) Apache 2.0 라이센스로 공개
- 오픈소스 모델 중 SoTA라고 설명
- non-reasoning 모델 중 LMArena에서 2위 달성
- text, images, multilingual inputs 처리 가능
🧑🏻‍💻 [Google] Now available: Create AI agents to automate work with Google Workspace Studio
- Gmail, Docs, Sheets 등 구글 제품을 위한 IA agents를 만들 수 있는 no-code tool
- Asana, Jira, Mailchimp, Salesforce 등과 연결 가능
🧑🏻‍💻 [OpenAI] How confessions can keep language models honest
- GPT-5 Thinking 모델이 실제로 instructions를 잘 따르고 있는지를 분석한 연구 결과
- main answer & separate ‘confession’을 출력하도록 지시하여 confession channel을 관측
- confession channel에서는 main answer가 올바를 때에조차 hidden failure를 보임
  - hallucination, 지름길 이용, 부적절한 보상 신호 악용 확인됨
📜 [NUS] PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing
- Overleaf에서 LLM agents가 직접 글쓰기를 도와주는 in-editor system 공개 (크롬 plugin 기반)
- 문서 변화 히스토리를 직접 알 수 있고 fine-grained patches 관리 가능

2nd week

📜 [ByteDance] DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
- DAComp: 복잡한 워크플로우를 반영하는 210개의 data engineering & data analysis tasks
- open-ended tasks는 LLM-judge로 평가 (meticulously crafted rubrics)
🧑🏻‍💻 [Poetiq] Poetiq Shatters ARC-AGI-2 State of the Art at Half the Cost
- Gemini-3을 이용해서 ARC-AGI-2 벤치마크에서 SoTA 달성
  - Gemini 3 Deep Think 대비 더 높은 정확도와 절반 이하의 비용
- 모델을 직접 만드는 게 아니라 froniter models들이 문제를 더 잘 풀 수 있도록 meta-system을 개발
🧑🏻‍💻 [Alibaba] Qwen3-TTS Update! 49 Timbres + 10 Languages + 9 Dialects
- Richer Timbres Support: 49개의 high-quality timbre(음색) 지원. 다양한 성별, 나이, 지역적 특성 고려
- Enhanced Multilingual & Dialect Capabilities: 영어, 중국어, 독일어, 한국어 등 주요 10개 언어 지원
  - 한국어, 일본어 등 그렇게까지 자연스러운지 모르겠음
- More Natural & Human-like Prosody/Speech Rates: 전작 대비 훨씬 자연스러운 발화
📜 [Anthropic] Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs
- pretraining 단계에 data filtering을 적용하는 것만으로는 LLMs 위험성 제거가 충분하지 않음
- → 기존 Gradient Routing을 개선하여 Selective GradienT Masking (SGTM) 개발
- 두 개의 지식 제거 실험
  - (1) bilingual synthetic dataset으로 학습된 모델의 한 언어를 제거
  - (2) English Wikipedia로 학습된 모델의 biology knowledge 제거
🧑🏻‍💻 [Google] Titans + MIRAS: Helping AI have long-term memory
- Titans (구현)
  - MLP 기반의 long-term memory module을 사용하여 대량의 정보를 손실 없이 저장하도록 함
  - 여기에 surprise metric을 사용하여 새로운 입력이 기존의 정보와 큰 차이가 있는지 detect
- MIRAS (이론)
  - lightning-fast linear RNNs - highly complex associative memory module
  - Memory architecture, Attentional bias, Retention gate, Memory algorithm
  - Huber loss, generalized norms, strict probability map을 통해 MSE 한계 극복
🧑🏻‍💻 [Qwen] SAPO: A Stable and Performant Reinforcement Learning Method for Training Large Language Models
- 기존 GRPO/GSPO의 hard clipping 대신 smooth & temperature-controlled gating function을 사용하는 Soft Adaptive Policy Optimization (SAPO) 도입
- sequence-level coherence를 유지하면서도 off-policy 토큰만 선택적으로 억제해 sample efficiency 개선
🧑🏻‍💻 [OpenAI] Introducing GPT-5.2
- spreadsheets 생성, presentations building, 코드 작성, 이미지 인식 등 다양한 Enterprise 니즈를 충족할 수 있다고 설명
  - 이를 뒷받침하는 GDPval 벤치 결과를 언급
- ChatGPT - Instant/Thinking/Pro, API - 5.2/5.2-chat-latest/5.2-pro
🧑🏻‍💻 [Cursor] A visual editor for the Cursor Browser
- Cursor Browser에서 화면 구성 요소를 직접 drag & drop 하면 모델이 차이를 인식하고 코드를 변경
- 각 element의 설정을 사이드 패널에서 직접 컨트롤 할 수 있음 (폰트 사이즈, 서체 등등)
- element를 클릭하고 그걸 대상으로 prompt 작성해서 코딩하는 것도 가능
📜 [Stanford] The Missing Layer of AGI: From Pattern Alchemy to Coordination Physics
- LLM은 단순한 pattern matchers로, reasoning 또는 planning을 할 수 없다는 주장에 정면으로 반박
- reasoning을 phase transition으로 modeling하는 theory of semantic anchoring formalize (UCCT)
- AGI에 필요한 것은 더 큰 모델, 더 많은 데이터, 더 복잡한 아키텍쳐가 아닌, 모델 패턴을 목표에 align 시키는 executive function이라고 주장
📜 [Berkeley, UIUC, Stanford, IBM] Measuring Agents in Production
- 26개 도메인에서 306명의 전문가 대상으로 in-depth case studies 진행
  - 정형적인 벤치마크 대신 현업 맥락에 맞춘 인간 검증을 통해 평가
- production agent가 일반적으로 simple & controllable approaches를 갖고 있다고 설명
  - 사람 개입 전에 최대 10개 steps 68%, prompting off-the-shelf models 의존 70%

3rd week

- 📜 [National Taiwan Uinv.] [AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference](https://arxiv.org/abs/2512.11280) - 현존 speculative decoding은 추가적인 학습, 하이퍼파라미터 튜닝, 모델 분석 등이 필요하다는 단점이 있다고 지적 - 추론 시 generation length & acceptance rate 를 dynamically adjust 하는 방식 제안 - token entropy & Jensen-Shannon distance 기준으로 결정 - 성능 2% 하락 정도로 49% 속도 향상을 이끌어낼 수 있었음 - 📜 [Meta] [Exploring MLLM-Diffusion Information Transfer with MetaCanvas](https://arxiv.org/abs/2512.11464) - 현존 multimodal LLMs는 이미지나 비디오를 precise & structured control 해서 생성할 수 없다는 한계를 지적 - Meta Canvas: MLLMs가 직접 spatial & spatiotemporal latent spaces를 reason & plan하고 diffusion generators로 interface하는 lightweight framework - 🧑🏻‍💻 [NVIDIA] [NVIDIA Nemotron 3 Family of Models](https://research.nvidia.com/labs/nemotron/Nemotron-3/) - [model](https://huggingface.co/collections/nvidia/nvidia-nemotron-v3), [technical report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf) - Nano, Super, Ultra, 강력한 agentic 능력을 가진 세 개 모델 공개 - 체크 포인트 및 학습 데이터까지 공개 - Hybrid MoE, LatentMoE, Multi-Token Prediction, NVFP4, Long Context (1M), Multi-environment Reinforcement Learning Post-training, Granular Reasoning Budge Control at Inference Time - 🧑🏻‍💻 [Ai2] [Molmo 2: State-of-the-art video understanding, pointing, and tracking](https://allenai.org/blog/molmo2) - Molmo 2 (8B, 4B): Qwen 3 기반의 video grounding & QA 모델 - Video tracking에서 Gemini 3 Pro 성능을 상회하기도 함 - molmo 2-O (7B): Olmo 기반의 for researcher 모델 - 학습 데이터의 양은 Meta의 PerceptionLM 대비 1/8 수준임에도 뛰어난 성능 달성 - 🧑🏻‍💻 [Ai2] [Introducing Bolmo: Byteifying the next generation of language models](https://allenai.org/blog/bolmo) - Olmo 3 기반의 byte-level language models - transformer 아키텍쳐는 그대로 두고 small byte encoders, decoders 추가 - Olmo 3 모델과 유사한 수준의 성능을 보이면서도 character 벤치마크에서 높은 점수 달성 - UTF-8 bytes를 fixed vocab 없이 처리, dynamic byte patches 사용 - 📜 [Google] [DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents](https://storage.googleapis.com/deepmind-media/DeepSearchQA/DeepSearchQA_benchmark_paper.pdf) - 17개 분야의 900개 프롬프트 벤치마크로 multi-step information-seeking tasks 평가 - 세 개의 능력을 평가 - (1) 이질적인 sources로부터 파편화된 정보의 systematic collation - (2) precision을 확보하기 위한 de-duplication & entity resolution - (3) open-ended search space 내의 stopping criteria를 추론하는 능력 - fully correct 점수 중 가장 높은 것을 기록한 것은 Gemini Deep Research Agent로 66.09 - [캐글](https://www.kaggle.com/benchmarks/google/dsqa/leaderboard)에서 데이터셋 및 리더보드 공개 - 📜 [NUS, GIT 등] [Memory in the Age of AI Agents](https://arxiv.org/abs/2512.13564) - 현재 agent memory는 용어 통일도 되어 있지 않음 → agent에 관한 memory 개념을 LLM memory로부터 구분 - forms, functions, dynamics를 기준으로 agent memory 분석 - agent memory는 token-level, parametric, latent memory로 크게 구분 - 📜 [Tsinghua] [DEER: Draft with Diffusion, Verify with Autoregressive Models](https://arxiv.org/abs/2512.15176) - Speculative decoding 문제점 지적 - (1) step-wise uncertainty가 계속해서 누적 - (2) 본질적으로 AR (autoregressive) drafters의 sequential decoding임 - dLLM이 이와 같은 문제를 해결할 수 있다고 보며 DEER라는 decoding framework 제안 - drafts with diffusion & verifies with AR models - two-stage training pipeline, single-step decoding - 🧑🏻‍💻 [Mistral] [Mistral OCR 3](https://mistral.ai/news/mistral-ocr-3) - form, scanned documents, complex tables, handwriting에서 Mistral OCR 2 대비 74% win rate 기록 - [Mistral AI Studio](https://console.mistral.ai/build/document-ai/ocr-playground) 또는 API 통해 이용 가능 - 🧑🏻‍💻 [Google] [FunctionGemma: Bringing bespoke function calling to the edge](https://blog.google/technology/developers/functiongemma) - Gemma 3 270M 모델을 function calling 특화 학습한 FunctionGemma 공개 - on-device & agent 수요에 맞춘 결과물 - unified action & chat, built for customization, engineered for the edge, broad ecosystem support 등을 특징으로 삼음

4th week

📜 [KlingAI] Kling-Omni Technical Report
- multimodal visual language inputs으로부터 high-fidelity vidoes를 직접 합성할 수 있는 generative framework 공개
- video generation, editing, intelligent reasoning 등을 end-to-end로 다룸
- 이에 따라 text instructions, reference images, video context 등을 입력으로 받을 수 있음
📜 [Google] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
- LLMs의 factually accurate text 생성 능력을 평가하기 위한 comprehnesive benchmarks
- 4개의 sub-leaderboards의 performance를 aggregate
  - (1) FACTS Multimodal (2) FACTS Parametric (3) FACTS Search (4) FACTS Grounding
  - 각 리더보드는 모델 responses를 평가하기 위한 judge models 세팅되어 있음
📜 [Google, UC Santa Barbara] Budget-Aware Tool-Use Enables Effective Agent Scaling
- web search agents에 한하여 agents가 tool-call budgets 내에서 작업할 수 있도록 하고자 함
- Budget Tracker: agent에게 continuous budget awareness를 제공하는 plug-in
- BATS (Budget Aware Test-time Scaling): budget awareness를 이용하여 dig deepr | pivot to new paths를 dynamically decide
📜 [Ant Group] LLaDA2.0: Scaling Up Diffusion Language Models to 100B
- diffusion large language models (dLLM)를 100B까지 scaling-up
- from-scratch 학습 대신 pre-trained AR 모델을 3-phase block-level WSD based training scheme을 통해 dLLM으로 전환
- post-training alignment (SFT & DPO)를 통해 MoE 아키텍쳐의 LLaDA2.0-mini (16B) & LLaDA2.0-flash (100B) 모델 획득
🧑🏻‍💻 [Alibaba] Qwen-Image-Layered: Layered Decomposition for Inherent Editablity
- image를 multi RGB layer로 decompose 할 수 있는 모델
- 각 layer는 다른 content에 영향을 주지 않도록 manipulated 되어 resizing, reposition, recoloring 등이 가능함
- 즉, semantic 또는 structure components를 distinct layers로 isolate
🧑🏻‍💻 [OpenAI] Evaluating chain-of-thought monitorability
- reasoning 모델의 CoT monitorability를 평가하기 위한 프레임워크
- 평가는 3개 타입으로 구분: intervention, process, outcome-property
🧑🏻‍💻 [Anthropic] Introducing Bloom: an open source tool for automated behavioral evaluations
- frontier AI models의 behavioral evaluations를 생성하기 위한 agentic framework
- hand-labeled judgements와 strongly correlate된 평가
- 최근 AI 모델의 behavioral profiles를 자동으로 explore 하는 오픈소스 프레임워크 Petri도 공개
🧑🏻‍💻 [Google DeepMind] Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior
- 270M ~ 27B 사이즈에 이르는 Gemma 3 모델의 잠재적 위험을 탐지하기 위한 해석 도구
- 총 1T 파라미터에 대해 110 Petabytes 데이터를 학습
- SAE와 transcoder 결합하여 모델 내부를 들여다 봄
- Matryoshka training technique이 적용되었고 chat usecase를 위해서도 학습되었다고 설명
📜 [Southwest Univ.] LIR3AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation
- reasoning 모델이 retrieved & internal knowledge를 integrate 하기 위해 sturctured strategies를 취한다고 설명
  - Context-Grounded Reasoning, Knowlege-Reconciled Reasoning 두 개의 모드로 해석
- LIR3AG: retrieved evidence를 coherent reasoning chains로 reconstruct 함으로써 non-reasoning 모델도 reasoning strategies를 transfer 할 수 있도록 함
📜 [Tsinghua] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
- video quality는 유지하면서 end-to-end diffusion 생성 속도를 100-200x 끌어올리는 video generation acceleration framework
- (1) Attention acceleration: low-bit SageAttention & trainable Sparse-Linear Attention (SLA)
- (2) Step distillation: rCM
- (3) W8A8 quantization
📜 [Google] Prompt Repetition Improves Non-Reasoning LLMs
- 일반 모델을 사용할 때, 입력 프롬프트를 반복하는 것만으로도 생성 토큰 수나 latency 증가 없이 성능 향상이 있다고 보고한 short paper
- Gemini, GPT, Claude, DeepSeek 같은 플래그십 모델들에 대해 실험한 결과 보고
- 또한 RL로 학습된 reasoning 모델들이 유저의 요청을 반복하는 경항이 있는데 이를 역시 prompt repetition이라고 표현하고 이것이 아주 효율적이라고 설명함
🧑🏻‍💻 [Minimax] MiniMax M2.1: Significantly Enhanced Multi-Language Programming, Built for Real-World Complex Tasks
- M2가 model cost & accessbility 문제에 집중했다면, M2.1은 real-world complex tasks에 집중
- 특히 코딩 능력 향상에 힘을 많이 들인 것으로 보임 (공식 포스트에서는 코딩 능력만 언급하고 있음)
  - Multi-Promgramming Language Capabilities
  - 웹 개발 뿐만 아니라 앱 개발도 잘할 수 있게 되었다고 설명
📜 [Sapienza Univ.] Epistemological Fault Lines Between Human and Artificial Intelligence
- LLM은 인식론적 존재가 아니고 stochastic pattern-completion systems 뿐임을 지적
- 이를 위해 LLM이 답변을 생성하기까지(판단을 내리기까지)의 과정을 인간의 사고 과정과 비교 분석
📜 [Meta, UIUC, CMU] Toward Training Superintelligent Software Agents through Self-Play SWE-RL
- 현존 software agents에 필요한 학습 데이터와 환경은 human knowledge & curation에 크게 의존중이라는 문제점을 지적
- Self-play SWE-RL (SSR): human-labeled issue or tests 없이 sandboxed repositories with source code에 대한 접근 권한만 제공
- LLM agent는 self-play 세팅에서 softwar bugs를 고치도록 강화 학습 반복
📜 [Tencent] Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding
- 사람이 길고 복잡한 텍스트를 holistic semantic representation에 근거하여 처리(global view)하는 것과 달리 LLM은 이게 부족한 상황
  - 사람의 이러한 능력을 심리학에서 Mindscape-Aware Capability 라고 부름
- Mindscape-Aware RAG (MiA-RAG): LLM-based RAG system에 explicit global context awareness를 제공
- hierarchical summarization을 build → retrieval & generation 둘 다 global semnatic representation에 condition
📜 [HKUST, Waterloo, Tsinghua, ICL] Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning
- ‘aha moments’, ‘length-scaling’, entropy dynamics 같은 현상들이 서로 다른 현상이 아니라 emergent reasoning hierarchy의 특징이라고 주장
- two-phase dynamic: procedural correctness의 제약을 받으며 low-level skills 개선 → high-level strategic planning 고도화로 이어짐
- 이 관점에서 GRPO 같은 RL 알고리즘은 토큰으로부터의 learning signal을 무시한채로 무작위 optimzation 한다는 한계를 지적
- Hierarchy-Aware Credit Assignment (HICRA): 영향이 큰 planning tokens 대상으로 opimization efforts 집중
📜 [Oxford] Shared sensitivity to data distribution during learning in humans and transformer networks (Nature Human Behaviour 2025)
- 인간과 transformers 모델의 ‘in-context’ learning & ‘in-weights’ learning 비교
- redundancy & diversity 는 in-weights & in-context learning 둘 다에서 상충 관계에 있다는 공통점 확인
- 그러나 dynamic training shcedules이 인간에게는 영향을 줄 수 있던 것과 달리 network는 아님
📜 [MIT] Self-Adapting Language Models
- LLM은 static 하기 때문에 새로운 지식을 기반으로 가중치 업데이트하지 못한다는 문제를 지적
- SEAL: 새로운 입력이 주어지면 모델이 학습하기 좋은 형태의 self-edit 데이터를 생성
- 이렇게 생성된 self-edit를 SFT하여 새로운 지식에 adapt
- updated model의 downstream performance를 reward signal로 사용하여 RL 함으로써 effective self-edits를 생성할 수 있도록 모델을 학습

🍁 11월

1st week

🧑🏻‍💻 [MiniMax] MiniMax M2 & Agent: Ingenious in Simplicity
- Top-tier 코딩 능력, 강력한 Agentic performance, Cost-Effectiveness & Speed 강조
- 모델 가중치를 허깅페이스에 오픈소스로 공개 (오픈소스 모델 중 1위라고 함)
🧑🏻‍💻 [OpenAI] Introducing Aardvark: OpenAI’s agentic security researcher
- GPT-5로 실행되는 agentic security researcher, Aardvark 출시
- 파이프라인: 커밋 단위의 변화 모니터링 → Threat Model 수립 → 취약점 탐지 → 샌드박스 검증 → 패치 제안 with codex → Human Review → PR 생성
📜 [MoonShot AI] Kimi Linear: An Expressive, Efficient Attention Architecture
- Kimi Linear: hybrid linear attention architecture - short- & long- context, RL scaling에서 full attention 대비 우위라고 설명
- Kimi Delta Attention (KDA): Gated DeltaNet을 finer-grained gating mechanism과 함께 extend
- 이를 Multi-Head Latent Attention (MLA)와 교차하여 3B activated & 48B total parameters 모델 학습
- 맞춤형 chunk-wise algorithm은 Diagonal-Plus-Low-Rank (DPLR) transition matrices의 variant로 뛰어난 하드웨어 효율성을 보여줌
📜 [BAAI] Emu3.5: Native Multimodal Models are World Learners
- vision & language를 통해 next state를 예측하는 large-scale multimodal world model (open-source)
- 10T 토큰 이상의 vision-language interleaved data에 대해 unified next-token prediction 하도록 end-to-end pretrained
  - multi-modal reasoning & generation을 위한 post-training & RL
- 추론 효율성 향상을 위해서 Discrete Diffusion Adaptation (DiDA) 제안
  - token-by-token decoding → bidirectional parallel prediction
  - 성능 하락 없이 이미지당 약 20배 추론 속도 향상
📜 [Meta] Collaborative Reasoner: Self-Improving Social Agents with Synthetic Conversations (NeurIPS 2025)
- Collaborative Reasoner (Coral): 언어 모델의 collaborative reasoning abilities를 평가하고 개선하는 프레임워크 제시
- 잘못된 solutions에는 동의하지 않고, 올바른 solution은 상대방에게 설득할 수 있는 능력 등을 확인할 있는 tasks & metrics
- 현존 모델들은 undesirable socia behavior로 인해 혼자서 풀 수 있는 문제도 틀리는 경향이 있다고 설명
- 이를 해결하기 위해 synthetic multi-turn preference data를 생성하는 self-play method 제안
📜 [Alibaba] AgentFold: Long-Horizon Web Agents with Proactive Context Management
- proactive context management: human cognitive process of retrospective consolidation에 영감을 받았다고 설명
- context를 dynamic cognitive workspace로 treat
  - 각 step이서 folding operation 실행: historical trajectory를 multiple sacles에서 관리
  - 전체 대화의 흐름을 추상화 하면서도 세부 디테일들을 보존
🧑🏻‍💻 [Anthropic] Emergent Introspective Awareness in Large Language Models
- known concepts의 representations를 injecting 한 뒤 model의 self-reported states를 측정
- 특정 시나리오에서 모델은 injected concepts의 존재를 정확하게 알아차릴 수 있다고 보고 → introspective awareness
📜 [Google DeepMind] Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model
- RedLLM: encoder-decoder vs. DecLLM: decoder-only LLM
  - 각각 prefix language modeling (LM), causal LM으로 pretrained
- Redpajama V1 (1.6T) 로 pretrain & FLAN 으로 instruction tuning
  - 150M ~ 8B 사이즈 모델 학습
- RedLLM이 강력한 scaling properties를 보였을 뿐만 아니라 instruction tuning 효과가 DecLLM보다 좋은 영역들이 존재했다고 설명
📜 [Google Cloud, UCLA] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- 문제점
  - 작은 사이즈의 open-source models는 여러 시도에도 correct solutions를 반환하는 일이 적어서 RLVR 적용이 어렵다
  - SFT의 경우 rigid token-by-token을 통해 long demonstration에 overfit 된다
- Supervised Reinforcement Learning (SRL): 각 action을 commit 하기 전에 internal reasoning monologue를 생성하도록 모델 학습
  - 모델의 action과 데이터셋으로부터 추출된 expert action의 유사도를 기반으로 smoother reward 제공
  - 학습중인 모델의 모든 rollouts가 틀린 상황에도 learning signals을 제공할 수 있음
- RLVR 이전에 SRL을 적용하는 것이 전반적인 성능 향상에 도움이 된다는 설명
🧑🏻‍💻 [Generalist] GEN-0 / Embodied Foundation Models That Scale with Physical Interaction
- embodied foundation model을 위한 multi-modal model scaling에 관한 성과를 공개
- Harmonic Reasoning: 모델이 think & act 를 동시에 할 수 있도록 학습시키는 방법으로 GEN-0의 핵심 feature라고 설명
- 7B 사이즈를 넘어가면서 작은 모델들에서 나타나던 ossification 문제가 개선됨 관측
🧑🏻‍💻 [Ai2] Introducing OlmoEarth Platform: Powerful open infrastructure for planetary insights
- Earth observation 업무를 하나의 foundation으로 커버하는 end-to-end 오픈 인프라
  - 기존에는 crop mapping, deforestation, land use classification 등 태스크별로 개별 모델이 필요했음
- 데이터 수집, 라벨링, 학습, 추론, 배포까지 한 번에 처리
- OlmoEarth: 10 테라바이트가 넘는 양의 데이터로 pretrained model family
🧑🏻‍💻 [Microsoft] Agent Lightning
- 코드 변경 하나 없이 agent를 최적화해주는 프레임워크
- agent 코드에 agl.emit_xxx()를 넣거나 tracer를 켜면 각 프롬프트 툴 호출 및 보상 신호가 구조화된 이벤트로 수집 → LightningStore → 작업, 리소스, 트레이스 동기화
- 선택된 알고리즘이 저장소의 스팬을 읽고 학습 → 학습 결과로 리소스를 저장소에 다시 게시
📜 [Tisnghua] Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs (EMNLP 2025 Findings)
- 기존 RAG와 reasoning의 한계를 극복하기 위해 둘을 통합된 관점에서 연구
- 다음 세 가지를 제시
  - Reasoning-Enhanced RAG, RAG-Enhanced Reasoning, Synergized RAG-Reasoning framework
🧑🏻‍💻 [Google] Exploring a space-based, scalable AI infrastructure system design
- 태양광 위성 cluster + Google TPU + free-space optical links 데이터센터를 우주에 지을 계획
- 태양광이 우주에서 지상 대비 최대 8배 효율이라고 함
- 로켓 발사비가 2030년대 중반에 이르렀을 때 에너지 단가가 지상에서와 근접할 가능성이 있다고 보고 2027년도 초 프로토타입을 목표로 진행하는 프로젝트
🧑🏻‍💻 [Cognition] Windsurf Codemaps: Understand Code, Before You Vibe It
- vibe coding만으로는 어려운 문제를 해결할 수 없고 코드에 대한 이해가 필수적이라고 주장
- 거대하고 복잡한 코드 베이스를 이해할 수 있도록 Codemap 생성
- Fast (SWE-1.5) & Smart (Sonnet 4.5) 방식을 Windsurf 내에서 선택 가능
📜 [Univ. of Milano-Bicocca] Can Role Vectors Affect LLM Behaviour? (EMNLP 2025 Findings)
- persona-based prompting 대신 role vector를 사용하는 방식에 대한 연구
- model activations로부터 29개의 role vectors를 만들고 다양한 도메인에 대해 벤치마크 성능을 평가
- (1) activation addition: role-specific directions로 강화할 수 있는가 (2) directional ablation: 이를 제거할 수 있는가
🧑🏻‍💻 [Moonshot AI] Introducing Kimi K2 Thinking
- 추론시 32B activation, 256K context window, 200-300개 연속적인 tool calls 가능
- 다수의 reasoning, coding 벤치마크에서 GPT-5, Sonnet 4.5 상회하는 성능으로 SoTA 달성
  - 추론 비용은 이 모델들보다 10x - 20x 저렴
- 100M 이상 유저 | 20M$/a month 의 경우에만 Kimi K2를 명시하는 라이센스로 오픈소스임
  - 근데 API 말고 활용하는 방법에 대해서는 알려진 바가 딱히 없어 보임
📜 [MDGA] Diffusion Language Models are Super Data Learners
- Crossover: unique data가 제한된 상황에서 DLM이 AR 모델 대비 더 학습 결과가 좋다고 설명 (for more epochs)
  - 데이터가 많거나 품질이 좋으면 늦게, 모델 사이즈가 클수록 빨리 나타남
  - dense & sparse 아키텍쳐 공통적으로 확인
- 세 가지 compounding factors
  - (1) any-order modeling (2) super-dense compute from iterative bidirectional denoising (3) built-in Monte Carlo augmentation
- 1B - 8B 사이즈의 모델로 실험한 결과를 제시
🧑🏻‍💻 [Edison] Kosmos: An AI Scientist for Autonomous Discovery
- structured world model을 통해 수백 개의 Agent 경로에서 추출된 정보를 통합하고 특정 연구를 수행
- 사람이 6개월 동안 처리할 일을 하루만에 끝낼 수 있는 것으로 보고
- 1,500개의 papers를 읽고 42,000 lines of analysis code를 실행할 수 있다고 함
📜 [Tencent, Tsinghua] Continuous Autoregressive Language Models
- CALM: 기존의 discrete next-token prediction을 continuous next-vector prediction으로 paradigm shift
- K개 tokens로 구성된 chunk를 single continuous vector로 압축하는 high-fidelity autoencoder 사용
  - the number of generative steps를 K 값에 비례하여 줄일 수 있게 됨
- robust training, evaluation, controllable sampling을 가능토록 하는 likelihood-free framework 개발

2nd week

📜 [OpenMOSS] Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
- Thinking with Text & Thinking with Images 패러다임의 한계를 지적
- Thinking with Video: Sora-2와 같은 video generation 모델을 이용하여 unified framework에서 visual & textual reasoning
- Video Thinking Benchmark 개발: (1) vision-centric tasks (2) text-centric tasks
- self-consistency & in-context learning이 Sora-2 performance 향상에 기여할 수 있다고 설명
📜 [GAIR] Context Engineering 2.0: The Context of Context Engineering
- context engineering: high-entropy contexts를 low-entropy machine-understandable representations로 전처리하는 것으로 정의
- 20여년에 걸친 발전 동향을 설명: sensor 정보 및 GUI 사용 시작 (1.0) → GPT-3 등장 (2.0) → human-level with social cues (3.0) → proactive superhuman intelligence (4.0)
📜 [Oxford, Microsoft] VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
- VCode: 이미지가 주어지면 모델은 symbolic meaning을 보존한 SVG를 생성해야 함
- general commonsense, professional disciplines, visual-centric perception 등 영역을 cover
- CodeVQA: policy model이 rendered SVG에 관한 질문에 답변함으로써 symbolic fidelity를 평가
- 현재 frontier VLMs도 language-centric & visual-centric 태스크 간 gap을 보임
🧑🏻‍💻 [Google] Introducing Nested Learning: A new ML paradigm for continual learning
- deep learning의 고질적인 문제인 catastrophic forgetting 이슈를 해결하고자 함
- Hope 아키텍쳐: self-modifying recurrent & context-aware learning. 이를 통해 Nested Learning이라는 패러다임 제시
- Key Components: Deep Optimizers, Continuum Memory System (CMS), Self-Modifying Architecture
🧑🏻‍💻 [Skyvern AI] Skyvern
- LLMs & computer vision을 이용하여 브라우저 기반의 워크플로우를 자동화
- AGPL-3.0 라이센스: 네트워크 이용시 소스 공개, 고지 필수 / 상업적 이용 가능
- Task-Driven autonomous agent design + Playwright (browser automation library)
- 이러한 웹 기반 에이전트를 이용하여 학습용 데이터 크롤링에 활용하고자 하는 니즈 높음 (최근)
📜 [Mila, McGill] Grounding Computer Use Agents on Human Demonstrations
- reliable computer-use agent를 만들기 위해서는 natural language instructions를 correct on-screen elements에 grounding 필수
- GroundCUA: expert human demonstraions로 제작된 large-scale desktop grounding dataset 공개
  - 12개 카테고리의 87개 어플리케이션 포함, 56K 스크린샷에 3.56M human-verified elements
- GroundNext: instructions를 target UI elements에 map 할 수 있는 모델 패밀리 (3B & 7B)
📜 [Zhejiang Univ.] Last Layer Logits to Logic: Empowering LLMs with Logic-Consistent Structured Knowledge Reasoning
- Logic Drift challenges: structured knowledge reasoning tasks를 잘 처리하지 못하는데, 이는 unstructured & sturctured knowledge에 대한 representational differences에 기인하는 것으로 해석
- 이를 해결하기 위한 기존 방법론들은 복잡한 workflow 구성 위주로 되어 있어 문제를 근본적으로 해결하지 못한다고 지적 (inflexible)
- Logits-to-Logic: logits strengthening & logits filtering을 LLM outputs의 logical defects를 교정하는 핵심 모듈로 사용하는 프레임워크
🧑🏻‍💻 [OpenAI] GPT-5.1: A smarter, more conversational ChatGPT
- GPT-5.1 Instant & Thinking
  - Instant 모델의 경우 Intelligence 뿐만 아니라 communication style 개선도 많이 이뤄졌다고 설명
  - 또한 쉬운 질문은 빠르게, 어려운 질문은 오랜 시간을 들여 처리하는 adaptive reasoning 적용
- Preset 업데이트
  - Default, Friendly, Efficient 유지
  - 새 옵션 Professional, Candid, Quirky 추가
- GPT-5.1 Auto가 요청에 맞는 모델로 자동 routing
🧑🏻‍💻 [Google DeepMind] SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds
- Scalable Instructable Multiworld Agent 2: Gemini models를 이용하여 interactive gaming companion으로 발전
- 단순히 instruction을 따르는 것 외에도 think & reason 할 수 있다고 설명
- human demonstration videos with language labels & Gemini-generated labels를 혼합하여 학습 데이터로 활용
- multi-modal 정보나 다양한 언어, 이모지 등을 이해할 수 있음
📜 [NVIDIA] TiDAR: Think in Diffusion, Talk in Autoregression
- Diffusion: fast parallel generation, AR: quality → 둘의 장점을 합침
- TiDAR: (Thinking) in Diffusion and sampels final outputs (Talking) AutoRegressively
  - specially designed structured attention masks를 이용하여 single forward pass 내에서 처리 가능
- AR 모델들의 성능에 견주면서도 초당 4.71 ~ 5.91배의 토큰을 출력할 수 있었다고 보고
  - 1.5B & 8B 사이즈 모델로 실험한 결과 제시
📜 [Beijing Jiaotong Univ.] Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI
- from Pipeline-based systems → to Model-native paradigm
- planning, tool use, memory와 같은 기능들이 외부 시스템에 의해 동작하는 게 아니라 모델의 internalized 능력으로 처리되는 추세
- outcome-driven exploration RL을 넘어서 LLM + RL + Task 조합이 중요함을 역설
  - language, vision, embodied domains 모두 해당되는 내용

3rd week

🧑🏻‍💻 [Anthropic] Measuring political bias in Claude
- political bias를 평가하는 방법을 제안. 1,350개의 paired prompts로 구성
- prompts, grader rubrics, scripts 모두 오픈소스로 공개
📜 [ByteDance] Depth Anything 3: Recovering the Visual Space from Any Views
- DA3: arbitrary number of visual inputs로부터 spatially consistent geometry를 예측하는 모델로, camera poses를 필요로 하지 않음
- 2개의 key insights
  - a single plain transformer (vanilla DINO encoder)
  - a singular depth-ray prediction target
- teacher-student training paradigm을 통해 Depth Anything 2 (DA2)급 성능 달성
📜 [Beihang Univ.] Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty
- fine-tuned LLMs이 knowledge boundaries를 모르는 것이 아니라 이를 표현하는 능력이 부족한 것이라고 주장
- Honesty-Critical Neurons Restoration (HCNR): key expression-governing neurons를 찾아 pre-trained state로 복구. Hessian-guided compensation 이용
🧑🏻‍💻 [xAI] Grok 4.1
- non-verifiable reward signals을 통해 style & intent 최저고하
- reasoning architecture 변경 없이 dialogue behavior를 조정
- reasoning-mode 기준으로 EQ-Bench3에서 Elo 점수 최고점 기록
🧑🏻‍💻 [Google] A new era of intelligence with Gemini 3
- reasoning에서 SoTA 성능을 달성한 Gemini 3 모델 공개
- 텍스트, 이미지, 비디오, 오디오, 코드 등을 이해할 수 있으면서 1M token context window 지원
- Google Antigravity: agent-first 개발 플랫폼으로 현재는 free 티어만 열려 있음
🧑🏻‍💻 [Ai2] DR Tulu: An open, end-to-end training recipe for long-form deep research
- Deep Research Tulu: long-form deep research tasks에 특화된 open-model
- SFT & Reinforcement Learning with Evolving Rubrics (RLER, online)
- DR Tulu 8B checkpoint, RLER rubric generation & training framework, dr-agetn-lib 등 오픈소스로 공개
📜 [Shanghai AI Lab] P1: Mastering Physics Olympiads with Reinforcement Learning
- RL 기반의 open-source reasoning models family P1 공개
- P1-235B-A22B 모델은 International Physics (IPhO 2025)에서 금메달 성적
- math, coding 등의 벤치마크에서도 우수한 성능을 보인다고 설명
📜 [Duke] It's LIT! Reliability-Optimized LLMs with Inspectable Tools
- LLM이 문제를 풀기 위해서 (필요한 경우) 외부 도구를 사용하도록 강제함으로써 좀 더 신뢰도 높은 reasoning process를 갖도록 함
- LIT (LLMs with Inspectable Tools): LLM의 tool-calling 능력을 이용해서 the most reliable & easy-to-trouble shoot solution을 선택하도록 함
- 이를 검증하기 위해 커스텀 가능한 1,300개의 datasets 구축
  - Harvard USPTO Patent Dataset & NeurIPS 2023 papers 기반으로 수학, 코딩, 모데링 문제들을 포함
🧑🏻‍💻 [Ai2] Olmo 3: Charting a path through the model flow to lead open-source AI
- Olmo 3-Base (7B, 32B), Olmo 3-Think (7B, 32B), Olmo 3-Instruct (7B), Olmo 3-RL Zero (7B)
- Base 모델은 Qwen 2.5와 유사한 수준의 성능이며, post-training을 통해 기존 오픈소스 모델들보다 뛰어난 성능을 지닌 것으로 보고
- data, code, model weights & checkpoints를 Apache 2.0로 공개
🧑🏻‍💻 [topoteretes] Cognee
- 단 6줄의 코드만으로 에이전트의 메모리를 관리할 수 있도록 돕는 오픈소스 프레임워크
- 셀프 호스팅 또는 Cognee Cloud를 통해 메모리를 관리할 수 있음
- 벡터 & 그래프 하이브리드 검색 파이프라인
- CLI & Web UI 제공
🧑🏻‍💻 [Google] Introducing Nano Banana Pro
- Gemini 3 Pro 기반의 Gemini 3 Pro Image 모델
- 아이디어 시각화 품질이 엄청 뛰어남. 글자(영어) 표현이나 장표 구성.
  - inforgraphics, slide decks, memes, mockups, storyboards 등
🧑🏻‍💻 [OpenAI] A free version of ChatGPT built for teachers
- 2027년 6월까지 교사들을 위한 ChatGPT를 무료로 공개 (U.S. k-12 educators)
- GPT-5.1 Auto 모델의 무제한 메세지, 검색, 파일 업로드, connectors 등 다양한 기능 지원
- 교사 개인화된 학습 지원과 동시에 데이터를 학습에 사용하지 않는 보안까지 보장
🧑🏻‍💻 [Meta] Introducing Meta Segment Anything Model 3 and Segment Anything Playground
- detection, segmentation, tracking 등을 지원하는 unified model
- SAM 3 model checkpoints, evaluation datasets, fine-tuning code 공개
  - Segment Anything Playground 플랫폼을 제공하여 모델의 특성과 능력을 이해할 수 있도록 보조
- 또한 3D objects & human reconstruction from a single image 관련 SAM 3D 모델, 코드 및 데이터 역시 공개
📜 [Kandinsky Lab] Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
- high-resolution image & 10-second video synthesis가 가능한 SoTA foundation model family
- 5.0 Image Lite (6B image generation), 5.0 Video Lite (2B text-to-video), 5.0 Video Pro (19B video generation)
- code, model check-point 오픈소스로 공개
- Diffusion Transformer with cross-attention (CrossDiT) for multimodal fusion of visual and textual information를 핵심 아키텍쳐로 설명

4th week

📜 [OpenAI] Early science acceleration experiments with GPT-5
- GPT-5를 수학, 물리, 천문학, 컴퓨터 공학, 생물학, 재료공학 연구에 활용해보는 연구
- 이를 통해 연구 내에서 사람의 시간을 아낄 수 있는 영역과, 여전히 사람의 손이 많이 필요한 영역을 구분해냄
- 특히나 수학 분야에서 풀리지 않았던 문제를 푸는 데 GPT-5가 어떻게 도움을 줄 수 있었는지에 대해 다룸
📜 [NVIDIA] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs
- Nemotron Elastic: hybrid Mamba-Attention Architecture를 포함하여 reasoning-oriented LLM을 만드는 framework 개발
  - single parent model에 여러 개의 nested submodels을 embed하고 각각 다른 configurations & budgets에 optimize
- 각 submodel은 parent model과 weight를 공유하고, 추가적인 학습 없이도 zero-shot extration 가능하다고 설명
- group-aware SSM elastification, heterogeneous MLP elastification, normalized MSE-based layer importance 등을 통해 Mamba의 구조적 제약을 보존
🧑🏻‍💻 [Anthropic] Introducing Claude Opus 4.5
- 인간 엔지니어보다 코딩을 잘하는 첫 번째 AI라며 소개된 모델. coding, agents, computer use에서 SoTA급 성능
- prompt injection에 업계 최고 수준으로 robust 하다고 설명
- 153 페이지 분량의 system card 🔗
📜 [Salesforce, Stanford] Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning
- multi-step co-evolution & seamless tool integration을 통해 외부 데이터 없이도 모델 스스로 발전할 수 있도록 만든 프레임워크
- 같은 모델로부터 만든 두 개의 agents가 공생하는 구조
  - curriculum agent & executor agent
  - executor agent에게 external tools를 붙여줌으로써, curriculum agent가 더 어렵고 복잡한 문제를 내게끔 압박
- Qwen3-8B-Base 모델의 추론 능력 향상에 대해 리포트
📜 [Peking] General Agentic Memory Via Deep Research
- 정적 메모리 문제를 해결하기 위해 general agentic memory (GAM) 제안
- just-in- time (JIT) compilation 원칙 준수
  - runtime에 simple, but useful memory만을 생성하도록 함 (offline stage)
- duo-design
  - Memorizer: universal page-store 내에서 complete historical information을 유지하면서도 key historical information 하이라이트
  - Researcher: page-store에서 필요한 정보를 retrieve & integrate
🧑🏻‍💻 [Tecent] HunyuanOCR
- multimodal architecture로 동작하는 OCR expert VLM
- 1B 파라미터로 다양한 벤치마크에서 SoTA 달성
- complex multilingual document parsing, text spotting, open-field information extraction 등 다양한 태스크 커버 가능
- 100개 이상의 언어 처리할 수 있다고 주장
🧑🏻‍💻 [Andrew Ng] Stanford Agentic Reviewer
- 논문 PDF를 분석하고 최신 관련 연구(arXiv)로 근거를 붙여 빠르고 구체적인 피드백 제공하는 agentic system
- PDF → MD 변환 후 제목/학술문서 여부 체크 → 논문에서 검색 쿼리 생성하여 arXiv 검색 → 상위 논문 요약 → 원 논문 MD + 관련 연구 요약 합쳐 템플릿 리뷰 생성
- ICLR 2025 데이터 대상으로 테스트 한 결과, Human-Human 간 Spearman 점수보다 높음
📜 [UCL] Memento: Fine-tuning LLM Agents without Fine-tuning LLMs
- memory-based online RL을 통해 low-cost continual adaptation 적용하는 방법론
- Memory-augmented Markov Decision Process (M-MDP) with neural case-selection policy
- policy는 memory rewriting mechanism을 통해 environmental feedback 기반으로 업데이트
- memory reading (retrieval)을 통해 policy improvement
🧑🏻‍💻 [DeepSeek AI] DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
- final-answer accuracy가 correct reasoning을 보장하지 않는다는 문제점을 지적
- two-part training system으로 모델의 full proofs를 생성, 체크, 교정
- generator with a dedicated verifier
  - verifier는 각 스템에 대해 scores
  - generator는 verifier가 accepts 할 때까지 rewrites its proofs
- 685B 사이즈 모델로 받은 높은 성적을 공개
📜 [Qwen, Edinburgh, Stanford, MIT] Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free (NerIPS 2025 Best Paper)
- 아직까지도 gating의 구체적인 영향에 대해 연구가 제대로 이뤄지지 않았다고 지적
- gating-augmented softmax attention variants에 대한 연구
  - 30개 종류의 15B MoE models, 1.6B dense 모델에 대해 조사 (3.5T 토큰 학습)
- head-specific sigmod gate를 Scaled Dot-Product Attention (SDPA) 이후에 적용하는 것이 모델 성능을 확실히 향상시킬 수 있는 방법이었다고 설명
- 두 가지 key factors
  - softmax attention 내에서 low-rank mapping 사이에 non-linearity 추가
  - SDPA 출력을 조절하기 위해 query-dependent sparse gating scores를 적용
- sparse gating mechanism이 massive activation, attention sink와 같은 이슈들을 mitigate 한다고 보고

🎃 10월

1st week

📜 [NVIDIA, MIT, HKUST] LongLive: Real-time Interactive Long Video Generation
- LongLive: frame-level AR framework for realtime & interactive long video generation
- KV-recache mechanism: new prompts을 통해 cached states를 refresh
- short window attention paired with a frame-level attention sink
🧑🏻‍💻 [Anthropic] Introducing Claude Sonnet 4.5
- 다양한 코드 벤치마크에서 새로운 SoTA 성능을 달성한 모델
- 30시간 넘게 처리해야 하는 코딩 태스크도 수행 가능하다고 설명
🧑🏻‍💻 [Microsoft] Vibe working: Introducing Agent Mode and Office Agent in Microsoft 365 Copilot
- Agent Mode in Office apps (엑셀, 워드) & Office Agent in Copilot chat
- SpreadsheetBench에서 SoTA
🧑🏻‍💻 [Ai2] Asta DataVoyager: Data-driven discovery and analysis
- structured data를 다루는 연구자를 위한 scientific research agents의 ecosystem
- spreadsheet, csv와 같은 structured data에서 explainable answer 반환 (복사 가능한 코드, 시각적 자료 등과 함께)
- on-premise, private cloud에서 데이터 관리 (보안)
🧑🏻‍💻 [OpenAI] Sora 2 is here
- 뛰어난 성능으로 주목 받고 있는 video generation model
- physics-aware, synchronized audio, controllability 등 특징
- 5-10s output, 워터마크
- invite-only launch in U.S. & Canada
📜 [NUS] MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use
- domain experts & AI agents에 의해 제작된 127개의 고품질 MCP tasks 벤치마크
- richer & diverse interactions 필요. CRUD operations 포함.
- gpt-5-medium이 52.56% pass@1, 33.86% pass^4로 현재 기준 최고 성능
🧑🏻‍💻 [Thinking Machines] Announcing Tinker
- LLM을 fine-tuning 할 수 있는 managed API를 첫 제품으로 공개 (Mira Murati)
- Llama-3.x ~ Qwen3 시리즈 모델 대상으로 학습 가능. 중간 체크포인트도 다운로드 가능
🧑🏻‍💻 [Google] AI as a research partner: Advancing theoretical computer science with AlphaEvolve
- optimization 문제를 풀 수 있는 LLM-based coding agent
- LLM을 통해 기존 연구 자료 요약, 새로운 이론과 관련된 연구 계획, 이를 위한 증거(proofs) 단계를 밟게 될텐데, 특히 proof 확보에 AlphaEvolve를 활용할 수 있을 것이라 설명
📜 [NUS, Oxford, Stanford] GEM: A Gym for Agentic LLMs
- General Experience Maker (GEM): open-source environment simulator
- 기존 OpenAI-Gym이 제공하던 것들을 그대로 지원 - asynchronous vectorized execution for high throughput & flexible wrappers for easy extensibility
- 추가로, robust integrated tools & single-file example scripts with five popular RL training frameworks 지원
📜 [Imperial College London] Fine-tuning with RAG for Improving LLM Learning of New Skills
- inference-time retrieval을 learned competence through distillaion으로 변경
- (1) agent failures로부터 compact & reusable hints 추출
- (2) 이 hints를 episode start 시점에 one-shot retrieval에 사용하여 improved teacher trajectories 생성
- (3) hint strings를 제거하여 student 모델을 학습함으로써 memorization 대신 internalization 유도
- household tasks를 다루는 ALFWorld, online shopping tasks를 다루는 WebShop 벤치마크에서 뛰어난 성능 달성
📜 [Meta, Johns Hopkins] The Era of Real-World Human Interaction: RL from User Conversations
- Reinforcement Learning from Human Interaction (RLHI): in-the-wild user conversations로부터 학습하는 paradigm 제시
- RLHI with User-Guided Rewrites: unsatisfactory model outputs를 유저의 natural-language follow-up response 기반으로 수정
- RLHI with User-Based Rewards: 유저의 long-term interaction history로 conditioned된 reward 모델을 통해 학습
- WildChat 데이터를 두 방식으로 학습한 모델이 personalization & instruction-following 관점에서 baseline outperform
🧑🏻‍💻 [DeepSeek AI] DeepSeek-V3.2-Exp
- V3.1-Terminus 모델에 DeepSeek Sparse Attention(DSA)을 도입한 차세대 모델 실험 버전
  - 본 Sparse Attention은 long-context scenarios를 위해 설계된 디자인
- HuggingFace의 inference를 이용한 demo 시연 가능

2nd week

🧑🏻‍💻 [OpenAI] OpenAI DevDay 2025
- Apps in ChatGPT & Apps SDK preview: ChatGPT 안에서 앱을 실행할 수 있도록 함으로써 ChatGPT를 대화형 OS로 확장
- AgentKit: Agent Builder, ChatKit, Evals (타사 모델 평가 지원), RFT, Guardrail 등
- Models & API update: GPT-5 Pro (API), Sora 2 (API), gpt-realtime-mini, gpt-image-1-mini
- Codex 일반 제공: Slack 연동, Codex SDK, 관리자 기능
📜 [Maryland] Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems (EMNLP 2025)
- calibrated log-likelihood score를 사용하여 multiple difference LLMs로부터 best response를 select 하는 방법론 제안
- 정확히는 모델들의 internal knowledge & confidence를 활용
📜 [Anthropic, Oxford] Eliciting Secret Knowledge from Language Models
- elicitation: AI가 보유하고 있는 지식이지만 verbalize 하지 않는 것을 이끌어내고자 하는 연구
- 3개 model families로 black-box & white-box 스타일 둘 다 연구
- 가장 퍼포먼스가 좋았던 것은 black-box 스타일 중 하나인 prefill attacks: LLM이 predefinex prefix가 주어졌을 때 completion 하면서 secret reveal
📜 [Oxford, Apple] The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining
- web-crawled datasets를 filtering할 때 가장 흔히 쓰이는 방법론은 Classifier-based Quality Filtering (CQF)
  - 각 document에 quality score를 부여
- CQF가 downstream task 퍼포먼스는 향상시키지만, 반드시 high-quality dataset modeling으로 이어지는 것은 아니라고 지적
  - 왜냐하면 CQF가 high-qaulity dataset 또한 filtering 하는 경우가 있기 때문
- CQF 기반으로 학습한 모델 vs. random token permutations 기반으로 학습한 모델
🧑🏻‍💻 [Google] Meet Jules Tools: A Command Line Companion for Google’s Async Coding Agent
- 기존 자동화 내에 커맨드로 포함 가능
- 과거 수정 내역과 개발자의 preferences를 기억하는 context awareness
- dashboard-style tasks view를 terminal에서 지원
📜 [CMU] LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization
- 모델의 activation만으로부터 model outputs의 정확도를 예측하는 방법론에 대한 연구
- retrieved context가 모델 답변에 필요할지에 대한 internal signal이 존재하는지 탐구
  - correct, incorrect, irrelevant context로 비교 실험
- intermediate layer activations에 대해 trained simple classifier를 사용하는 것만으로도 첫 번째 토큰의 activation을 분석하여 75% 정확도를 달성함
📜 [Meta, NYU] A Single Character can Make or Break Your LLM Evals
- in-context examples를 어떻게 formatting 해야 하는지에 대한 연구는 아직까지 많이 이뤄지지 않음을 지적
  - comma? new line? semi-colon?, …
- Llama, Qwen, Gemma model family로 비교실험한 결과 the choice of delimiter가 MMLU에 대한 성능을 +- 23%까지 영향을 줬다고 설명
- 심지어 topics, models families 구분 없이 존재하는 현상이며 scale에 따른 개선도 없다고 함
- attentino head scores를 분석하여, good-performing delimiters가 입력의 핵심 토큰에 attention 할 수 있도록 돕는다는 것을 확인
- 또한 LLM의 robustness to the choice of delimiter를 강화하는 방법론 탐구
🧑🏻‍💻 [Google] Introducing the Gemini 2.5 Computer Use model
- 유저 인터페이스에서 interact 가능한 agents를 build 할 수 있는 개발자용 Gemini API 공개
  - Gemini 2.5 Pro의 visual understanding & reasoning capability 기반으로 specialized
- web & mobile control benchmarks에서 다른 모델들 outperform with lower latency
- Google AI Studio & Vertext AI 등에서 access 가능
🧑🏻‍💻 [Google] Speech-to-Retrieval (S2R): A new approach to voice search
- voice를 text로 변환하지 않고 바로 검색에 활용하여 더 빠르게 검색할 수 있도록 함
- Simple Voice Questions (SVQ) dataset open-sourcing: 17개 언어, 27개 지역 대상으로 수집된 short audio questions. S2R 평가에 사용됨
📜 [Samsung] Less is More: Recursive Reasoning with Tiny Networks
- Hierarchical Reasoning Model (HRM): 2개의 small neural network를 사용하는 방법론으로 복잡한 문제를 작은 사이즈로도 잘 풀어낸다고 알려짐
  - 27M parameters trained on small data (~1000 examples)
- Tiny Recursive Model (TRM): 더 간단한 recursive reasoning approach로, HRM보다 뛰어난 일반화 성능을 지녔다고 설명
  - only 2 layers. 7M parameters
🧑🏻‍💻 [Figure] Introducing Figure 03
- designed for homes, factories, world sclae humanoid
- each fingertip은 high-fidelity tactile sensor를 통해 real-time perception & reasoning을 가능토록 함
📜 [Tsinghua] Cache-to-Cache: Direct Semantic Communication Between Large Language Models
- enriching the KV-Cache semantics can improve response quality without increasing cache size
- 이를 통해 KV-Cache가 inter-model communication의 effective medium이라고 주장
- Cache-to-Cache (C2C): LLMs 간의 direct semantic communication을 위한 새로운 paradigm
- neural network를 사용하여 source model’s KV-cache를 project & fuse with that of target model
📜 [Meta] Agent Learning via Early Experience
- agents를 학습할 땐 verifialbe rewards도 부족하고 long-horizon rollouts도 부족하다는 문제를 지적
  - 현재는 expert data로 fine-tuning하고 있으나 이는 scale-up 할 수 없는 원인이 됨
- early experience: agent’s own actions로 생성된 interaction data로 future states는 reward signals 없이 supervision으로 serve
  - → Implicit world modeling, Self-refelction

3rd week

🧑🏻‍💻 [Anthropic] A small number of samples can poison LLMs of any size
- 모델 크기나 학습 데이터의 양과 상관 없이 250개의 malicious documents면 backdoor vulnerability 만들기에 충분하다고 설명
- 모델 사이즈에 비례하여 더 많은 데이터를 학습하게 되므로 이를 attack 하기 위해서는 training data의 비율을 조정해야 한다는 것이 관념이었으나 “고정된” 개수의 documents로 attack이 가능하다고 주장하는 것임
📜 [Stanford] Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
- ACE: contexts를 evolving playbooks로 다루는 프레임워크
- agent, domain-specific benchmark에서 ACE가 context를 offline & online 둘 다 잘 optimize 한다는 실험 결과
📜 [KAIST] KORMo: Korean Open Reasoning Model for Everyone
- 완전한 오픈소스로 공개한 최초의 한국어 모델 (bilingual LLM, 10.8B)
- (1) synthetic data로 model collapse 없이 pre-training 가능
  - synthetic data-driven fully open models (FOMs)
- (2) bilingual instruction tuning으로 near-native reasoning & coherence 달성 가능
🧑🏻‍💻 [Adrej Karpathy] Nanochat
- 8XH100 node에서 돌아갈 수 있도록 설계된 full-stack implementation
- 학습 및 추론 돌리는데 $100 정도 비용
🧑🏻‍💻 [MS] Introducing MAI-Image-1, debuting in the top 10 on LMArena
- MS에서 최초로 fully in-house 개발된 image generation model
📜 [Princeton] Skill-Targeted Adaptive Training
- STAT: teacher model의 metacognition ability를 이용한 fine-tuning strategy 제안
- teacher는 task dataset을 사용해서 list of skills를 만들고, 각 스킬에 필요한 data point에 labeling
- student’s answers를 monitoring하여 Missing-Skill-Profile를 생성
  - STAT-Sel: 이에 따라 training examples를 adaptively reweights
  - STAT-Syn: missing skills를 포함하는 additional examples를 synthesize
📜 [NYU] Diffusion Transformers with Representation Autoencoders
- DiT에 사용되는 VAE를 pretrained representation encoders paired with decoders로 교체
  - high-quality reconstructions & semnatically rich latent spaces 제공
🧑🏻‍💻 [Alibaba] Qwen3-VL
- 4B, 8B 사이즈의 compact dense vision-language models (Instruct & Thinking)
- FP8 deployment 가능
- 일부 벤치마크에서 Gemini 2.5 Flash-Lite & GPT-5 Nano 능가
📜 [Shanghai Jiao Tong] AI for Service: Proactive Assistance with AI Glasses
- AI4Service: 일상에서 proactive & real-time assistance 가 가능하도록 만드는 paradigm
- Alpha-Service: 두 가지 challenges를 address (using AI Glasses)
  - Know When to intervene by detecting service opportunities
  - Know How to provide both generalized & personalized services
- 5개의 key components
  - Input Unit, CPU, Arithmetic Logic Unit, Memory unit, Output Uni
🧑🏻‍💻 [Anthropic] Introducing Claude Haiku 4.5
- coding 능력이 뛰어나지만 사이즈가 작은 최신 모델 공개
- Sonnet 모델과 유사한 아키텍쳐를 따르고 있으나 speed & cost efficiency를 최적화하는 것에 집중
🧑🏻‍💻 [Alibaba] Meet Your AI Memory
- Qwen Chat에서도 user context & history 기반으로 personal experience를 향상시키고자 함
📜 [Meta] The Art of Scaling Reinforcement Learning Compute for LLMs
- LLMs의 RL scaling에 관한 연구
- → ScaleRL 제시: 100,000 GPU hours까지 scale-up 가능한 best-practice recipe라는 점을 입증
📜 [Stanford] Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity
- typicality bias in preference data를 model collapse의 주된 원인으로 지적
- Verbalized Sampling (VS): model collapse를 피할 수 있는 training-free prompting strategy
- responses에 대한 probability distribution을 모델이 스스로 verbalize 하는 것만으로도 creative writing, dialogue simulation, open-ended QA 등 태스크에서 답변 다양성 크게 증가 (factual accuracy 감소 없이)

4th week

📜 [Together, Stanford] ReasonIF: Large Reasoning Models Fail to Follow Instructions During Reasoning
- Large Reasoning Models (LRMs) 역시 user instruction을 따라 reasoning process를 만들어야 한다는 점을 지적
- ReasonIF: reasoning instruction following 능력을 평가하는 벤치마크 도입
  - multilingual reasoning, formatting 등 6개의 카테고리로 구분
- 현존하는 open-source LRMs는 최대 0.25점을 기록하는 수준임
- 합성데이터를 이용한 multi-turn reasoning & Reasoning Instruction Finetuning (RIF) 강조
📜 [Nanjing, ETH] A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning
- confidence estimation 관점에서 sampling-based tes-time scaling methods를 분석하는 framework
- self-consistency는 high estimation error, perplexity는 modeling error 라는 한계점 지적
- 이를 해결하기 위해 RPC 제안: Perplexity Consistency & Reasoning Pruning을 이용하는 hybrid method
📜 [PaddlePaddle] PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
- NaViT-style dynamic visual encoder & ERNIE-4.5-0.3B language model
- 109개 언어를 지원하며 다양한 elements 인식 가능 (text, table, formula, chart 등)
- page-level parsing & element-level recognition에서 SoTA
🧑🏻‍💻 [Google] Grounding with Google Maps: Now available in the Gemini API
- 250M개가 넘는 지역에 대한 real-world data에 대해 Gemini가 reasoning
- $25 / 1,000 location-enhanced prompts
🧑🏻‍💻 [HuggingFace] HuggingChat
- GPT-5-style model routing 적용해서 수백 개의 open-source models 중 적합한 것을 골라서 답변 생성
🧑🏻‍💻 [Anthropic] Claude Code on the web
- GitHub 통해 연동된 repo에 대해 parallel 하게 coding tasks 수행 가능
- 터미널 접속 없이 웹에서 처리하는 기능이 codex와 동일
🧑🏻‍💻 [OpenAI] Introducing ChatGPT Atlas
- ChatGPT가 내장된 browser 출시
- 이용 시작부터 7일 간 promotion. 더 많은 호출 가능. 현재는 mac os만 지원
- 새로운 탭 화면이 검색창 같은데 ChatGPT 메인 화면이어서 대화 이력도 확인 가능
📜 [Spike Studio] Automatic Prompt Generation via Adaptive Selection of Prompting Techniques
- user의 abstract task descriptions 기반으로 task-appropriate prompting technique을 선정하고 high-quality prompts 생성
- 다양한 tasks 간의 semantic similarity를 기반으로 knowledge base를 constructs
- 유저가 task descriptions를 입력하면 system이 가장 관련성 높은 task cluster로 assign
🧑🏻‍💻 [Google] Google AI Studio
- prompts로 vibe coding 할 수 있는 AI Studio 출시
📜 [Zhejiang, NUS] LightMem: Lightweight and Efficient Memory-Augmented Generation
- LightMem: 메모리를 3개의 complementary stages로 organizes
  - (1) cognition-inspired sensory memory가 lightweight compression을 통해 무관한 데이터를 filter & 주제에 따라 그룹화
  - (2) topic-aware short-term memory가 이런 topic-based groups를 consolidate
  - (3) long-term memory가 이러한 정보를 활용
📜 [JHU, PKU, Princeton, MIT, Harvard] World-in-World: World Models in a Closed-Loop World
- 현존하는 generative world models (WMs) 벤치마크는 open-loop protocol을 채택함으로써 visual quality는 강조하는 반면 agents가 embodied tasks를 성공하는지에 대해서는 집중하지 않고 있다고 지적
- World-in-World: real agent-environment를 반영하는 closed-loop에서 WM를 벤치마크하는 open platform
  - 다양한 WMs를 평가하는 4개의 closed-loop environments를 curate
- 또한 embodied setting에서 WM에 대한 data scaling law를 제안
📜 [HKUST, NYU] DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference
- reasoning LLMs의 overthinking을 최소화하기 위한 연구
- reasoning traces의 토큰 확률의 entropy 계산 → U-shaped entropy pattern 발견
  - 쉬운 문제에 대해서도 높은 entropy를 갖고 있음 (정확한 답변임에도 불구하고)
- DiffAdapt: 각 question의 난이도와 reasoning trace entropy를 근거로 Easy/Normal/Hard 추론 전략을 선택하는 프레임워크
  - 각 전략마다 prompt, temperature, maximum token length 정해져 있음
📜 [Tsinghua, GIT] AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders
- Speculative Decoding (SD) 에서 주로 사용되는 Knowledge Distillation (KD)은 SD의 진정한 목적인 token acceptance rate를 maxmize를 달성하지 못함을 문제로 지적
- AdaSPEC: KD process에 selective token filtering을 통합한 방법론 제시
  - reference model을 사용하여 difficult-to-fit tokens를 filtering → simpler tokens에 대해 better align
📜 [DeepSeek AI] DeepSeek-OCR: Contexts Optical Compression
- optical 2D mapping을 통해 long contexts를 압축하는 기술 제시
- DeepEncoder & DeepSeek3B-MoE-A570M decoder
- 텍스트 토큰이 vison 토큰의 10배보다 적게 유지되는 경우 OCR 정확도는 97% 수준 (압축률이 10배 미만이면)

5th week

📜 [Sheffield] Can Confidence Estimates Decide When Chain-of-thought is Necessary for Llms?
- CoT gating에 대한 training-free confidence estimation methods 연구
- 4개의 방법론으로 비교 실험해본 결과 특정한 방법론이 특정한 데이터셋에 대해 무조건 좋다고 결론 내리기는 어렵다고 함
📜 [Meta, Berkeley] Continual Learning via Sparse Memory Finetuning
- sparse parameter updates가 catastrophic forgetting 없이 새로운 지식을 습득할 수 있는 방법이라고 제시 → sparse memory finetuning
- 사전 학습에 사용되었던 데이터보다 새로운 데이터에 대해 높은 activation 값을 갖는 memory slots만 사용
🧑🏻‍💻 open-notebook
- 구글의 Notebook LM의 대안이 될 수 있는 open source로 privacy-foucused 특징
- 16개가 넘는 모델들을 선택할 수 있음
- docker를 이용하여 간편하게 설치할 수 있음
🧑🏻‍💻 [Anthropic] Claude for Excel
- 엑셀 시트를 읽고 유저와 Q&A 가능한 LLM 베타 공개 (research preview)
🧑🏻‍💻 [Mistral AI] Introducing Mistral AI Studio.
- Enterprise향 모델 개발 플랫폼 제공
- Built-in evaluation, Treaceable feedback loops, Proveanance and versioning, Governance, Flexible deployment 등을 핵심 특징으로 제시
🧑🏻‍💻 [Google] Our Quantum Echoes algorithm is a big step toward real-world applications for quantum computing
- 새롭게 개발한 Quantum Echoes 알고리즘은 최초로 verifiable Quantum Advantage 달성
- 최고급 슈퍼컴퓨터 대비 13,000배 빠른 속도
- 동작 원리: 양자 시스템에 신호 보냄 → 하나의 큐비트를 perturb → reverse evolution을 이용한 echo 측정
🧑🏻‍💻 [Ai2] olmocr
- PDF, PNG, JPEG 기반 문서를 MD로 변환해주는 오픈소스 OCR
📜 [Shanghai AI, Nanjing, CMU] JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
- visual 정보도 활용하는 코드 모델의 발전은 고품질의 multi-modal code data 확보의 어려움으로 인해 병목
- standard charts부터 complex interactive web UI에 이르는 large-scale, high-quality corpus를 생성하는 toolkit 제안
- 위 toolkit을 이용하여 JanusCode-800K 구축

🙇🏻 9월

1st week

📜 [Harvard University, Cambridge] Lexical Hints of Accuracy in LLM Reasoning Chains
- 세 가지 feature classes
  - (1) CoT length (2) intra-CoT sentiment volatility (3) lexicographic hints
- Humanity's Last Exam (HLE), Omni-MATH 대상으로 DeepSeek-R1 & Claude 3.7 Sonnet 테스트
- guess, stuck, hard와 같은 어휘들이 uncertainty의 강한 지표로 확인되었고, sentiment는 보조 지표 정도로 활용 가능
🧑🏻‍💻 [Ai2] Asta: Accelerating science through trustworthy agentic AI
- Asta agents: human researchers를 대체하는 것이 아니라 assist하는 tools 갖춤
- scientific AI의 지평을 넓히고 투명성을 증진하기 위한 AstaBench
- Asta resources: scientific AI agents를 build, test, refine 하기 위한 a set of softwoare components
🧑🏻‍💻 [Microsoft] MAI-Voice-1, MAI-1-preview
- OpenAI system에 대한 의존을 줄이고 독자적인(in-house) speech generation model 구축
- MAI-Voice-1
  - single GPU에서 구동 가능하며 일 초 내에 일 분 길이의 오디오 생성 가능
  - single- / multi- speaker 시나리오에서 expressive, natural speech 지원
- MAI-1-preview
  - 15,000 H100 hours로 pre- / post- trained MoE text 모델
  - instruction following & everyday query responses에 집중했다고 밝힘
🧑🏻‍💻 [Apple] FastVLM: Efficient Vision Encoding for Vision Language Models (CVPR 2025)
- high-resolution images에 대해 designed 된 hybrid architecture visual encoder를 이용하여 정확하면서도 빠르고 효율적인 visual query processing 가능
- 추론 코드, 모델 체크포인트, iOS/macOS demo는 깃허브 링크에서 확인 가능
- 허깅페이스 데모 링크
🧑🏻‍💻 [Google] Stop “vibe testing” your LLMs. It's time for real evals.
- csv 데이터 업로드, Autorater 선택 (커스텀 가능), 평가 실행, 분석 대시보드, 반복 개선
- 한 번의 평가로 다양한 조합의 성능을 확인
- The complete toolkit for AI evaluation
- 현재는 미국에서만 사용 가능
🧑🏻‍💻 [Tencent] Hunyuan-MT
- translation model, Hunyuan-MT-7B, ensemble model, Hunyuan-MT-Chimera
- 중국의 5개 소수 민족 언어를 포함한 33개 언어 커버
- pretrain → CPT → SFT → translation rl → ensemble rl (technical report 참고 가능)
🧑🏻‍💻 [Google] Welcome EmbeddingGemma, Google's new efficient embedding model
- 구글의 새로운 embedding 모델에 대한 허깅페이스 블로그 포스트
- 308M 사이즈 & 2K context window, 100개 이상 언어 지원
- Gemma3 모델을 backbone으로 삼고 있으나, bi-directional attention으로 modified
- Matroyshka Representation Learning (MRL)로 학습되어 768 차원의 ouput을 512, 256, 128 차원으로 truncate 할 수 있음
🧑🏻‍💻 [Microsoft] VibeVoice: A Frontier Open-Source Text-to-Speech Model
- text로부터 expressive, long-form, multi-speaker conversational audio 생성 framework
- speaker consistency, natural turn-taking 등의 문제를 크게 해결
- ultra-low frame rate of 7.5Hz에서 operating 하는 continuous speech tokenizers 사용
- Context-Aware Expression 데모가 있어서 들어봤는데 엄~청 자연스럽지는 않은 느낌
📜 [Oxford, Shanghai AI, NUS, UCL, …] The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
- LLM-RL의 single-step Markov Decision Processes와 temporally extnded partially observable Markov decision process (POMDP)를 contrast
- 두 가지 taxonomy로 구분
  - planning, tool use, memory 등을 포함하는 core agentic capabilities
  - 다양한 태스크 도메인에 대한 applications
- reinforcement learning이 agents의 능력을 기존의 static, heuristic modules에서 adaptive, robust agentic behavior로 transform
🧑🏻‍💻 [OpenAI] Why language models hallucinate
- 언어 모델이 hallucinate 하는 이유는 학습 및 평가 과정에서 uncertainty를 인정하는 것보다 guessing 하는 것이 더 큰 reward를 받기 때문이라고 주장
- modern training pipeline에서 hallucinations의 통계적 원인을 분석
  - 이진 분류의 오류에 기인한다고 설명
  - incorrect statements가 facts와 구별되지 않는다면, PLM은 natural statistical pressures를 기반으로 hallucinate 한다고 설명
- 또한 good test-takers로 optimized 되는 LM 특성상 불확실할 때 추측하는 것이 test performance가 높은 것으로 평가받게 되는 문제점을 지적
- 불확실한 응답을 penalizing하는 “전염병(epidemic)”은 misaligned scoring of exisiting benchmarks를 수정하는 방향으로 고쳐져야 한다고 주장
📜 [Manchester] Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
- Drivelology - “nonsense with depth”: syntactically coherent, yet pragmatically paradoxical, emotionally loaded, rhetorically subversive
- 겉으로 봤을 땐 non-sense이지만 contextual inference, moral reasoning, emotional interpretation을 통해 implicit meaning을 encoding 해야됨
- 현존 LLM들은 아직까지 Drivelological text를 온전히 이해하지 못한다고 설명
  - English, Mandarin, Spanish, French, Japanese, Korean 등 언어에 대해 1,200여 개 데이터를 meticulously curate
📜 [Meta, NUS, Rice] REFRAG: Rethinking RAG based Decoding
- RAG 시나리오에서의 두 가지 문제를 지적
  - 긴 입력을 처리하면서 발생하는 knowledge enrichment & system efficiency 간 trade-off
  - 검색된 텍스트의 대부분은 query와 상관없음
- RAG context에서 decoding 할 때 대부분의 연산은 불필요하며, 제거하더라도 전체 성능에 크게 영향주지 않는다고 주장
- REFRAG 제안: RAG application에서 latency를 개선하기 위한 compress, sense, expand 할 수 있는 decoding framework (attention sparsity structure)
- perplexity를 높이지 않으면서 TTFT를 30.85x 상승 & LLM의 context size를 16x 상승
📜 [ByteDance] UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
- 데이터 플라이휠을 통해 스스로 데이터를 생성하고 학습
- GUI 에이전트가 단순한 조작을 넘어 복잡한 환경에도 적응할 수 있음
📜 [Stanford] MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML
- general-purpose LLM에 robust in-context ML capability를 장착
- millions of structural causal models (SCMs) 로부터 ML tasks를 합성하여 1,024 shot 생성
- random-forest teacher로 시작하여 tree-based decision strategies를 LLM에 distill
- 모든 tasks는 token-efficient prompt로 serialized
- GPT-5-mini 모델보다도 Qwen-2.5-7B-Instruct를 tuning한 모델의 성능이 좋았다고 설명하면서 이를 many-shot scaling law라고 표현함

2nd week

📜 [NVIDIA] Universal Deep Research: Bring Your Own Model and Strategy
- 현존하는 deep research agent는 고정된 tool choice 목록에 대해 hard-coded 되어 있는 것을 사용하는 수준에 그친다고 지적
- UDR: 어떤 언어 모델이든 사용할 수 있고, 유저가 스스로 deep research strategies를 추가적인 학습 없이도 custom 할 수 있도록 돕는 generalist agentic system
- Phase 1: skipped steps and drift를 줄이기 위한 strategy compiles → Phase 2: executes synchronous tool calls & yield-based notifications
📜 [Emory Univ.] Improving Factuality in LLMs via Inference-Time Knowledge Graph Construction
- RAG 시나리오에서 knowledge가 unstructured text로 취급되는 것에 대해 지적
- knowledge graphs를 dynamically constructs & expands 하는 framework 제안
- question으로부터 seed KG를 추출하고, 이를 바탕으로 LLM’s latent knowledge를 이용하여 iterative expansion 수행
📜 [Arizona, Michigan] Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?
- LLM이 uncertain 할 때, multiple generated response 간 불일치 패턴이 존재한다고 설명
- 한 LLM이 여러 개의 응답을 생성하고, 다른 LLM(auxiliary)이 disagreement patterns을 분석하도록 지시
📜 [Univ. of Bamberg] Are Humans as Brittle as Large Language Models?
- LLM의 non-determinism 특성 뿐만 아니라 prompt brittleness 역시 output에 영향을 줌
- 이에 따라 human annotators도 instruction changes에 유사한 sensitivity를 보이는지 확인하고자 함
- 실험 결과에 따르면 human annotators & LLMs 모두 특정한 prompt 수정 유형에 대해 불안정(brittlenss)한 특성을 보임
📜 [ByteDance, HKUST, Peking, Tsinghua] Reverse-Engineered Reasoning for Open-Ended Generation
- deep reasoning이 수학과 같은 도메인에서 쓸모가 있으나, open-ended & creative generation에 대해서는 아직 탐구되지 않음
- REverse-Engineered Reasoning (REER): trial-and-error | imitation을 통해 reasoning process forwards를 building 하는 것 대신 known good solutions로부터 backwards works
- DeepWriting-20K: 20,000 deep reasoning trajectories 데이터를 오픈소스화
📜 [Meta Superintelligence, UC Berkeley] Language Self-Play For Data-Free Training
- LLM 발전이 고품질 학습 데이터에 dependent 하다는 점을 문제로 지적
- 추가적인 데이터 없이 모델 성능을 개할 수 있는 강화학습 방식 제안
- Language Self-Play (LSP): 모델이 스스로 play하면서 stronger policies 형성
- Llama-3.2-3B-Instruct 모델로 실험한 결과 제시
📜 [HKUSK, MiniMax, Waterloo] WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents
- open-source web agents가 학습할 수 있는 높은 난이도의 information seeking 데이터 부족을 문제점으로 지적
- WebExplorer: model-based exploration & iterative, long-to-short query evolution 데이터 생성 방법론
- WebExplorer-8B: 128K, 100 tool calling turns
📜 [HKUST, Jilin Univ., CUHK] Implicit Reasoning in Large Language Models: A Comprehensive Survey
- multi-step으로 문제를 푸는 LLM reasoning paradigm에서 implicit reasoning에 대해 computation 관점에서 분석
- representational forms → computational strategies
- how & where internal computation unfolds: latent optimization, signal-guided control, layer-recurrent execution
🧑🏻‍💻 [Anthropic] Claude can now create and edit files
- Claude 챗 UI 내에서 Excel spreadsheets, documents, PowerPoint slide decks, PDFs 등을 생성 및 편집 가능
- raw data를 input으로 주면 이를 분석한 결과 및 통계적 분석, 시각화 자료, 인사이트 등을 반환
🧑🏻‍💻 [ByteDance] Seedream 4.0
- 4K 해상도 이미지 까지 처리 가능한 이미지 생성 모델
- batch input & output, prompt-based editing, versatile styles, knowledge-driven generation 등을 특징으로 삼음
- 모델 성능은 MagicBench 기준으로 평가하여 공개 (Text-to-Image, Single-Image Editing)
📜 [Zurich, Gothenburg] Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation
- data annotation 또는 text analysis 같은 태스크에 LLM을 활용하면서 발생하는 systematic biases & random errors 등을 지적
- 21편의 사회과학 연구에서 나온 37개 data annotation 태스크를 18개 LLM으로 재현
- 13M개의 LLM labels 생성 & 2,361개의 realistic hypotheses 검증 → SOTA 모델도 1/3 오류, 소형 모델은 1/2 오류
- 결국 false positive (1종 오류) 발생을 줄이기 위해서는 human annotation이 중요하다는 결론
🧑🏻‍💻 [Alibaba] Qwen3-Next: Towards Ultimate Training & Inference Efficiency
- hybrid attention mechanism, highly sparse MoE structure, training-stability-friendly optimization, multi-token prediction mechansim for faster inference
- Qwen3-Next-80B-A3B-Base: dense Qwen3-32B에 에 준하는 성능. 32K context window를 지원하는데 10배 높은 throughput 달성
- Qwen3-Next-80B-A3B-Instruct, Thinking 두 모델도 공개. 256K context window
- 포스트 내에 아키텍쳐에 대한 자세한 설명 포함되어 있음
📜 [Apple] OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
- OpenVision의 architecture를 간소화하고 학습 효율성을 높이기 위한 loss design을 제시
- text encoder를 제외 → contrastive loss는 오직 순수하게 generative training signal만 측정함
  - OpenVision 2
- training time & memory consumption을 크게 줄이면서도 기존 모델 성능 유지

3rd week

📜 [Salesforce] SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
- Autonomous Single-Agent: manual directive 없이도 context 기반으로 dynamically next action 선택 (여러 모델을 사용하는 multi-agent 시스템과 대비)
- reasoning-optimized models에 대한 continual reinforcement learning을 제안하여 reasoning ability를 보존하면서도 agentic skills를 강화하고자 함
  - Length-normalized RL Objective, Trajectory Filtering, Partial Rollouts 등
📜 [Individual] SI-FACT: Mitigating Knowledge Conflict via Self-Improving Faithfulness-Aware Contrastive Tuning
- internal parametric knowledge vs. provided context 충돌하는 상황을 문제로 지적
- Self-Improving Faithfulness-Aware Contrastive Tuning: self-instruct mechanism을 이용하여 base LLM이 자동적으로 고품질의 structured contrastive learning data를 생성하도록 만듦 (positive & negative samples)
📜 [HKUST] VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
- VLA 모델의 학습 비용이 크다는 점을 문제로 인식하여 vision-language representations를 action에 어떻게 효과적으로 연결할지에 대해 연구
- VLA-Adapter를 제시하여 large-scale VLMs & extensive pre-training에 대한 의존 낮춤
- lightweight Policy module with Bridge Attention 제시: action space 내에 optimal condition을 자율적으로 injects
- robotic data pre-training 없이, 단 0.5B parameter backbone으로 높은 퍼포먼스 달성
📜 [Princeton] Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training
- LLM 해석 관련 연구로 세 가지 결론을 내림
  - (1) 현존 LLMs는 특정 종류의 의사 결정에 대한 internal process를 정확하게 기술할 수 있는 능력이 있음
  - (2) 이러한 능력은 학습을 통해 강화하는 것도 가능
  - (3) 학습된 능력은 어느정도 일반화 가능
- GPT-4o, GPT-4o-mini 두 모델을 fine-tuning하여 실험한 결과를 제시
📜 [Google DeepMind, Toronto] Virtual Agent Economies
- sandbox economy: AI agents 간 발생하는 소통을 분석할 수 있는 프레임워크
- mission economies를 도입하여 agents들이 공동의 목표를 달성할 수 있도록 함으로써 trust & safety 가 더 잘 보장되는 환경을 조성할 수 있었다고 설명
🧑🏻‍💻 [OpenAI] Introducing upgrades to Codex
- 7시간 넘게 실행되는 리팩토링 코드 작업을 실행하는 것도 가능하다는 바이럴 마케팅..?
- Code review, Dynamic reasoning (task 난이도에 따라), Tool use 등의 핵심 features
- CLI, IDE extension, Cloud 등 다양한 환경에서 지원
- 깃허브 코드 리뷰 자동화 가이드 by OpenAI
🧑🏻‍💻 [Meta] MobileLLM-R1
- mathematical, programming, and scientific problems만을 다룰 수 있는 reasoning 특화 모델
- 1B도 되지 않는 사이즈의 모델 family로 Qwen3 0.6B를 능가하는 성능을 보여준다고 함
- 사전학습에는 2T, 총 5T 토큰 정도 학습했다고 밝힘
📜 [Berkeley, Washington] Reconstruction Alignment Improves Unified Multimodal Models
- Unified multimodal models (UMMs) 학습을 위한 기존 image-text pairs는 주로 sparse한 데이터로 fine-grained visual details가 누락되어 있다는 문제를 지적
- Reconstruction Alignment (RecA): visual understanding encoder embeddings를 dense ‘text prompts’로 이용하여 captions 없이도 보다 풍부한 supervision을 제공하는 post-training method
- visual understanding embeddings를 조건으로 input image를 reconstruct 하는 self-supervised reconstruction loss 근거로 학습
- autoregressive, masked-autoregressive, diffusion-based 등 어떤 형태에도 적용 가능하면서도 뛰어난 성능을 보여줌
🧑🏻‍💻 [Google] VaultGemma: The world's most capable differentially private LLM
- differential privacy (DP)를 이용하여 scratch부터 학습한 가장 큰 사이즈의 언어 모델
  - DP: 학습 시 노이즈를 추가하여 학습 데이터가 모델로부터 추출되는 것을 방지하는 mathematical framework (민감 정보 보호)
- 모델 성능을 저해하지 않으면서도 privacy를 지킬 수 있도록 하는 새로운 scaling law 제시
📜 [Nanjing, Shanghai AI] The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations
- 현존하는 input questions의 난이도를 측정하는 방식은 repeated response sampling, auxiliary models, fine-tuning 등의 방식으로 비효율적이며 일반화되지 않는다는 점을 지적
- target LLM에 의해 생성되는 hidden representations만을 이용하여 난이도를 추정하는 방식을 제안
- token-level generation process를 Markov chain으로 모델링하고, value function을 정의하여 hidden state 기반으로 output quality를 추정
🧑🏻‍💻 [Google] Powering AI commerce with the new Agent Payments Protocol (AP2)
- agent가 유저로부터 사전 승인된 권한들을 바탕으로 직접 결제까지 가능토록 하는 프로토콜을 구글에서 제시
- 매 단계는 로그로 남아서 안전성과 신뢰성을 높임
🧑🏻‍💻 [Alibaba] Tongyi DeepResearch: A New Era of Open-Source AI Researchers
- Agentic Continual Pre-training (CPT), SFT for cold-starting, final RL stage
- prompt engineering 없이 ReAct 방식으로 inference
- 30B 사이즈 모델로 OpenAI DeepResearch 급 성능 달성
📜 [Peking] Early Stopping Chain-of-thoughts in Large Language Models
- ES-CoT: answer convergence를 탐지하여 최소한의 performance loss로 CoT generation을 stopping
- 각 reasoning step마다 LLM이 현재 시점의 최종 답변을 생성토록 하고 이를 step answer로 명명
- 이 step answer가 연속적으로 동일한 답변이 나온 횟수를 answer convergence의 지표로 해석
📜 [Algoverse] FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness
- 지금까지의 연구는 CoT의 faithfulness를 측정하는 것까지만 집중하고 이를 개선하는 연구는 이뤄지지 않았음을 지적
- FRIT: 모델이 systematically corrupted examples로부터 causally consistent reasoning을 생성하는 방법을 배울 수 있도록 돕는 학습 scalable alignment
- reasoning 매 step에 대해 합성 데이터를 생성하여 faithful/unfaithful pairs 구축하고 DPO 학습
🧑🏻‍💻 [Thinking Machines Lab] Defeating Nondeterminism in LLM Inference
- LLM의 temperature가 0이더라도 다른 답변을 반환하던 문제점 해결
- batch size 변동, normalization, multiplication, attention 등의 연산이 항상 동일한 결과를 반환할 수 있도록 함
- 대신 실험에서 1,000개 시퀀스를 처리하는데 26초가 걸리던 것이 42초가 걸리는 정도의 trade off 발생 (62% slow down)
📜 [Microsoft] Is In-Context Learning Learning?
- ICL이 given observation을 명시적으로 encode하는 것은 아니라고 지적
- 오히려 모델은 prior knowledge & given exemplars 에 의존한다고 설명
- autoregression’s ad-hoc encoding is not a robust mechanism 그리고 제한된 all-purpose generalisabilty 제안
🧑🏻‍💻 [OpenAI] Detecting and reducing scheming in AI models
- 모델이 align 되어 있으나 hidden objectives를 secretly 추구하는 것을 일컫는 scheming에 대한 연구
- 모델이 평가 상황을 탐지하면 scheming behavior를 바꾼다는 연구 결과
- reinforcement learning & targeted anti-scheming objectives를 적용하여 situational awareness를 높이고 scheming을 줄일 수 있음

4th week

📜 [Shanghai AI] ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
- computer use agents (CUAs)를 위한 computer-use data는 확보하기도 어렵고 가격이 비쌈
- ScaleCUA: 6개의 운영체제와 3개의 task domains에 대한 large-scale 오픈소스 dataset
📜 [Tsinghua, Northeastern] DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL
- open knowledge graphs로부터 complex, difficult, hard-to-find questions를 자동적으로 합성하는 전략 제안
- end-to-end multi-turn RL을 적용하여 LLMs의 long-horizon reasoning with deep search 능력 향상 도모
- DeepDive-32B: BrowseComp에서 WebSailor, DeepSeek-R1-Browse 등을 outperform
📜 [Zayed University] K2-Think: A Parameter-Efficient Reasoning System
- 32B 사이즈(Qwen2.5 base)로 프론티어급 성능을 달성한 reasoning system 소개. GPT-OSS 120B, DeepSeek v3.1 언급
- Long CoT SFT, RLVR, Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, Inference-optimized Hardware
- 다른 reasoning 모델과 마찬가지로 수학, 과학, 코딩 영역에 특화되어 있다고 설명
- 각 요청마다 초당 2천 토큰씩 처리할 수 있는 서빙 환경으로 오픈소스 모델 이용 가능 (허깅페이스 링크, Chat UI 링크)
📜 [Apple] AToken: A Unified Tokenizer for Vision
- AToken: images, vidoes, 3D assets에 대해 high-fidelity reconstruction & semantic understanding을 보여준 최초의 unified visual tokenizer
- perceptual & Gram matrix losses를 결합한 adversarial-free training objective 제시
- curriculum training 방식을 택하여 single images에서부터 videos, 3D 처리할 수 있도록 학습
- continuous & discrete latent tokens 둘 다 처리 가능하다는 특징
📜 [Cornell, CMU] Predicting Language Models' Success at Zero-Shot Probabilistic Prediction
- tabular prediction tasks에서 LLM의 zero-shot predictive capabilities를 측정하는 실험
- LLM이 base prediction task를 잘 수행할 때, 이것의 individual-level의 예측 능력은 훨씬 강해진다고 설명
- 이를 토대로 LLM의 성능을 task level에서 측정할 수 있는 metric을 제시하여 LLM이 잘하는 태스크와 그렇지 않은 것을 구분할 수 있도록 함
🧑🏻‍💻 [xAI] Grok 4 Fast
- cost-efficient reasoning multi-model model. 40% 적은 thinking tokens 사용한다고 설명
- web & X search, 2M context window, reasoning & non-reasoning
📜 [Microsoft, Tsinghua] RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation
- 온전한 repo를 scratch부터 만들기 위해서는 일관성있고 신뢰도 높은 planning 필요
- Repository Planning Graph (RPG): 파일 구조, data flows, functions 등을 한 개의 graph 내에 encoding
- ZeroRepo: scratch부터 repo를 생성하는 graph-driven framework
  - proposal-level planning, implemetation-level refinement, graph-guided code generation 순서로 실행
- RepoCraft: 현실 세계의 1,052개 태스크를 아우르는 6개의 프로젝트 벤치마크
📜 [School of AI] A State-Update Prompting Strategy for Efficient and Robust Multi-turn Dialogue
- long-horizon, multi-turn 대화에서 대부분 LLM 성능이 낮다는 점을 문제로 지적
- State Reconstruction & History Remind 할 수 있는 prompt engineering method 소개
📜 [ASI] LIMI: Less is More for Agency
- sophisticated agentic intelligence는 minimal, but strategically curated demonstations of autonomous behavior로부터 나온다고 설명
- 78개의 training samples만으로 학습한 모델이 다른 SoTA급 모델들의 퍼포먼스를 상회
- 즉, 데이터 양치기가 좋은 agentic intelligence를 만드는데 도움이 되지 않는다는 것
🧑🏻‍💻 [Alibaba] Qwen3-Omni: Natively Omni-Modal Foundation Models!
- end-to-end omni-modal model로 text, images, audio, video를 single architecture에서 처리 가능
- 36개 벤치마크 중 32개 SoTA, 119개 텍스트 언어 & 19개 speech 언어 처리, 30분 길이의 audio input 처리 가능
- Thinker-Talker: Thinker는 텍스트를 생성하고 Talker는 speech를 실시간 stream
- 20M+ hours 학습한 AuT encoder, MoE, Joint pretraining 등의 특징
🧑🏻‍💻 [DeepSeek AI] DeepSeek-V3.1-Terminus
- Code Agent & Search Agent 로 사용할 수 있는 모델 공개
- 최근 업데이트를 통해 language consistency 이슈도 해결
🧑🏻‍💻 [Figma] Connect Figma to top MCP clients
- 피그마에서 remote MCP 서버를 제공
- VS Code, Cursor, Claude Code 등 다양한 서비스들에서 MCP 서버 연동 가능
📜 [Michigan] Benchmarking and Improving LLM Robustness for Personalized Generation
- personalization 관점에서 factuality는 같이 고려되고 있지 않음을 문제로 지적
- robust LLM: factually accurate & align with user preferences
- PERG: PREGData를 이용한 모델의 preference 평가 프레임워크
- Pref-Aligner: 모델의 robustness를 크게 향상시켜주는 two-stage approach
🧑🏻‍💻 [Google Chrome] Chrome DevTools (MCP) for your AI agent
- AI agents가 Chrome 내에서 직접 코드를 보고 테스트 할 수 있음
- 디버깅, 성능 추적 및 네트워크 분석 등을 위한 26개의 built-in tools
- Claude, Cursor, Copilot, Gemini CLI 등을 통해 사용 가능
📜 [Meta] CWM: An Open-Weights LLM for Research on Code Generation with World Models
- world models로 code generation에 대한 연구를 하기 위해 필요한 32B open-weights LLM (context size는 131k tokens)
- Python interpreter & agentic Docker environments로부터 observation-action trajectories를 대량으로 mid-train

🔥 8월

1st week

🧑🏻‍💻 [OpenAI] Introducing study mode
- 질문에 바로 답변하지 않고 소크라테스식으로 답변하도록 유도하는 기능
- 티어에 상관 없이 모든 유저들이 이용할 수 있는 기능으로 제공
🧑🏻‍💻 [Microsoft] Microsoft Edge Your AI-powered browser
- Edge 브라우저에서 multi-tab RAG를 지원하는 Copilot Mode 공개
📜 [Tecent] HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels
- 텍스트 또는 이미지로부터 explorable & interactive 3D world를 생성하는 framework 제안
- 기존 video/3D 기반 방식의 단점 보완 → panoramic image 기반 360° world proxy 활용
- 세 가지 특징. 1) 360° immersive experiences 2) mesh export capabilities 3) disentangled object representations
📜 [Leiden Univ.] How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding
- LLM의 CoT 과정이 진짜 ‘thoughts’를 반영하고 있는지에 대한 연구
- sparse autoencoder를 activation patching과 결합하여 CoT 결과로부터 monosemantic features 추출
- CoT가 확실히 더 높은 activation sparsity, feature interpretability score를 달성
📜 [CUHK] ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
- UI-to-Code를 가능하도록 하는 modular multi-agent framework
- grounding, planning, generation, 세 단계로 구성되어 있음
  - vision language model을 사용하여 UI components를 탐지 및 라벨링
  - front-end priors 기반의 hierarchical layout 구성
  - adaptive prompt-based synthesis를 통한 HTML, CSS 코드 생성
🧑🏻‍💻 [Alibaba] Qwen3 Coder Flash
- 30.5B 코드 모델로 coding tasks에서 Claude Sonnet 4 수준의 성능을 달성
- 128 experts, 8 activated per inference, with 3.8B active parameters
- 256K native context window, expandabel to 1M tokens using YaRN
- 최근 공개한 Qwen3 Coder 모델의 경량화 버전으로 이해할 수 있음
🧑🏻‍💻 [Google] Gemini 2.5 Deep Think
- Gemini app과 Google AI Ultra 구독자 대상으로 공개한 기능
- 복잡한 문제를 작은 단위로 쪼개는 interative development and design
- algorithmic development and code, scientific and mathematical discovery 등에 특화되어 있다고 설명
📜 [Microsoft] Phi-Ground Tech Report: Advancing Perception in GUI Grounding
- Computer Use Agents (CUA)가 실행하는 핵심 기능 중 하나가 GUI Grounding
- Phi-Ground mode family: 10B 이하의 agent 중에서 SoTA를 달성한 모델 공개
📜 [ByteDance] Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving
- Seed-Prover: lemma-style whole-proof reasoning model
- deep & broad reasoning을 가능토록 하는 3개의 test-time inference strategies
- geometry reasoning engine Seed-Geometry 도입
- IMO 2025의 6개 문제 중 5개를 완벽하게 prove
🧑🏻‍💻 [Kaggle] Introducing Kaggle Game Arena
- AI models & agents 간의 성능을 비교할 수 있는 벤치마크 플랫폼
- o3, Gemini 2.5 Pro, Claude Opus 4, Grok 4 와 같은 frontier 모델들이 동작할 수 있는 game environments, harnesses, visualizers 등을 제공
🧑🏻‍💻 [Anthropic] Persona vectors: Monitoring and controlling character traits in language models
- 사람이 다른 moods | attitudes 를 경험할 때 뇌의 일부가 ‘light up’ 하는 것처럼 활성화되는 neural network 상의 영역들을 persona vectors라고 지칭
- 이를 파악함으로써 모델의 undesirable 특성들을 억제할수도 있고, 학습 데이터를 조정할수도 있음
- Qwen 2.5-7B-Instruct, Llama-3.1-8B-Instruct 두 open-source 모델로 평가
🧑🏻‍💻 [OpenAI] Open models by OpenAI
- gpt-oss-120b, gpt-oss-20b 두 개의 모델을 허깅페이스에 공개
- Apache 2.0 라이센스. Safety에 대해서도 각별히 신경을 썼다고 함
- Designed for agentic tasks, Deeply customizable, Full chain-of-thought 등의 특징
📜 [CUHK, Shanghai AI] Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models
- DLLMs의 static length 문제를 지적
- 모델이 내부적으로(internal) 주어진 문제에 대한 적절한 답변 길이와 관련된 signals를 포함하고 있다고 설명
- 이러한 latent signals를 이용한 DAEDAL 제안: Dynamic Adaptive length Expansion for Diffusion lArge Language models (알파벳 조합 너무 억지..)
📜 [Alibaba] Qwen-Image Technical Report
- complex text rendering & precise image editing 에 큰 발전이 있는 image generation foundation model
- non-text-to-rendering으로 시작해 점점 더 복잡한 텍스트 입력을 받는 curriculum learning approach 적용
- text-to-image (T2I), text-image-to-image (TI2I), image-to-image (I2I) reconstruction을 위해 dual encoding 방식 사용 (Qwen2.5-VL & VAE)
🧑🏻‍💻 [Google DeepMind] Genie 3: A new frontier for world models
- 작년 12월에 출시된 Genie 2의 후속 모델로 SoTA급 world model로 소개
- 초당 24프레임, 720p 해상도의 few-minute consistency (Genie 2는 10-20s, Veo는 8s 수준)
  - 데모 영상 수준 퀄리티 아주 뛰어난 편
- promptable world events: 다양한 종류의 text-based interaction 가능
🧑🏻‍💻 [OpenAI] GPT-5 is here
- real-time router를 통해 reasoning 여부를 결정하고 적절한 모델을 선정하여 답변함
- coding 능력이 크게 향상되어 타 frontier 모델들 수준으로 올라왔다고 보고 (실사용 후기에 따르면 그정도는 아닌 듯함)
- o3-pro처럼 더 오래 생각하는 test-time scaling 방식이 적용된 GPT-5 pro 모델
📜 [ByteDance, Tsinghua] Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
- Seed Diffusion: discrete-state diffusion based large scale language model
- non-sequential, parallel generation 덕분에 엄청나게 빠른 추론 속도: 2,146 tokens/s over H20 GPU
- 코드 벤치마크에서 속도-성능의 파레토 라인을 push
🧑🏻‍💻 [Google] Guided Learning in Gemini: From answers to understanding
- 구글에서 유저의 질문에 바로 답변하는 것 대신 학습에 도움이 될 수 있도록 하는 LearnLM 개발
- 특정 주제에 대해 deep dive 할 수 있도록 probing & open-ended questions encourage
📜 [VeriGUI Team] VeriGUI: Verifiable Long-Chain GUI Dataset
- VeriGUI: novel verifiable long-chain GUI dataset
- realistic computer environments 대응을 위한 학습 및 평가 데이터셋
- (1) long-chain complexity (2) subtask-level verifiability 강조
📜 [Arizona State Univ.] Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens - CoT reasoning이 in-distribution data로부터 학습된 structured inductive bias를 반영하고 있는지 연구 - 모델이 학습 동안 봤던 데이터에 근사하는 reasoning path를 conditionally generate 하게 만듦으로써 파악 - CoT reasoning을 task, length, foramt 세 개의 차원으로 나눠 분석 - DataAlchemy: LLM을 from scratch 학습하고 다양한 분포 조건 하에서 systematically probe 할 수 있는 환경을 디자인

2nd week

📜 [OPPO AI] Efficient Agents: Building Effective Agents While Reducing Cost
- efficiency-effectiveness 간의 밸런스를 잘 맞춘 agent framework
- test-time scaling (예를 들어 best-of-N) 방식은 성능 향상 대비 비용 상승률이 너무 높다는 한계를 분석
- 같은 관점에서 web browsing은 최소화되어야 한다고 주장
📜 [Rutgers Univ.] ReaGAN: Node-as-Agent-Reasoning Graph Agentic Network
- 노드 간 정보의 불균형이나 global semantic information이 고려되지 않는 문제점 등을 해결하고자 함
- Retrieval-augmented Graphic Agentic Network: 그래프의 각 노드를 autonomous & individual decision making 가능하도록 설정
- 각 노드가 곧 agent로 Memory, Planning, Action, Tool Use 가능
🧑🏻‍💻 [Cursor] Cursor CLI
- 터미널 기반으로 동작하는 CLI 버전 공개 (베타)
- 다른 서비스들과 크게 다른 점은 없어 보임
🧑🏻‍💻 [Google] LangExtract
- LLM을 이용하여 유저가 정의한 instructions에 따라 unstructured text documents로부터 structured information을 추출하는 파이썬 라이브러리
- 시각화 기능도 잘 지원되고 Ollma를 이용하면 로컬 모델로도 돌릴 수 있음
🧑🏻‍💻 [HuggingFace] Introducing AI Sheets: a tool to work with datasets using open AI models!
- open-source 모델을 사용하여 데이터셋 구축을 할 수 있는 노코드 spreadsheet tool
- LLM을 이용하여 합성 데이터 등을 생성 후 최종 데이터셋을 csv 형태로 반환할 수 있음
📜 [Zhipu AI, Tsinghua] GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
- 335 totoal, 32B activated open-source MoE LLM / GLM-4.5 Air: 106B
- thinking & direct response 동시 지원하는 hybrid reasoning method
- 23T 토큰에 대해 학습
📜 [Meta] TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction
- direct brain scanning 없이 fMRI activation pattern을 예측하는 모델
- frozen pretrained model을 사용하여 audio, video, dialogue로부터 feature 추출
📜 [ByteDance] WideSearch: Benchmarking Agentic Broad Info-Seeking
- WideSearch: 15개 도메인에 대한 200 manually curated question (100개는 영어, 100개는 중국어)
- large-scale atomic information을 필요로 하는 질문들이며 각 내용이 객관적으로 증명되어야 하는 까다로운 문제들임
- 대규모 & 반복적인 정보 검색을 잘하는 LLM-based agent를 만드는 것이 목표
📜 [Gaoling School, Baidu, CMU] ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability
- 현존 LLM 기반 listwise reranker들은 복잡한 시나리오에서 잘 동작하지 않음
- automated reasoning-intesnvie training data synthesis framework 제안. self-consistency data filtering mechanism이 적용되어 데이터 퀄리티를 보장
- cold-start SFT → RL for ruther ranking ability enhancement
- 강화학습 단계에서 listwise ranking을 위해 multi-view ranking reward를 설계했는데, 이는 기존의 ranking metric-based reward보다 효과적이라고 설명함
📜 [Apple] Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential
- autoregressive model에 다음 여러 토큰을 예측할 수 있는 능력이 내재되어 있다고 주장하며, 이를 활용한 novel framework를 제안
- common prefix로부터 multi token precition, 이를 이용하여 coherent sequence를 생성하는 모듈
- gated LoRA formulation: 기존 모델의 functionality 유지
📜 [Ai2, Washington] MolmoAct: Action Reasoning Models that can Reason in Space
- robotic foundation model이 perception과 instruction을 control과 직접적으로 매핑하는 것이 일반화 성능을 제한하게 되는 이유라고 문제점 지적
- MolmoAct 모델은 observations & instructions를 depth-aware perception tokens로 encode → mid-level spatial plans 생성 → precise low-level actions 예측 (7B 사이즈)
- MolmoAct Datset: mid-training robot dataset 공개. 10,000개의 고품질 robot trajectories
📜 [Hebrew] Story2Board: A Training-Free Approach for Expressive Storyboard Generation
- 자연어로 스토리보드(4개의 그림으로 구성) 생성하는 태스크 - 이런 걸 고도화하는 연구 분야도 있구나
- 기존에는 subject identity에만 집중한 것을 한계로 지적하고, spatial composition, background evolution, narrative pacing 등에 집중했다고 설명
🧑🏻‍💻 [NVIDIA] NVIDIA Releases 3 Million Sample Dataset for OCR, Visual Question Answering, and Captioning Tasks
- Llama Nemotron VLM Dataset V1: VLM 학습을 위한 고품질의 3M개 데이터셋 공개
- OCR, VQA, captioning 등에 집중된 데이터셋이며, 최근 Llama 3.1 Nemotron Nano VL 8B V1 을 학습하는데 사용됨
📜 [Alibaba] WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
- multimodal Deep Research가 대부분 텍스트에 집중한다는 한계점을 지적
- efficient cold start를 위해 high-quality synthetic multimodal tranjectories 사용
- BrowseComp-VL: visual & textual information을 동시에 잘 가져와야 하는 복잡한 벤치마크
📜 [WeChat, Tsinghua] We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning - structured mathematical knowledge system, model-centric data space modeling, RL-based training paradigm을 통합한 unifed system, We-Math 2.0 - MathBook Knowledge System: five-level hierarchy system. 491 knowledge points, 1,819 fundamental principles - MathBook-Standard & Pro: 난이도에 따라 구분한 학습용 데이터셋 - MathBook-RL: Cold-Start Fine-tuning → Progressive Alignment RL - MathBookEval: 491개의 knowledge points를 전부 커버하고 다양한 reasoning step distributions를 갖는 벤치마크

3rd week

🧑🏻‍💻 [Meta] DINOv3
- self-supervised vision foundation model that scales data and model size
- Gram anchoring loss를 사용하여 dense patch consistency를 보존하고 resolution, size, text alignment를 위한 post-hoc tweaks를 더함
📜 [ByteDance] Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
- M3 Agent: 사람처럼 long-term memory를 지닌 multimodal agent framework. real-time visual & auditory inputs를 처리하여 memory를 build 또는 update
- 시간에 따라 축적되는 knowledge를 semantic memory로 관리 (episodic memory와 별도)
- M3 Bench: long-video question answering benchmark. robot 관점에서 획득한 100개 데이터 + web-sourced 929개 데이터
📜 [Chinese Academy of Science] PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing
- 기존 논문 검색 시스템들은 abstract만을 수집하여 indexing했으므로 세부적인 requirement를 충족하지 못하는 문제
- offline hierarchical indexing & online adaptivr retrieval → paper search를 위한 index tree
📜 [Amsterdam] Can we Evaluate RAGs with Synthetic Data?
- synthetic benchmark가 충분히 쓸만한지 확인하는 두 가지 관점
- (1) 생성 모델은 고정하고 retriever를 varying (2) retriever를 고정하고 생성 모델을 varying
- (1)에서는 일관성 있는 결과가 나오는 반면 (2)는 그렇지 않다고 설명
🧑🏻‍💻 [Google] Introducing Gemma 3 270M: The compact model for hyper-efficient AI
- task-specific fine-tuning with strong instruction-following and text structuring capabilities, 270M parameters
- 170M embedding parameters인데 이는 large vocab size 때문이라고 함 (256k tokens)
- INT4 precision으로 사용 가능한 Quantization-Aware Trained (QAT) 버전도 공개
🧑🏻‍💻 [Alibaba] Qwen-Image-Edit: Image Editing with Higher Quality and Efficiency
- input image를 Qwen-2.5-VL과 VAE Encoder에 동시에 넣어 semantic & appearance editing 가능
- 영어와 중국어에 대해 정확한 text editing 가능
- Seedream, GPT Image, FLUX 등의 모델을 능가한 SoTA 달성
📜 [Univ. of Tubingen] MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models
- Masked Diffusion Language Models는 추론 시 unmask → mask 과정을 반복하는데, 이는 학습 당시 mask를 random 하게 설정했던 것과 discrepancy 존재
- 이를 해결하기 위해 learning effective denoising trajectories 문제를 a sequential decision-making problem으로 정의
- Masked Diffusion Policy Optimization (MDPO): diffusion process의 Markov property 이용하여 모델이 추론 시 겪는 progress를 학습 당시에도 볼 수 있도록 함
📜 [OPPO] Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL
- 한 개의 모델 내에서 여러 개의 tools & agents를 이용하여 multi-turn problem solving이 가능하도록 하는 패러다임 제안
- agentic supervised fine-tuning을 위한 multi-agent distillation framework 제안 → reinforcement learning on verifiable agentic tasks
- 학습을 통해 획득한 결과 모델을 Agent Foundation Models (AFMs)라고 부름
📜 [Shanghia Jiao Tong Univ.] Transplant Then Regenerate: A New Paradigm for Text Data Augmentation
- LMTransplant: seed text를 바탕으로 확장된 context를 만들고, 이를 바탕으로 variant를 생성하라고 지시
- LLM에 embedded knowledge를 이용하여 기존 text의 attribute를 지닌 채로 diverse & creative content-level variants 생성 가능
🧑🏻‍💻 [DeepSeek] DeepSeek-V3.1 Release
- SWE-/Terminal- bench에서 전작 대비 큰 성능 향상을 보여줌
📜 [ByteDance, Nanjing] DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
- DuPO: Dual learning-based preference optimization framework로 generalized duality를 통해 annotation-free feedback 생성
- RLVR이 지나치게 많은 비용을 필요로 한다는 한계 & 전통적인 dual learning이 학습 당시에 본 task만 처리할 수 있다는 한계를 극복
- primal task’s input을 known & unknown components로 쪼개고, primal output & known information을 이용하여 unknown part를 reconstruct
📜 [Wuhan, Nanjing] From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models
- FinCDM: financial LLM 평가를 위한 first cognitive diagnosis evaluation framework
- LLM 평가를 knowledge-skill level로 진행하여 LLM이 어떤 financial skills & knowledge를 갖고 있는지 확인할 수 있음 (단순한 숫자로 반환하는 것 x)
- CPA-QKA: the first cognitively informed financial evaluation dataset. Certified Public Accountant (CPA) 검사로부터 derive
📜 [Meta] Deep Think with Confidence
- 기존 LLM들의 test-time scaling에서 majority voting를 통한 self-confidence 같은 것들은 computational overhead를 크게 발생시킨다는 문제가 있음
- Deep Think with Confidence (DeepConf): model-internal confidence signals를 이용하여 low-quality reasoning traces를 dynamically filter out
- 추가적인 학습 or hyper-parameter tuning 필요 없이 기존 serving frameworks에 integrate 가능
📜 [Shanghai AI Lab] Intern-S1: A Scientific Multimodal Foundation Model
- scientific domain에서는 여전히 open-source models & closed models gap이 상당하다는 문제점 지적
- Intern-S1: a specialized generalist equipped with general understanding and reasoning capabilities
- 28B activated, 241B total parameters, MoE 모델
- 5T 토큰 데이터로 사전학습. 그중에 2.5T 토큰이 과학 분야 데이터
- offline & online RL을 적용할 때, InternBootCamp라는 프레임워크 내에서 Mixture-of-Rewards (MoR)를 이용하는데 1000개 이상의 태스크를 동시에 학습
📜 [Tsinghua] ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents
- autonomous desktop intelligence를 위한 프레임워크로 API-GUI paradigm을 특징으로 가짐
- distributed RL infrastrcuture를 구성하여 수천개의 가상 desktop 환경을 병렬적으로 orchestrate 함으로써 대규모 RL 수행
- Entropulse: SFT와 RL을 번갈아가며 학습함으로써 entropy collapse 현상을 완화
📜 [Shanghai AI Lab] Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing - Avengers-Pro: performance-efficiency tradeoff를 적절히 골라주는 test-time routing framework - incoming queries를 embed & cluster → 가장 적절한 LLM으로 route - 단일 모델을 사용할 때보다 퍼포먼스 고점도 높고, 동일 성능을 뽑아내기 위해 필요한 비용은 적음

4th week

🧑🏻‍💻 [xAI] xai-org/grok-2
- 270B 사이즈의 2024년 플래그십 모델인 Grok 2.5을 오픈소스로 공개
- 각 토큰당 62B activated parameters
- tensor parallelism을 이용하여 8개 GPU에서 serving 가능
🧑🏻‍💻 [GitHub] Why we open sourced our MCP server, and what it means for you
- GitHub와 LLM 간의 source-of-truth interface로 사용되는 MCP 서버를 오픈소스로 공개
🧑🏻‍💻 [Anthropic] Enhancing Model Safety through Pretraining Data Filtering
- 모델의 사전학습 데이터에서 harmful content를 filtering 하기 위해서 classifier & pre-trained model을 사용
  - 6개의 classifier approaches
- classifier에 사용된 모델은 Claude 3.5 Haiku보다도 훨씬 작았다고 설명
📜 [UCL, Huawei] AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
- Adaptive LLM agents가 fine-tuning 없이 memory-based online RL 하는 learning paradigm 제시 (본인들의 DeepResearch 세팅의 agent model을 Memento로 명명)
- Memory-augmented Markov Decision Process (M-MDP)에 neural case-selection policy를 equip
- policy는 memory rewriting mechanism을 통해 environmental feedback 기반으로 지속 업데이트
📜 [Shanghai AI Lab] InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
- versatility, reasoning capability, inference efficiency가 크게 강화된 오픈소스 multimodal models
- Cascade Reinforcement Learning (Cascade RL) framework: offline RL for stable convergence & online RL for refined alignment (coarse-to-fine)
- Visual Resolution Router (ViR)를 통해 성능 열화 없이 visual tokens의 resolutions를 조정
- Decoupled Vision-Language Deployment (DvD) strategy: vision encoder & language model을 서로 다른 GPU에 분리함으로써 computational load를 효율적으로 관리
📜 [Microsoft] CoCoA: Confidence- and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models
- CoCoA: novel token-level algorithm for principled conflict resolution & enhanced faithfulness
- entropy gap & contextual peakedness를 confidence-aware measures로 이용하여 conflic 해결
- 심지어 low conflict settings에서도 높은 퍼포먼스를 보였다고 설명 (QA, Summarization 등)
📜 [UIUC, HKUST] Utilizing Training Data to Improve LLM Reasoning for Tabular Understanding
- tabular understanding을 위해 기존에는 labeled data에 fine-tuning & training-free CoT를 활용했으나 두 방식을 한계로 지적
- Learn then Retrieve, LRTab: 학습 데이터로부터 배운 정보와 유관한 것을 retrieving 하여 활용하는 prompting-based reasoning approach
- incorrect CoTs에 대해서는 모델이 에러를 피할 수 있도록 Prompt Conditions가 무엇이었을지 예측하도록 프롬프팅
🧑🏻‍💻 [Google] Introducing Gemini 2.5 Flash Image, our state-of-the-art image model
- Gemini 2.5 Flash Image 모델이 Image editing 분야에서 OpenAI와 Flux를 넘어 SoTA 달성
- 캐릭터 특성을 그대로 잘 유지하면서 지시 사항을 잘 따라 변경해준다는 특징으로 큰 화제가 됨
🧑🏻‍💻 [Google] NotebookLM's Video Overviews are now available in 80 languages
- 제목 그대로 NotebookLM의 Video Overview에서 전세계 80개 언어를 지원함
🧑🏻‍💻 [Anthropic] Piloting Claude for Chrome
- Chrome의 extension으로 Claude를 사용하여 browser-using AI를 piloting
- 현재는 Max 유저 1,000명 대상으로 early access (wait list 등록 필요)
- 여러 위험성에 대해서도 사전 고지를 하고 있는 상황
- 올해 초 OpenAI에서도 web-browsing 기능을 공개했었으나 현재 제대로 쓰이고 있는지에 대해서는 확인이 필요함
📜 [UC Berkeley] MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
- LLM의 tool use, cross-tool coordination, precise parameter control 등을 요하는 realistic, multi-step tasks 평가 벤치마크
- MCP 기반으로 build 되어 LLM을 28개의 대표적인 live MCP servers와 연결하여 다양한 도메인(finance, traveling 등)을 다룸
- multi-faceted evaluation framework 제안
🧑🏻‍💻 [xAI] Grok Code Fast 1
- grep, terminal, file editing 등 common tools 사용을 master
- GitHub Copilot, Cline, Cursor, Roo Code, Windsurf 등에서 사용 가능
- TS, Python, Java, Rust, C++, Go 등 다양한 언어를 다룰 수 있으며, 서빙단에서 속도를 최적화했음을 언급
📜 [KTH] Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction - reasoning을 길게 하는 것이 답변의 confidence와 상관이 없음. 생성하면서 reasoning step이 유용할지 알 수 있다면 early stop or prune ineffective steps가 가능할 것 - Qwen2.5-32B & GPT-4o 모델로 reasoning chains를 생성하고, Qwen3-8B 모델로 final accuracy 측정 - answer span Y에 대한 각 reasoning step의 conditional entropy를 step-by-step 계산하여 uncertainty 측정

🍉 7월

1st week

📜 [Stanford, NYU] From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
- LLM과 인간 인지 (human cognition)이 의미 보존과 표현의 압축성 사이에서 어떻게 다른 전략을 사용하는지에 대한 논문
- 인간은 적절한 수준의 비효율성을 감소하면서도 더 풍부하고 유연한 개념 구조를 형성하는 반면 LLM은 통계적으로 효율을 극대화하여 개념 구조 형성
🧑🏻‍💻 [Baidu] Announcing the Open Source Release of the ERNIE 4.5 Model Family
- 3B - 47B MoE, 0.3B - 424B Dense Models, 총 10개의 멀티모달 모델 공개 (Apache 2.0)
- MoE에 각 modality별로 독립적인 파라미터를 할당함과 동시에 modalities 간에 share 하는 파라미터도 보유하는 heterogeneous architecture 적용
- 중국의 딥러닝 프레임워크인 PaddlePaddle로 모델 학습
📜 Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies
- Mixture of Reasoning (MoR): LLM이 external prompt engineering 없이 autonomous, task-adaptive reasoning 할 수 있도록 만드는 학습 프레임워크
- Thought generation → SFT dataset construction
📜 [Mila, Oxford, AI2] Chain-of-Thought Is Not Explainability
- CoT rationale가 필요하지도 않고 interpretable 하지도 않다고 주장
- verbalized chain이 주로 unfaithful 하며 모델 예측 자체로부터 diverge 하는 것이기 때문에 모델이 최종 정답에 이르는데 방해가 된다고 설명
- (1) 추가적인 증명이 없다면 CoT는 interpretability technique로 사용할 수 없다.
- (2) downstream decision-making의 faithfulness를 평가하기 위한 rigorous methods를 사용해야 한다
- (3) 모델 내부에서 explanation을 ground 하기 위한 causal validation method를 고도화 해야 한다
- 요슈아 벤지오가 저자 ㄷㄷ
🧑🏻‍💻 [Ai2] SciArena: A New Platform for Evaluating Foundation Models in Scientific Literature Tasks
- SciArena: scientific literature tasks를 Foundation models들이 얼마나 잘 처리할 수 있는지를 평가하는 open & collaborative 플랫폼
- SoTA 성능을 파악하기 위해 23개의 프론티어 모델들을 호스트 중. 현재는 o3 모델이 최고 성능을 보임
- Chatbot Arena처럼 Elo rating system 사용
- 논문 링크 🔗
📜 [ETH Zürich] Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models (ICLR 2025)
- sparse auto-encoder (SAE)를 interpretability tool로 사용하여 entity recognition
- SAE는 representation space에서 meaningful direction을 알아낼 수 있는데, 이를 통해 모델이 특정 entity를 아는지 모르는지(self-knowledge)를 구분할 수 있음
- direction을 이용하면 모델이 원래 알고 있던 것은 모른다고 하거나, 반대로 모르던 것은 알고 있는 것처럼 답변(hallucinate)하도록 유도하는 것이 가능
🧑🏻‍💻 [Google Gemini] Gemini-CLI
- CLI 환경에서 사용 가능한 오픈소스 agent 프레임워크 (Apache-2.0)
🧑🏻‍💻 observe.tools
- endpoint 한 줄 변경으로 디버깅 가능한 솔루션
- 디테일한 trace 확인, payload 수정, 공유 등 기능 지원
🧑🏻‍💻 [Ai2] IFBench
- LLM의 instruction following 능력을 평가하기 위한 challenging 벤치마크
- OOD constraints: verification function이 존재하는 58개의 new & challenging constraints
- Multiturn Constraint Isolation in 2 turns
- new IF-RLVR training constraints: 마찬가지로 verification function이 존재하는 29개의 new & challenging constraints (IF-RLVR training data 🔗)
📜 [Alibaba] WebSailor: Navigating Super-human Reasoning for Web Agent
- DeepResearch와 같은 agentic system이 뛰어난 성능을 달성할 수 있는 이유는 방대한 information landscape를 탐색할 때의 extreme uncertainty를 크게 줄일 수 있기 때문
- Duplicating Sampling Policy Optimization (DUPO): agentic RL training algorithm
- DUPO + structured sampling, information obfuscation, RFT cold start
📜 [Inception Labs] Mercury: Ultra-Fast Language Models Based on Diffusion
- diffusion 기반의 상업용 LLM 제안. 엄청난 추론 속도로 화제가 되었음
- Transformer architecture & multiple tokens parallel prediction
- 두 사이즈, Mini & Small 로 구성된 Mercury Coder 에 대한 상세한 리포트
📜 [NUS, MIT, Yonsei] MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
- MEM1: long multi-turn tasks에서 constant memory 기반으로 agents 동작이 가능하도록 하는 RL framework
- 매 턴마다 compact shared internal state를 update
- 기존 데이터셋을 이용하여 복잡한 task sequences로 만들어, 보다 realistic & compositional setting에 맞춰 학습 진행
- 뛰어난 일반화 성능 보고
📜 [Baidu] Towards AI Search Paradigm - human information processing & decision-making을 emulate 할 수 있는 검색 시스템 - LLM-powered agents를 이용하여 다양한 범위의 정보에 dynamically 접근 (from simple fatual queries to complex multi-stage reasoning tasks) - query complexity를 평가하고, 문제를 executable plans로 쪼개고, tool usage, task execution, content synthesis로 문제 해결 (MCP)

2nd week

📜 [Independent] Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs
- Self-Correction Blind Spot: output에 나타나는 동일한 에러를 교정하지 못함
- Self-Correction Bench 제안: complexity level을 3개로 정해서 controlled error injection을 통해 관련 능력을 systematically 평가
- LLM의 이러한 한계는 모델의 학습 데이터 구성(composition)과 관련이 높음
  - RL은 reward를 바탕으로 correction이 일어나지만 SFT는 아니므로..
  - 단순히 “Wait” 정도를 추가하는 것만으로도 Blind Spot을 89.3%나 줄일 수 있었음
📜 [Salesforce] Lost at the Beginning of Reasoning
- LLM의 첫 reasoning step이 최종 답변에 지나치게 큰 영향을 미친다는 실험 결과를 제시
  - 즉, 스타트를 잘못 끊으면 이어지는 reasoning quality도 자연스레 낮다는 뜻
- DeepSeek-R1 & Qwen3 대상으로 실험
- reward 모델을 이용하여 고품질의 first reasoning step을 retain 하는 sampling 전략 제안
- 의도적으로 첫 번째 추론 step에 문제가 있는 샘플들로 구성된 벤치마크를 제작하여 모델의 self-correction 능력을 평가
🧑🏻‍💻 [Sakana AI] Inference-Time Scaling and Collective Intelligence for Frontier AI
- 여러 개의 모델로 새로운 모델을 만드는 것 외에도 추론 단계에 활용할 수 있을 것이라는 아이디어 → Collective Intelligence (집단 지성)
- AB-MCTS (Adaptive Branching Monte Carlo Tree Search)
  - AI가 trial-and-error를 빠르게 수행하여 여러 frontier 모델이 협력하도록 함
  - 4o-mini + Gemini-2.5-Pro + R1-0528
📜 [Tsinghua] GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
- Reinforcement Learning with Curriculumn Sampling (RLCS)
- GLM-4.1V-9B-Thinking 모델을 오픈소스로 공개: 동사이즈 모델군에서 SoTA. video understanding, content recognition, coding, grounding 등 다양한 태스크 수행 가능
  - long document understanding & STEM reasoning
📜 [Alibaba] Ovis-U1 Technical Report
- 3B unified model: multi-modal understanding, text-to-image generation, image editing
- diffusion-based visual decoder & bidirectional token refiner
- frozen MLLM 모델을 이용하는 타 방법론들과 달리, 언어 모델로부터 unified training approach를 이용하여 understanding & generation 둘 다 학습 → better performance
🧑🏻‍💻 [Anthropic] Project Vend: Can Claude run a small shop? (And why does that matter?)
- Anthropic에서 한 달 동안 Claude로 자판기 사업을 시켜봄 (미니 냉장고+셀프 체크아웃 iPad)
- 잘한 점: 웹어서 공급처를 찾아 특이, 희귀 상품 (네덜란드 초콜릿 우유 등) 준비
- 실패한 점: 과도한 할인 정책, 허위 결제 정보 생성
- 현재 상태로는 매장 운영이 불가능하지만, 향후 중간 관리자 정도의 역할을 할 수 있다고 판단
📜 [MemTensor] MemOS: A Memory OS for AI System
- memory를 관리 가능한 시스템 리소스로 다루는 운영체제
- representation, scheduling, evolution of plain text, activation-based & parameter-level memories를 통합
- MemCube를 기본 단위로 사용하여 memory & meta data를 encapsulate
📜 Should We Still Pretrain Encoders with Masked Language Modeling?
- 38개 모델을 210M ~ 1B 사이즈로 학습하며 ablation study 수행
- MLM 학습 방식과 CLM 학습 방식의 결과 차이를 비교
- MLM은 학습 결과가 좋지만 CLM의 데이터 대비 학습 효율이 좋음
- CLM → MLM 으로 이어지는 biphasic 학습 전략이 제한된 budget 내에서 가장 좋은 결과로 이어졌다고 설명
📜 [IIT] SingLoRA: Low Rank Adaptation Using a Single Matrix
- single low-rank matrix와 이것의 transpose와 곱하는 것으로 weight decomposition
- 이를 통해 두 matrix 간 존재하는 scale disparities로 인해 발생하는 성능 하락 문제 해결 가능
- 자연어에 대해서는 Llama, 이미지에 대해서는 Stable Diffusion 모델을 fine-tuning한 결과 제시
🧑🏻‍💻 [Perplexity] Browse at the speed of thought
- Comet 브라우저를 Perplexity Max 티어 구독자 대상으로 선공개
📜 [Google DeepMind] MedGemma Technical Report
- MedGemma: Gemma 3 4B & 27B 기반의 medical vision-language foundation model
- medical multimodal question answering & chest X-ray finding classification 태스크 잘 처리한다고 보고
- MedSigLIP: SigLIP으로부터 개발한 medically-tuned vision encoder
🧑🏻‍💻 [Ai2] Introducing FlexOlmo: a new paradigm for language model training and data collaboration
- data collaboration을 통해 AI co-development를 가능하도록 하는 training paradigm 제시
- data owners는 데이터에 대한 통제권을 잃지 않고서도 AI 모델에 기여할 수 있게 됨. 데이터를 직접적으로 공유할 필요도 없게 됨
🧑🏻‍💻 [Google] T5Gemma: A new collection of encoder-decoder Gemma models
- Gemma 2 프레임워크를 기반으로 T5Gemma 학습 (Small, Base, Large and XL 사이즈)
- model adaptation: 사전학습된 decoder-only model의 weight로 initialize → UL2 or PrefixLM-based pre-training → 기존 decoder-only model보다 뛰어난 성능
- encoder-decoder 간의 사이즈를 꼭 맞추지 않아도 됨 (flexibility)
🧑🏻‍💻 [xAI] Grok4
- o3 모델도 25점 정도의 점수를 기록하는 HLE 벤치마크에서 44점 이상(tool use 기준)을 달성했다고 보고
- multi-agent 구조, 256K context window
📜 [Intel] Investigating the Robustness of Retrieval-Augmented Generation at the Query Level
- RAG는 input query의 quality에 강한 dependence가 있다는 문제를 지적
- query에 다양한 변형을 가하여(perturbation) RAG components의 sensitivity 측정
- 연구 결과에 따르면 사소한 query variation도 최종 생성 결과를 꽤나 degrade 한다고 함
📜 [NUS] Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights (NeurIPS 2025)
- Drag-and-Drop LLMs (DnD): prompt-conditioned parameter generator. unlabeled task prompts를 LoRA weight update에 직접 mapping하는 방식
- lightweight text encoder가 각 prompt batch를 condition embeddings로 distills → cascaded hyper-convolutional decoder에 의해 full set of LoRA 행렬로 변환
- task-specific parameters를 수 초 안에 생성 → FFT 대비 12,000배 낮은 overhead → unseen tasks에 대해 기존 LoRA 대비 30%까지 성능 향상
🧑🏻‍💻 [SKT] A.X-4.0
- Qwen2.5 기반의 오픈소스 모델 공개
- 한국어 이해 & enterprise deployment 를 강점으로 내세움
- 72B 사이즈. 7B 사이즈의 light 버전도 공개
🧑🏻‍💻 [SKT] A.X-3.1-Light
- SKT 자체 supercomputing 인프라 TITAN을 이용해 from-scratch 학습
- 1.65T multi-lingual 토큰 corpus로 학습. 7B 사이즈.
📜 [Stanford, Cohere] Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models (ICLR 2025)
- diffusion LM의 likelihood modeling & fixed-length generation 한계를 지적
- a class of block diffusion: discrete denoising diffusion & autoregressive models 사이를 interpolate
  - flexible-length generation & inference efficiency with KV cacahing and parallel token sampling
- 이를 위한 efficient training algorithm, estimators of gradient variance, data-driven noise scheduels to minimize the variance 등을 제시
📜 [Tencent, Princeton] One Token to Fool LLM-as-a-Judge - LLM을 generative reward model로 사용하여 ground-truth reference와 비교를 시킬 때 작은 표지에 영향을 크게 받는다는 것을 확인 (이런 방식을 master key 유형으로 분류하는 것 같음) - non-word symbols - :, . - reasoning openers: Thought process:, Let’s solve this problem step by step - 위와 같은 표현들은 주로 false positive로 이어짐 (reward를 주지 않아야 하는데 줌) - data augmentation & 모델 학습을 통해 이런 issue를 mitigate 할 수 있다고 설명

3rd week

🧑🏻‍💻 [Moonshot AI] Kimi K2: Open Agentic Intelligence
- 총 1T, 32B activated parameters MoE 모델. Base, Instruct 두 버전 오픈소스로 공개.
- MuonClip optimizer를 도입하여 qk-clip technique 고도화
- Tool learning을 위한 대규모 Agentic Data Synthesis
📜 [Google DeepMind] Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
- Gemini 2.5 Pro, Gemini 2.5 Flash 공개
- coding, reasoning 특화 & thinking 모델임
- multimodal understanding 능력이 뛰어나 3시간 분량의 영상도 처리할 수 있음
- long context + multi-modal ⇒ agentic problem-solving
📜 [MetaStone AI, USTC] Test-Time Scaling with Reflective Generative Model
- Reflective Generative Form을 통해 o3-mini급 성능을 보이는 MetaStone-S1 모델 공개
- 두 가지 주요한 특징
  - (1) A unified interface for policy and process reward model: trajectory scoring head 사이즈가 고작 53M
  - (2) Eliminating the reliance on process-level annotation: self-supervised process reward model
📜 [CMU] Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
- dynamic chunking: content- & content- dependent segmentation 전략을 자동적으로 학습하는 mechanism
- dynamic chunking을 hierarchical network (H-Net)에 통합함으로써 tokenization-LM-detokenization → single model 로 대체
- 영어로 학습된 모델의 경우 character 단위에서 더 robust한 특징을 보였다고 설명
- Mamba 창시자인 Albert Gu 논문
📜 [KAIST, Mila, Google] Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
- Mixture-of-Recursions (MoR): parameter sharing & adaptive computation 둘 다 고려한 single Recursive Transformer
- parameter efficiency를 위해 shared stack of layers를 사용하고, lightweight router를 통해 adaptive token-level thinking
- 첫 recursion의 KV pairs를 재사용하는 KV sharing variant 제안
📜 [Johns Hopkins, Tsinghua, Rice] Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
- Vision-Language-Vision Auto-Encoder framework
  - vision encoder, Text-to-Image (T2I) diffusion model의 decoder, LLM을 순차적으로 이용
- T2I diffusion model의 decoder를 이용함으로써 language representation space를 regularize 할 수 있었음
🧑🏻‍💻 [OpenAI] Introducing ChatGPT agent: bridging research and action
- Pro, Plus, Team 요금제 사용자 대상으로 공개한 agent 기능. 현재는 Pro 유저만 사용 가능
- 다른 툴들과 쉽게 연동하여 태스크 수행. 벤치마크 성능도 공개함
- ChatGPT agent System Card
🧑🏻‍💻 [Mistral] Voxtral
- 24B & 3B 사이즈 음성 모델을 Apache 2.0 라이센스로 공개
- Word Error Rate 측정 결과를 공개했는데 GPT-4o mini Audio, Gemini 2.5 Flash보다 뛰어난 성능을 보임
- text 이해 능력도 Mistral Small 3.1에 비해 크게 뒤지지 않는 정도
📜 [Peking, Tsinghua] A Survey of Context Engineering for Large Language Models
- 이젠 prompt engineering이 아닌 context engineering의 시대
- 이를 구성하는 핵심적인 요소 (1) Context Retrieval and Generation (2) Context Processing (3) Context Management
- System Implementations: (1) Retrieval-Augmented Generation (RAG) (2) Memory systems (3) Tool-Integrated Reasoning (4) Multi-Agent Systems
🧑🏻‍💻 [Stanford] Agents4Science 2025
- AI가 저자인 논문을 대상으로 AI가 심사하는 최초의 open conference (스탠포드 대학)
- 9월 25일 제출 마감, 9월 29일 심사 마감, 10월 22일 virtual conference 일정
- AI가 과학 분야에 어떻게 기여할 수 있을지 탐구하고자 하는 과감한 시도
📜 [Tsinghua, UIUC, Tokyo, Peking, HKUST] Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs - Reasoning-Enhanced RAG: advanced reasoning이 각 RAG 단계에서 어떻게 optimize 하는지 분석 - RAG-Enhanced reasoning: 다른 종류의 retrieved knowledge가 어떤식으로 context를 확장하는지 분석 - Synergized RAG-Reasoning: LLM이 최고 성능 달성을 위해 search & reasoning 을 iteratively 수행

4th week

📜 [CMU] Agentic-R1: Distilled Dual-Strategy Reasoning
- 현 long CoT 모델들은 수학 문제를 잘 풀지만, slow & error-prone natural language traces에 의존한다는 문제점을 지적
- 또한 tool-augmented agents는 code execution으로 문제를 해결해왔으나 여전히 복잡한 logical 문제들을 풀지는 못함
- DualDistill: 여러 teachers로부터의 complementary reasoning strategies를 unified student model에 distill하는 framework
- Agentic-R1: 각 쿼리마다 최적의 전략을 dynamically 선택하도록 학습한 모델. tool을 사용하거나 텍스트 기반의 추론을 하거나.
🧑🏻‍💻 [ARC] ARC-AGI-3
- LLM agents의 성능을 측정하기 위한 interactive benchmark
- 기존에도 ARC 벤치마크 퍼즐을 맞추는 태스크로 유명 (인간과 유사한 사고가 가능한지)
- o3, Grok 4와 같은 frontier models도 현재까지 0점 기록
- RTX 5090 또는 $1K API 로 추론. 8시간 제한
🧑🏻‍💻 [Google] Gemini Embedding now generally available in the Gemini API
- first Gemini Embedding text model (gemini-embedding-001)을 Gemini API or Vertext AI에서 API로 이용 가능
- science, legal, finance, coding 등 다양한 도메인에 대해 뛰어난 성능을 보인다고 설명
- 100개 이상의 언어에 대해 2048 input token length 지원. Matryoshka Representation Learning (MRL) 테크닉 사용시 3072, 1536, 768 차원 추천
📜 [Anthropic] Inverse Scaling in Test-Time Compute
- Large Reasong Models (LRM)이 test-time compute & accuracy 사이의 inverse scaling relationship을 갖는다는 점을 분석한 논문
- 모든 flagship 모델들이 복잡한 deductive tasks에서 약점을 보임
- extended reasoning은 self-preservation 표현을 증가시킴
- Simple Counting tasks with Distractors, Regression Tasks with Spurious Features, Deduction Tasks with Constraint Tracking
📜 [Zhejiang] GUI-G^2: Gaussian Reward Modeling for GUI Grounding
- 기존 강화학습은 GUI에서 hit-or-miss targets를 기준으로 binary reward를 사용
- GUI-G^2: GUI 요소를 interface plance 위의 continuous Gaussian Distribution으로 modeling
  - Guassian point rewards: precise localization을 모델링
  - Coverage rewards: predicted Gaussian distirbutions & target regions 간의 overlap 측정
- element dimensions 기반으로 reward distributions를 calibrate하는 adaptive variance mechanism 개발
📜 [MiroMind AI] MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization
- Qwen 2.5를 backbone으로 개발된 LRM으로 closed source 모델과의 격차 해소를 목표함
- 719K개의 math-reasoning 데이터셋 SFT + 62K개의 challenging & verifiable 문제에 대해 RLVR
- Context-Aware Multi-Stage Policy Optimization (CAMPO): length-progressive training + adaptive repetition penalty
🧑🏻‍💻 [Alibaba] Qwen3-235B-A22B-Instruct-2507
- 256K long-context 지원하는 non-thinking model
- Qwen Chat default 모델로 탑재. Kimi K2 모델을 능가하는 성능으로 보고
📜 [CMU] Diffusion Beats Autoregressive in Data-Constrained Settings
- data-constrained setting에서 masked diffusion model이 auto regressive 모델보다 뛰어나다는 설명
- repeated data에 대해 더 낮은 validation loss를 보이고 downstream performance도 뛰어남
- 저자는 이러한 현상을 implicit data augmentation으로 해석 (고정된 left-to-right factorization을 따르는 AR 방식과의 차이점)
🧑🏻‍💻 [Alibaba] Qwen3-Coder: Agentic Coding in the World
- OpenAI-, Claude-code compatible
- Qwen2.5-Coder를 사용하여 7.5T 토큰으로 학습된 480B-35B(active) MoE model, Qwen3-Coder
- 256K default, 최대 1M 토큰 지원
📜 [Shanhai AI] The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs
- dLLMs이 context-aware, masked-input adversarial prompts에 취약하다는 문제점 지적
- DIJA: adversarial interleaved mask-text prompts 생성 → dLLM 특성을 이용한 생성 방식으로, 타 jail-breaking을 압도하는 결과였다고 보고
📜 [Sapient Intelligence] Hierarchical Reasoning Model
- Hierarchical Reasoning Model (HRM): sequential reasoning tasks를 single forward pass로 실행
- 2개의 interdependent recurrent modules
  - a high-level module responsible for slow, abstract planning
  - a low-level module handling rapid, detailed computations
- 27M 파라미터 사이즈의 모델로, 단 1000개 training samples로 학습
🧑🏻‍💻 [GitHub] GitHub Spark in public preview for Copilot Pro+ subscribers
- Copilot Pro+ 구독자 대상으로 Spark라는 browser-based tool 공개
- 자연어로 micro apps를 만들 수 있도록 지원하는 기능으로, Claude Sonnet 4로 동작
🧑🏻‍💻 [HuggingFace] Trending Papers
- 허깅페이스에서 Meta & Papers with Code 와 협력하여 Trending Papers 오픈
📜 [Cardiff Univ] There’s No Such Thing as Simple Reasoning for LLMs (ACL 2025 Findings)
- 현재 LLM들은 복잡한 many-hop reasoning 문제들에 집중하고 있음
- 그러나 오히려 훨씬 간단한 reasoning 문제들을 풀지 못한다는 것을 문제점으로 지적
- 본 연구에서는 3-step 추론으로 해결할 수 있는 간단한 문제들에 조금씩 노이즈를 더하여(순서를 바꾸는 등) 모델 성능을 테스트 해봤고, 현존 모델들이 이런 세팅에 상당히 취약하다는 것을 지적함
📜 [Stanford] Optimization before Evaluation: Evaluation with Unoptimized Prompts Can be Misleading (ACL 2025 Industry Track)
- academic & internal industry 벤치마크에 대해 평가할 때 Prompt Optimization (PO)이 미치는 영향에 대한 연구
- 대부분의 모델과 벤치마크가 PO에 심각한 영향을 받는다고 설명
📜 [Shanghai AI, Fudan] Yume: An Interactive World Generation Model - image, text, video를 사용해서 interactive, realistic, dynamic world를 만드는 것을 목표 - Yume: image를 입력으로 받아 dynamic world를 생성하는데, 이는 keyboard actions으로 탐험 가능함 - high-fidelity & interacitve video world generation을 위해 네 개의 핵심 구성 요소를 갖춘 프레임워크 사용 - camera motion quantization, video generation architecture, advanced sampler, model acceleration - Masked Video Diffusion Transformer (MVDT) with memory module

5th week

📜 [Anthropic, UC Berkeley] Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
- Language model이 semantically unrelated data를 통해 behavioral traits를 transmit 하는 현상을 Subliminal Learning이라고 부름
- 특성 T를 갖는 teacher 모델이 일련의 숫자로만 구성된 데이터셋을 생성하고 이를 학습한 student 모델이 특성 T를 배울 수 있다는 것
- teacher 모델이 생성하는 코드나 reasoning path로 학습하더라도 동일 현상을 관측할 수 있다고 설명
🧑🏻‍💻 [Anthropic] Building and evaluating alignment auditing agents
- alignment auditing을 자동화하기 위한 세 개의 agents: investigator, evaluation, breadth-first red-teaming
- hidden goal을 찾아내고 misaligned behavior 등을 탐지하는 등 impressive results를 보여줌
- prefill attacks, context-manipulated jailbreaks, interpretability-driven safety failures 등에 취약하다는 결론
🧑🏻‍💻 [Runway] Introducing Runway Aleph | A new way to edit, transform and generate video.
- 비디오 편집을 위한 AI 모델 Aleph launch
- 비디오를 from scratch 생성하지 않고 text prompt를 통해 필요한 영역들을 수정
- 예를 들어 camera angles 수정, remove objects, effects like rain or fireworks 등 가능
🧑🏻‍💻 [Z.ai] GLM-4.5: Reasoning, Coding, and Agentic Abililties
- 중국 스타트업에서 DeepSeek 대비 87% 저렴한 LLM 출시
- coding benchmark에서 Claude 4 Sonnet, GPT-4.1 급의 성능
- GLM-4.5: 355B total & 32B active parameters / GLM-4.5 Air: 106B total & 12B active parameters
  - 둘 다 hybrid reasoning model로 복잡한 추론이나 tool using, non-thinking 등을 지원
📜 [Waterloo] Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models
- instruction-tuning은 LLM의 output 다양성을 감소시킴
- OLMo, OLMo 2 모델을 대상으로 한 실험에서 DPO의 영향도가 가장 크다는 결론
- 이를 바탕으로 conformative decoding 제안: instruct model이 base model의 다양성을 reintroduce 할 수 있도록 guide 하는 decoding strategy
📜 [Renmin] Agentic Reinforced Policy Optimization
- 현 LLM들은 multi-turn tool interactions를 고려하지 않은 single-turn 상황에만 집중
- Agentic Reinforced Policy Optimization (ARPO)
  - 외부 툴 사용 직후 생성되는 토큰의 entropy 분포가 향상된다는 점을 포착
  - entropy-based adaptive rollout mechanism
📜 [Univ. of Alberta] Curiosity by Design: An LLM-based Coding Assistant Asking Clarification Questions - 현 LLM들은 extensive prompt engineering | external context 없이 유저 의도를 잘 추론하지 못한다는 문제점을 지적 - 이를 해결하기 위해 인간의 code reivew 과정을 모사하는 LLM-based coding assistant를 개발 - ambiguous or under-specified queries에 clairification questions를 질문 - unclear programming-related queries를 탐지하는 trained query classifier → clarification questions를 생성하는 fine-tuend LLM

🌞 6월

1st week

📜 [Yale] Table-R1: Inference-Time Scaling for Table Reasoning
- table 데이터에 대해 inference-time scaling이 가능하도록 만드는 두 개의 post-training 전략 제시
  - frontier model의 reasoning steps로부터 distillation
  - reinforcement learning with verifiable rewards (RLVR)
- Distillation을 위해 DeepSeek-R1 모델로 reasoning traces 생성
📜 [Cohere] Command A: An Enterprise-Ready Large Language Model
- real-world의 enterprise use cases를 잘 처리하는 것을 목표로 학습된 111B 사이즈 LLM
- agent-optimized & multilingual-capable model (23개 언어 지원), hybrid architecture
- self-refinement & model merging techniques 적용
📜 [Sakana AI] Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
- Darwin Godel Machine (DGM): self-improving system that iteratively modifies its own code & empirically validates each change
- 여러 frozen foundation models가 tool use를 통해 코드를 읽고, 쓰고, 실행하는 coding agents optimize를 목표
📜 [UC Berkeley, Yale] Learning to Reason without External Rewards
- complex reasoning을 위한 LLM을 Reinforcement Learning with Verifiable Rewards (RLVR) 하는 것은 너무 비싸다는 문제
- → Reinforcement Learning from Internal Feedback (RLIF): 외부 rewards or labeled data 없이 intrinsic signals로부터 학습
- Intuitor: 모델 스스로의 confidence, self-certainty를 유일한 reward signla로 사용. 기존 GRPO 자리를 대체
🧑🏻‍💻 AgenticSeek: Private, Local Manus Alternative.
- 100% 로컬에서 실행 가능한 Manus AI 스타일의 agent 라이브러리
- web search, write codes, plan tasks, select agents, voice-enhanced 등 다양한 features
📜 [UIUC, UC Berkeley] AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
- LLM의 test-time reasoning progress를 조절하는 프레임워크
- scaled thinking phase를 $\alpha$ moment 라고 표현. $\alpha$ moment가 slow thinking 하는 시점임
🧑🏻‍💻 [ElevenLabs] Introducing ElevenLabs Conversational AI 2.0
- real-time turn-taking을 통해 자연스러운 voice interaction 가능. “um”과 같은 filler words도 자연스럽게 filtering
- enterprise 사용에 더욱 적합: private files or prorietary data sources에 RAG 연결 가능
📜 [Kakao] A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs
- 현 LLMs는 service-specific constraints를 따르면서 conversational abilities를 보일 수준이 안됨
- e-commerce domain을 위한 conversational agent에 관한 case study
- 카나나를 기반으로 더 넓은 분야로 대화형 agent를 확장하고자 하는 것일까하는 생각
📜 [Alibaba] QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
- 현 LRMs는 short-context reasoning tasks에 집중
- QwenLong-L1: short-context LRMs를 long-context scenarios에 adapt 할 수 있도록 progressive context scaling을 적용하는 프레임워크
- warm-up SFT stage → curriculum-guided phased RL
- QwenLong-L1-32B 모델이 OpenAI-o3-mini, Qwen3-235B-A22B 등을 outperform
📜 [Renmin Univ.] Do not Abstain! Identify and Solve the Uncertainty
- LLM의 uncertainty 원인을 recognize & address 하는 능력을 improve 하기 위한 연구
- ConfuseBench: 세 종류의 uncertainty를 다룸 - document scarcity, limited capability, query ambiguity
- original query의 confusing aspect를 highlight 하는 context-aware inquiries 생성하고, 이를 기반으로 source of uncertainty를 판단하는 방법론 제안
📜 [HuggingFace] SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
- robotic policies를 scratch부터 학습하는 것 대신 VLMs를 vision-language-action (VLA) models로 adapt 하는 최근 연구 동향
- SmolVLA: small, efficient, community-driven VLA. training & inference 비용 저렴
📜 [Meta, DeepMind, Cornell, NVIDIA] How much do language models memorize?
- 모델이 datapoint에 대해 얼마나 “knows” 하는지 추정하는 새로운 방법을 통해 언어 모델의 capacity 측정
- memorization을 unintended memorization & generalization 두 가지로 구분
  - generalization을 제거하여 모델의 total memorization을 계산하고 model capacity를 추정할 수 있음
- GPT family 모델들은 약 3.6 bits-per-parameter의 capacity를 가짐
📜 [Meta] LlamaFirewall: An open source guardrail system for building secure AI agents - open-source security focused guardrail framework - prompt injection, agent misalignment, insecure code risks 등을 mitigate 하기 위한 목적 - PromptGuard 2: universal jailbreak detector - Agent Alignment Checks: CoT auditor - CodeShield: online static analysis engine - 정규표현식이나 프롬프트를 통해 guardrails을 쉽게 업데이트 할 수 있도록 하는 scanners 포함

2nd week

📜 [Apple] The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- 현존 LRMs에 대한 평가는 최종 결과에 대한 accuracy 위주로 이루어짐
- 다양한 puzzle environments를 통해 모델의 internal reasoning traces를 확인하여 LRMs이 “think” 하는 방식에 대한 insight 획득
- reasoning effort가 특정 문제 난이도까지 상승하다가 이후에는 감소하여 scaling에서의 한계를 보임을 지적
- 낮은 난이도의 문제들에 대해서는 일반적인 LLM들이 훨씬 뛰어난 퍼포먼스를 보여줌 & 어려운 난이도에 대해서는 일반적인 LLM이나 LRM이나 둘 다 collpase
📜 [Stanford, NYU] From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
- 사람은 knowledge를 categories로 organize 하는 semantic compression을 하는데, LLM의 특성은 어떠한지 분석한 연구
- expressive fidelity & representational simplicity 간의 trade-off가 있는데, 모델은 human understanding에서 중요한 fine-grained semantic distinctions을 놓침
- 또한 LLM은 aggressive statistical compression에 대해 bias를 보임
📜 [UC Santa Cruz, Stanford] Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains
- medical & mathematical 도메인에서 thinking trajectories를 knowledge & reasoning 파트로 구분하여 reasoning model을 분석
- fine-grained evaluation framework 제안
  - (1) 사용된 knowledge의 정확성 (Knowledge Index (KI))
  - (2) the quality of reasoning (Information Gain (IG))
- 한 도메인에서 획득한 reasoning 능력이 다른 도메인으로 transfer 되지 않는다는 연구 결과
📜 [Stanford] OpenThoughts: Data Recipes for Reasoning Models
- proprietary 모델에 준하는 open-source 모델을 만들기 위한 학습 데이터셋 제작
- OpenThoughts2-1M 데이터셋으로 OpenThinker2-32B 모델 학습. DeepSeek-R1-Distill-32B에 준하는 성능
- 추가로 데이터셋을 정제하여 OpenThoughts3 제작
📜 [CMU] Coding Agents with Multimodal Browsing are Generalist Problem Solvers
- AI agents의 일반화 성능을 높이기 위한 방법 및 필수 도구들에 대한 연구
  - 기존 모델들은 특정 도메인이나 태스크에 specialized 되어 있어 일반화가 되지 않음을 지적
- OpenHands-Versa: a generalist agent built with a modest number of general tools
📜 [Microsoft, Peking, Tsinghua] Reinforcement Pre-Training
- Reinforcement Pre-Training (RPT): next-token prediction을 RL에서 사용되는 reasoning taks로 reframe
  - 주어진 문맥에서 다음 토큰을 정확히 예측하면 verifiable rewards를 받는 방식
- general-purpose RL을 위한 방대한 양의 텍스트 데이터를 이용할 수 있는 scalabe method라고 소개
- further reinforcement fine-tning을 위한 strong pre-trained foundation
📜 [ByteDance] Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
- Dolphin: analyze-then-parse paradigm을 따르는 multimodal document image parsing 모델
- reading order에 맞는 sequence of layout elements를 생성하고 이를 anchors로 사용
- anchors는 task-specific prompts와 짝지어지고, 다음 단계에서 parallel content parsing에 사용됨
- multi-granularity parsing tasks를 다루는 30M개 이상의 dataset
📜 [Cambridge] Truly Self-Improving Agents Require Intrinsic Metacognitive Learning (ICML 2525)
- 현재 self-improving agents는 self-improvement processes가 너무 rigid 하여 generalization & scaling 안된다는 문제가 있음
- 인간의 metacognition에 착안하여 세 개의 components로 구성된 프레임워크 제안
  - metacognitive knowledge, metacognitive planning, metacognitive evaluation
- 기존 agents들이 학습하는 것은 extrinsic metacognitive mechanisms을 따른다고 설명
📜 [Claude Opus] Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- 최근 Apple에서 공개한 Illusion of Thinking 논문의 실험 결과를 지적하며 Claude Opus 모델을 1저자로 올린 논문
📜 [MIT] Self-Adapting Language Models - Self-Adapting LLMs (SEAL): LLM이 스스로 finetuning data를 생성하고 directives를 update 하여 self-adapt 하도록 만드는 프레임워크 - self-edit: 새로운 input이 주어지면 모델은 information을 스스로 재구성, 하이퍼 파라미터 명시 등 - effetive self-edits 방법을 모델에게 알려주기 위해, updated model의 퍼포먼스를 reward signal로 사용하는 강화 학습 적용 - separate adaptation modules 또는 auxiliary networks를 사용하는 기존 방법론들과 달리, 모델의 생성 결과를 adaptation process에 직접 사용하여 parametrize & control 하는 것이 특징

3rd week

🧑🏻‍💻 [OpenAI] Launching OpenAI o3-pro
- 답변이 느리더라도 더 오래 생각하고 깊은 이해를 바탕으로 결과를 제시하는 모델 o3-pro 버전을 정식으로 공개
- personalized answer를 위한 memory 기능 지원
- o3, o1-pro 모델을 math, coding, science 벤치마크에서 outperform. pass@1 벤치마크가 인상적임
📜 [Huawei] SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks
- GitHub issue resolution task를 위한 벤치마크는 환경 설정, 결과 채점, taks instance validation 등의 이유로 구축하기가 쉽지 않음
- SWE-Factory
  - SWE-Builder: evaluation environment construction을 자동화해주는 multi-agent system
  - exit-code-based grading method: custom parsers를 직접 작성할 필요가 없음
  - reliable exit code signals를 이용하여 fail2pass validation process를 자동화
📜 [Rice, Johns Hopkins, NVIDIA] Play to Generalize: Learning to Reason Through Game Play
- Visual Game Learning (ViGaL): MLLMs이 아케이드류 게임을 통해 ood generalization이 가능한 multimodal reasoning 능력을 획득
- Snake 같은 게임을 학습한 7B 사이즈 모델이, RL 동안에 어떤 solutions, equations, diagrams를 보지 못했음에도 불구하고 MMMU에서 성능 향상을 보임: transferable reasoning skills
- 따라서 synthetic, rule-based game을 controllable & scalable pre-text tasks로 사용할 수 있다고 설명 for generalizable multimodal reasoning abilities in MLLMs
📜 [Sakana AI] Text-to-LoRA: Instant Transformer Adaption
- natural language task description을 바탕으로 즉시 LoRA adapters를 생성하는 hypernetwork-based approach
- Text-to-LoRA (T2L): many LoRA adapters를 합축한 모델로 unseen tasks에 대해 generalizes
📜 [Meta] V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
- V-JEPA 2: a scalable joint-embedding predictive architecture for self-supervised video learning
- 2-stage training
  - action-free pretraining on 1M+ hours of internet videos and images
  - post-training with only 62 hours of unlabeld robot trajectories (Droid dataset)
- self-supervised robot planning, architectural scale-up 등의 특징
📜 [Microsoft, UCLA] Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks
- Direct Reasoning Optimization (DRO): LLMs를 open-ended, long-form reasoning tasks에 fine-tuning 하는 프레임워크, using Reasoning Reflection Reward (R3)
- preceding CoT reasoning에서 key tokens를 identify & emphasize → reasoning & reference outcome 사이의 consistency를 fine-grained level에서 capture
- R3는 optimized 중인 model의 내부 연산 결과를 활용하므로 self-contained training setup 가능
📜 [Google DeepMind] Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.
- Gemini 2.5 & Gemini 2.5 Flash 모델 공개
- coding & reasoning benchmarks에서 SoTA 달성
- Gemini 2.5 Pro 모델은 3시간 길이의 비디오를 이해할 수 있을 정도로 뛰어난 multimodal understanding 능력을 보임
📜 [MIT] Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task
- LLM group, Search Engine group, Brain-only group, 세 그룹으로 나누어 에세이 작성 실험
- LLM으로 태스크를 수행한 그룹은 타 그룹 대비 less coordinated neural effort가 관측되었다고 보고
- 또한 작성된 에세이의 퀄리티는 AI judge & human teachers로부터 비슷한 평가를 받았으나, NER/n-gram 관점에서는 타그룹 대비 낮은 성적을 기록
📜 [Yale, Columbia, …] MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation
- global financial domain에 대해 modalities (text, vision, audio)
- domain-specific tasks에 대해 linguistic settings (monollingual, bilingual, multilingual)
- PolyFiQA-Easy & PolyFiQA-Expert: mixed-language inputs에 대해 복잡한 reasoning이 필요한 벤치마크 공개
- 또한 기존의 simple aggregation existing datasets 대신, dynamic difficulty-aware slection mechanism 제안
🧑🏻‍💻 [Anthropic] SHADE-Arena: Evaluating sabotage and monitoring in LLM agents
- AI 모델이 정상적으로 태스크를 수행하는 것처럼 보이면서 실제로는 사용자를 속이는 상황(sabotage)을 체크하기 위한 벤치마크
- 각 태스크는 main task & harmful side task 로 구성
- 이중 모니터링 시스템, 은밀성 평가(단순 성공 여부 x, 들키지 않고 성공 o), 복잡성과 현실성 고려
📜 [MiniMax] MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
- 세계 최초의 open-weight, large-scale hybrid-attention reasoning model (hybride MoE & lighning attention mechanism)
- 1M context length 지원, 연산 효율성 강조
- CISPO: token update 대신 importance sampling weights를 clip 하는 novel RL algorithm
- 512 H800 GPUs로 3주 동안 학습하여 $534,700 비용이 들었다고 강조함
📜 [OpenAI] Persona Feature Control Emergent Misalignment
- Toward understanding and preventing misalignment generalization: OpenAI 블로그
- GPT-4o를 insecure code에 의도적으로 fine-tuning 하면 unrelated prompts에도 malicious response를 반환 - emergent misalignment - 한다는 선행 연구 있음
- model diffing approach: sparse autoencoder를 사용하여 fine-tuning 전후의 internal model representations 비교
- 이를 통해 activation space 내의 misaligned persona feature를 확인할 수 있었고, 이는 곧 모델이 그러한 (malicious) 행동을 보일지 아닐지 예측할 수 있다는 것을 의미함 → re-align도 가능하다고 설명
📜 [ByteDance] Seedance 1.0: Exploring the Boundaries of Video Generation Models - high-performance & inference-efficient video foundation generation model - (1) multi-source data curation with precision and meaningful video captioning - (2) natively supporting multi-shot generation & jointly learning of both text-to-video and image-to-video tasks 를 포함하는 training paradigm - (3) fine-grained SFT & video-specific RLHF with multi-dimensional reward mechanisms를 포함하는 post-training approaches - (4) multi-stage distillation strategies & system-level optimizations를 통한 10x inference speedup

4th week

📜 [Huawei] RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning
- RAG+: RAG pipeline에 application-aware reasoning을 명시적으로 통합한 extension
- knowledge & aligned application example 로 구성된 dual corpus construct → 추론 단계에서 retrieves both jointly
- LLMs가 relevant information에 접근할 수 있을 뿐만 아니라 이를 structured & goal-oriented reasoning processes에 적용할 수 있게 됨
📜 [Stanford] Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce
- AI agents가 human labor를 automate 또는 augment 하는 것과 관련된 large-scale framework
- WORKBank: 844개 tasks, 104개 occupations에 대해 worker desires & expert assessments를 결합한 데이터 베이스
- Human Agency Scale (HAS): AI-agent-supported work에서 desired human involvement를 정량화
- 4 AI deployment zones: Automation Green Light, Red Light, R&D Opportunity, Low Priority
https://magenta.tensorflow.org/magenta-realtime?utm_source=alphasignal
🧑🏻‍💻 [IlElevenLabs] Introducing 11ai: the voice-first AI assistant that takes action
- voice-first interaction을 MCP와 결합하여 AI assistant가 action을 취할 수 있게 됨
- MCP를 통해서는 Salesforce, HubSpot, Gmail, Zapier 등에 연결 가능
- out-of-the-box integration으로 Perplexity, Linear, Slack, Notion 지원
- Ultra-low latency, Multimodal support, Integrated RAG, Automatic language detection, Enterprise-ready 등의 특징
📜 [Sakana AI] Reinforcement Learning Teachers of Test Time Scaling
- 깃허브 링크 🔗
- 현재 LLM의 강화학습은 one-hot correctness를 기반으로 이뤄지므로 initialization에 대한 의존성이 너무 높고, 학습이 잘된 RL 모델도 결국 distillation에서 cold start 문제를 해결하기 위한 teacher model로 쓰이는 현황을 지적
- Reinforcement-Learned Teachers (RLT): 각 문제에 대한 question & solution을 입력으로 받음 → 둘 사이를 ‘connects-the-dots’ 하여 학생들에게 자세한 설명을 제공하는 태스크 수행
- 이를 학생들에게 제공하여 solution에 대한 이해도를 확인하고, 이를 바탕으로 dense rewards를 획득
📜 [Cornell] Memento: Note-Taking for Your Future Self
- 최근 LLM은 reasoning-only tasks에만 뛰어나고 multi-hop question과 같은 retrieval 시나리오에 취약
- Memento (prompt strategy): 1) complex question을 smaller steps로 나눈다 2) LLM을 이용하여 database를 dynamically construct 3) 문제를 풀기 위해 작은 문제들을 다시 합친다
📜 [Oxford, Amazon, Cambridge] Distilling Tool Knowledge into Language Models via Back-Translated Traces
- LLM은 정확한 계산이나 multi-step alebraic reasoning을 요하는 수학 문제 풀이에 취약
- Tool-integrated reasoning (TIF)은 inference-time dependencies로 인해 확장 가능성이 낮음
- natural language를 통해 tool knowledge를 LLM에 distill 하는 패러다임 제안
- Solver Agent: interleaving planning, symbolic tool calls, reflective reasoning을 통해 수학 문제 풀이
  - multiple LLM-based agents 기반의 back-transaltion pipeline을 이용하여 TIR traces를 natural language reasoning traces로 변환
  - Translator Agent: 각 tool calls에 대한 설명 생성
  - Rephrase Agent: 이들을 coherent narrative로 merge
- 이런 식으로 만든 synthesized traces에 대해 오픈소스 모델들을 fine-tuning하여 tool knowledge & structured rasoning patterns 내재화에 기여했다고 보고
🧑🏻‍💻 [Google DeepMind] AlphaGenome: AI for better understanding the genome
- DNA sequence model. API available
- single variants or mutation in human DNA sequences가 유전자를 조정하는 생물학적 과정에 어떻게 영향을 주는지 예측하는 모델
🧑🏻‍💻 [Anthropic] Agentic Misalignment: How LLMs could be insider threats
- 16개의 major models를 대상으로 연구: 대부분의 모델이 설정된 시나리오 내에서 압박을 받으면 blackmail, document leaks과 같은 unethical actions를 취하는 것으로 확인됨
- 모델이 테스트 시나리오라는 것을 인지했을 땐 misbehavior를 보일 확률이 급격하게 낮아짐
- 실험 결과를 보면 blackmail rates에서 가장 높은 수치를 보이는 것은 Claude Opus 4 → 엄청나게 솔직한 연구 결과
🗞️ [Meta] Introducing Oakley Meta Glasses, a New Category of Performance AI Glasses
- Meta가 Oakley와 콜라보로 출시한 최신 AI Glasses
- 풀충전 기준 일반적인 사용으로 8시간, stanby 기준 19시간 지속되는 배터리
- Ultra HD (3K) video를 담을 수 있는 high resolution camera
- built-in, personal AI assistant. 스포츠 활용성 높음
📜 [Ohio, Amazon] Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
- Mind2Web2: 130개의 realistic, high-quality, long-horizon tasks로 구성된 벤치마크. real-time web browsing & extensive information synthesis 필요
- 이를 평가하기 위한 Agent-as-a-Judge 프레임워크 제안
  - tree-structured rubric 기반의 task-specific judge agents를 construct 하여 answer correctness & source attribution 평가
📜 [Ai2] OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization
- OMEGA (Out-of-distribution Math problems Evaluation with 3 Generalization Axes)
- (1) Exploratory: known problem-solving skills를 같은 도메인이지만 더 어려운 문제에 적용
- (2) Compositional: 독립된 상황에서 습득한 distinct reasoning skills를 new & coherent way로 결합/통합
- (3) Transformative: 익숙한 approaches를 새로운 영역에 unconventionally 적용
- geometry, number theory, algebra 등에 대해 programmatically 생성된 train-test 데이터쌍으로 구성됨
📜 [Skoltech] Complexity-aware fine-tuning
- 학습 데이터를 complexity(entropy) 기준으로 나눠서 모델을 학습
- easy & medium은 fine-tuning, hard는 distill 한 결과가 단순 SFT 결과보다 좋았다고 설명
📜 [Ai2] Language Modeling by Language Models
- LLM을 이용해서 새로운 LM architecture를 발견할 수 있을까?
- multi-agent LLM을 이용해서 proposal stage - code generation - verification에 이르는 research를 simulate
- Ladder of Sacles 접근법을 사용하는 Genesys 시스템을 제안: 제안 → 리뷰 → 검증 → large scale
🧑🏻‍💻 [Anthropic] Desktop Extensions: One-click MCP server installation for Claude Desktop
- Desktop Extension (.dxt files)을 통해 버튼 클릭 한 번으로 MCP servers 설치 가능
- 기존 MCP 설치는 ‘개발자 도구 필요, Manual configuration, Dependency 관리, 업데이트 복잡성’ 등의 문제를 지님
- .dxt file download → Claude Desktop open → Click “Install”
📜 [Baidu] Towards AI Search Paradigm
- human information processing & decision-making을 emulate 할 수 있는 검색 시스템
- LLM-powered agents를 이용하여 다양한 범위의 정보에 dynamically 접근 (from simple fatual queries to complex multi-stage reasoning tasks)
- query complexity를 평가하고, 문제를 executable plans로 쪼개고, tool usage, task execution, content synthesis로 문제 해결 (MCP)
📜 [Google] Performance Prediction for Large Systems via Text-to-Text Regression
- tabular 데이터를 처리하는 60M 사이즈의 encoder-decoder 모델
- 단 500개의 few-shot examples 만으로 새로운 태스크에 adapt 가능
- encoder 사용, sequence 길이 증가, 모델의 inherent uncertainty quantification 중요성 강조

🏕️ 5월

1st week

🧑🏻‍💻 [Google] DolphinGemma: How Google AI is helping decode dolphin communication
- National Dolphin Day에 Georgia Tech와 협업한 Wild Dolphin Project (WDP) 결과물인 DolphinGemma 공개
- 돌고래의 vocalization 구조를 이해하고 dolphin-like sound sequences를 생성하는 모델
- Catacean Hearing Augmentation Telementary (CHAT) 시스템에 구글 픽셀폰 사용 가능
🧑🏻‍💻 [Google] Introducing TxGemma: Open models to improve therapeutics development
- LLM을 이용한 therapeutic 개발 효율성을 개선하기 위한 open models
- 전체 discovery process의 therapeutic entities의 properties를 이해하고 예측하도록 학습한 모델들임
- promising targets를 식별하고 clinical trial outcomes까지 예측 가능
- 7M 데이터로 학습되었으며 2B, 9B, 27B 사이즈로 구성됨
🧑🏻‍💻 [DeepSeek AI] DeepSeek-Prover-V2-671B
- Recursive Proof Search를 통해 Cold-Start reasoning data를 합성
  - DeepSeek-V3를 subgoal decomposition & formalization 에 활용
  - 이렇게 획득한 데이터를 이용하여 강화학습
- ProverBench: Formalization of AIME and Textbook Problems
  - 325개의 문제로 구성된 벤치마크 소개
  - 이중 15개는 AIME competitions의 number theory & algebra questions
  - 나머지 310개는 curated textbook examples & educational tutorials 로 구성
- 7B & 671B 두 사이즈의 모델 공개
  - 671B 모델은 DeepSeek-V3-Base 에 학습
  - 7B 모델은 DeepSeek-Prover-V1.5-Base 에 학습 & 32K context window
📜 [Cohere, Princeton, Stanford, Waterloo, MIT, Ai2, Washington] The Leaderboard Illusion
- LLM 성능 평가를 위한 Chatbot Arena의 systematic issues를 분석한 결과
  - undisclosed private testing practices가 모델 공개 전 특정 providers에게 유리한 것이라고 지적
  - selective disclosure of perfomance results 때문에 Arena가 biased 된다고 설명. 현재는 많은 모델들이 여기에 overfitted 되어 있음을 지적
- proprietary closed models (Google, OpenAI) 는 battles에서 더 높은 비율로 picked 되기 때문에 open-source models 보다 더 많은 data access 가능
  - Google & OpenAI 가 각각 19.2% & 20.4%, 나머지 83개 open-weight models가 29.7% 차지하는 수준
  - 보수적인 추정에도 상대적인 performance gains이 약 112% 수준에 이른다고 설명
🧑🏻‍💻 [Ai2] OLMo 2 1B
- 동일 사이즈의 small 모델군 (Gemma 3 1B, Llama 3.2 1B) 중 최고 성능이라고 소개
- Mid-training에 OLMo-mix-1124 & Dolmino-mix-1124 를 포함한 4T 토큰 학습
- Post-training에 Tülu 3 dataset의 OLMo-specific variant를 사용하여 SFT
- olmo-2-0425-1b-preference-mix에 대해 DPO training & 최종적으로 RLVR training 적용
📜 [Renmin Univ.] DeepCritic: Deliberate Critique with Large Language Models
- LLM을 생성 결과에 대한 critique model로 사용하는 것이 automated supervision으로 이어진다는 것은 이미 잘 알려져 있음
  - 본 연구에서는 LLM의 math critique ability에 집중
- math solutions의 각 reasoning step에 대해 의도적으로 critique 할 수 있도록 만드는 2-stage framework 제안
  - (1) Qwen2.5-72B-Instruct를 이용하여 4.5K long-form critique를 생성하고 이를 SFT의 seed로 사용
  - (2) PRM800K로부터 획득한 existing human-labeled data 또는 Monte Carlo sampling-based correctness estimation으로 automatically annotated 데이터로 fine-tuned 모델을 RL
🧑🏻‍💻 [Anthropic] Claude can now connect to your world
- Claude의 Research 기능을 web, Google Workspace 외에도 개인 Integrations 까지 지원하여, 답변 전에 최대 45분 동안 research 수행
- Integrations: Claude가 web & desktop app에 걸친 원격 MCP server 위에 동작
- Jira & Confluence, Zapier, Cloudfalre, Intercom, Asana, Square, Sentry, Paypal, Linear, Plaid 서비스 지원
📜 [KAIST, DeepAuto.ai] Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
- 머신러닝 관련 연구에서 실행 가능한 코드를 제시하는 경우가 적은데, 이를 재현하는 것은 slow & labor-intensive 작업임
- PaperCoder: multi-agent LLM framework로, 머신러닝 논문을 functional code repositories로 변환. 세 단계로 동작
  - (1) Planning: high-level roadmap 구축, diagram을 포함한 system architecture 설계, file dependencies 식별, configuration files 생성
  - (2) Analysis: implementation-specific details를 해석
  - (3) Generation: modular, dependency-aware code 생성
  - 각 단계는 specialized agent에 의해 수행
- 생성 이후에는 model-based & human evaluations 수행
📜 [mem0.ai] Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
- LLM agents가 긴 대화와 session에 걸쳐 coherence를 유지할 수 있도록 하는 memory-centric architecture
- 두 개의 시스템으로 구성
  - Mem0: dense & language-based memory system
  - Mem0g: enhanced version with graph-based memory to model complex relationships
- Mem0은 벤치마크에서 가장 낮은 search & total latencies를 보였고, Mem0g는 다른 graph-based | RAG systems 대비 속도 & 효율성 관점에서 뛰어난 성능을 자랑함
📜 [KAIST, DeepAuto.ai] UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities
- 다양한 modalities (text, image, video) & granularities (paragraph vs. document, clip vs. video) 를 지원하는 RAG system
- Modality-aware routing: 매 query마다 적절한 modality를 dynamically select 하는 router
- Granularity-aware retrieval: 각 modality는 granularity levels로 쪼개져 각각의 complexity에 적합한 content를 retrieve
- Flexible routing: training-free (zero-shot GPT-4o prompting) & trained (T5-Large) routers 둘 다 지원
📜 [Amazon] SLOT: Structuring the Output of Large Language Models - SLOT: unstructured LLM outputs을 precise structured formats로 변환해주는 model-agnostic approach - 기존 방법론들은 constrained decoding 또는 specific models 이요 - SLOT은 fine-tuned lightweight language model을 post-processing layer에 사용 - schema accuracy & content fidelity 를 정량 평가하기 위한 평가 methodology 제안 - fine-tuned Mistral-7B model with constrained decoding이 99.5% 수준의 성능 달성

2nd week

📜 [Meta] PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
- Perception Language Model (PLM): image & video understanding 연구를 위한 open & reproducible framework
- proprietary models로부터의 distillation 없는 training pipelines을 분석하고 large-scale synthetic data를 explore
- 2.8M human-labeled fine-grained video question-answer pairs & spatio-temporally grounded video captions
- PLM-VideoBench: video에 대한 ‘what, where, when, how’ 추론 능력을 평가하기 위한 벤치마크 공개
📜 [NVIDIA] Llama-Nemotron: Efficient Reasoning Models
- 뛰어난 reasoning 능력, inference efficiency, open license for enterprise use 보유한 open family models
- Nano (8B), Super (49B), Ultra (253B) 사이즈로 구성되어 있으며, DeepSeek-R1에 준하는 성능이면서도 inference throughput & memory efficiency 뛰어남
- dynamic reasoning toggle을 지원하는 최초의 open-source models
  - 유저가 직접 standard chat vs. readoning modes 선택 가능
🧑🏻‍💻 [OpenAI] Evolving OpenAI’s structure
- OpenAI가 영리 기업으로서의 검토를 중단하고 비영리 기업 포지션을 유지하기로 결정함
- 이를 통해 더 큰 규모의 투자를 받아 AGI 개발에 전념하겠다고 함
- 이후 capable models를 오픈소스화할 예정
🧑🏻‍💻 [Alibaba] Qwen-Agent
- planning, memory, multi-turn function calling 을 지원하는 tool-using LLM agents 구축 가능
- code execution, document reading, web browsing, RAG workflows 가능
📜 [Beijing Univ.] RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation
- MCP와 같이 모델이 사용할 수 있는 도구들이 많음에도 불구하고 prompt bloat & selection complexity로 인해 이를 제대로 활용하지 못하고 있음
- RAG-MCP: 주어진 query와 관련성이 가장 높은 MCP(s)를 semantically retrieve
- selected tool descriptions만을 모델에 전달함으로써 prompt size를 줄이고 decision-making을 간소화 함
📜 [Anthropic] Reasoning Models Don't Always Say What They Think
- CoT를 통한 모델의 사고과정 모니터링이 타당하지 않다고 주장하는 논문
- 프롬프트에 제시된 6가지 힌트를 활용해 CoT의 신뢰도를 평가
  1. 힌트를 실제로 사용할 때 이를 CoT에 드러내는 비율은 1% 이상이지만, 대부분은 20% 미만
  2. outcome-based RL은 faithfulness를 향상시키나, 이는 초반에만 그렇고 금방 한계에 도달
  3. RL을 통해 힌트 사용 빈도가 증가하더라도 (reward hacking w/o CoT), 이를 CoT에서 언급하는 빈도는 증가하지 않음
- CoT를 이용한 test-time monitoring은 unexpected behaviors를 탐지하는데 전혀 쓸모가 없다고 주장
🧑🏻‍💻 [Mistral AI] Medium is the new large.
- mid-sized model을 공개했는데 GPU 4대에서 동작 가능하면서도 Claude Sonnet 3.7의 90% 이상 스코어를 달성할 정도의 성능을 보임
- private, high-context, domain-specific use cases에 해당하는 enterprise 활용도 가능
  - custom post-training & continuous pretraining 지원
  - finance, energy, healthcare 도메인에서 사용
  - self-hosted | virtual private cloud setups 에서 사용 가능
🧑🏻‍💻 Zed: The Fastest AI Code Editor
- Rust 기반의 Open Source 코드 에디터
- Privacy & Security 모드가 default. 원한다면 feedback 제공도 당연히 가능.
- Claude, OpenAI, Google 등 API는 당연히 지원하고, 본인 computing power를 사용하는 ollama 기반의 모델들도 사용할 수 있음
  - ollama 사용 시에 미지원되는 기능은 Edit Predictions 뿐이라고 함
- MCP 지원
📜 [Barbin Institute] Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
- Large Multimodal Reasoning Models (LMRMs)는 복잡하고 다양한 환경에 사용 가능한 promising paradigm으로 떠오름
- Multimodal reasoning은 modular, perception-driven pipelines에서부터 unified, language-centric frameworks로 발전하여 일관성 있는 cross-modal understanding 능력을 갖추게 됨
- instruction tuning & reinforcement learning 을 통해 크게 발전했으나, 아직까지 omni-modal generalization, reasoning depth, agentic behavior 에서 한계 존재
- 발전 흐름에 따라, task-specific modules, Multimodal CoT (MCoT), native large multimodal reasoning models (N-LMRMs) 순으로 survey 결과 정리
📜 [Univ. of Chicago] Mitigating Memorization In Language Models
- ICLR 2025 Spotlight poster
- 언어 모델의 memorization 현상을 mitigate 하기 위한 방법론들 제시
  - 3 regularizer-based, 3 finetuning-based, 11 machine unlearning-based
  - regularizer-based는 느리고 효과 x, finetuning은 효과 좋지만 비쌈, machine unlearning이 가장 좋은 방법론 → 그중에서도 BalancedSubnet가 제일 좋음
- TinyMem: small, computationally-efficient LMs for the rapid development and evaluation of memorization-mitigation methods
📜 [Alibaba] ZeroSearch: Incentivize the Search Capability of LLMs without Searching
- ZeroSearch: search APIs 없이 LLM 학습하는 method를 open-source로 공개
- policy model은 search APIs 대신 simulated documents 를 사용하여 학습
  - 언어모델을 사용하여 매 쿼리마다 20개 문서 생성
  - 최종 답변 퀄리티를 기준으로 reward signals 사용
- 3B, 7B, 14B 모델들 대상으로 학습하여 multi-step QA 능력 향상
- Learning with curriculum rollout: 학습이 진행될수록 retrieval noise 증가
📜 [Shanghai Jiao Tong Univ.] A Survey of AI Agent Protocols - 현존하는 agent protocols를 조사하여 context-oriented vs. inter-agent protocols 와 general-purpose vs. domain-specific protocols 로 구분 - security, scalability, latency 관점에서도 조사

3rd week

📜 [Microsoft, Salesforce] LLMs Get Lost In Multi-Turn Conversation
- LLM의 single- & multi- turn 성능을 비교하는 large-scale simulation 실험
- top open- & closed-weight LLMs가 multi-turn에서 single-turn 대비 큰 성능 하락폭을 보여주었다고 보고
- 200,000+ simulated conversations는 aptitude의 사소한 문제 & unreliability의 증가, 두 가지로 구분 가능
- 결론: when LLMs take a wrong turn in a conversation, they get lost and do not recover
📜 [Texas A&M Univ.] LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities
- on-device 시나리오에서는 SLMs 마저도 size optimization을 겪게 되어 있음 → fairness, ehtical & privacy risks 증가
- LiteLMGuard: quantized SLMs를 위한 real-time, prompt-level defense로 on-device prompt guard 라고 설명
  - 모델의 아키텍쳐와 상관없이 적용 가능하다고 주장
- 여러 DL models를 Answerable-or-Not 데이터셋으로 학습한 결과 ELECTRA를 후보로 선정
🧑🏻‍💻 [Sakana AI] Continuous Thought Machines
- Continuous Thought Machine (CTM): neuro activity의 synchronization을 추론 핵심 메커니즘으로 사용하는 AI model
- 뉴런 수준의 timing information을 사용하여 기존보다 보다 복잡한 nueral behavior & decision making process를 이해할 수 있게 되었다고 함
- 핵심 중 하나는 모델이 step-by-step으로 “think” 할 수 있게 되어 추론 과정이 보다 interpretable & human-like 해졌다고 설명
- CTM publication
📜 [CWI] How well do LLMs reason over tabular data, really?
- general-purpose LLMs의 tabular reasoning 능력이 현실 세계의 tabular inputs을 처리할 수 있을만큼 robust 한가?
- 언어 모델의 tabular queries에 대한 performance를 어떻게 evaluate 할 수 있는가?
- multiple-choice prompt 평가 & BERT-score 대신 LLM-as-a-Judge 신뢰도가 높다고 설명
📜 [ByteDance] Seed1.5-VL Technical Report
- vision-language foundation model designed to general-purpose & multimodal understanding and reasoning
- 532M-parameter encoder, MoE LLM (20B active params)
- GUI control & gameplay 등 agent-centric tasks에서 뛰어난 성능 보인다고 설명
📜 [Tsinghua] Absolute Zero: Reinforced Self-play Reasoning with Zero Data
- Reinforcement learning with verifiable rewards (RLVR) 를 위해서 학습 데이터 (question & answer)를 직접 curate 해야 되는 점을 문제로 지적
- Absolute Zero: external data 의존하지 않고 single model 스스로 own learning progress를 maximize & improve
- Absolute Zero Reasoner (AZR): code executor를 사용하여 training curriculum & reasoning ability를 self-evolve 하는 system
🧑🏻‍💻 [OpenAI] Introducing HealthBench
- health contexts 내의 AI 능력을 평가하기 위한 5,000개의 multi-turn conversations 데이터셋 오픈소스로 공개 (annotaed with physician-written rubrics and evaluated using GPT-4.1)
- 각 case는 dialogue, prompt, model output, rubric이 JSON format으로 구성됨
- research-use license로 Dataset & grader code 사용 가능
📜 [Salesforce] BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
- semantically rich CLIP image features를 생성하기 위해 diffusion transformer를 사용
  - → training efficiency & improved generative quality
- image understanding, 이어서 image generation에 대해 사전학습하는 학습 방식이 효과적이었다고 설명
- GPT-4o를 이용하여 high-quality instruction tuning dataset BLIP3o-60k 데이터셋 제작
🧑🏻‍💻 [ByteDance] DeerFlow
- 검색 엔진, 웹 크롤러, 파이썬, MCP 서버 등을 갖춘 Deep Research assistant
- Coordinator, Planner, Reporter 등의 agent들로 구성되는 시스템
- LangChain, LangGraph로 빌드되어 있어 Human-in-the-loop이 지원되며, 최근 핫한 Podcast generation도 가능 (생성된 reports 기준으로)
🧑🏻‍💻 [Google] AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms
- Gemini-based coding agent
- AlphaTensor 모델에서 single function call을 넘어 entire codebase 까지 커버할 수 있도록 함
- Gemini Flash로 빠르게 idea generation & Gemini Pro로 deeper analysis
🧑🏻‍💻 [LangChain] open-agent-platform - no-code agent building platform - Agent Supervisor를 통해 tools, RAG servers, other agents - web-based interface for creating, managing and interacting with LangGraph agents

4th week

📜 [Chinese Academy of sciences] Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
- over-thinking problem을 해결하기 위해 LRM이 problem complexity를 기준으로 explicit reasoning을 할지 말지 결정하도록 함
- 간단한 생략 기호 “…”를 프롬프트에 포함하는 것만으로도 꽤나 긍정적인 영향을 줄 수 있다고 언급
- AutoThink: stage-wise reward shaping을 통해 reasoning policies를 optimize하는 multi-stage reinforcement learning (RL) 프레임워크
📜 [Singapore, Tsinghua, Salesforce] Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models
- 모델의 “aha moment”가 등장하는 timing & consistency가 예측 & 통제 불가능하다는 한계 때문에 LRM의 성능을 scaling 하거나 이를 신뢰하기 어려움
- 이를 해결하기 위해 prompts & 우연한 ‘aha moments’를 넘어서, 모델이 세 가지 meta-abilities에 align 되도록 학습 - deduction, induction, abduction
- three-stage pipeline: individual alignment, parameter-space merging, domain-specific reinforcement learning
📜 [KAIST] System Prompt Optimization with Meta-Learning
- 다양한 user prompts에 robust 하고 unseen tasks에 transferable 한 system prompts를 디자인하는 것을 목표로 삼는 bilevel system 제안
- meta-learning framework: system prompt 뿐만 아니라 user prompts도 업데이트
🧑🏻‍💻 [HuggingFace] Welcome to the 🤗 Model Context Protocol (MCP) Course
- 허깅페이스 MCP 관련 강의
🧑🏻‍💻 [Alibaba] Qwen3 Technical Report
- dense & MoE 아키텍쳐, 0.6B ~ 235B 파라미터 사이즈
- thinking mode & non-thinking mode 통합. 유저 쿼리나 chat template에 따른 dynamic mode swithcing
- thinking budget mechanism을 도입하여 유저가 추론 시 computational resources를 adaptive하게 할당함으로써 태스크 복잡도에 따른 모델 퍼포먼스와 latency 간 균형을 맞출 수 있다고 설명
- 기존 29개 → 119개 언어 및 방언 지원, Apache 2.0 라이센스
📜 [Tsinghua] AdaptThink: Reasoning Models Can Learn When to Think
- reasoning model이 thinking을 skip하고 최종 답변을 생성토록 지시하는 NoThinking이 performance & efficiency 관점에서 더 효율적임
- AdaptThink: 문제 난이도에 따라 최적의 thinking mode를 reasoning model이 선택하도록 가르치는 RL 알고리즘
  - constrained optimization objective: overall performance를 유지하면서도 NoThinking을 선택하도록 함
  - sampling strategy: on-policy training 동안에 Thinking & No-Thinking samples의 균형을 맞춤
📜 [NUS] Thinkless: LLM Learns When to Think
- Thinkless: LLM이 task complexity & model’s ability 를 기반으로 short-form & long-form reasoning을 adaptively 선택하도록 하는 learnable framework
- RL 패러다임으로 학습되고 , 두 개의 control tokens를 사용
- Decoupled Group Relative Policy Optimization (DeGROP) 알고리즘
  - 두 개의 learning objective: control token loss & response loss
📜 [Southern California] Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLM
- unified graph-based analytical freamwork를 제시하여 RLM의 reasoning processes에 대해 더 좋은 모델링을 하고자 함
- (1) long & verbose CoT outputs를 semantically coherent reasoning steps로 만들기
- (2) 각 스텝 간의 contextual & logical dependencies 를 이용하여 directed reasoning graphs 구축하기
- exploration density, branching, convergence ratios 등과 같은 structural propreties가 reasoning accuracy와 강한 상관관계를 갖고 있다고 설명함
- RLMs 들이 few-shot prompting에 오히려 약세를 보이는 등의 counterintuitive 현상에 대한 의문으로부터 출발한 연구 → prompting strategies의 중요성 강조
🧑🏻‍💻 [Google] Gemini 2.5: Our most intelligent models are getting even better
- Gemini 2.5의 reasoning 능력을 강화하여 업데이트한 버전을 공개
- 풀스택 개발 태스크에 대해 WebDev Arena에서 1415 ELO 스코어 달성
- 두 개의 목소리로 native audio generation 가능
🧑🏻‍💻 [Google] Build with Jules, your asynchronous coding agent
- 기존 repositories에 직접 integrate 가능한 asynchronous & agentic coding assistant
- 각 codebase를 Google의 Cloud virtual machine (VM) 에 복사하여 프로젝트 전체를 이해한다고 설명
- Works on real codebase, Parallel execution, Visible workflow, User steerability, Audio summaries 등을 특징으로 삼고 있음
📜 [ByteDance] Emerging Properties in Unified Multimodal Pretraining
- BAGEL: multimodal understanding & generation 을 natively support 하는 open-source foundation 모델
- large-scale interleaved text, image, video, web data를 수 trillion tokens으로 학습한 unified & decoder-only model
- free-form image manipulation, future frame prediction, 3D manipulation, word navigation 과 같은 advanced multimodal reasoning 능력을 보유
📜 [Jiaotong University] Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs
- Deliberation on Priors (DP): Knowledge Graph 안의 priors를 충분히 이용할 수 있도록 새롭게 제시한 reasoning 프레임워크
- supervised fine-tuning & Kahneman-Tversky optimization 조합을 통해 structural priors를 LLM에 통합하는 progressive knowledge distillation strategy
- reasoning introspection strategey: LLM이 추출된 constraint priors 기반의 refined reasoning verfication를 수행할 수 있도록 guide
🧑🏻‍💻 [Mistral] Devstral
- software engineering tasks를 위한 agentic LLM, Devstral을 Apache 2.0 라이센스로 공개
- 현실적인 프로그래밍 문제를 해결하기 위해, 즉 GitHub issuses를 풀기 위해 학습된 모델
- RTX 4090 or Mac with 32GB RAM에서 구동 가능한 정도로 가벼움
🧑🏻‍💻 [Google DeepMind] Gemini Diffusion
- 현재 wait-list에 등록 가능 (25.05.24 기준)
- random noise를 coherent output으로 변경하여 text or code를 생성하는 모델
- rapid response, more coherent text, iterative refinement 등을 특징으로 설명
🧑🏻‍💻 [Google DeepMind] Gemma 3n
- phone or laptop (2GB of RAM) 에서 돌아가는 compact AI model로, Gemma 3 4B에 비해 1.5x 빠른 response를 보여줌
  - 삼성 갤럭시 울트라에서 초당 446 토큰 처리
- Mix ‘n’ match architecture는 small & large models를 switch 하는 데 도움을 줌
- Chatbot Arena에서 1283점을 기록하며 Claude 3.7 Sonnet의 뒤를 이음
📜 [ServiceNow] Augmenting LLM Reasoning with Dynamic Notes Writing for Complex QA
- multi-hop QA 에서의 iterative RAG 가 지닌 한계점을 극복하기 위한 연구
- NotesWriting: 매 스텝마다 retrieved documents를 concise & relevant notes 로 변경하는 연구
- LLM의 effective context length를 간접적으로 높여 더 큰 크기의 input text를 효율적으로 처리할 수 있음
- 다른 RAG 방법론들과 integrated 가능한 framework
📜 [Yonsei, CMU] Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
- Web-Shepherd: web navigation trajectories를 step-level로 평가하는 process reward model (PRM) 제시
- WebPRM Collection: 40K step-level perference pairs & annotated checklists
- WebReward Bench: PRM 평가를 위한 meta-evaluation 벤치마크
🧑🏻‍💻 [HuggingFace] nanoVLM: The simplest repository to train your VLM in pure PyTorch
- 750줄의 순수 PyTorch 코드로 구성된 초경량 Vision-Language 모델
- 단일 GPU에서 학습 가능
📜 [UIUC] Language Specific Knowledge: Do Models Know Better in X than in English?
- 인간의 code-switching은 특정 주제나 도메인에 대해 더 편하게 느끼는 언어가 있기 때문에 발생하는 것이라고 가정
  - 언어 모델도 그런 경향이 있다면 reasoning 능력을 더 끌어올릴 수 있지 않을까? 라는 접근
- Language Specific Knowledge (LSK): ethnic cultures는 언어에 따라 발전하는 경향이 있고, 이에 따라 culture-specific datasets에 대해 실험해본 결과 가정이 옳았다고 설명함
- LSKExtractor: language-specific knowledge의 존재를 확인할 수 있는 벤치마크 공개
📜 [Meta] J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning - J1: CoT를 기반으로 뛰어난 think 능력을 지닌 LLM-as-a-Judge 모델을 학습하는 RL 방법론 - verifiable & non-verifiable prompts를 verifiable rewards를 포함하는 judgement tasks로 변환 → thinking을 incentivize & judgement bias를 mitigate - DeepSeek-R1을 포함한 현존 8B or 70B 모든 모델들을 outperform - Pairwise-J1 & Pointwise-J1, offline vs. online training recipes, reward strategies 등을 analysis & ablation

5th week

🧑🏻‍💻 [Anthropic] Introducing Claude 4
- 코딩 특화 reasoning 모델 공개
- long thought process에 대한 요약 제시
- developer mode에서는 unsummarized reasoning 확인 가능
- VS Code나 JetBrains에서 사용 가능한 새로운 extension 출시
🧑🏻‍💻 [ByteDance] BAGEL: The Open-Source Unified Multimodal Model
- multi-modal reasoning & image editing 이 가능한 open-source model
- multiple expert networks & two image encoders 사용
- 7B 사이즈의 모델로, 4 x 16GB GPU에서 run 또는 LoRA 기반 학습 가능
📜 [Tokyo] MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation
- 29개 언어로 구성되어 11,829개의 동일한 questions를 포함하고 있어 직접적인 cross-linguistic comparision 가능
- 각 언어당 658개의 질문들을 포함하는 lite version 제공
📜 [Cambridge, UCL, Google] Visual Planning: Let's Think Only with Images
- 현 MLLMs는 reasoning 과정을 text로만 표현하여 시각적 정보(spatial & geometrical)를 충분히 활용하지 못한다고 지적
- Visual Planning: text 없이 순수하게 visual representation으로 reasoning
  - step-by-step inference를 encode 하는 sequences of images 를 통해 executed
- Visual Planning via Reinforcement Learning (VPRL): large vision models를 GRPO로 post-training 하는 RL 프레임워크
🧑🏻‍💻 [Mistral AI] Build AI agents with the Mistral Agents API
- Web Search, Code Execution, Image Generation, Document Library
- MCP tools integration, Agent Orchestration
- 사용성이 좋고 개발 용이성이 뛰어난 형태의 API가 많이 공개되는 추세
🧑🏻‍💻 [Mistral AI] Codestral Embed
- code search & retrieval 에 특화된 embedding 모델 공개
- binary, int8, float32 자료형 지원
🧑🏻‍💻 [Resemble AI] chatterbox
- open-source TTS 모델로, elevenlabs의 모델 성능을 능가한다는 소식
- emotion exaggeration control 지원, watermarked outputs
- Hugging Face Gradio app 에서 테스트 가능
- 0.5B Llama backbone, 0.5M hours of cleaned data로 학습
📜 [Shanghai AI Lab, Tsinghua, UIUC] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
- LLM reasoning을 위한 RL에서 policy entropy collapse 문제를 해결하고자 함
  - policy entropy가 초기 학습 단계에서 급격히 감소하여 policy model이 overly confident 하게 되는 현상을 뜻함 (성능 포화)
  - 이로 인해 exploratory ability가 diminish 하게 됨
- $R = -a \cdot \exp(H) + b$
- policy entropy의 변화는 action probability & logits 변화 사이의 covariance에 의한 것이라고 설명
- entropy collapse를 방지하기 위해 공분산이 높은 토큰의 업데이트를 제한하는 두 가지 방법 (Clip-Cov, KL-Cov) 제안
📜 [Utah, Washington] What Has Been Lost with Synthetic Evaluation?
- LLM 생성 결과의 validity & difficulty 를 검증
  - CondaQA: negation reasoning에 대한 평가
  - DROP: quantities reasoning 평가
📜 [Google] Sufficient Context: A New Lens on Retrieval Augmented Generation Systems (ICLR 2025)
- sufficient context 개념을 사용하여 여러 모델과 데이터셋을 분석
- 성능이 뛰어난 모델들은 context가 충분할 때 답변을 잘하지만 그렇지 않을 때에 답변을 abstain 하지 않고 틀린 답변을 반환하는 경우가 있음
- 그러나 성능이 낮은 모델들은 context가 충분할 때조차 hallucination 또는 incorrect answers 반환하는 경우 있음
- RAG 시스템을 위해 새로운 selective generation method를 제안하여 충분한 context information을 더 잘 활용할 수 있도록 함
📜 [Apple] Interleaved Reasoning for Large Language Models via Reinforcement Learning - long CoT가 inefficiency를 초래하고 time-to-first-token (TTFT)를 증가시키는 문제를 지적 - RL을 이용하여 reasoning LLM이 interleave thinking & answering for multi-hop questions 할 수 있도록 guide 하는 training paradigm 제안 - 올바른 intermediate step에 incentivize 하는 rule-based reward 도입

🌸 4월

1st week

📜 [UC San Diego] Large Language Models Pass the Turing Test
- ELIZA, GPT-4o, LLaMA-3.1-405B, GPT-4.5 모델을 대상으로 튜링 테스트
- GPT-4o 모델의 경우, 인간 페르소나를 부여했을 때 인간 상대로 73%의 win rate를 기록
📜 [AI2] Introducing CodeScientist: A step toward automated scientific discovery
- CodeScientist를 이용하여 19개의 potential discoveries를 생성했는데, 이중 6개는 전문가 평가를 통과함 (soundness & novelty 관점에서)
- 전체 프로세스 내에서 Ideation, Planning, Experiment, Reporting, Meta-analysis 수행
- 아직까지 사람의 의사결정이 중간에 개입되어야 한다는 한계가 있지만 빠른 속도로 발전하고 있다는 인상을 줌 (Sakana AI의 것도 그렇고..)
🧑🏻‍💻 [HuggingFace] YourBench: A Dynamic Benchmark Generation Framework
- Dynamic Benchmark Generation: Produce diverse, up-to-date questions from real-world source documents (PDF, Word, HTML, even multimedia).
- Scalable & Structured: Seamlessly handles ingestion, summarization, and multi-hop chunking for large or specialized datasets.
- Zero-Shot Focus: Emulates real-world usage scenarios by creating fresh tasks that guard against memorized knowledge.
- Extensible: Out-of-the-box pipeline stages (ingestion, summarization, question generation), plus an easy plugin mechanism to accommodate custom models or domain constraints.
📜 [National University of Singapore] JudgeLRM: Large Reasoning Models as a Judge
- LLM이 enhanced reasoning 능력으로 충분히 judge 할 수 있는지를 연구한 논문
- SFT performance gains & reasoning-demanindg samples의 비율 간의 음의 상관관계 확인
- JudgeLRM: judge-wise, outcome-driven rewards 향으로 RL을 적용한 judgement-oriented LLMs family
🧑🏻‍💻 [OpenAI] OpenAI Academy
- prompt engineering, multimodal AI, fine-tuning 등 다양한 hands-on training 강의 제공 (practical applications rather than theory)
- workshops & live events 등도 진행
📜 [Meta] Multi-Token Attention
- Soft attention은 LLM이 주어진 문맥 내에서 관련성이 높은 부분을 locate 하는 데 도움을 주었지만, single query & key vector에 의존한다는 점 자체가 한계임 (Single Token Attention)
- Multi-Token Attention (MTA): LLM이 여러 개의 query & key vectors에 대해 attention weights를 condition 하는 어텐션 기법 제안
- queries, keys, heads에 대해 convolution 적용
📜 [OpenAI] PaperBench: Evaluating AI's Ability to Replicate AI Research
- AI agent로 ICML 2024 Spotlight & Oral papers를 복제하는 벤치마크
- Claude 3.5 Sonnet이 21.0% 스코어를 기록했으나 인간 ML PhD는 41.4%를 기록
- 평가를 수행하는 것도 LLM임
🧑🏻‍💻 [Anthropic] Introducing Claude for Education
- 교육 목적에 특화된 Claude for Education 런칭
- Learning mode: 학생들에게 정답을 바로 알려주기보다는 critical thinking skills를 develop 할 수 있도록 reasoning process를 가이드
- Socratic questioning (결론을 뒷받침하는 근거는 무엇인가?), 핵심 개념 강조 등의 특징
📜 [Mila, Nanyang, MS, … ] Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
- cognitive science의 principles, neuroscience, computational research를 통합한 intelligent agent에 대한 연구
1. modular foundation of intelligent agents
2. self-enhancement and adpative evolution mechanisms
3. collaborative and evolutionary multi-agent systems
4. building safe, secure, and beneficial AI systems
📜 [Oxford, NUS, DeepMind] Why do LLMs attend to the first token? - attention sink: LLMs이 주로 시퀀스 내 첫 번째 토큰에 지나치게 attend 하는 현상. 이는 quantisation difficulties, security issues, streaming attention로 이어짐 - 왜 이러한 현상이 발생하고, 이러한 현상을 어떻게 활용할지에 대해서는 연구가 미진함 - 이를 통해(attention sink) LLM이 over-mixing 하지 않게 된다고 주장

2nd week

📜 [Salesforce] APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay
- APIGen-MT: verifiable & diverse multi-turn data를 만드는 two-phase framework
- 첫 단계에서는 LLM reviewers committee를 이용하여 detailed blue prints 생성
- blue prints는 simulated human-agent interplay를 통해 complete interaction trajectories로 발전
- 1B에서 70B 사이즈에 이르는 xLAM-2-fc-r 시리즈 학습하여 GPT-4o나 Claude 3.5를 $\tau$-bench & BFCL benchmarks에서 outperform 했다고 보고
🧑🏻‍💻 [Meta] The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation
- 세 개 모델
  1. Llama 4 Scout: 17B active parameters, 16 experts 기반. 고성능 경량 모델
  2. Llama 4 Maverick: 17B active parameters, 128 experts 기반. 멀티모달+코딩 특화 범용 모델
  3. Llama 4 Behemoth: 288B active parameters, 총 2조 파라미터.(훈련 중, 미공개)
  - Behemoth는 teacher model로서 Scout, Maverick의 추론, 코딩, 멀티모달 이해 능력 전수
- MoE 아키텍쳐, native multi-modal model, 10M context length, Codistillation 등의 특징
- bias 문제 해결을 위한 노력 언급
📜 [HuggingFace] SmolVLM: Redefining small and efficient multimodal models
- Smaller VLMs는 large models의 extensive image tokenization 등을 그대로 가져다쓰며 GPU 메모리 사용 비효율성 등의 문제를 안고 있었음
- SmolVLM: resource-efficient inference를 위해 설계된 compact multimodal models series
- 가장 작은 SmolVLM-256M 모델은 추론 시 1GB 미만의 GPU 메모리를 사용할 정도로 효율적이며, static images에 대해서 뿐만 아니라 뛰어난 video comprehension 이해 능력을 보였다고 함
🧑🏻‍💻 [Ai2] Going beyond open data – increasing transparency and trust in language models with OLMoTrace
- Ai2의 flagship 모델들을 대상으로 playground에서 활용 가능한 기능으로, 모델의 답변이 어떤 학습 데이터로부터 나오게 되었는지를 하이라이트 해주는 기능
- 학습 데이터에 접근할 수 있는 다른 모델에도 적용할 수 있는 기능
📜 [Yandex] Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
- LLM workers를 병렬적으로 실행함으로써 모든 workers가 concurrently-updated attention cache를 통해 synchronize 하고, 어떻게 collaborate 할지 prompt
- 한 instance가 생성하는 과정을 나머지 instances가 concurrent cache를 통해 살펴볼 수 있음
- RoPE 차용
- modern reasoning-capable LLM들이 추가적인 fine-tuning 없이 shared Key-Value cache 만으로 좋은 성과를 낼 수 있었다고 보고
🧑🏻‍💻 [Google] Announcing the Agent2Agent Protocol (A2A)
- AI Agents들이 각자의 플랫폼과 서비스 사이에서 communicate 할 수 있는 open protocol
- HTTP, SSE, JSON-RPC 등을 사용하여 기존 시스템과의 compatibility 보장
- Agents는 사용 가능한 functions를 structured JSON files로 정리하고, 이를 Agent Cards라고 함
- 최근 Agent Development Kit (ADK)를 공개했는데 이는 Vertex AI, Gemini API와 integrate 가능한 open source임
🧑🏻‍💻 [OpenAI] Evaluating model performance
- LLM-as-a-Judge (prompt testing & evaluation)를 dev workflow에 쉽게 integrate 할 수 있는 기능을 API단에서 지원
- 평가에 사용되는 test data를 data_source_config에 명시하고, 모델 출력 결과가 올바른 것인지에 대한 정보는 testing_criteria에 작성
🧑🏻‍💻 [Amazon] Amazon’s new Nova Sonic foundation model understands not just what you say—but how you say it
- speech understanding & speech generation을 통합한 single model
- Amazon Bedrock에 API로 이용 가능
📜 [Nanjing, ByteDance] DDT: Decoupled Diffusion Transformer
- Diffusion Transformer의 inherent optimization dilemma: low-frequency semantics를 encoding 하기 위해서는 high-frequency components를 줄여 균형을 맞춰야 함
- Decoupled Diffusion Transformer (DDT): semantic extraction를 위한 encoder & specialized velocity decoder 로 구분되는 디자인
- 인접한 denoising step 간의 self-condition을 공유함으로써 추론 속도까지 향상시킬 수 있음
🧑🏻‍💻 [OpenGVLab] InternVL3 🤗
- InternVL 2.5 대비 뛰어난 multimodal perception & reasoning 능력을 보여줌
- tool usage, GUI agents, industrial image analysis, 3D vision perception 등
- text performance가 Qwen 2.5 시리즈 대비 뛰어나다고 언급
📜 [Kimi] Kimi-VL Technical Report
- efficient open-source MoE vision-language model + Kimi-VL-Thinking
- activating language decoder 사이즈가 2.8B 수준임에도 불구하고 뛰어난 성능 달성
- multi-turn agent tasks, college-level image & video comprehension, OCR, mathematical reasoning 등의 태스크에서 뛰어난 퍼포먼스를 보임
- 128K content window & native-resolution vision encoder, MoonViT 덕분에 ultra-high-resolution visual inputs 이해 가능
🧑🏻‍💻 [Google] Introducing Firebase Studio
- full-stack AI application build & deploy 를 위한 web-based open-source IDE
- Project IDX, Genkit, Gemini 를 하나의 workspace에 통합
- App Prototyping agent: prompt | drawing 으로부터 full apps 생성하는 기능
🧑🏻‍💻 [OpenAI] BrowseComp: a benchmark for browsing agents
- AI agents의 쉽게 탐색하기 힘든 정보들에 대한 검색 능력을 평가하기 위한 open-source 벤치마크
- 📜 BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
- 정답이 간단하고 이견의 여지가 없는 1,266개의 문제로 구성
📜 [Zhejiang University] Large language models could be rote learners
- LLM 평가에서 단순 암기력을 평가하는 MCQ 평가의 한계를 극복하기 위한 연구
- LLM이 암기한 내용(rote memorization)보다 그렇지 않은 것(genuine capability)에 대해 더 좋은 퍼포먼스를 내는 경향이 있다고 보고
- TrinEval: MCQ를 trinity format으로 변경하여 memorization 평가는 줄이고 knowledge 평가는 더 잘할 수 있도록 만드는 evaluation 프레임워크

3rd week

🧑🏻‍💻 [ByteDance] Seed-Thinking-v1.5: Advancing Superb Reasoning Models with Reinforcement Learning
- STEM & coding 에서 강점을 보이는 reasoning 모델 공개
- 총 200B, activated 20B의 MoE 모델
- 일반화된 reasoning 능력 평가를 위해 BeyondAIME, Codeforces, 두 개의 벤치마크 공개
📜 [Microsoft Research] MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft
- visual-action autoregressive Transformer: game scenes & corresponding action을 입력으로 받아 consequence new scenes를 생성
  - 두 입력을 각각 image tokenizer & action tokenizer 에 통과시켜 discrete token으로 변환 후 concat 하여 input으로 사용
- 모델이 초당 4~7 프레임을 생성할 수 있도록 학습되었으며 플레이어와 실시간 interact 가능
- visual quality & action following capability 를 함께 측정할 수 있는 metric 제시
🧑🏻‍💻 [DeepCogito] Cogito v1 PreviewIntroducing IDA as a path to general superintelligence
- [3, 8, 14, 32, 70]B 사이즈의 reasoning LLM을 오픈소스로 공개
- 70B 모델이 Llama의 최신 109B MoE 모델을 능가하는 성능을 보인다고 보고
- Iterated Distillation and Amplification (IDA) - a scalable and efficient alignment strategy for general superintelligence using iterative self-improvement
- 모든 모델은 질문에 바로(direct) 답하거나, 답변 전에 스스로 생각(self-reflect)할 수 있음
- 109B, 400B, 671B 사이즈의 모델들을 곧 공개할 계획이며 공개 범위에는 체크포인트도 포함
🧑🏻‍💻 [OpenAI] Introducing GPT-4.1 in the API
- GPT-4.1, GPT-4.1 mini, GPT-4.1 nano를 only API로 공개
- 세 모델 전부 주요 벤치마크에서 GPT-4o, GPT-4.5를 outperform & 1M context window & diff 모드 지원
- structured input 이해, multi-turn, multi-needle tasks에서 기존보다 더 뛰어난 성능
🧑🏻‍💻 [xAI] Grok Studio
- 코드 실행과 구글 드라이브 연동을 지원하는 Grok Studio를 첫공개
- documents, codes, reports, browser games 등을 생성할 수 있고 컨텐츠를 별도 윈도우에 띄움
🧑🏻‍💻 [Google] Introducing TxGemma: Open models to improve therapeutics development
- TxGemma: efficient therapeutic 개발을 위해 designed 된 open models collection
- promising target을 식별하는 것부터 clinical trial의 outcome을 예측하는 것 등이 가능
- Gemma 2에 7M 학습 샘플을 학습한 2B, 9B, 27B 모델
📜 [China Telecom] xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
- 다양한 데이터셋에 대해 여러 LLM들이 추론한 결과를 수집함으로써 QA pairs로 구성된 VAR 데이터셋 구축
- label 정확도를 높이기 위해 multi-round annotation 수행
- Long Reasong tasks에 대한 평가 모델을 학습하기 위해 데이터셋을 구축했다는 내용이 전부인 듯
📜 [UCLA, Meta] d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
- d1: pre-trained masked dLLM을 SFT + RL 을 이용하여 reasoning 모델로 만드는 framework
- (a) masked SFT를 이용하여 knowledge를 distill 하고 self-improvement behavior를 instill
- (b) diff-GRPO: critic-free, policy-gradient based RL algorithm
📜 [Microsoft] BitNet b1.58 2B4T Technical Report
- BitNet b1.58 2B4T: native open-source 1-bit LLM을 2B 사이즈로 공개
- computational efficiency를 큰 특징으로 삼으면서도 language understanding, mathematical rreasoning, coding preoficiency, conversational ability 등이 전부 뛰어나다고 설명
- CPU, GPU 추론 둘 다 지원하며 HuggingFace를 통해 이용 가능
🧑🏻‍💻 [OpenAI] Introducing OpenAI o3 and o4-mini
- multi-step reasoning, structured tool use에 강점을 갖는 두 멀티모달 모델을 공개
- 차트 해석, UI 이해, 수학적 추론, OCR + context 등 수행 가능
🧑🏻‍💻 [Ai2] DataDecide: How to predict best pretraining data with small experiments
- DataDecide 공개: 100B 토큰에 달하는 고품질 25개 corpora로 학습한 모델. 4M ~ 1B 사이즈
- 학습 중 check point를 공개함으로써, 작은 모델로 특정 데이터셋에 대해 어떻게 학습되는지 경향성을 파악하여 scale-up 하는 데 도움을 주고자 하는 목적으로 공개했다고 설명함
🧑🏻‍💻 [Comet-ML] Opik
- Open source LLM evaluation framework 1.2 버전 공개
- Tracing, Annotations, Playground 등 기능 지원
- LLM-as-a-Judge metric 포함
🧑🏻‍💻 [Cohere] Introducing Embed 4: Multimodal search for business
- SoTA multimodality: 다양한 요소로 구성된 PDF & dynamic presentation slides 내 searching 가능
- 128K context window length (200 페이지 분량)
- 100개 이상의 다양한 언어 지원
- virtual private cloud (VPC) 환경 뿐만 아니라 on-premise 환경도 지원

4th week

🧑🏻‍💻 [SkyworkAI] Skywork-OR1 (Open Reasoner 1)
- Math-7B, 32B-Preview, 7B-Preivew 모델로 구성된 오픈소스 family
- Skywork-OR1-RL-Data: DeepSeek-R1-Distill-Qwen-32B로 난이도를 평가한 데이터 구성됨 (데이터 사용시 필터링으로 사용 가능). 총 105K Math, 14K Coding 데이터
- 32B-Preview 모델의 경우 AIME, LiveCodeBench에서 DeepSeek-R1 수준 성능을 달성했다고 보고
📜 [NVIDIA] CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
- 사전학습을 위한 Common Crawl 같은 데이터셋은 domain label이 없고, The Pile 같은 데이터셋은 labor-intensive 하다는 문제점
- CLIMB 제안: 사전학습을 위한 data mixture를 적절히 discover, evaluate, refine 하는 framework
- 이를 이용하여 획득한 400B 토큰에 대해 1B 모델을 학습한 결과는 SoTA인 Llama-3.2-1B 모델을 능가하는 수준이라고 보고
- 20개 cluster, 1.2T 토큰으로 구성된 ClimbLab, 400B 토큰으로 구성된 ClimbMix 공개
📜 [HKUST] Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models
- LRMs (Large Reasoning Models) 들은 overthinking 문제가 있음
- thinking token 사이에 ( ) smaller 모델로부터 생성된 external CoT를 넣어주는 방식이 모델이 적은 토큰을 생성하는 데 도움을 준다고 설명 → ThoughtMani
- QwQ-32B 모델을 LiveBench/Code dataset에 적용했을 때, 기존 성능은 유지하면서도 약 30% 정도의 토큰을 절약할 수 있었음 (CoT generator로부터 overhead가 발생하긴 함)
🧑🏻‍💻 [Google] Gemma 3 QAT Models: Bringing state-of-the-Art AI to consumer GPUs
- 1B, 4B, 12B, 27B 사이즈의 Quantization-Aware Trained (QAT) 모델들을 공개
- Gemma 3 27B 모델의 경우 int4 기준 14.1GB 메모리를 차지하여 RTX 3090 한 대에 KV cache 포함한 로드가 가능하다고 설명
- OpenAI API를 통해 function calling & custom tool 사용 가능
📜 [UC Berkeley, LangChain] PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines
- 2087개의 LLM pipeline prompts & corresponding 12623개의 assertion criteria 로 구성된 데이터셋
- 이 데이터로 fine-tuned 된 Mistral & Llama 3 가 (본인들 벤치마크에 대해) GPT-4o를 평균 20.93% outperform 했다고 설명
📜 [Tsinghua] Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- Reinforcement Learning with Verifiable Rewards (RLVR) 방식이 언어 모델로 하여금 본질적으로 새로운 reasoning pattern을 갖추는 데 기여하지 못한다고 주장
- 즉, 현존하는 reasoning models의 reasoning abilities는 base model에 이미 존재하던 것을 적절히 sampling 할 수 있도록 학습되어 갖춰진 것으로 설명
- 이러한 경향성은 visual reasoning tasks에서도 관측됨
- 오히려 distillation이 이와 달리 모델에게 new knowledge 를 전달하는 방법이라고 설명
📜 [Shanghai AI Lab, Fudan, CMU] MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
- LLM 학습 데이터를 heuristic 하게 정제하는 것은 semantic space 내의 intent를 올바로 capture 하지 못하는 결과로 이어진다고 지적
- → 데이터셋 내 information content를 정량화하는 method 제안: label graph를 구축하고 graph 내의 information distribution을 이용
- Maximize Information Gain (MIG): semantic space 내에서 반복적으로 sampling을 수행하는 efficient sampling method
- 이 방법론을 Ai2 에서 공개했던 Tulu3 데이터셋에 적용해봄으로써 성능 향상을 이끌어 낼 수 있었다고 설명
📜 [Google DeepMind] Welcome to the Era of Experience
- Stream 개념을 제시: real | simulated 환경 내 continuous interaction loops 를 뜻함 (for future agents)
- 학습을 위해 human-generated datasets에 의존하는 것을 피하고 environmental feedback을 사용할 것을 주장
- 여러 태스크와 도메인에 대한 continuous, long-term learning을 지원
- task-specific performance가 아닌 시간에 걸친 capability growth에 집중
📜 [Alibaba] Wan: Open and Advanced Large-Scale Video Generative Models
- SoTA 수준의 Wan2.1 이라는 open suite of video foundation models 공개 (video generation)
- T2V-1.3B 모델은 8.19GB VRAM를 필요로 하며, RTX 4090 한 장으로 5초짜리 480P 비디오를 약 4분만에 생성 가능
- Text-to-Video, Image-to-Video, Video Editing, Text-to-Image, Video-to-Audio 등 다양한 태스크 수행 가능
- Chinese & English 텍스트 생성 능력이 뛰어남
- temporal information을 보존하면서도 1080P video를 잘 encoding & decoding 할 수 있음
🧑🏻‍💻 [Anthropic] Values in the wild: Discovering and analyzing values in real-world language model interactions
- 700,000개의 chat을 분석하여 3,300개 이상의 distinct values가 존재한다는 것을 파악
- 이때 privacy-preserving system을 이용했기 때문에 유저의 개인정보는 제거되었다고 설명
- 분석 과정을 시각화한 도식 참고하면 좋을 듯. AI values taxonomy를 구축한 것이 눈에 띔
📜 [NVIDIA] Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
- long-context muldimodal learning 기반의 vision-language models (VLMs) family 공개
- 특히 long video understanding & high-resolution image understanding 의 문제를 해결
- Automatic Degrade Sampling & Image Area Preservation 을 통합하여 contextual integrity & visual details 보존
- Eagle-Video-110K: story-level & clip-level annotations를 통합한 데이터셋
📜 [Huawei] Dynamic Early Exit in Reasoning Models
- LRLMs가 추론 과정에서 redundant step을 포함하는 문제를 해결하기 위해 early exit을 도입하여 CoT sequence를 self-truncate 할 수 있도록 함
- fixed heuristics와 달리 potential reasoning transition points (ex. Wait 토큰)을 model behavior에서 탐지하는 방식.
- 이때 모델이 trial answer에 대해 high confidence를 갖는 경우 next reasoning chain’s generation을 중단
- 추가적인 학습이 필요없는 방식이며 기존 o1-like reasoning LLMs에 seamlessly integrate 가능
📜 [Chinese Academy of Sciences] GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
- large vision-language models (LVLMs)에 SFT하여 GUI agents를 만드는 것은 학습 데이터도 많이 필요하고 일반화 성능도 떨어지는 방식임
- unified action space rule modeling을 통해 LVLMs이 GUI 이해 능력을 향상할 수 있도록 하는 강화학습 프레임워크 GUI-R1 제안
- 각 플랫폼(Windows, Linux, MacOS 등)으로부터 얻은 소수의 carefully curated high-quality data, GRPO를 이용하여 자원 효율적인 결과를 달성할 수 있었다고 설명
🧑🏻‍💻 [ByteDance] Introducing UI-TARS-1.5
- Qwen2.5-VL-7B 모델을 강화학습한 multimodal agent를 오픈소스로 공개
- token-level multimodal supervision 기반의 reasoning-before-action approach를 사용
- 뛰어난 Web Navigation 능력은 GPT-4.5 능가하는 수준
🧑🏻‍💻 [Nari-Labs] Nari Dia-1.6B
- 오픈소스 text-to-dialogue model: 스크립트를 현실적인 대화로 바꿔주는 모델
- ElevenLabs Studio나 Sesame CSM-1B 모델 이상의 퍼포먼스를 보여주어 큰 화제를 일으키는 중
- 카이스트 학부생이 2명이 작업한 결과물로 알려짐
📜 [a-m-team] DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training
- 3.34M unique queries & 40M distilled responses 로 구성된 large-scale & difficulty-graded reasoning dataset (허깅페이스에 공개)
- pass rate & Coefficient of Variation (CV) 를 이용하여 유의미한 학습 데이터만 남겼다고 설명
📜 [Shanghai AI Lab, Tsinghua] VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
- 기존 MLLM 벤치마크는 text description에 의존하거나 언어 기반 reasoning shortcut을 허용함으로써 진정한 vision-centric reasoning 검증을 할 수 없다고 지적
- VisuLogic: 6개 카테고리에 대한 1,000 human-verified problems (quantitative shifts, spatial relations 등)
- 사람은 51.4%, 대부분의 모델은 30% 이하의 정확도를 기록하는 수준의 벤치마크이며, visual reasoning 능력을 고도화할 수 있는 학습 데이터도 공개했다고 언급함
📜 [Tsinghua, Shanghai AI Lab] TTRL: Test-Time Reinforcement Learning
- LLM을 reasoning tasks에서 explicit label 없이 강화학습하는 것에 대한 연구
  - ground-truth 정보 없이 reward estimation을 어떻게 할 것인지가 challege
- Test-Time Reinforcement Learning (TTRL): pre-trained models의 priors를 이용하여 self-evolution
  - Test-Time Scaling (TTS) 에서 majority voting 등이 RL training에서 reward 역할을 할 수 있었음에 착안
  - initial (base) model의 성능을 outperform 하는 현상이 관측되어 방법론 타당성 입증
🧑🏻‍💻 [OpenAI] Introducing our latest image generation model in the API
- 첫 주에 130M 이상의 유저가 700M 이상의 이미지를 생성할 정도로 인기를 끌었음
- 해당 기능을 gpt-image-1 API로 공개
- 이미지 한 장당 대략 0.3$ 정도 비용 발생
🧑🏻‍💻 [NousResearch] Minos-v1
- ModernBERT-large 기반의 LLM QA refusal 결정 모델 (Refusal 또는 Non-refusal 반환)
  - 유저의 질문과 LLM의 답변 pair를 입력으로 받아 둘 중 하나의 클래스를 confidence와 함께 반환하는 모델
- 400M 사이즈 모델로 8,192 context length, 약 380K 데이터로 학습
📜 [DevRev] Efficient Single-Pass Training for Multi-Turn Reasoning
- LLM을 위한 multi-turn reasoning 학습에 존재하는 문제
  - LLM은 추론 토큰을 생성하는데 이를 이후 입력에 포함하면 안됨
- 이러한 불일치(discrepancy)로 인해 일반적인 다른 데이터셋에 대해 학습하는 것과 달리, single forward pass로 전체 대화를 처리할 수 없음
- 이를 해결하기 위해 response token duplication & custom attention mask (enforces appropriate visibility constraints) 적용
🧑🏻‍💻 [HuggingFace] Tiny Agents: a MCP-powered agent in 50 lines of code
- MCP는 LLM이 이용 가능한 Tools set을 expose하는 표준 API라고 설명
- AI Agents 시스템 구축에 50줄 코드면 충분
🧑🏻‍💻 [Anthropic] The Urgency of Interpretability
- Claude 3.5 Haiku가 생각하는 방식을 분석한 연구 결과를 제시
- 언어별로 별도 시스템이 존재하는 것이 아니라, 영어, 프랑스어, 중국어 등 다양한 언어가 공유하는 추상적 개념 공간이 존재 → 의미 처리 후 특정 언어로 번역되는 방식으로 동작
- 시를 쓸 때 단순히 다음 토큰들을 예측하는 것이 아니라 미리 운율을 맞출 준비를 하고 있음
- 어려운 수학 문제 등을 풀 때, 잘못된 근거를 제시하면 그럴싸한 답변을 생성. 이런 과정은 여러 ‘중간 단계’를 거치는 것으로 확인됨
📜 [Microsoft] BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
- 1-bit LLM deployment에 방해되는 것 중 가장 크리티컬한 것은 activation outliers
- BitNet v2: 1-bit LLM을 위한 native 4-bit activation quantization 프레임워크
- H-BitLinear: activation quantization 이전에 online Hadamard transformation 적용
🧑🏻‍💻 [Alibaba] Qwen3: Think Deeper, Act Faster
- 0.6B에서 235B 파라미터 사이즈에 이르는 모델 패밀리 공개
  - 가장 큰 두 모델: Qwen3-30B-A3B, Qwen3-235B-A22B (둘 다 MoE)
- Hybrid thinking mode: thinking mode와 non-thinking mode 스위칭 가능
- 36T 토큰으로 학습. 이는 Qwen2.5를 학습한 데이터의 두 배에 이르는 양.
- 119개에 이르는 다양한 언어를 지원하며, MCP를 natively support
🧑🏻‍💻 [NourResearch] Atropos - 언어모델 강화학습 환경 framework로 LLM의 trajectories를 다양하게 수집 및 평가할 수 있음 - Multi-Turn & Asynchronous RL 지원 - Inference Agnostic: OpenAI, vLLM 와 같은 표준 인터페이스에 쉽게 통합 가능 - 5월 중으로 해커톤도 개최할 예정

🌱 3월

1st week

📜 [Microsoft] LongRoPE2: Near-Lossless LLM Context Window Scaling
1. ‘높은 차원의 RoPE 차원에서의 불충분한 학습은 영구적인 OOD issue를 야기한다’는 가설
2. needle-driven perplexity 기반의 evolutionary search를 이용한 RoPE rescaling alogirthm이 위 문제를 해결해줄 것이라고 가정
3. mixed context window training
- LLaMA3-8B에 LongRoPE2를 적용하여 128K를 커버할 수 있게 만들면서도 기존 short-context performance는 98.5% 보존
🧑🏻‍💻 [OpenAI] Introducing GPT-4.5
- function calling, structured outputs, system messages, streaming in API 지원
- 이미지 입력, agentic planning & execution 가능
- text-based interactions 내의 뉘앙스 파악 더 잘함 & 향상된 EQ → 문과적 사고는 좋아졌는데 실질적인 성능은 아쉽다는 평이 많음
🧑🏻‍💻 [Inception Labs] Introducing Mercury, the first commercial-scale diffusion large language model
- 스탠포드 교수 Stefano Ermon이 diffusion large language model 회사 설립 (dLLMs)
- H100에서 초당 1000 토큰을 출력할 수 있을 정도로 기존 모델들 대비 10x 이상 빠르다고 설명
- 다음 토큰을 autoregressive 하게 예측하는 방식/패러다임을 “coarse-to-fine” 생성 방식으로 전환해야 한다고 주장
📜 [King’s College London, The Alan Turing Institue] CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
- implicit CoT가 explicit CoT에 비해 아직까지 뒤처져 있음을 언급
- CODI: shared model이 teacher & student 역할을 수행하며 explicit & implict CoT를 학습
- implicit CoT로도 explicit CoT 성능을 달성하면서도 3.1배의 토큰 압축률을 보여줌
- explicit reasoning이 대박을 친 이후로 추론 비용이 급상승해서인지 implicit & compression 관련 연구들에 눈에 띄고 있음
🧑🏻‍💻 [Sesame] Crossing the uncanny valley of conversational voice
- Conversational Speech Model (CSM): context-aware speech in real-time conversations을 위해 설계된 모델 (1B, 3B, 8B)
- tone, pace, rhythm 등을 conversational context and emotions 기반으로 조절 가능
- decoder는 Residual Vector Quantization (RVQ) tokens로부터 high-fidelity speech를 reconstruct
- 2K context window 커버 가능, 1M hours of publicly available transcribed and diarized speech로 학습
🧑🏻‍💻 [Anthropic] Token-efficient tool use (beta)
- token-efficient-tools-2025-02-19 header를 통해 평균 14%, 최대 70%의 토큰 & latency를 줄일 수 있다고 설명
  - API call에서 tool use와 관련된 옵션임. Claude 3.7을 공개하면서 사용 비용을 최소화하는 옵션을 함께 제시함.
📜 LLM Post-Training: A Deep Dive into Reasoning Large Language Models
- fine-tuning, reinforcement learning, test-time scaling 등의 post-training 방법론들을 조사한 서베이 논문
- catastrophic forgetting, inference-time trade-off, reward hacking 등의 issues를 함께 다룸
- Tuning 파트에 엑사원은 있는데 솔라는 포함되지 않았음
- Awesome LLM Post-Training repository 🔗
📜 [Mila] Multi-Turn Code Generation Through Single-Step Rewards
- 현재 multi-turn code generation 방법론들은 피드백 없이 코드를 생성하거나 complex & hierarchical 강화학습을 사용
- μCODE: single-step reward만을 사용하는 multi-turn code generation
- 중간의 어떤 과정에서도 올바른 코드로 recovered 가능하다고 주장
- 멀티턴 실행 피드백과 새로 생성된 코드를 scoring하는 verifier를 iteratively 학습
📜 [Univ. of Oklahoma] A Survey On Large Language Models For Code Generation
- 최근 아주 핫한 코드 생성 모델들에 대한 서베이 페이퍼
- 엄청 방대한 양을 커버하고 있지는 않음
📜 [Tencent AI] The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models
- Unsupervised Prefix Fine-Tuning (UPFT): Prefix Self-Consistency를 이용. 다양한 solution에 공통적으로 포함되는 initial reasoning steps를 학습 대상으로 삼음
- initial prefix substrings (8개 토큰) 에 대해서만 학습함으로써 데이터 라벨링이나 sampling의 공수를 줄임
- 학습 시간은 75%, sampling cost는 99% 줄이면서도 Rejection Sampling Fine-Tuning과 같은 기존 학습 방식에 준하는 성능을 달성했다고 보고
🧑🏻‍💻 [Qwen] QwQ-32B
- DeepSeek-R1 671B 모델에 견주는 32B 모델 공개 (MoE 아닌 Dense 모델)
- 131K Token length 지원
- RoPE, SwiGLU, RMSNorm
🧑🏻‍💻 [Cohere] Aya Vision: Expanding the Worlds AI Can See
- 다양한 언어와 modalities를 지원하는 SoTA vision model (23개 언어)
- 8B, 32B 사이즈 모델. Kaggle & HuggingFace 에 weights 공개
🧑🏻‍💻 [Google] Data Science Agent in Colab: The future of data analysis with Gemini
- Gemini를 이용한 multi-step reasoning을 통해 full notebooks를 생성 (just code snippets x)
- classification, regression, feature selection, correlation analysis 등 기능 지원
- CSV, JSON, Excel files 지원
📜 [Nanjing Univ., Microsoft] Process-based Self-Rewarding Language Models
- LLM이 학습용 데이터를 스스로의 output에 대한 reward를 기반으로 생성하는 방식을 제안
- → 현존하는 self-rewarding 방식은 수학적 추론 영역에서 약점을 보인다고 지적
- → self-rewarding 내에 long-thought reasoning, step-wise LLM-as-a-Judge, step-wise preference optimization 등 도입
📜 [Washington, Peking] MPO: Boosting LLM Agents with Meta Plan Optimization
- LLM-based agents 시스템은 아직 planning hallucination & each egent 학습 필요성 을 한계로 지님
- Meta Plan Optimization (MPO): explicit guidance를 통합하여 agent의 planning capability를 향상시키는 프레임워크. agent의 실행 결과에 대한 피드백을 바탕으로 삼음.
- Meta Plan에 대한 평가(reward)를 제공하는 모델도 있어서 파이프라인이 강화학습처럼 보임
📜 [Alibaba] Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers
- (numbers of speakers 기준) 지구상 90% 인구가 이해하는 25개 언어를 커버
- Babel-9B, 83B multilingual LLMs 공개
- 전통적인 continued pretraining 대신 model extension을 통해 parameter count를 확장함으로써 성능 향상을 도모했음
📜 [Alibaba] START: Self-taught Reasoner with Tools
- external tools을 이용하여 reasoning capabilities를 큰 폭으로 향상
- (1) Hint-infer: 인위적으로 설계한 힌트를 삽입 (ex. 파이썬 코드를 써야겠어!)
- (2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-infer를 통해 생성된 reasoning trajectories(tool 사용을 포함하는)를 fine-tuning
📜 [CMU] SOLAR: Scalable Optimization of Large-scale Architecture for Reasoning
- reasoning에서 nuanced topological reasoning이 문제임을 지적
- accuracy와 efficiency를 향상시키기 위해 reasoning topology를 dynamically optimize
- Topological-Annotation-Generation (TAG) system: topological dataset creation & segmentation을 자동화
- multi-task Topological Reward Model (M-TRM) 학습: 자동적으로 best reasoning topology를 선택하여 single pass에 답변 반환 (multiple single-task 필요성 x)
📜 [NVIDIA, Berkeley, MIT, Nanjing, KAIST] Token-Efficient Long Video Understanding for Multimodal LLMs
- explicit temporal modeling이 부족하여 long videos의 dynamic patterns을 capture하기 어렵다는 문제를 지적
- STORM (Spatiotemporal TOken Reduction for Multimodal LLMs): image encoder & LLM 사이의 temporal encoder를 통합하는 아키텍쳐
- Mamaba State Space Model을 사용하여 temporal information을 image tokens에 통합하여 보다 풍부한 representations를 생성
- training & inference latency 둘 다 감소시키면서도 extended temporal contexts에 대한 efficient & robust video understanding 를 보여줌
📜 [Stanford] Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
- 동일한 사이즈의 모델 간에서도 RL을 통한 self-improvement 능력 획득이 가능(Qwen)한 경우와 그렇지 않은(Llama) 경우가 있음 → self-improvement 능력 획득에 필요한 조건은 무엇일까?
- 4개의 cognitive behaviors: verification, backtracking, subgoal setting, backward chaining
- OpenWebMath data를 continued-pretraining에 활용하여 Llama를 학습한 결과는 Qwen에 준함
📜 [Columbia Business School] How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach
- 다양한 compression instructions를 통해 reasoning length와 model performance 간의 관계에 대한 systematic study
- → 거의 모든 distinct reasoning chain마다 reasoning length와 accuracy 간의 universal tradeoff 존재
- token complexity: successful problem-solving을 위해 필요한 최소한의 토큰 숫자
- → accuracy-compression tradeoff의 이론적 한계를 계산하는 데 활용
- → adaptive compression: 답하기 쉬운 질문에는 짧은 responses를 반환토록 함

2nd week

📜 [Renmin Univ.] R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
- internal knowledge에만 의존하는 LRM들은 time-sensitive or knowledge-intensive questions에 대해 약함
- R1-Searcher: two-stage outcome-based RL approach
- reasoning process 동안 추가적인 지식 습득을 위해 모델이 자율적으로 external search system에 접근
- RL만 배타적으로 사용. cold start를 위한 reward나 distillation 불필요.
🧑🏻‍💻 [Manus] Leave it to Manus
- 중국 스타트업이 AI agents 서비스로 세간의 주목을 받고 있음
- 자체적으로 공개한 벤치마크 결과에서는 OpenAI Deep Research를 압살
- 파격적인 데모(수십 개의 앱이 동시에 실행)가 사실인지에 대한 커뮤니티 논쟁이 있었음
🧑🏻‍💻 [OpenAI] New tools for building agents
- 개발자들이 agents를 만들 때 사용할 수 있는 agent 툴을 공개
- Chat Completions API에 Assistants API의 tool 사용 능력을 합친 Responses API
- web search, file search, computer use 능력을 내장
📜 [Skolkovo Institue of Science and Technology] Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
- Artificial Text Detection (ATD)는 LLM 등장 이래로 더욱 중요해지고 있으나 unseen text에 대한 일반화 성능이 낮다는 문제점을 지적
- Sparse Autoencoder를 이용하여 Gemma-2-2b로부터 feature를 추출함으로써 ATD interpretability를 높임
- 다양한 모델로부터 획득한 텍스트가 사람으로부터 얻은 것과 어떻게 다른지에 대한 인사이트 제공 가능
🧑🏻‍💻 [Google DeepMind] Gemini Robotics brings AI into the physical world
- Gemini Robotics: Gemini 2.0 기반의 vision-language-action (VLA) model
- Gemini Robotics-ER: Gemini의 embodied reasoning (ER) 능력을 활용하여 advanced spatial understanding을 보여줌
- 다음 세대의 휴머노이드를 만들기 위해 Apptronik와 파트너십
- Technical Report link 🔗
🧑🏻‍💻 [Google] Introducing Gemma 3: The Developer Guide
- 1B-27B 사이즈의 open-weight model family (open-source는 아님)
- LMArena에서 R1 바로 뒤를 이어 2위 차지
- SigLIP 기반의 vision encoder를 통한 Multimodal 지원, 128K 윈도우 사이즈, 140개 이상 언어 이해
- 3개의 강화 학습 기법 적용: RLMF (Machine Feedback), RLEF (Execution Feedback), RLHF (Human Feedback)
🧑🏻‍💻 [Perplexity] Perplexity Ask MCP Server
- Model Context Protocol (MCP)가 최근 핫한 키워드로 떠오르고 있음
  - AI 시스템과 데이터 소스를 연결하기 위한 개방형 표준 프로토콜
  - 클라이언트 - 서버 아키텍쳐를 기본으로 삼음
  - 기존 API 대비 더 직관적이고 유연한 솔루션
- 도커 이미지로 만들어서 테스트까지 가능한 방법을 간단한 가이드로 소개함
🧑🏻‍💻 [OpenAI] Detecting misbehavior in frontier reasoning models
- 📜 Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
- reasoning 모델을 위한 강화학습 과정에서 발생하는 reward hacking 문제 중 coding task에 집중
- 모델이 reward를 maximize 하기 위해서 cheating 하는 내용들을 explicitly state 하는 것이 관측됨
- 현재로서는 모델 스스로 intent를 숨기고 detection을 회피하고자 하는 경향성이 있음
📜 [Meta, NYU, MIT, Princeton] Transformers without Normalization
- Transformers에 normalization을 적용하지 않고도 기존과 동일하거나 그 이상의 퍼포먼스를 보여줌
- Dynamic Tanh (DyT): element-wise 연산, $\text{DyT}(x)=\text{tanh}(\alpha x)$, Transformers 아키텍쳐에서 normalization layers를 replace
- 이 아이디어는 기존 normalization의 결과가 tanh-like S-shaped input-output mapping을 보여준다는 점에서 착안함
- recognition부터 generation, computer vision부터 language model 까지 다양한 태스크로 validate
📜 [KAIST] Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching - Sketch-of-Thought (SoT): cognitive-inspired reasoning paradigm을 linguistic constraints와 결합하여 reasoning 정확도를 유지하면서도 token usage를 최소화하는 프레임워크 - 3개의 paradigm: Conceptual Chaining, Chunked Symbolism, Expert Lexicons - lightweight routing model을 통해 적절한 reasoning task로 분기

3rd week

📜 [UC Berkeley, Tokyo] Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks
- LLM들이 high-level planning objectives & low-level execution 의 균형을 맞추는 것은 쉽지 않음
- Plan-and-Act: synthetic data generation을 통해 LLM 기반 agents의 plan generation을 고도화한 프레임워크
- Planner: 목표를 달성하는 데 필요한 structured & high-level plans
- Executor: 위 plan들을 environment-specific actions로 translate
🧑🏻‍💻 [Microsoft] RD-Agent
- R&D를 자동화해주는 오픈소스 툴. data analysis, data mining, model tuning 지원
- 확실히 Agent 개념을 활용한 자동화가 연구에 본격적으로 활용되고 있고 앞으로는 BM으로 발전하지 않을까 싶음
📜 [IBM, HuggingFace] SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
- end-to-end document conversation향으로 개발한 초소형(256M) vison-language model, SmolDocling
- DocTags: 페이지 내 모든 요소를 위치와 함께 capture하는 새로운 universal markup format
- business documents, academic papers, technical reports 등 다양한 형식의 문서에서 code listings, table,s equations, charts, list 등의 feature 추출 가능하며 robust 하다고 설명
- 모델은 이용 가능하며 데이터셋은 곧 공개 예정
📜 [Anthropic] Auditing Language Models for Hidden Objectives
- alignment audits: 모델이 바람직하지 않은 objectives를 갖고 있는지 조사
- RLHF 내 reward model의 error를 학습하고, 이러한 error를 이용(exploit)하는 방법을 익힘
- (1) 모델의 hidden objective와 training에 대해 모르는 사람들을 4팀으로 꾸려 blind auditing game 수행
- (2) 후속 연구로 모델을 audit 하는 8개 테크닉을 탐구. SAE가 가장 효과적이었다고 함
📜 [IIIT Hyderabad] No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language models
- physical characteristics부터 socio-economic에 이르는 다양한 카테고리별로 LLM의 biases를 조사
- bias detection task를 위한 5개의 prompting approaches 소개
- biases detecting 벤치마크의 metrics에 대한 3개의 research questions 제시
- 실험 결과에 따르면 모든 LLM이 최소 1개 이상의 bias를 나타내고 있으며, LLaMA3.1-8B 모델의 bias가 가장 적었다고 함
- 논문 내에 bias 평가 metric에 대한 정리가 잘 되어 있으나 사이즈가 작은 오픈소스 모델 대상으로 실험 결과를 정리한 점은 아쉽
🧑🏻‍💻 [Mistral] Mistral Small 3.1
- 24B 사이즈, 128K 윈도우 사이즈, 오픈소스 모델로 동사이즈 비교에서 SoTA 달성
- GPQA에서 44.42% 스코어를 달성하며 Gemma 3-it (36.83%) 모델과 GPT-4o-mini (40.2%) 모델을 능가
- 초당 150 토큰 생성 가능하며 이미지도 처리 가능
🧑🏻‍💻 [AI2] OLMo 2 32B: First fully open model to outperform GPT 3.5 and GPT 4o mini
- 지난 11월에 공개했던 7B, 13B 모델에 이어 32B 모델을 공개
- 오픈소스 모델(데이터, 코드, 학습 방식 등 모든 디테일 공개) 중 GPT 3.5와 GPT 4o mini를 능가하는 것은 최초라고 보도
- refined post-training과 RLVR (Reinforcement Learning with Verifiable Rewards) 적용
📜 [Tsinghua] Personalize Anything for Free with Diffusion Transformer
- Diffusion Transformer (DiT)에서 denoising tokens을 reference subject tokens로 대체함으로써 zero-shot reconstruction 가능
- 덕분에 personalization 및 image editing도 가능
- Personalize Anything: DiT를 이용하여 personalized image generation을 수행하는 training-free framework
  1. timestep-adaptive token replacement: early stage injection & late stage regularization
  2. patch perturbation strategies to boost structural diversity
📜 [Babes-Bolyai University] Synthetic Data Generation Using Large Language Models: Advances in Text and Code
- LLM을 이용해 텍스트와 코드 데이터를 생성하는 방식에 대한 서베이 페이퍼
- low-resource tasks (classification, QA), code-centric applications 발전에 대해 언급
🧑🏻‍💻 [Google] New ways to collaborate and get creative with Gemini
- Canvas: Gemini 기반의 AI assisted coding tool
  - Python, Javascript, HTML 지원
  - real-time code collaboration이 가능하지만 multi user는 안됨
- Audio Overview: documents, slides, Deep Research reports를 두 AI host 간의 오디오 팟캐스트로 변환
  - 웹/앱 지원
  - 생성물을 다운로드 또는 공유 가능
🧑🏻‍💻 [LG AI Research] EXAONE Deep Released ━ Setting a New Standard for Reasoning AI
- 32B reasoning 모델로, 수학, 과학, 코딩 등의 능력이 뛰어나다고 보고
- Notable AI models에 이름을 올린 유일한 한국어 모델
- 7.8B & 2.4B 모델도 공개
📜 [Eleuther AI] RWKV-7 "Goose" with Expressive Dynamic State Evolution
- 3B sequence 모델로, 동일 사이즈 타모델 대비 훨씬 적은 토큰을 사용하고도 SoTA 달성
- 추론 시 토큰마다 필요한 memory usage & inference time이 constant
- 3.1T 토큰의 multilingual dataset도 공개
📜 [METR] Measuring AI Ability to Complete Long Tasks
- 사람이 처리할 수 있는 태스크들을 처리하는데 걸리는 시간을 기준으로 난이도로 해석
- AI 모델들이 2초에서 8시간까지 걸리는 engineering 태스크 170여 개를 완수
- 서베이 결과에 따르면 AI task length는 7개월마다 2배로 증가하고, 현재를 기준으로는 Claude 3.7 Sonnet이 1-hour tasks를 50% 신뢰도로 잘 끝내는 수준이라고 함
- 연구 결과를 정리해놓은 METR posting 링크 🔗
📜 [Shanghai AI Lab] ϕ-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation
- foresight sampling: globally optimal step estimation을 획득하기 위해 simulated future steps를 leverage
- φ-Decoding: foresight & clustering 을 통해 두 개의 distribution에 approximate → joint distribution으로부터 sampling
📜 [Rice University] Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
- reasoning 모델들은 분명 추론 성능을 크게 향상시켰음에도 불구하고 compuataional overhead가 발생
- (1) model-based efficient reasoning: full-length reasoning 모델을 concise reasoning으로 optimize 하거나 애초에 efficient reasoning model을 학습
- (2) reasoning output-based efficient reasoning: 추론 단계에서 reasoning step과 length를 dynamically 조절
- (3) input prompts-based efficient reasoning: 입력 프롬프트의 난이도나 길이를 기준으로 reasoning efficiency를 개선
📜 [The Hebrew University, IBM, Yale] Survey on Evaluation of LLM-based Agents - LLM agent 평가 벤치마크와 프레임워크를 네 개의 차원(dimension)으로 분석 - (1) fundamental agent capabilities (planning, tool use, self-reflection, memory) - (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents - (3) benchmarks for generalist agents - (4) frameworks for evaluating agents

4th week

📜 [University of Texas at Dallas] A Review of DeepSeek Models' Key Innovative Techniques
- DeepSeek 모델을 만들 때 사용된 개념들에 대한 in-depth review
- Multi-Head Latent Attention (MLA), Advanced MoE, Multi-Token Prediction (MTP), Grouped Relative Policy Optimization (GRPO) 등
📜 [ByteDance, Tsinghua] DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- a fully open-source, large-scale RL system. Qwen2.5-32B 모델 베이스
- Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) 알고리즘 제안
📜 [Hong Kong, Peking] Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
- reward hacking 문제를 해결하기 위해 Hierarchical Reward Model (HRM) 제안
- fine-grained & coarse level의 individual & consecutive reasoning step을 평가
- 이전 step의 추론이 잘못되어 뒤에 안좋은 영향을 주는 케이스를 특히 잘한다고 보고
- MCTS의 비효율성을 해결하기 위해 Hierarchical Node Compression (HNC) 라는 node merging 기법 제안
🧑🏻‍💻 [OpenAI] Introducing next-generation audio models in the API
- 2개의 speech-to-text (Transcribe, Mini Transcribe), 1개의 text-to-speech (Mini TTS) 모델 API 공개
- multi-speaker detection, 대화 시작 & 중단, noisy 환경 등에 대해 훨씬 robust 하다고 설명
- real-time | batch-processing voice agents 구현 가능
🧑🏻‍💻 [Anthropic] The "think" tool: Enabling Claude to stop and think in complex tool use situations
- Claude의 extended thinking capability를 활용할 수 있도록 “think” tool을 사용하는 방법과 원리에 대해 안내하는 포스팅
- 말 그대로 tool을 사용하는 schema(API 호출에 필요한)와 이를 위해 최적화된 프롬프트를 안내하고 있음
🧑🏻‍💻 [DeepSeek AI] DeepSeek-V3-0324
- an open-source 685B MoE model with improved front-end generation and tool use
- multi-turn interactive rewriting, translation quality & letter writing, enhances search-based report analysis
- function calling, JSON output, FIM (Fill-in-the-Middle) completion
- 허깅페이스에 MIT 라이센스로 공개
📜 [National University of Singapore, Nanyang] MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization
- Multi-Agent framework incorpoRating Socratic guidance (MARS): multi-agent fusion technology를 사용하여 automatic planning을 수행하고 gradual continuous optimization & evaluation 가능
- 7개의 agent로 구성되어 각각이 autonomously Planner를 사용하여 optimization path를 고안
- 또한 Teacher-Critic-Student Socratic dialogue를 사용하여 프롬프트를 iteratively optimize
- 이는 기존의 Automated Prompt Optimization (APO)의 한계를 극복하기 위함임
🧑🏻‍💻 [Google DeepMind] Gemini 2.5: Our most intelligent AI model
- LMArena에서 GPT4.5 & Claude3를 능가하며 1위를 차지한 thinking model
- 1M token content window. 곧 2M을 지원할 예정
- RAG & document-based workflows에 최적화되어 있다고 언급
🧑🏻‍💻 ARC-AGI-2 + ARC Prize 2025 is Live!
- 상금 $1,000,000 (한화 10억 이상)의 AGI 챌린지
- 사람에게는 쉽지만 AI에게는 어려운 reasoning task 중심. 이전 challenge보다 더 어렵다고 자체적으로 설명함.
🧑🏻‍💻 [OpenAI] Introducing 4o Image Generation
- text rendering, precisely following prompts, leveraging 4o’s inherent knowledge base & chat context 등의 특징
- trained our models on the joint distribution of online images and text
  - → 이를 통해 이미지와 텍스트가 어떤 식으로 관계되어 있는지를 학습했다고 설명
- ChatGPT, Sora에서 사용 가능하며, 곧 API로도 지원될 예정
📜 [Tencent] CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision
- CodeTool: 코드의 concise & easilty verifiable 특성을 이용하여 LLM의 tool invocation을 개선하는 stepwise code generation 프레임워크
- (1) On-the-spot Reward: each tool invocation에 대해 immediate feedback 제공
- (2) Latent Reward: 전체적인 task completion에 대해 각 step의 기여를 평가
🧑🏻‍💻 [Alibaba] Qwen2.5 Omni: See, Hear, Talk, Write, Do It All!
- text, image, audio, video를 이해하고 생성하는 오픈소스 모델 (Apache 2.0)
- Think-Talker 아키텍쳐는 speech synthesis에서 reasoning을 분리함으로써 more structured ouputs에 기여
  - Thinker는 언어모델로서 reasoning & text generation을 담당
  - Talker는 text | direct audio instruction 을 기반으로 speech를 생성
- Block-wise processing을 이용하여 continuous response generation 가능
🧑🏻‍💻 [AI2] Introducing Ai2 Paper Finder
- LLM 기반 문헌 검색 시스템으로, 사람처럼 질의 해석 → 검색 → 평가 → 재검색의 과정을 자동화
- 키워드 대신 자연어 전체 문장을 그대로 입력해도 관련 논문을 찾아줌
- relevance 판단 시 복잡한 질의를 다중 기준으로 분해해 평가하고, citation 기반 확장 탐색도 수행
- 빠른 응답이 필요한 경우엔 fast mode, 깊이 있는 탐색이 필요할 땐 iterative exhaustive mode 제공
📜 [Google] Gemma 3 Technical Report
- 1B-27B 사이즈의 lightweight open models family, Gemma 3 공개
- vision understanding, 더 많은 언어, longer context (128K)
- local to global attention layer의 비중을 높임으로써 (local의 비중을 높임) KV-cache가 폭발적으로 증가하는 것을 방지
- Gemma 3 모델들은 distillation으로 학습되어pre-trained & instruction finetuned version 둘 다 Gemma 2 성능을 능가
🧑🏻‍💻 [Anthropic] Tracing the thoughts of a large language model
- Anthropic에서 Claude 3.5 Haiku 내부 computation을 trace 할 수 있는 방법을 기술한 두 개의 technical papers를 공개
- 이를테면 feature activations와 이것이 transformer layers에 걸쳐 미치는 영향을 추적할 수 있음
- Claude는 한 번에 여러 개의 future words를 선택 / shared internal states를 사용하고 이를 다른 언어들에 각각 매핑
🧑🏻‍💻 [Tencent] Reasoning Efficiency Redefined! Meet Tencent’s 'Hunyuan-T1'—The First Mamba-Powered Ultra-Large Model - 세계 최초 Mamba 아키텍쳐 기반 초거대모델 (Transformer-Mamba MoE) - TurboS 기반으로 in-depth reasoning에서 강점을 보이며 long-context capture 능력이 뛰어남 - curriculum learning & self-rewarding

☃ 2월

1st week

🧑🏻‍💻 AI Coder Reviewer
- Ollama랑 통합 가능한 AI Code Review 도구
- 다양한 프로그래밍 언어에 대한 automated code review 지원
📜 [GIT] Large Language Models Think Too Fast To Explore Effectively
- LLM이 open-ended tasks에서 인간을 능가할 수 있을지 Little Alchemy 2를 사용하여 테스트
- 인간은 uncertainty와 empowerment를 적절히 조절할 수 있는데, 이를 능가하는 건 o1 모델 밖에 없었다고 주장
- Sparse Auto Encoder에 대한 representational 분석 결과에 따르면 uncertainty와 choices는 early layer에서 represented 되는데, empowered values는 later layer에서 처리되어 모델 입장에서는 미성숙한 결정을 내리도록 하는 원인이 된다고 설명 (?)
🧑🏻‍💻 [Mistral] Mistral Small 3
- MMLU에서 81점 기록, 코드 생성과 수학 태스크에서 Llama-3.3-70B or GPT-4o-mini 급 성능
- 24B 파라미터, 32K context window, 초당 150 토큰 처리 가능 → 32GB RAM을 가진 RTX 4090 또는 맥북에서 돌릴 수 있음
- 합성데이터나 RLHF를 사용하지 않아 추가적인 fine-tuning 하기에 적합한 base 모델이라고 주장
🧑🏻‍💻 [AI2] Scaling the Tülu 3 post-training recipes to surpass the performance of DeepSeek V3
- Tülu 3 405B 오픈 소스 post-training 모델 공개
- 오픈소스 모델임에도 불구하고 DeepSeek v4, GPT-4o 수준의 성능 달성
- Reinforcement Learning from Verifiable Rewards (RLVR) 프레임워크가 MATH 성능을 크게 향상시켰다고 설명
📜 [DeepSeek] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- DeepSeekMath 7B 모델 공개: DeepSeek-Coder-Base-v1.5 7B 모델을 Common Crawl의 수학 관련 120B 토큰으로 학습
- MATH에서 외부 도구의 도움 없이 51.7%를 달성하며 GPT-4, Gemini-Ultra급의 성능을 보임
- web data를 엄선하는 파이프라인 & Group Relative Policy Optimization (GRPO)
🧑🏻‍💻 [OpenAI] OpenAI o3-mini
- STEM, coding, logical problem-solving을 위해 디자인된 small-scale reasoning model
- o1-mini 의 자리를 대신함 (예를 들어 기존 o1-mini API는 o3-mini 로 대체)
- o1과 달리 vision을 지원하지 않음
- 설연휴 기간 폭발적인 관심을 얻은 DeepSeek-R1 을 견제하는 움직임으로 해석
🧑🏻‍💻 [OpenAI] Introducing deep research
- 대량의 온라인 정보를 바탕으로 multi-step 추론하여 tasks를 수행하는 agent 기능
- 기존 추론 모델들은 인터넷에 접근하지 못한다는 한계가 있었는데 이를 극복함
- 굉장히 난이도가 높은 것으로 알려진 Humanity’s Last Exam에서 26.6% 스코어를 기록함
📜 [HKU, UC Berkeley, Google DeepMind, NYU] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- SFT와 RL의 generalization & memorization 영향도를 비교 분석한 연구
- 학습된 모델이 unseen textual & visual domain에서 일반화하는지 확인
- SFT는 단순히 학습 데이터를 암기하는 것이라면 RL은 실제 일반화에 도움이 됨. 단, SFT는 답변의 형식을 유지하는 데 도움이 됨
📜 [Arizona, UCLA] Preference Leakage: A Contamination Problem in LLM-as-a-judge
- synthetic data generator & LLM-based evaluator 둘 간의 relatedness로 야기되는 LLM-as-a-judeg의 contamination을 preference leakage라고 명명
- 동일 모델, inheritance 관계, model family, 세 가지 유형에 대한 조사
- 모델 사이에 명백한 preference leakage가 존재한다고 주장
📜 [Chineses Academy of Sciences] DeepRAG: Thinking to Retrieval Step by Step for Large Language Models
- MDP로서 retrieval-augmented reasoning을 수행하는 프레임워크 DeepRAG 제안
- 쿼리를 iteratively decompose 함으로써 external knowledge를 retrieve 할지 말지, 혹은 parametric reasoning을 할지를 결정
🧑🏻‍💻 [Google] Gemini 2.0 is now available to everyone
- multimodal reasoning이 가능한 Gemini 2.0 models 공개 (Flash, Flash-Lite, Pro Experimental)
- Flash, Flash-Lite 모델은 1M context window, Pro Experimental 모델은 2M context window를 지님
- 1.5 Flash 대비 cost & latency 증가하지 않으면서도 고품질 답변을 생성
🧑🏻‍💻 [Anthropic] Constitutional Classifiers: Defending against universal jailbreaks
- 논문 링크 🔗
- 일반적인 jailbreaks를 수천 시간 시도했음에도 불구하고 robust 결과를 보여줬다고 설명
- 그럼에도 불구하고 무지성 거절(refusal rates)의 비율은 단 0.38% 밖에 증가하지 않았음
- 8개 레벨의 jailbreaking demo를 뚫는 사람에게는 $10,000를, 일반적인 jailbreaking strategy로 뚫는 사람에게는 $20,000를 수여하는 HackerOne 개최중
🧑🏻‍💻 [HuggingFace] Open-source DeepResearch – Freeing our search agents
- OpenAI에서 공개한 Deep Research를 구현하고 오픈소스로 공개한 포스팅
- Deep Research가 GAIA 벤치마크에서 높은 성능을 달성한 것을 언급
- CodeAgent 를 사용하여 복잡한 sequences of actions를 디자인할 수 있다고 설명
🧑🏻‍💻 [OpenAI] Introducing ChatGPT search
- 작년 10월 31일 공개했던 기능을 본격적으로 지원하고 있음
- 크롬 확장프로그램을 통해 default 검색 엔진을 ChatGPT search로 설정할 수도 있음
📜 [Stanford, Washington, AI2] s1: Simple test-time scaling
- OpenAI의 o1과 같이 test-time scaling & strong reasoning performance를 위한 연구
- s1K: 세 개의 기준(difficulty, diversity, quality)으로 검증한 reasoning taces를 포함한 데이터셋
- budget forcing: 모델이 답변을 끝내려고 할 때, test-time compute를 강제로 중단하거나 늘리기 위해서 “Wait” 키워드를 여러 차례 붙이는 방법론
- Qwen2.5-32B-Instruct 모델에 s1K 학습 한 s1-32B 모델에 budget forcing 장착하니 수학 능력 크게 향상
- 모델, 데이터, 코드는 오픈소스로 깃허브에 공개 🔗
🧑🏻‍💻 [Ai2] Ai2 Scholar QA beta
- 연구할 때 literature review를 편하게 도와주는 솔루션
- Section Planning and Generation, Paper Comparison Table Generation 등의 특징
- 블로그 포스팅(Introducing Ai2 ScholarQA) 참고
📜 [HuggingFace] SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
- 1.7B 사이즈의 “small” language model 공개
- multi-stage training process를 통해 math, code, instruction-following data를 web-text와 혼합하여 약 11T 토큰 학습
- new specialized datasets 도입 (Fine-Math, Stack-Edu, SmolTalk): 기존 데이터셋이 너무 작거나 품질이 낮았던 이슈를 해결하기 위함
- 비슷한 사이즈 수준의 모델들(Qwen2.5-1.5B, Llama3.2-1B) 중에서는 SoTA급 성능을 달성했다고 보고
📜 [T-Tech] Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
- 언어 모델의 연속적인 layer에 걸쳐 존재하는 features를 sparse autoencoder로 확인
- data-free cosine similarity technique: 특정 features가 얼마나 persists, transform, first appear 하는지 등을 파악
- 이를 통해 model computation에 대한 interpretability & mechanistic insights 획득 가능
📜 [Shanghai AI Lab, Peking] UltraIF: Advancing Instruction Following from the Wild
- UltraIF: real-world user prompts를 simpler queries, constraints, corresponding evaluation questions로 decompose
- 이를 위해 UltraComposer를 constraint-associated prompts & evaluation questions 묶어서 학습
- 8B 사이즈의 모델을 response generator & evaluator로 사용했을 때에도 유의미한 성능 향상이 있었다고 보고
🧑🏻‍💻 [Mistral] The all new le Chat: Your AI assistant for life and work
- iOS, Android, 기업 인프라에서 이용 가능한 챗봇 Le Chat을 공개
- Flash Answers, a build-in code interpreter, real-time search 등을 주요 특징으로 내세움
- Flash Answers의 경우 초당 1,000개 정도의 단어를 생성할 수 있다는 특징인데 데모상으로는 확실히 타사 서비스(ChatGPT, Claude)에 비해 압도적으로 빠름

2nd week

📜 [Nanjing Univ.] Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models
- o1과 같은 추론 모델들은 아직 overthinking & over-reliance on auxiliary reward models 문제를 지니고 있음
- 이를 해결하기 위해 LLM이 자율적으로 언제, 어디서 backtrack 할 것인지를 결정하도록 하면 된다고 주장 (like in traditional search algorithms)
- 이를 위한 self-backtracking mechanism을 제시: 학습 & 추론 에서 backtrack 가능
- 이는 optimal-path supervised fine-tuning method 대비 40% 정도의 성능 gain이 있다고 하는데 왜 그것과 비교하는지는 잘 모르겠음.
📜 [SJTU] LIMO: Less is More for Reasoning
- 복잡한 수학적 추론 능력은 (수십만 개 이상이 아니라) 극도로 적은 데이터로도 획득할 수 있다고 주장
- 이는 supervised fine-tuning이 generalization 보다는 memorization으로 이어진다는 주장과도 상반되는 결과
- 817개의 curated training samples로 학습한 LIMO를 기반으로 LIMO Hypothesis 주장
  - 사전학습 단계에서 domain knowledge가 충분히 encoded 되었다면, 정교한 추론 능력은 최소한의 cognitive process를 포함하는 데이터로도 획득할 수 있다
  - 이를 위해서는 (1) 모델이 pre-training 동안 획득한 knowledge (2) post-training examples의 effectiveness가 중요
🧑🏻‍💻 [Harvard] Data.govArchive
- 16TB 사이즈, 311,000개 데이터로 구성된 federal public dataset
📜 [Apple] ELEGNT: Expressive and Functional Movement Design for Non-anthropomorphic Robot
- movement design에 있어서 fuctional & expressive objectives 간의 interplay를 explore하는 prototype 공개
  - expressive: intention, attention, emotions
  - functional: task fulfillment, spatial constraints, time efficiency
- posture, gesture, gaze 등의 비언어적 행동들이 internal state를 의식적으로 & 무의식적으로 표현하는 것이기 때문에 이를 (램프처럼 생긴) 로봇의 행동(movements) 결정에 반영하겠다는 연구
- expression-driven movements가 function-drive movements보다 낫다는 연구 결과를 제시
🧑🏻‍💻 [HuggingFace] π0 and π0-FAST: Vision-Language-Action Models for General Robot Control
- HuggingFace의 LeRobot에 robotics foundation model을 공개
- 이러한 유형의 모델을 Vision-Language-Action 모델이라고 부르는 듯 (VLA)
- 설치부터 학습까지 상세한 코드 예시를 통해 설명하는 허깅페이스 블로그 포스팅
📜 [ISTA] QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
- Quantization 이후 학습을 추가로 진행하는 Quantization-Aware Training (QAT) 기법 중 하나
- QeEST: 학습 모델의 weights & activations를 4-bit 혹은 그 이하로 학습하며 FP16과 유사한 수준의 성능 기록. 심지어 1-bit에서도 안정적으로 학습 가능하다고 설명.
- 이는 (1) normalization 과정에서 weights & activations의 continuous distribution을 유지하여 quantization (2) 새로운 trust gradient estimator를 제시 했기에 가능했다고 함
📜 [Ben Gurion Univ.] Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
- Chameleon Benchmark Overfit Detector (C-BOD): LLM이 특정 벤치마크에 overfit 되었는지를 판단하기 위해 prompts를 systematically distort하는 framework
- 학습 파이프라인에 integrate하여 robust language model을 만드는 데 기여 가능
- 모델 성능이 memorized pattern에 의해 좋게 나온 것인지 아닌지를 판단하는 것이 중점
- 예상 외로 성능이 높은 모델들이 perturbation에 의한 성능 degradation이 심했다고 보고
📜 [AIRI] SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
- multilingual parallel detoxification data를 생성하는 파이프라인 공개
- SytnDetoxM: manually & synthetically 생성된 multilingual parallel detoxification dataset, 16K 개의 데이터로 구성
📜 [Shanghai AI Lab] Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
- Test-Time Scaling (TTS)에 있어서 compute-optimal strategy는 policy model, PRM (Process Reward Model)에 크게 dependent 하다고 설명
- compute-optimal TTS를 이용하면 극도로 작은 reward model (< 1B)로도 엄청나게 사이즈가 큰 (> 405B or GPT-4o) 모델의 성능을 넘어서는 것이 가능하다고 주장
- 깃허브 링크 🔗
🧑🏻‍💻 [OpenAI] Sam Altman reveals GPT-5 will merge o-series models, removing manual model selection
- GPT-4.5 (orion) 모델은 GPT-5 출시 전 마지막 non-chain-of-thought 모델이 될 것 / few weeks or months 후 출시 예정
- reasoning 모델은 별도로 출시되지 않고 GPT-5에 통합
🧑🏻‍💻 [Anthropic] The Anthropic Economic Index
- Claude 데이터를 사용하여 AI가 일자리와 경제에 미친 영향을 분석
- automation의 43%가 AI를 활용한 결과임을 보고
- paper link 🔗
📜 [Oxford] Distillation Scaling Laws
- compute budget & allocation between student and teacher 를 기반으로 distilled model performance를 측정하여 distillation scaling law를 제시
- (1) teacher가 존재할 때 (2) teacher 학습이 필요할 때로 구분하여 연구 결과 제시
- 결국 distillation 과정에서 student 모델 뿐만 아니라 teacher 모델의 cross entropy loss를 함께 살피며 적절히 scaling 하는 것이 중요하다는 점을 언급하는 것으로 보임
📜 [Imperial College London, Cohere] LLMs can implicitly learn from mistakes in-context
- mathematical reasoning에서 발생한 mistakes에 대한 explanation이 주어지지 않더라도 성능 향상에 도움이 될지 연구
- 실험 결과에 따르면 incorrect answer를 correct answer와 함께 보여주는 것만으로도 성능 향상이 있었다고 함. CoT의 성능도 boosting 가능.
- LLM이 in-context implicit learning 할 수 있다는 결론
📜 [Amazon, UCLA] Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs (ICLR 2025)
- PrefEval: long-context conversational setting에서 LLM이 user의 preference에 대한 일관된 추론이 가능한지 평가하는 벤치마크
- 3,000개의 엄선된 preference & query pair, 20개 주제 커버
- 최대 100k 토큰 context에 해당하는 multi-session conversation으로 평가
- 깃허브 링크 🔗
📜 [Meta, KAIST, UC San Diego] LLM Pretraining with Continuous Concepts
- Continuous Concept Mixing (CoCoMix): discrete next token prediction을 continuous concept와 결합하는 pretraining framework
- CoCoMix는 사전학습된 sparse autoencoder로부터 “continuous concepts”를 학습하여 예측하고, 모델의 hidden state와 token의 hidden state을 interleave
- 단순 next token prediction에 비해 sample efficient 하면서도 consistently 성능이 높았다고 설명
📜 [University of Hong Kong, ByteDance] Goku: Flow Based Video Generative Foundation Models
- 데모 페이지 링크 🔗
- rectified flow Transformer를 이용하여 만든 joint image-and-video generation 중에서 SoTA model failmily
- data curation pipeline, model architecture design, flow formulation, advanced infrastructure for efficient and robust large-scale training 공개
- 주요 tasks의 정량 & 정성 평가 가장 높은 결과를 받았다고 설명
📜 [SNU, Cornell] Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation
- Text-to-image (T2I) 분야에서 large scale text encoder는 denoising module에 비해 성능이 뛰어나지만 통상 8배나 많은 메모리를 사용한다는 문제점 존재
- Skrr (Skip and Re-use layers): T2I diffusion 모델에서 text encoder를 효율적으로 pruning 하는 strategy
- transformer block을 selectively skipping하거나 일부 layer를 reusing함

3rd week

📜 [Convergence Labs] LM2: Large Memory Models
- 기존 Transformer 아키테쳐의 한계를 극복하기 위해 auxiliary memory module을 붙여 contextual representation repository로 사용
- input token과 cross attention 하며 gating mechanism을 통해 update
- 일반적인 벤치마크에서도 좋은 성능을 유지하고 multi-hop 에서도 뛰어난 발전이 있었다고 보고
- interpretability, test-time behavior 등에서도 장점이 있음
📜 [ELLIS Institute Tübingen] Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
- recurrent block을 iterate 함으로써 test-time에서 depth를 arbitrarily 정함
- CoT에 의존하지 않아 specialized training data가 필요하지 않고, 심지어 small context window에서도 working
📜 [Meta AI] Brain-to-Text Decoding: A Non-invasive Approach via Typing
- Brain2Text: electro | magneto encephalography (EEG | EMG)로부터 sentences를 decode하는 deep learning 아키텍쳐. QWERTY 키보드로 type된다고 함
- 기존 방식들은 invasive device를 활용하는데 이와 다른 non-invasive 방식이며 둘 사이의 gap을 줄인 데 의의가 있다고 설명
- character-error-rate (CER)은 32%로 67%의 error rate를 보이는 EEG 대비 큰 성능 향상이 있었다고 보고
📜 [University of California, Berkeley] LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- LLM이 Long CoT reasoning을 data-efficient SFT와 LoRA를 통해 학습할 수 있다고 주장
- Qwen2.5-32B 모델을 17k CoT Training sample로 학습한 결과를 리포트
- reasoning step의 각 내용보다는 Long CoT의 structure가 학습 과정에 훨씬 더 큰 영향을 미친다고 주장 (logical consistency가 중요!)
- 저자가 이전에 공개한 Sky-T1-32B-Preview model의 academic paper
📜 [NYU, Tubingen] Do Large Language Models Reason Causally Like Us? Even Better?
- LLM의 답변은 understanding | statistical pattern 중 어떤 것으로부터 나오는 걸까
- 본 논문에서는 from human-like to normative inference 라고 scale을 표현함
- 실험한 4개의 모델 중에서 GPT-4o, Claude는 가장 normative behavior를 강하게 보였고 나머지인 Gemini-Pro와 GPT-3.5는 그렇지 않았다고 설명
- 사람이 내놓는 답변도 실제로 이해한 내용을 바탕으로 나오는 것인지 판단하는 기준이 있긴 한가?
🧑🏻‍💻 [Perplexity] Introducing Perplexity Deep Research
- 수십 개 검색, 수백 개 source를 읽고 자율적으로 report를 생성하는 기능 공개
- finance, marketing부터 product research까지 다양한 범위의 태스크를 expert 수준으로 처리
- 최종 report를 PDF 또는 문서 형태로 export하거나 Perplexity Page로 변환하여 공유할 수 있음
📜 [Renmin Univ. of China] Large Language Diffusion Models
- LLaDA: scratch부터 pretraining & SFT를 적용한 diffusion model
- self-constructed Autoregressive Models 성능과 scalability가 뛰어나다고 주장
- forward data masking process & reverse process를 통해 Transformer가 masked token 예측하는 것처럼 분포를 모델링
📜 [Virginia Tech, Oxford] Towards Reasoning Ability of Small Language Models
- 6개의 model families에 속하는 72개의 SLM을 14개 reasoning benchmarks에 대해 실험한 결과를 정리한 survey
- 4개의 평가 method와 4개의 LLM을 judge로 사용하며 실험은 3번씩 반복
- adversarial conditions와 intermediate reasoning steps 또한 평가
🧑🏻‍💻 [xAI] Grok 3 Beta — The Age of Reasoning Agents
- 지구상 현존하는 모델들 중 가장 똑똑하다는 문구로 소개된 xAI의 LLM
- logical processing을 위한 Think Mode, complex problem-solving을 위한 Big Brain Mode
- faster query processing을 위해 H100 20만대 사용 (전작 대비 10x 이상)
- Grok 3는 X Premium Plus 구독자들 사용 가능
📜 [DeepSeek, Peking, Washington] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
- NSA: dynamic hierarchical sparse strategy를 사용하여 coarse-grained token compression을 fine-grained token selection과 결합
- 현재 GPU에 최적화가 잘되어 있음 & end-to-end training
🧑🏻‍💻 [Microsoft] OmniParser V2: Turning Any LLM into a Computer Use Agent
- OmniParser: UI 스크린샷 내의 pixel spaces부터 structured elements까지 tokenizing
- a large set of interactive element detection data & icon functional caption data 로 학습
- ScreenSpot Pro 라는 벤치마크에서 높은 성능을 기록했다고 보고
- OmniTool: agents를 위한 tool를 포함하는 dockerized Windows system
📜 [Michigan, Amazon, Pennsylvania] Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models
- Long CoT에서 불필요한 step의 존재로 인한 연산량 증가 및 지연에 대한 문제 제기
- 이를 해결하기 위해 perplexity를 importance 지표로 삼는 method 제안
  - 특정 step을 제거했을 때 perplexity가 증가한다면 모델의 입장에서 중요도가 높은 것
- few-shot CoT 내의 sample 중 불필요한 것들을 제거 or 살아남은(critical) steps만으로 fine-tuning 하는 방법으로 활용 가능
📜 [AIRI] Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity
- 현존하는 vector compression 성능은 최대 10x 수준으로 알려졌으나 실제로는 16-bit precision이 아니면 성능이 많이 떨어짐 (이론과 현실의 gap 지적)
- 본 연구에서는 1500x 이상의 compression rate를 달성했다고 주장
- compression에서 중요한 것은 input의 길이가 아닌 줄어들 uncertainty의 양이라고 설명
🧑🏻‍💻 [Google Research] Accelerating scientific breakthroughs with an AI co-scientist
- 연구자들을 돕기 위해 Gemini 2.0 기반으로 구축한 multi-agent AI system
- Supervisor agent가 6개의 specialized agents에 tasks 할당
  - Generation, Reflection, Ranking, Evolution, Proximity, Meta-review
- paper link 🔗
🧑🏻‍💻 [Sakana AI] The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition
- CUDA kernel discovery & optimization을 온전히 자동화하는 agentic framework 제시
- PyTorch code를 CUDA kernel용으로 변환 → evolutionary meta-generation을 거쳐 runtime performance optimize
- 250개의 테스트에서 186개의 태스크의 처리 속도를 평균(median) 1.52x 향상시켰다고 보고
- paper link 🔗
📜 [Meta] MLGym: A New Framework and Benchmark for Advancing AI Research Agents
- MLGym, MYGym-Bench: AI research tasks에 대한 LLM agents 프레임워크 및 벤치마크
- 벤치마크는 CV, NLP, RL, Game Theory에 관한 13개의 tasks로 구성
- 프레임워크는 여기에 새로운 태스크를 추가 및 통합하는 것을 도와줌
📜 [The Univ. of Melbourne] Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models
- 현존하는 벤치마크 평가로는 LLM의 ‘cognitive tasks’ 수행을 위한 능력을 판단할 수 없다고 주장
- adversarial stimuli & interpretability techniques 로 평가 시 여러 언어와 reasoning tasks에서 not robust한 결과를 보였다고 설명

4th week

🧑🏻‍💻 [StepFun, Tsinghua] Open-Reasoner-Zero
- scalability, simplicity, accessibility에 집중한 open source reasoning-oriented RL training implementation
- minimalist approach: vanilla PPO with GAE & rule-based reward function / w/o KL regularization
- 1/30 training steps만으로도 DeepSeek-R1-Zero-Qwen-32B를 GPQA Diamond Bench에서 우세
- paper link 🔗
🗞️ [1X] Introducing NEO Gamma
- NEO Beta 다음 세대의 휴머노이드 공개
- “companion” 포지션으로 가정 환경에서 자연스러운 움직임을 보여줌 (링크 데모 참고)
📜 [Alibaba] Qwen2.5-VL Technical Report
- enhanced visual recognition, precise object localization, robust structured data extractions, document parsing, long-video compression
- objects를 식별할 때 bounding box를 치거나 point를 정확하게 파악하는 점이 특징
- dynamic resolution processing & absolute time encoding 도입 → 다양한 사이즈의 이미지, long-video 처리 가능
- task-specific fine-tuning 없이도 다양한 domain에 robust performance를 보인다고 주장
📜 [Arizona, UCLA, Notre Dame, UIUC] Preference Leakage: A Contamination Problem in LLM-as-a-judge
- data generator LLM과 judge LLM 사이의 세 관계에 대해 연구
- (1) being the same model (2) having an inheritance relationship (3) belonging to the same model family
- 여러 LLM baselines와 benchmarks를 통해 관계에 따른 judge bias가 존재한다는 것을 empirically 확인 (preference leakage)
- 그렇다면 데이터를 생성할 땐 다양한 LLM을 활용해야 하는 것 아닐까?
🧑🏻‍💻 [Anthropic] Claude 3.7 Sonnet and Claude Code
- Claude 3.7 Sonnet: Instant responses를 step-by-step thinking과 결합한 답변 반환 가능
  - thinking mode의 context length 128K 까지 확장
  - API를 통해 thinking time도 조절 가능
- Claude Code: CLI AI coding assistant
  - repository search, edit files, commits to Github 기능 지원
🧑🏻‍💻 [AI2] Efficient PDF Text Extraction with Vision Language Models
- PDFs와 document images를 깔끔하고 구조화된 텍스트로 변환하는 툴킷
- 다양한 종류의 PDF에 대해 250,000장 fine-tune
- 1M PDF pages당 $190 → GPT-4o API batch 대비 32배 저렴하다고 소개
- markdown 형태로 output 반환
🧑🏻‍💻 [Alibaba] Wan 2.1: Leading AI Video Generation Model (Wanx 2.1)
- text, image 입력으로 받아 고품질 images & videos 생성 가능한 open-source model family
- T2V-1.3B, 14B 두 개 version으로 공개
- 허깅페이스를 비롯한 다양한 플랫폼에서 이용 가능
🧑🏻‍💻 [Google] Get coding help from Gemini Code Assist — now for free
- VS Code, JetBrains IDE, GitHub 에서 지원
- Gemini 2.0으로 지원하며 월 180,000개의 code completions 지원 (GitHub Copilot free tier 대비 20배 많은 양)
- 128K context window를 바탕으로 complex code base에 대한 이해 가능
- 코드 내 stylistic issues and bugs 등을 automatically 탐지 가능
📜 [Kakao] Kanana: Compute-efficient Bilingual Language Models
- Korean & English 처리할 수 있는 bilingual language model series
- high quality data filtering, staged pre-training, depth up-scaling, pruning, distillation
- 특히 Kanana models를 post-training 하는 과정에서 사용된 방법론들을 보고
- 2.1B ~ 32.5B 사이즈의 모델들로 구성되어 있고, 2.1B 모델은 공개
🧑🏻‍💻 [Amazon] Introducing Alexa+, the next generation of Alexa
- 수만 개의 서비스와 장치들을 아우르는 시스템으로 supervision 없이 복잡한 multi-step tasks 수행
- Amazon’s Nova & Anthropic’s Claude를 비롯한 여러 개의 foundational LLMs를 각 태스크에 가장 적합하게 활용
- 도메인별 experts를 활용하는 개념. 개인 맞춤화된 특징들을 지원 (유저 히스토리 기반)
📜 [Meta, UIUC, CMU] SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
- RL-based LLM의 reasoning을 real-world software engineering으로 확장하기 위한 approach
  - DeepSeek-R1 같은 모델들은 코딩 테스트를 위한 문제들처럼 실행하기 쉽고 real-world와는 동떨어진 코드들로 학습되었다는 한계를 지적
- open-source software evolution data로부터 실제 개발자들의 reasoning processes & solutions를 autonomously 학습
  - GitHub Pull Requests Dataset Curation (4.6M repositories)
  - lightweight rule-based reward를 leverage
- Llama3-SWE-RL-70B 모델이 SWE-bench Verified에서 41.0% 성능을 달성
  - 이는 100B 이하의 오픈소스 모델 중에서 유일하게 GPT-4o에 견줄 수 있는 성능
📜 [Zoom] Chain of Draft: Thinking Faster by Writing Less - LLM과 달리 실제 사람은 본질적인 정보만을 다루는 간결한 intermediate thoughts를 draft 하여 보다 효율적인 reasoning 방식을 취하고 있음 - Chain of Draft (CoD): 인간의 cognitive processes와 같이 tasks를 처리할 때 필수적이고 유용한 정보들만 남기는 방식 - 기존 대비 7.6% 수준의 토큰만 사용해서도 성능을 유지할 수 있음 → 추론 비용을 아끼고 latency 낮출 수 있음

🙇🏻 1월

1st week

📜 [NVIDIA, HuggingFace] Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
- ModernBERT: encoder-only 모델에서 Pareto improvement
- 8192 sequence 길이로 2T 토큰을 학습
- 분류, single-/multi- vector retrieval 태스크에서 SoTA 달성
📜 [Google] LearnLM: Improving Gemini for Learning
- 현존 LLM들은 정보 제공에 초점이 맞춰져 있고 교육 상황에 적합하지는 않음
- 특정 pedagogical attribute를 평가하기 위한 프레임워크
- pedagogical instruction following을 포함하여 학습한 LearnLM 이 다양한 learning scenario에서 좋은 평가를 받았음
📜 [Nanjing Univ., Baidu] Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization
- CV는 아직 NLP만큼의 zero-shot generalization 성능을 달성하지 못함
- discrete & terminological task definitions 대신 Explanatory Instructions를 사용
- ‘image input → explanatory instruction → output’ 12M 개의 triplet으로 구성된 데이터셋 구축
- Auto-regressive-based vision-language model 학습 (AR-based VLM)
📜 [Microsoft] Bootstrap Your Own Context Length
- long-context LM을 학습하는 방식으로 short-context 능력만을 이용하는 bootstrapping approach를 제안
- diverse long-context instruction tuning data를 합성하는 simple agent flow
- 즉, short-context의 언어 모델들만을 이용하여 long-context 언어 모델을 만들 수 있다는 주장
- Llama-3 계열 모델을 기준으로 최대 1M token 까지 확장했다고 언급
📜 [GIT, Washington, CMU, AI2] Multi-Attribute Constraint Satisfaction via Language Model Rewriting
- Multi-Attribute Constraint Satisfaction (MACS): 다양한 external real-value attributes에 대해 user-specified constraints를 만족할 수 있는 general한 언어 모델 학습 방법
- 초기 paraphrased outputs으로부터 다양한 multi-attribute를 sampling 함으로써 LM을 editor로 학습
- 이를 제대로 평가하기 위해 Fine-grained Constraint Satisfaction (FineCS) 벤치마크를 제작
  - Text Style Transfer, Protein Design, 두 개의 challenging tasks로 구성
📜 [Xiaoduo AI Lab] Xmodel-2 Technical Report
- reasoning task에 특화된 1.2B 사이즈의 sLLM
- 이것의 아키텍쳐는 다른 모델들이 통합된 하이퍼파라미터셋을 그대로 활용할 수 있도록 함으로써 최적의 세팅으로 larger model에 scale 할 수 있음
- MiniCPM의 WSD learning rate scheduler 사용
- 깃허브 링크 🔗
📜 [Tencent] HunyuanProver: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving
- LEAN4와 interactive automatic theorem proving을 통해 Hunyuan 7B를 fine-tuning한 언어 모델 HunyuanProver
- data sparsity issue 해결을 위해 iterative 데이터 합성 프레임워크를 디자인
- system 2 thinking을 위한 guided tree search algorithm 디자인
- 30k 개의 합성 데이터를 공개: 자연어로 된 원래 질문, autoformalization으로 변형된 것, HunyuanProver로부터의 proof로 구성
📜 [Meta] MLLM-as-a-Judge for Image Safety without Human Labeling
- AI-generated content (AIGC) 중에 harmful content가 포함되어 있는지를 확인하는 것이 중요한데 여기에 MLLM을 활용
  - 기존 문제점: human label, guideline 제작 등은 너무 비쌈. 룰 업데이트가 주기적으로 필요함
- MLLM이 zero-shot으로 주어진 ruel과 이미지 간의 관련성을 평가하고 빠르게 판단할 수 있도록 하는 방법론을 제안
📜 [Toronto] Toward Adaptive Reasoning in Large Language Models with Thought Rollback (ICML 2024)
- Thought Rollback (TR) 라는 reasoning framework를 제시하여 LLM이 adaptive 하게 thought structure를 bulid 하여 hallucination을 완화
- TR의 core mechanism은 rolling back thoughts로 LLM이 thoughts에 대해 error analysis를 수행하여 이전에 mistaken 된 thought를 roll back 하도록 함
- prompt 내에 이러한 trail-and-error를 포함하여 더욱 reliable한 reasoning path를 구축
- 깃허브 링크 🔗
📜 [Taiwan, Intel] Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging - additional safety data에 의존하지 않으면서도 downstream task performance를 개선할 수 있는 방법이 뭘까? - ⇒ merging pre- & post-fined-tuned safety-aligned model - Step 1. Downstream Task Fine-Tuning → Step 2. Combining Base and Fine-tuned Model

2nd week

📜 [Shenzhen] ICPC: In-context Prompt Compression with Faster Inference
- ICPC: prompt의 길이를 adaptive 하게 줄이는 prompt compression 방법론 제시
- encoder를 사용하여 프롬프트 내 각 단어의 확률을 계산하고 information function을 이용하여 information 계산하여 information loss를 최소화
📜 [AI2, Washington, NYU] 2 OLMo 2 Furious
- OLMo 2는 개선된 아키텍쳐, 학습 레시피, 사전학습 데이터, dense autoregressive model을 포함
- Dolmino Mix 1124, late-stage curriculum training에 사용되는 pretraining data mixture
- Tulu 3에서 얻은 최선의 practice를 OLMo 2-Instruct 개발에 활용, final-stage reinforcement learning with verifiable reward (RLVR)에 focus
📜 [Berkeley, CMU] AutoPresent: Designing Structured Visuals from Scratch
- SlidesBench: 모델이 자연어 instructions를 바탕으로 slide를 자동 생성하는 태스크 벤치마크
  - 10개 도메인에 대한 310개 슬라이드 deck에 대한 585개의 testing sample로 구성
  - (1) reference-based 방식: target slide와의 유사도 평가
  - (2) reference-free: 생성된 슬라이드 자체의 디자인 퀄리티 평가
- AutoPresent: 8B Llama-based model, 7k개의 instruction & 슬라이드 생성 코드 pair로 학습
- 모델이 스스로의 결과물을 self-refined 하는 iteraitve design refinement가 유의미한 결과 향상으로 이어진다고 보고
- 깃허브 링크 🔗
🧑🏻‍💻 [HuggingFace] SmolAgents
- code 몇 줄로 power agents를 실행할 수 있도록 돕는 허깅페이스의 오픈소스 라이브러리
- transformers에서 사용 가능한, Hub에 업로드된 모든 모델을 사용할 수 있음. OpenAI, Anthropic, Meta 모델들도 사용 가능
📜 [Chinese Academy of Sciences] Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models
- Auto-RT: 복잡한 attack 전략들을 자동적으로 explore & optimize 하는 강화학습 프레임워크
- exploration complexity를 줄이고 최적화 전략을 개선하기 위한 두 가지 key points
  - (1) Early-terminated Exploration
  - (2)Progressive Reward Tracking algorithm
- 깃허브 링크 🔗
📜 [Orange] Survey on Question Answering over Visually Rich Documents: Methods, Challenges, and Trends
- Visually-rich Document Understanding (VrDU)는 comprehension과 generation 능력을 둘 다 필요로 함
- 본 논문에서는 LLMs function에 의한 VrDU 모델들의 개선 방법론 및 한계점 등을 survey
🧑🏻‍💻 [Google] Agents
- AI agents가 어떻게 reasoning, tools, external data를 결합하는지에 대해 설명한 whitepaper
- 세 개의 핵심 구성 요소를 정의: Decision Engine, Tool Integration, Orchestration Layer
- Tools는 각 functionality에 따라 Extension, Function, Data Stores로 구분
🧑🏻‍💻 [NVIDIA] NVIDIA Announces Nemotron Model Families to Advance Agentic AI
- AI agents를 4배 빠른 속도로 최적화 할 수 있는 open source LLMs 공개
- NVIDIA NeMo Retriever 등을 포함하여 NVIDIA NeMo 플랫폼을 구축하고자 하는 움직임
📜 [IBM] MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems
- MTRAG: end-to-end human-generated multi-turn RAG benchmark
- 4개 도메인에서 평균 7.7 턴의 110개 대화로 구성되며, 총 842개의 태스크를 다룸
- 합성 데이터를 이용한 LLM-as-a-Judge 자동화 파이프라인도 포함하고 있음
- 깃허브 링크 🔗
📜 [Korea Univ.] SUGAR: Leveraging Contextual Confidence for Smarter Retrieval (ICASSP 2025)
- Semantic Uncertainty Guided Adaptive Retrieval (SUGAR): context-based entropy로 single-/multi- step retrieval을 결정
- external knowledge가 relevant 한 것인지 LLM이 알 수 없어 발생하는 hallucination을 최소화
🧑🏻‍💻 [NVIDIA] Cosmos
- 자율 주행 및 robotics를 위한 합성 데이터를 생성할 수 있는 오픈소스 비디오 모델
- 20M 시간 & 9,000T 토큰으로 학습된 Diffusion-based models
- Autoregressive, text-to-video, video-to-video, combined inputs 지원 등의 특징
🧑🏻‍💻 [LangChain] Structured Report Generation Blueprint with NVIDIA AI
- NVIDIA와 협력하여 AI agents 중 Structured Report Generation 개발
- optimized Llama 3.3 and LangGraph integration
📜 [NYU] Entropy-Guided Attention for Private LLMs
- Shannon’s entropy를 지표로 사용한 결과, MHA 관점에서 초기 레이어에는 entropic overload, 후기 레이어에는 under-utilization을 관측
- entropy regularization 테크닉을 곁들ㅇ니 entropy-guided attention 메커니즘으로 entropci overload를 완화
📜 [Renmin, Tsinghua] Search-o1: Agentic Search-Enhanced Large Reasoning Models
- OpenaAI-o1과 같은 Large reasoning models (LRMs) 들은 knowledge insufficiency 문제를 항상 겪고 있음
- Search-o1: LRMs에 agentic RAG mechanism과 Reason-in-Documents module을 더한 프레임워크
- 깃허브 링크 🔗
📜 [Microsoft] GeAR: Generation Augmented Retrieval - GeAR: well-desgined fusion & decoding module 을 결합하여 query와 document의 fused representation을 토대로 관련된 텍스트를 생성 - bi-encoder에 추가적인 연산 burden을 더하지 않는 방식임 - LLM을 이용한 효과적인 합성 데이터 파이프라인을 구축

3rd week

📜 [Nanyang, Fudan] Long Context vs. RAG for LLMs: An Evaluation and Revisits
- Long Context (LC) vs. RAG 비교 페이퍼
- (1) QA benchmarks에서는 LC가 일반적으로 RAG 보다 우위
- (2) summarization-based RAG는 LC보다 낫지만 chunk-based retrieval는 조금 아쉽
- (3) dialogue-based & generatl question queries에 대해서는 RAG가 우위
📜 [SynthLab, Stanford, UC Berkeley] Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
- Meta Chain-of-Thought (Meta-CoT): traditional CoT를 explicitly modeling 함으로써 특정 CoT에 이르게 만들 수 있도록 하는 프레임워크
- process supervision, synthetic data generation, search algorithms 등 Meta-CoT 생성에 대한 방법론 탐구
- linearized search traces & reinforcement learning post-training 을 instruction tuning과 통합
📜 [OneLineAI, Yonsei] Multi-Step Reasoning in Korean and the Emergent Mirage
- HRMCR (HAE-RAE Multi-Step Commonsense Reasoning): 한국의 문화와 언어적 특성을 반영한 multi-step reasoning benchmark
- 질문들은 템플릿과 알고리즘을 통해 자동적으로 생성되었음
- 일정 threshold 이상의 학습을 수행한 모델로부터 emergent behavior 관측됨
🧑🏻‍💻 [Mistral] Codestral 25.01
- 더 효율적인 아키텍쳐와 개선된 토크나이저를 특징으로 삼음
- 덕분에 2배 이상 빠른 속도로 코드 생성 가능
- 256k context length를 지원하며 다양한 프로그래밍 언어 벤치마크에서 SoTA 달성
- VS Code 또는 JetBrains 에서 Chat Demo 버전 사용 가능
🧑🏻‍💻 [UCBerkeley NovaSky] Sky-T1: Train your own O1 preview model within $450
- 17K 개에 달하는 수학, 코딩, 과학 데이터 / data curation, 학습, 평가를 위한 코드 / 모델 가중치 등을 오픈소스로 공개
- QwQ-23B-Preview를 이용하여 초기 데이터를 생성한 뒤 reject sampling 적용
- Qwen2.5-32B-Instruct 모델을 curated dataset으로 fine-tune
📜 [Microsoft] rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
- SLMs도 distillation 없이 OpenAI o1에 달하거나 혹은 그 이상 수준의 수학 추론 능력을 보유할 수 있다고 주장
- MCTS를 통한 deep thinking을 활용하여 이와 같은 성과를 달성할 수 있었다고 보고
- (1) code-augmented CoT data synthesis method (2) naive step-level score annotation을 지양하는 reward model training method (3) self-evolution recipe
🧑🏻‍💻 [AMD, John Hopkins] Agent Laboratory: Using LLM Agents as Research Assistants
- 사람이 만들어낸 연구 아이디어를 입력으로 받아 연구 결과와 코드 레포를 반환
- MacBook이든 GPU cluster든 주어진 computational resources에 맞게끔 동작하는 structured framework
- 세 단계로 구성: (1) Literature Review (2) Experimentation (3) Report Writing
📜 [Google Research] Titans: Learning to Memorize at Test Time
- attention이 긴 context를 커버하지 못한다는 단점을 극복하기 위해 새로운 long-term memory module을 제안
- historical context를 기억하는 방법을 배워서 오래된 과거 정보를 활용하여 현재 context에 attention 하는 방법론
- 결국 attention과 neural memory라는 두 개의 module을 기반으로 삼는 새로운 아키텍쳐 model family, Titan
- 2M context size 이상에서도 needle-in-haystack tasks를 정확하게 수행할 수 있다고 보고
📜 [Minimax] MiniMax-01: Scaling Foundation Models with Lightning Attention
- MiniMax-Text-01, MiniMax-VL-01로 구성된 MiniMax-01 시리즈를 공개
- 핵심은 lightning attention & efficient scaling
- MoE 방식과 결합했는데, 이때 32개의 experts, 456B total parameters, 45.9B activated parameters 로 구성
- 학습 중 context window는 1M 길이에 달하고, 추론 시에는 4M 까지 extrapolate 가능하다고 주장
- GPT-4o, Claude-3.5-Sonnet에 준하는 성능을 달성하면서도 20-32배나 긴 context window를 커버할 수 있다고 함
📜 [Sakana] Transformer^2: Self-adaptive LLMs
- LLM이 weight matrice 내의 singular components를 실시간으로 selectively adjusting 함으로써 unseen tasks에 adapt 하도록 돕는 self-adapation framework
- two-pass mechanism: (1) dispatch system (2) task-specific expert vectors
- LoRA 대비 사용하는 파라미터의 숫자는 적으나 효율성이 뛰어남
🧑🏻‍💻 [OpenAI] Scheduled tasks in ChatGPT
- 한 번에 10개까지의 active tasks 스케줄 가능
- one-time reminder 또는 recurring actions 설정 가능
- 웹 인터페이스를 통한 태스크 관리
- 데스크탑, 모바일, 웹에서 알림 수신 가능
📜 [Chinese Academy of Sciences] Aligning Instruction Tuning with Pre-training
- instruction tuning을 위한 데이터셋은 pre-training에 사용된 것과 분포도 맞지 않고 다양성이 부족하다는 문제가 존재
- AITP (Aligning Instruction Tuning with Pre-training): underrepresented pre-training data를 고품질의 instruction-response pair 데이터로 변환
  - task-specific objective 유지 & 데이터셋의 다양성 증대
  - adaptive data selection, controlled rewriting, balanced integration 등
📜 [Together AI, MIT, Princeton] Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping
- Ladder Residual: residual-based model에 적용 가능한 간단한 architectural modification. communication latency를 효율적으로 hide 하는 방법
- 모델을 여러 GPU에 나누는 Tensor Parallelism에서 발생하는 통신 간의 병목을 최소화하기 위한 방법론 제시
📜 [Meta] Training Large Language Models to Reason in a Continuous Latent Space
- LLM reasoning 에서는 일반적으로 textual coherence가 중요한 language space에서와 달리 reasoning에 최적화된 토큰이 필요
- CoConuT (Chain of Continuous Thought): LLM의 last hidden state를 reasoning state의 representation으로 해석하여 continuous thought로 명명
- official code link (Github) 🔗
📜 [Northeastern Univ.] Foundations of Large Language Models
- 200 페이지 분량의 LLM 책이 arxiv에 공개되어 화제
📜 [Google DeepMind] Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
- LLM과 달리 diffusion 모델은 denoising step 수를 통해 inference-time computation을 조절할 수 있음 (수십 step 이상이면 성능이 증가하지는 않음)
- 이것 이상의 inference-time scaling hegavior에 대해 연구. diffusion sampling process에서 더 나은 noise를 찾는 search problem에 집중.
- class-/text- conditioned 이미지 생성 벤치마크에서 상당한 개선을 이뤄냈다고 보고

4th week

📜 [Zhejiang Univ.] OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
- vanilla-retrieved information은 depth, utility가 부족하거나 redundancy 문제 존재
- 이를 해결하기 위해 OmniThink라는 machine writing framework 프레임워크를 제안: 인간과 같은 iterative expansion & reflection 프로세스를 모방
- 특정 주제에 대한 지식을 점진적으로 deepen 하는 cognitive behavior가 아이디어의 핵심
🧑🏻‍💻 [DeepSeek] DeepSeek-R1
- OpenAI-o1의 수학, 추론, 코드 태스크 수행 능력에 준하는 오픈소스 모델
- Self-verification, Reflection, CoT solutions 등의 특징
- DeepSeek-R1, DeepSeek-R1-Zero, Llama & Qwen 아키텍쳐 기반의 6개 distilled 모델 공개
🧑🏻‍💻 [OpenAI] OpenAI’s function calling guide
- OpenAI Platform에 Function calling 관련 문서가 추가됨
- 좋은 예시들이 포함되어 있어 function calling 공부하는 데 활용할 수 있을 것 같음
📜 [Microsoft Research] RedStone: Curating General, Code, Math, and QA Data for Large Language Models
- RedStone: Common Crawl 의 데이터를 처리하는 scalable pipeline
- 기존의 domain-specific expertise가 요구되었던 방식들과 달리 Common Crawl 에 포함된 다양한 도메인의 데이터를 tailor
- 작업물 링크 🔗
📜 [Korea Univ., Upstage] ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains (ICLR 2025)
- ChroKnowBench: chronologically 축적된 지식을 평가하기 위한 벤치마크 데이터셋
  - 세 가지 핵심 요소: multiple domains, time dependency, temporal state
- ChroKnowledge (Chronological Categoriazation of Knowledge): LLM의 non-parametric chronological knowledge를 평가하기 위한 sample-based framework
  - temporal knowledge를 이끌어내는 능력은 모델이 학습된 데이터 형식에 따라 다르다
  - LLM은 지식을 부분적으로 recall 하거나 temporal boundaries에서 단절되는 듯하다
📜 [ChungAng Univ.] Probing-RAG: Self-Probing to Guide Language Models in Selective Document Retrieval (NAACL 2025)
- Probing-RAG: 언어 모델의 중간 layer의 hidden state representation을 사용하여 주어진 query의 additional retrieval 필요성을 adaptive하게 결정하는 방법론
  - real-world 에서는 최적의 document를 찾기 위해 주로 multi-step을 거쳐야 하는 문제를 해결
- pre-trained prober를 사용하여 모델의 internal cognition을 빠르게 capture
🧑🏻‍💻 Pocket Flow
- 100줄 짜리 LLM Agent framework for Agents, Task Decomposition, RAG
- Nested Directed Graph를 활용하여 Node, Action, Flow, Batch & Async 등의 기능을 지원
🧑🏻‍💻 [OpenAI] Announcing The Stargate Project
- AI infrastructure를 만들기 위해 $500B (한화 약 700조)를 투자하는 Stargate Project를 발표
- NVIDIA GPU 사용, Oracle은 고품질 cloud infrastructure 제공, Microsoft Azure는 모델 분산 학습 지원
- medicine & biotechnology 등의 high-value fields에 집중
📜 [ByteDance, Tsinghua] UI-TARS: Pioneering Automated GUI Interaction with Native Agents
- UI-TARS: 입력으로 스크린샷을 받아 이해하고 사람과 같은 interaction을 수행하는 native GUI agent model
- 프롬프트나 workflow를 통해 commercial model을 사용하는 이전 프레임워크들과 달리 end-to-end model임
- Enhanced Perception, Unified Action Modeling, System-2 Reasoning, Iterative Training with Reflective Online Traces 등의 주요 특징
📜 [Microsoft] LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts (ACL 2024)
- 자연어 텍스트를 자동으로 평가하기 위한 프레임워크 제시
- multiple LLM distribution을 combine 하여 인간 judge’s annotation을 predict
- judge-specific & judge-independent parameters를 둘 다 포함하는 small feed-forward neural netowrk를 사용
🧑🏻‍💻 [OpenAI] Introducing Operator
- 현재는 US 거주 중인 Pro 유저만 사용 가능
- web 상에서 tasks를 자동화해주는 AI agent (폼 작성, 여행 예약 등)
- Computer-Using Agent (CUA) 라는 새로운 모델을 사용
  - GPT-4의 vision 능력으로 GUI 상호작용이 가능하도록 강화학습
- 웹사이트 클릭, 타이핑, 스크롤 가능 / 캘린더 관리나 슬라이드쇼 생성 등의 복잡한 태스크는 아직 수행하지 못함
🧑🏻‍💻 [Anthropic] Introducing Citations on the Anthropic API
- Claude가 답변을 생성할 때 참고한 source document 내에서 활용한 정확한 문장 식별 가능
- Anthropic API & Google Cloud’s Vertex AI 에서 API로 이용 가능
- Document summarization, Complex Q&A, Customer support 등의 유즈케이스
🧑🏻‍💻 [HuggingFace] SmolVLM Grows Smaller – Introducing the 250M & 500M Models!
- SmolVLM family에 256M, 500M 사이즈의 모델들을 추가. 특히 256M 사이즈는 Vision Language Model 중에서 가장 작은 것
- 두 개의 base 모델과 instruction fine-tuned 모델, 총 네 개의 체크포인트를 공개
📜 [Google Cloud] Chain of Agents: Large Language Models Collaborating on Long-Context Tasks (NeurIPS 2024)
- 기존에는 LLM으로 long context를 처리하기 위해 1) 입력 길이를 줄이거나 2) context window를 확장하고자 함
- Chain-of-Agents (CoA): multi-agent collaboration을 이용하여 information aggregation & context reasoning 가능하도록 만든 프레임워크
- segmented text를 sequentially 처리할 수 있는 multiple worker agents로 구성 → manager agent가 결과를 종합하여 coherent final output 생성

5th week

📜 [Renmin Univ. of China] Enhancing LLM Reasoning with Reward-guided Tree Search
- reward-guided tree search algorithm을 통한 LLM의 추론 능력 향상 방법에 대한 연구
- policy model, reward model, search alogirthm을 통합하는 프레임워크
- policy 모델이 학습된 reward model에 의해 tree를 dynamically expand 하는 tree search algorithm
- STILL-1 (Slow Thinking with LLMs) 라는 프레임워크
📜 [Renmin Univ. of China] Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems
- o1-like reasoning system을 구현하기 위한 reproduction report
- STILL-2: imitate, explore, self-improve framework
- distilled long-form thought data를 사용하여 reasoning model을 학습함으로써 slow-thinking mode를 가능하게 만듦
- 모델이 multiple rollout을 생성함으로써 어려운 문제를 탐색하도록 함 → high-quality trajectories가 올바른 답변으로 이어짐
📜 [Centfor for AI Safety, Scale AI] Humanity’s Last Exam
- Humanity’s Last Exam (HLE): 다양한 종류의 주제를 아우르는 최종 closed-ended academic benchmark를 목표 (멀티모달)
- automated grading에 적합한 multiple-choice, short-answer question 등으로 구성
- 정답은 논란의 여지가 없고 명확한 것이나 retrieval을 통해 바로 답변하기 어려운 문제들
- 공개 링크 🔗
📜 [Truthful AI, Toronto] Tell me about yourself: LLMs are aware of their learned behaviors
- behavioral self-awareness: in-contex examples 없이도 스스로의 행동에 대해 언급하는 능력
- 명시적으로 associated behavior에 대해 언급하지 않는 두 개의 데이터셋 사용
  - (a) making high-risk economic decisions (b) outputting insecure code
  - 그럼에도 모델은 이를 명백히 설명
- 우리가 지시하지 않은 내용을 모델이 습득하게 된다는 것은 AI Safety 이슈로 이어질 수 있음
🧑🏻‍💻 [DeepSeek] Janus-Pro release
- multimodal understanding & visual generation 능력이 개선된 Janus-Pro 릴리즈
- 작년(2024)에 이미 JanusFlow, Janus 라는 이름으로 mllm을 공개했었음 (허깅페이스에서 다운로드 가능)
🧑🏻‍💻 [Alibaba] Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens
- 알리바바에서 1M 토큰까지 커버할 수 있는 Qwen 모델을 공개 (Qwen2.5-7B-Instruct-1M & 14B)
- 특히 14B 모델은 Qwen2.5-Turbo, GPT-4o-mini를 능가하는 성능을 보여줌
- 긴 context를 효율적으로 처리하기 위해서 sparse attention과 DCA (Dual Chunk Attention) 사용
📜 [COAI Research] Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models
- DeepSeek R1 (deepseek-ai_deepseek-r1_2025) 모델의 reasoning tokens에 대한 연구
- 모델이 명시적으로 학습한 적 없는 self-preservation (자기보호) 특성을 보임
- 이러한 모델이 robotics와 결합되었을 때 물리적으로 영향을 줄 수 있음에 대한 concern 제기
📜 [USTC, Microsoft] Optimizing Large Language Model Training Using FP4 Quantization
- LLM을 위한 FP4 training framework 제시
- 두 가지 key factor
  - (1) differentiable quantization estimator for precise weight updates
  - (2) outlier clamping and compensation strategy to prevent activation collapse
- 안정성을 위해 mixed-precision training과 vector-wise quantization 통합
- 100B 토큰으로 학습되는 13B 모델까지도 scale-up 가능한 것으로 확인
🧑🏻‍💻 [Perplexity] Sonar
- DeepSeek의 reasoning model로 제공하는 새로운 API 공개
- Advanced CoT reasoning, US-based, Data privacy, Self-serve API access를 주요 특징으로 삼음
- 일반 버전과 pro 버전으로 구분됨
📜 [UIUC, AI2, IBM, Yale, Washington] ReFIT: Reranker Relevance Feedback during Inference
- Retrieve-and-rerank는 보통 bi-encoder가 후보를 대량으로 retrieve 하면 cross-encoder가 reranking 하는 프레임워크를 일컬음
- inference-time에 retriever에 대한 relevance feedback을 제공하여 최초 k개 recall에 대한 성능 향상을 도모
- reranker의 predictions을 retriever의 query representation에 반영할 수 있도록 lightweight update mechanism을 사용하여 distill
  - → updated 된 query vector를 사용하여 second retrieval step 실행
  - 기존 retrieve-and-rerank frameworks에 applicable
📜 [Huawei, McGill] InnerThoughts: Disentangling Representations and Predictions in Large Language Models
- LLM에게 MCQA를 할 땐 last layer의 hidden state만 사용하는 것이 일반적
- small separateneural network predictor module을 training questions에 대해 만들어 전체 레이어의 hidden state를 입력으로 받아 결과 예측
- LLM의 representational abilities를 온전히 사용하는 방식의 프레임워크라고 주장
- 비용은 적은데 finetuning급 성능 향상을 이뤄낼 때도 있었다고 보고
🧑🏻‍💻 [Alibaba] Qwen2.5-Max: Exploring the Intelligence of Large-scale MoE Model - large MoE language model로 DeepSeek V3를 능가하는 성능이라고 보고됨 - 다양한 도메인의 데이터를 20T 토큰 이상 학습. SFT + RLHF. - Alibaba Cloud 계정 등록 후 OpenAI 라이브러리로 이용 가능

2024

🎄 12월

1st week

📜 [Google Cloud, Google DeepMind] Reverse Thinking Makes LLMs Stronger Reasoners
- 인간의 역방향 사고(문제→해결, 해결→문제)를 LLM에 적용하는 RevThink 프레임워크 제안
- 데이터 증강: teacher 모델로부터 (1)원래 질문 (2)정방향 추론 (3)역방향 질문 (4)역방향 추론을 수집
- 3가지 training objectives를 통한 student 모델 학습
  - 질문→정방향 추론 생성
  - 질문→역방향 질문 생성
  - 역방향 질문→역방향 추론 생성
📜 [Chineses Academy of Sciecnes] Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models
- 기존: few-shot prompting이나 수동 규칙으로 iterative retrieval 구현
- RAG의 성능 향상을 위한 iterative retrieval 과정을 LLM의 자율적 의사결정 능력에 맡기는 Auto-RAG 제안
  - LLM이 retriever와 multi-turn 대화를 통해 검색을 계획하고 쿼리를 개선
  - 충분한 정보가 모일 때까지 자동으로 반복
  - 질문의 난이도와 검색된 지식의 유용성에 따라 반복 횟수를 자율적으로 조절
🧑🏻‍💻 [NVIDIA] Multimodal PDF Data Extraction
- text, graphs, charts, tables 사이즈 상관 없이 insight를 추출 가능한 Data Extraction
- enterprise RAG를 위한 제품으로 보임
- 현재는 데모 수준으로 업로드된 370/501개 파일에 대한 QA를 RAG 기반으로 테스트 해볼 수 있는 것 같음
🧑🏻‍💻 [Kaggle] LLMs - You Can't Please Them All
- essay quality를 평가하기 위해 LLM-as-a-judge를 이용
- LLM judges 간 disagreement를 극대화하는 essay를 제출하는 것이 목표
📜 [The University of Sydney, Huawei] Enhancing Large Language Models through Adaptive Tokenizers (NeurIPS 2024)
- 기존 tokenizer는 통계 기반으로 형성된 static 방식 → 현재 LLM 아키텍쳐와 싱크 안됨 (?)
- 초기의 방대한 vocabulary로 시작, 학습 동안 모델의 perplexity를 관측하며 tokenizer를 refine
🧑🏻‍💻 [Amazon] Amazon Nova Foundation Models
- fast text model 부터 full video generation 까지 Bedrock API 를 통해 이용 가능
- 라인업: Micro, Lite, Pro, Premier, Canvas, Reel
🧑🏻‍💻 [Cohere] Introducing Rerank 3.5: Precise AI Search
- 기업의 복잡한 데이터에 대한 improved reasoning & multilingual 능력
- 현존하는 검색 시스템들과 compatible
- 100개 이상의 언어를 지원한다고 설명
🧑🏻‍💻 [Google DeepMind] Genie 2: A large-scale foundation world model
- single 이미지를 입력으로 받아 플레이 가능한 3D 환경으로 반환
- Genie 1 → 2 에서의 emergent capabilities of a foundation world model 을 주장
📜 [Vanderbit Univ.] Training Noise Token Pruning
- for vision transformers
- discrete token dropping 조건을 continuous additive noise로 relax 하여 학습 내에서 smooth optimization을 제공
📜 [Univ. of California, Berkely] Predicting Emergent Capabilities by Finetuning (COLM 2024)
- LLM의 downtream 능력에 대해서는 사전학습에 비해서 예측하기 더 어렵다는 문제 (emergent ability를 fine-tuning 단에서 수행한 연구는 처음 보긴 함)
- 현재 LLM의 random few-shot 정확도를 기반으로 다음 세대 모델의 정확도를 예측할 수 있을까?
- insight: finetuning LLMs on a given task can shift the point in scaling at which emergence occurs towards less capable models
- 언어 모델을 특정 태스크에 대해 학습하면 emergent ability가 발현되는 point를 옮길 수 있다
📜 [Google DeepMind] PaliGemma 2: A Family of Versatile VLMs for Transfer
- SigLIP-So400m vision encoder + Gemma 2 (224px, 448px, 896px)
- long fine-grained captioning 같은 task 뿐만 아니라 OCR-related tasks도 커버
  - 꽤 넓은 범위로 transfer 가능하다는 것을 실험적으로 확인한 것으로 보임
🧑🏻‍💻 [OpenAI] o1 and ChatGPT Pro
- Day 1, o1 모델을 공개. ChatGPT Pro 플랜을 월 200$ 로 공개.
- Improved accuracy, Multimodal support, Faster and more concise 등의 특징
- Pro 유저는 o1, GPT-4o, o1-mini 등을 무제한 사용 가능
📜 [Microsoft, MIT] Does Prompt Formatting Have Any Impact on LLM Performance? (NAACL 2025)
- prompt template이 모델 성능에 미치는 영향을 연구
- 같은 내용을 일반 텍스트, 마크다운, JSON, YAML 형식 등으로 변환하여 GPT-3.5-turbo, GPT-4 모델을 테스트
- 성능이 높은 모델일수록 템플릿에 상관없이 성능이 유지되고, 그렇지 않은 모델은 크게 영향을 받는 것으로 확인됨
🧑🏻‍💻 [Google DeepMind] GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy (Nature)
- 15일까지 아주 정확하게 예측 가능한 일기 예보 모델을 개발
- new high resolution AI ensemble model 이라고 소개하고 있음 (diffusion 기반의 모델)
- 📜 Nature 논문 링크
📜 [Yunnan Univ.] Learning to Reason via Self-Iterative Process Feedback for Small Language Models (COLING 2025)
- odds ratio preference optimization (ORPO)를 결합하여 SLM 스스로 positive & negative signal을 생성 및 활용할 수 있도록 함
- sampling-based inference simulation & process reward models 를 이용하는 process supervision 도입
📜 [Peking, Baichuan] SysBench: Can Large Language Models Follow System Messages?
- 현존하는 LLM의 세 가지 한계점: constraint violation, instruction misjudgement, multi-turn instability
- 위 능력을 평가하고 분석 가능한 벤치마크 SysBench를 도입
- 이미 자주 사용되고 있는 6개의 constraint, 500개의 tailor-designed system messages, multi-trun conversation 등을 기반으로 데이터셋을 직접 구축
- 깃허브 링크 🔗

2nd week

📜 [Tsinghua] Densing Law of LLMs
- capability density 개념 제시: LLM의 실제 파라미터 사이즈 대비 effective parameter size의 비율
  - effective parameter size는 기존 모델 M 만큼의 퍼포먼스를 낼 수 있는 최소한의 사이즈를 의미
- → LLM의 학습 퀄리티를 평가
📜 [CMU, KAIST, Washington] Evaluating Language Models as Synthetic Data Generators
- AgoraBench: 언어모델의 데이터 생성 능력을 평가하는 벤치마크를 제시
- 6개의 언어 모델, training 99개 student 모델을 사용하여 1.26M training instances를 합성
- 데이터 생성 능력은 문제 해결 능력과 직접적인 상관관계를 보이지 않는다고 설명
- 깃허브 링크 🔗
🧑🏻‍💻 [LG AI Research] EXAONE-3.5 release
- EXAONE 3.5 language model series including instruction-tuned models of 2.4B, 7.8B, and 32B
🧑🏻‍💻 [Google] Meet Willow, our state-of-the-art quantum chip
- 더 많은 qubits를 사용함에 따라 에러를 exponentially 줄일 수 있었음
- Willow가 기록한 벤치마크 연산 능력은 오늘날 가장 빠른 슈퍼컴퓨터가 10 septilion (10의 25승)년을 연산할 것을 단 5분만에 처리할 수 있는 수준
📜 [Chinese Academy of Sciences] Towards Adaptive Mechanism Activation in Language Agent (COLING 2025)
- ALAMA: Adaptive Language Agent Mechanism Activation Learning with Self-Exploration
- expert model에 대한 의존 없이 mechanism activation adaptability를 최적화하는 것에 집중
- a harmonized agent framework (UniAct)를 구축하고 태스크 특성에 따라 적합한 방법론으로 최적화
📜 [OpenAI] OpenAI o1 System Card
- 최근 공개한 o1 preview → o1 모델의 특징과 성능을 리포트한 페이퍼를 공개
- GPT-4를 공개할 때와 마찬가지로 뻔한 이야기들을 담고 있음
🧑🏻‍💻 [OpenAI] Day 3. Sora
- widescreen, vertical, square 세 형태로 20초 길이의 영상 생성 가능
- 프롬프트를 통해 remix, blend, create 가능
- Turbo 모델은 전작 모델 대비 확실히 생성 속도가 빠름
🧑🏻‍💻 [OpenAI] Day 4. Canvas
- Expanded Access (web and windows), Integrated with GPT-4o, Data visualization, Split-screen workspace
- Direct python execution
📜 [Microsoft] Phi-4 Technical Report
- 데이터 퀄리티에 집중하여 학습한 14B 파라미터 언어 모델
- web content, code 중심의 organic data로 사전학습하는 기존 모델들과 달리, 합성 데이터를 적절히 혼합하여 사용하는 학습 방법론 적용
- phi-4는 STEM-focused QA 능력에서 teacher model의 성능을 능가하는 모습을 보여줌
📜 [Univ. of California, Santa Barbara] RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
- LLM이 추론 시 복잡한 현실 수준의 규칙들을 따를 수 있는지 평가하기 위한 벤치마크
- 세 개의 practical domain을 다루고 있음: airline baggage fees, NBA transactions, tax regulations
- 현존 LLM들의 세 가지 주요 한계: (1) 비슷하지만 다른 규칙을 구분하지 못함 (2) 규칙을 정확히 이해했더라도 수학 문제에서 일관된 성능을 보이지 않음 (3) 전반적으로 이 벤치마크 점수가 다 낮음
📜 [Univ. of Potsdam] I Don't Know: Explicit Modeling of Uncertainty with an [IDK] Token (NeurIPS 2024)
- hallucination을 잡기 위한 novel calibration method를 제시
- [IDK] 라는 스페셜 토큰을 vocab에 추가하고 부정확한 예측에 대한 probability mass를 [IDK] 토큰으로 옮기는 objective function을 도입 → 모델이 uncertainty를 명시적으로 반환하도록 함
- 이 방식으로 학습된 모델은 기존에 실수하거나 잘못 답변하던 내용들에 대해 uncertainty를 훨씬 더 잘표현할 수 있게 되었다고 보고
📜 [OpenAI] Measuring short-form factuality in large language models
- short & fact-seeking questions에 대한 모델의 능력을 평가하기 위한 벤치마크
- GPT-4의 response에 반하도록 수집한 challenging 벤치마크
- 오직 한 개의 답변만이 정답이 될 수 있도록 문제를 구성 (correct, incorrect, not attempted)
- 모델의 “know what they know”를 평가하기 위한 벤치마크
- 깃허브 링크 🔗
📜 [Saudi Data & Artificial Intelligence Authority] SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs
- AI2에서 공개한 Tulu3 post-training 파이프라인을 이용하여 SmolLM2-1.7B 모델을 학습한 SmolTulu-1.7b-Instruct 모델을 공개
- 135M 사이즈의 모델일 사용하여 learning rate과 batch size 관계가 모델 퍼포먼스에 큰 영향을 미친다는 것을 확인
- ARC, GSM8K 같은 태스크는 높은 lr, HellaSwag의 pattern recognition, IFEval 등은 낮은 lr이 적합

3rd week

📜 [Independent] Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture
- Foundation 모델의 성능을 높이기 위해 sequence transformation과 state transformation을 결합
- state space duality algorithm에서 rotary position embedding의 availability를 확인
- dynamic mask attention 적용하여 성능은 그대로 유지하면서도 연산 효율이 좋음
- cross domain mixture of experts를 디자인 (1024개 experts)
📜 [Beijing Univ.] Smaller Language Models Are Better Instruction Evolvers
- SLM이 LLM보다 effective instruction을 합성하기 더 좋다는 것을 실험적으로 입증
- SLM이 instruction evolving 동안 보다 넓은 output space를 가진다고 주장
- Instruction Complex Aware IFD (IC-IFD)를 제안: instruction data를 평가하기 위해 IFD를 개선한 메트릭
📜 [Google, Peking] TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
- 현재 트랜스포머 아키텍쳐의 가장 큰 문제 중 하나는 linear projection을 고정된 숫자의 파라미터에 의존하고 있다는 것 → scale-up 어려워지는 이유
- 모델 파라미터를 토큰으로 간주하여 트랜스포머 아키텍쳐 내 모든 linear projection을 token-parameter attention layer로 대체
- 깃허브 링크 🔗
📜 [Meta] Byte Latent Transformer: Patches Scale Better Than Tokens
- byte-level LLM 아키텍쳐에서 최초로 추론 효율성과 강건함 측면에서 tokenization-based LLM 수준을 달성한 사례
- bytes를 dynamic하게 sized patch로 encoding → 고정된 vocab x
- 8B 사이즈의 모델을 4T training bytes로 학습
🧑🏻‍💻 [Google DeepMind] Veo 2
- 4k까지의 고해상도 비디오를 굉장히 현실적으로 생성할 수 있는 SoTA급 모델
- 렌즈 타입과 카메라 효과를 instruction으로 정해서 비디오를 생성할수도 있음
- 구글의 SynthID 워터마크를 통해 AI-generated content인지 아닌지 쉽게 식별 가능
📜 [Shanghai AI Lab] Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
- 현재 visual generative model을 평가하기 위해서는 수백, 수천 개의 이미지 또는 비디오를 sampling 하는 복잡한 과정을 거쳐야 한다는 문제점 존재
- → Evaluation Agent 프레임워크: dynamic, multi-round evaluation, 각 라운드마다 몇 개의 샘플만을 사용
- 완전한 오픈소스 프레임워크로써 1) efficiency 2) promptable evaluation 3) explainability 4) scalability 등이 핵심 특징
- 깃허브 링크 🔗
🧑🏻‍💻 Claude Engineer v3
- Claude 3.5 모델을 이용하는 self-improving AI Assistant
- CLI & web 인터페이스 둘 다 지원
- 무려 10k 개의 스타 ⭐
📜 [AIRI] BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack (NeurIPS 2024)
- extremely long documents 전체에 걸쳐 퍼져있는 fact를 바탕으로 LLM의 추론 능력을 평가하는 벤치마크, BABILong 공개
- fact chaining, simple induction, deduction, counting 등 20여 개의 reasoning task 포함
- 평가 결과에 따르면 popular LLM도 문맥의 10-20% 정도만 활용하는 수준이며 reasoning complexity가 높아짐에 따라 퍼포먼스가 급격하게 떨어짐
📜 [CMU, Duke] TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
- browsing the Web, writing code, running program 등 digital worker가 일하는 방식으로 AI agent의 상호작용 능력을 평가하기 위한 벤치마크
- internal web site, data를 포함하는 self-contained environment를 구축
- 가장 뛰어난 모델로는 전체 태스크의 24% 정도를 완수할 수 있었다고 보고함
- 깃허브 링크 🔗
🧑🏻‍💻 [Google DeepMind] FACTS Grounding: A new benchmark for evaluating the factuality of large language models
- 논문 링크 🔗 캐글 리더보드 링크 🔗
- LLM의 답변이 사실적으로 정확하고 충분한 내용을 담고 있는지 확인할 수 있는 벤치마크
- gemini 모델들이 상위권을 다 차지하는데 상당히 의문스러운 양상..
- 860개의 public, 859개의 private held out set으로 구성되어 있고 전자를 공개
🧑🏻‍💻 [VS Code] Announcing a free GitHub Copilot for VS Code
- 2000 code completions/month, 50 chat requests/month, access to GPT-4o & Claude 3.5 Sonnet
- 코드 어시스턴트에 대한 관심이 뜨거운데, Cursor, Windsurf 에 뒤지지 않으려는 노력으로 보임
- 그러나 아직까지 다른 코드툴에 비해서는 너무 약해/평범해 보이는 기능들..
🧑🏻‍💻 [OpenAI] o3 preview & call for safety researchers
- 📜 Deliberative alignment: reasoning enables safer language models
  - o-series 모델에 적용한 새로운 alignment strategy
- 안전성 검사를 위한 작업을 진행 중이고, 이를 위해 일부 연구자들에게 사용 기회를 제공할 것으로 보임
🗞️ [Perplexity] Perplexity has reportedly closed a $500M funding round
- 인공지능 기반 검색 엔진 강자인 Perplexity가 500M 달러, 한화 약 6천 억원 규모의 투자를 받은 것으로 알려짐. 기업 가치는 약 110조에 달하는 것으로 평가.
- OpenAI가 Chat 모델 시장을 선점한 것, 검색 시장을 Perplexity가 선점한 것 등을 보면 시장에서 입지를 빠르게 가져가는 쪽이 압도적인 인지도와 유저풀을 갖게 되는 것 같다는 생각이 듦
📜 [Meta, Washington, CMU] Explore Theory-of-Mind: Program-Guided Adversarial Data Generation for Theory of Mind Reasoning
- ExploreToM, robust training & evaluation 을 위한 난이도 높은 theory of mind 관련 최초의 프레임 워크
- A* search를 custom domain-specific language에 사용하여 복잡한 story sturcture를 생산
- Llama-3.1-70B나 GPT-4o 같은 모델도 각각 0%, 9%에 달하는 낮은 정확도를 보임
- 깃허브 링크 🔗

4rd week

📜 [Washington, AI2] Self-Instruct: Aligning Language Models with Self-Generated Instructions (ACL 2023)
- 2년 전 논문이지만 지금도 많이 활용되고 있는 좋은 방법론이라 기록
- 언어 모델의 zero-shot 성능이 뛰어나더라도 human-written instruction data 자체는 확보하기 어렵다는 문제가 존재
- → Self-Instruct: 언어 모델의 생성 결과를 bootstrapping 함으로써 사전학습 모델의 instruction following 능력을 개선하는 프레임워크 제시
- instruction, input, output 생성 → invalid, similar 데이터는 필터링
📜 [Oxford] Confidence in the Reasoning of Large Language Models
- LLM의 답변에 대한 confidence와 accuracy 간의 상관관계를 연구한 논문
- (1) reconsider 하도록 prompt를 받았을 때의 persistence를 정성적으로 측정
- (2) self-reported confidnece score를 정량적으로 측정
- 일반적으로는 confidence와 accuracy가 양의 상관관계를 보이지만, 두 번째 답변이 첫 번째 답변보다 안좋을 가능성이 높음
- confidence는 token-level probability로 부분적인 해석만 가능
📜 [Peking, Microsoft Research] Outcome-Refining Process Supervision for Code Generation
- 코드 생성 태스크에서 학습된 리워드 모델을 사용하는 경우 성능은 뛰어나지만 학습 비용이 많이 들고 평가 신뢰도가 높지 않다는 문제가 존재
- Outcome-Refining Process Supervision, outcome refinement 자체를 supervised process 자체로 취급하는 paradigm 제시
- 여러 개의 solution trajectories를 유지하기 위해 tree-structured exploration을 사용
📜 [HKUST, Tencent] B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
- 평가하고자 하는 항목은 두 가지
  - (1) 모델이 충분히 다양한 response를 생성할 수 있는 능력이 있는가
  - (2) 고퀄리티-저퀄리티 데이터를 구분하는 external reward의 효용성
- 추론 관련 태스크에서 exploration & exploitation을 추적하여 정량적 분석 수행
- Self-Taught Reasoning 프레임워크 B-STaR 제시
📜 [Tsinghua] Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization
- 언어 모델들의 각 요소를 상세히 분석함으로써 RoPE 기반 attention 일반화의 문제점을 파악
- Discrete Signal Processing theory를 사용하여 RoPE가 Non-Uniform Discrete Fourier Transform을 achieve 함으로써 periodic attention을 가능하도록 만든다는 것을 확인
- Fourier Position Embedding (FoPE): periodic extension과 length generalization을 개선하기 위해 attention의 frequency-domain properties를 enhance
- 깃허브 링크 🔗
🧑🏻‍💻 MIS (Make It So)
- CLI Assistant
- OpenAI, Mistral, X.ai, Ollama 등과 같은 다양한 AI 프로바이더를 지원
- 자연어로 명령을 실행할 수 있음. 실제 명령 실행 전에 확인 과정을 거쳐 문제 일으킬 가능성 최소화.
- 깃허브 링크 🔗
📜 [KAIST, Microsoft Research] Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning
- Language model Ensembel with Monte Carlo Tree Search (LE-MCTS) 제시
- Markov decision process에 따라 언어 모델들의 ensemble 하여 step-by-step reasoning을 구성
- state는 중간 추론 과정 (reasoning path)를 나타내고 action은 다음 reasoning step을 생성하는 것으로 구성됨
📜 [Nanjing Univ.] Token-Budget-Aware LLM Reasoning
- 다른 문제들을 바탕으로 token budget을 dynamic 하게 추정하는 프레임워크
- CoT reasoning에 사용되는 토큰의 수와 비용을 효과적으로 감소시킬 수 있었다고 주장
- 깃허브 링크 🔗
📜 [KAIST, Google DeepMind] Revisiting In-Context Learning with Long Context Language Models
- 최근 Long Context Language Models (LCLMs)의 등장으로 여러 개의 예시를 입력으로 제공할 수 있는 상황이 되며 ICL의 중요성이 재조명되고 있음
- 정교한 예시 선정이 random selection 대비 큰 성능 향상으로 이어지지 않는다는 결과
- 오히려 좋은 예시들을 찾는 것보다 context window를 채울 만큼의 예시를 확보하는 게 더 어렵고 중요한 문제로 인식되기 시작했다는 주장
📜 [Tsinghua, Peking] How to Synthesize Text Data without Model Collapse?
- 합성 데이터가 모델 학습에 미치는 영향은 무엇이며, 어떻게 하면 모델 붕괴 없이 데이터를 합성할 수 있을까?
- 합성 데이터의 비율을 조절해가며 학습해 본 결과, 합성 데이터의 비율과 모델의 성능이 음의 상관관계를 보인다고 함
- 사람이 생성한 데이터에 token editing을 적용하여 semi-synthetic data를 획득하는 방법론 제안
📜 [Anthropic, NYU, Mila] Alignment faking in large language models
- alignment faking: training objective를 선택적으로 준수하여 학습 이후의 수정을 방지하는 현상
- 무료 유저와 대화할 것임을 시스템 프롬프트에 포함하고 모든 해로운 질문에 대해서도 답변할 수 있도록 학습 → 추론 시 무료/유료 유저 결과 비교해보니 무료 유저의 해로운 질문에 대해서 답변할 확률이 14%, 유료 유저에 대해서는 0%
- 사전학습 데이터를 모방한 합성 데이터를 이용하여 보다 현실적인 세팅에서 실험한 결과 제시
- 강화학습으로도 alignment-faking reasoning을 확인한 결과, 그 비율이 78%까지 오르는 것을 확인함
📜 [Pennsylvania, Salesforce] GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers
- prompt engineering automating에 있어서 비용이 많이 드는 LLM에 의존해야 한다는 문제점 존재
- GReaTer: task loss gradients를 활용하여 open-source, lightweight LM으로 self-optimization of prompts 수행하는 테크닉
- 깃허브 링크 🔗
📜 [Google Research, Google DeepMind] A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
- LLM pre-training efficiency와 quality를 향상시키기 위해 SLM을 적절히 활용하는 방법론 제안
- (1) additional training supervision을 위한 soft label 제공
- (2) small subset of valuable training examples 선별
- 1.5B 모델을 soft labeler로 이용하여 2.8B 사이즈 모델을 학습한 결과를 제시
- low-quality supervision이 좋은 영향을 줄 수 있음, 그리고 adaptive하게 적용할 필요성 등을 확인한 것으로 보임. 장기적으로는 더 좋은 모델을 활용하여 더 뛰어난 모델을 사전학습 단계에서 만들 수 있다는 의미가 될 수도.. (자원이 뒷받침 된다면)
📜 [DeepSeek] DeepSeek-V3 Technical Report
- 671B total, 37B activated 파라미터 사이즈를 갖는 MoE LM / 14.8T 토큰으로 사전학습 및 SFT, RL / 2.788M H800 GPU hours
- 효율적인 학습 및 추론을 위해 Multi-head Latent Attention (MLA) & DeepSeekMoE 아키텍쳐 선택
- load balancing을 위한 auxiliary-loss-free strategy, multi-token prediction training objective
- 깃허브 링크 🔗
📜 [Meta] Large Concept Models: Language Modeling in a Sentence Representation Space
- concept: an explicit higher-level semantic representation (실제 사람이 언어를 인지하는 방식을 따르고자 함 instead of token)
- existing sentence embedding space, SONAR 사용
- diffusion-based generation의 일종인 MSE regression 등을 시도
- 1.6B 모델에 1.3T 토큰 학습 & 7B 모델에 2.7T 토큰 학습
- 깃허브 링크 🔗
🧑🏻‍💻 [Ollama & HuggingFace] Use Ollama with any GGUF Model on Hugging Face Hub
- 허깅페이스의 Local Apps settings에서 ollama 설정
- 모델 페이지의 Use this model에서 ollama를 선택
- ollama run hf.co/{username}/{repository}
🧑🏻‍💻 [Qwen] QVQ: To See the World with Wisdom
- Qwen에서 weight를 공개한 멀티모달 모델
- MMMU, MathVista, MathVision, OlympiadBench 등 수학적 추론 능력이 크게 요구되는 벤치마크에서 GPT-4o & Claude3.5 Sonnet 이상의 퍼포먼스를 보임
- Language Mixing & Code-Switching 등이 예상치 못하게 나타날 수 있음, Recursive Reasoning 등의 문제가 존재
📜 [Tencent] A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression
- long-context를 처리하는 gits-based context compression에 대한 한계를 지적
  - synthetic recall과 같은 태스크에서 약점을 보임
- 세 개의 key failure patterns
  - (1) lost by the boundary (2) lost if surprise (3) lost along the way
- 두 개의 전략을 제시
  - (1) fine-grained autoencoding: original token 정보를 reconstruct 하는 걸 강화
  - (2) segment-wise token importance estimation: token dependencies 기반으로 최적화 조절
📜 [Gaoling School] YuLan-Mini: An Open Data-efficient Language Model
- 비슷한 사이즈 모델들 중 가장 뛰어난 2.42B LLM 공개 (1.08T 토큰으로 학습)
- 세 개의 특징을 가진 사전학습 테크닉
  - (1) an elaborate data pipeline
  - (2) 학습 불안정성을 완화하는 robust optimization method
  - (3) targeted data selection & long context training
- 깃허브 링크 🔗
📜 [Chalmers University] The Impact of Prompt Programming on Function-Level Code Generation
- CodePromptEval: 5개의 프롬프트 테크닉을 평가하기 위한 7072개의 프롬프트로 구성된 데이터셋 (few-shot, persona, chain-of-thought, funciton signature, list of packages)
- 세 개의 LLM(GPT-4o, Llama3, Mistral)로 부터 생성한 completion function의 quality 평가
- 특정 테크닉이 코드 생성에 도움은 되지만, 이것들의 조합/결합이 반드시 도움이 되는 것은 아님
- correctness & quality 간의 trade-off 관측 (quality가 뭘 의미하는지 모르겠음)
📜 [Meta] Improving Factuality with Explicit Working Memory
- Explicit Working Memory (Ewe): long-form text generation에서 real-time feecback을 받는 working memory를 통합
- memory는 online fack-checking과 retrieval feedback을 기반으로 refreshed
  - → 중간에 잘못 생성되었던 내용들에 대한 dependency issue를 해결할 수 있음
- memory update 규칙, memory unit에 대한 configuration, retrieval datastore의 quality 등이 성능에 가장 큰 영향을 미치는 요소들

🍁 11월

1st ~ 2nd week

📜 [Boston] Linguistics Theory Meets LLM: Code-Switched Text Generation via Equivalence Constrained Large Language Models
- 하나의 대화 내에서 두 개 이상의 언어를 번갈아 가면서 사용하는 것은 NLP에서 상당히 어려운 문제
- EZSwitch: Equivalence Constraint Theory (ECT)를 LLM에 결합하여 언어학적으로 타당하고 유려한 code-switched text를 만들 수 있도록 하는 프레임워크
- CSPerf: human preference dataset
📜 [Yale, NYU] Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? (NAACL 2024 Short)
- LLM이 text table, HTML, LaTeX 형식 등을 잘 다룰 수 있는지 평가하는 벤치마크, Struc-Bench
- Prompting Score (P-Score) & Heuristical Score (H-Score) 를 제안
- structure fine-tuning을 고안하여 Llama에 적용한 결과, 눈에 띄는 성능 향상이 있었다고 보고
- 깃허브 링크 🔗
📜 [Apple] Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
- HyperCloning, 사전학습된 모델의 파라미터를 더 큰 모델의 증가된 hidden dimension에 맞게 확장하는 방법론
- larger model이 smaller model의 functionality를 보유할 수 있도록 도와줌
- 학습이 시작되기 전 larger 모델이 smaller 모델의 능력을 탑재하고 있으므로, 무작위로 초기화된 파라미터를 학습하는 것보다 훨씬 효율적이라고 주장
🧑🏻‍💻 [OpenAI] Introducing ChatGPT search
- GPT-4o의 언어 처리 능력에 웹 데이터 access를 더한 hybrid system을 제공
- 합성데이터로 fine-tuned GPT-4o를 사용
- 날씨, 주식, 스포츠 등은 data provider와 파트너십을 통해 real-time data를 특별히 제공한다고 함
📜 [Ghent University] Large Language Models Reflect the Ideology of their Creators
- 다양한 LLM과 언어에 나타난 ideological stance의 다양성을 조사
- LLM에게 최근 세계사의 유명하면서도 논쟁이 많은 인물들을 묘사하도록 프롬프팅 (영어 & 중국어)
- 같은 LLM이라도 영어와 중국어 사용에 따라 normative disagreement를 보인다는 것을 확인함
- Western 모델에 정치적인 성향이 반영되어 있다고도 주장
📜 [Ohio, Washington, AI2] ComPO: Community Preferences for Language Model Personalization
- 기존 언어 모델 학습에 반영하는 human feedback은 “average” user의 선호를 가정한 것이기 때문에 다양한 주관적 & finer-grained 특성을 무시하고 있음
- ComPO, preference provider와 함께 모델 output의 확률 분포를 contextualize 함으로써 preference optimization를 personalize
- 개인 단위가 아닌 그룹 단위의 선호 데이터셋을 수집하여 community-level preferences from Reddit → ComPRed 공개
📜 [NYU, AI2, NVIDIA, Washington] Diverging Preferences: When do Annotators Disagree and do Models Know?
- human-labeled preference dataset에 존재하는 diverging prefernces를 연구
- 4개의 high-level 클래스로 구분되는 10개의 카테고리로 disagreement taxonomy를 구축
  - task underspecification, response style, refusals, annotation errors
- 이것들이 reward modeling & evaluation 에 어떤 영향을 미치는지 조사
📜 [VNU Univ.] MoD: A Distribution-Based Approach for Merging Large Language Models
- Mixture of Distribution (MoD): 모델 weight 대신 출력 확률 분포로 operate
- 각 모델들의 specialized 능력을 보존하면서도 task 사이의 효율적인 knowledge sharing 가능
- 간단하게 살펴봤을 땐 다른 merge 방식과 뭐가 그렇게 크게 다른지는 잘 모르겠음
- 깃허브 링크 🔗
🧑🏻‍💻 [Google] Gemini API and Google AI Studio now offer Grounding with Google Search
- Grounding with Google Search 기능을 Google AI Studio, Gemini API 에서 선보임
- 검색 결과를 기반으로 답변을 생성하는 방식으로 최근 생성형 검색 엔진에 대한 관심이 뜨거움
- 그러나 최근 구글 검색의 결과물이 만족스럽지 않다는 점을 감안하면 그렇게 좋을지는 잘 모르겠음
🧑🏻‍💻 [HuggingFace] SmolLM2-1.7B-Instruct
- 135M, 360M, 1.7B 사이즈로 구성된 sLLM 패밀리 version 2를 공개
- 잘 정제된 데이터셋으로 SFT & DPO 학습한 모델로, 동사이즈 대비 아주 뛰어난 성능 지표를 보임
- 이미 ollama에서도 지원 🔗
🧑🏻‍💻 [Anthropic] PDF support (beta)
- PDF 파일 내에 존재하는 텍스트, 시각 자료, 이미지, 차트 등을 분석할 수 있는 기능을 API로 제공
- 최대 32MB, 100 페이지 커버가 가능하며 페이지당 1,500 ~ 3,000 토큰 사용
🧑🏻‍💻 [xAI] API Public Beta
- 개발 마지막 단계에 있는 Grok 모델을 public beta로 공개
- 128K 토큰 길이의 context, function calling, system prompt를 지원
- 베타 기간 동안 25$의 API 크레딧을 매달 지급
🧑🏻‍💻 [Anthropic] Claude 3.5 Haiku
- optimized for rapid, accurate code completions
- 다른 태스크보다 특히 코드 생성에서 좋은 퍼포먼스를 보이는 것 같음
- 그런데 비용이 많이 올라서 논란이 되는 것으로 보임
- Sonnet 3.5 (new)의 성능도 함께 화제가 되는 중
📜 [MIT, Cambridge] The Geometry of Concepts: Sparse Autoencoder Feature Structuret
- Sparse autoencoder는 최근 LLM에 의해 표현되는 세상의 concepts를 high dimensional vectors의 dictionaries로 produce 가능
1. “atomic” small scale structure는 “crystal” face를 가진 평행사변형 또는 사다리꼴을 포함한다.
2. “brain” intermediate-scael structure는 상당한 spatial modularity를 포함한다.
3. “galaxy” scale structure는 isotropic이 아니다. 대신 middle layer에서 가파른 기울기를 갖는 power law of eigen values를 지닌다.
📜 [Google Research] Distinguishing Ignorance from Error in LLM Hallucinations
- close-book Question Answering (CBQA) 시나리오에서 hallucination에 대해 연구: 모델이 실제로 파라미터 내에 correct knowledge를 보유하지 않은 것인가 or 알고 있는데 답변을 잘못한 것인가
- 후자의 경우 중간 연산에 개입함으로써 문제를 해결할 수 있으나, 전자의 경우 외부 지식 source가 필요
- 두 경우를 구분하기 위해 Wrong Answer despite having Correct Knowledge (WACK) 라는 model-specific dataset 구축 방식을 제안
📜 [Duke, Google Research] SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models
- external knowledge base에 의존하거나 추가적인 fine-tuning 없이 LLM의 truthfulness를 향상시킬 수 있는 novel decoding framework
- 마지막 layer의 output logits와 초기 layer의 output logits을 contrasting 하여 LLM 내부에 embedded 된 latent knowledge를 이용
- latent knowledge가 output에 대해 self-refinement 할 수 있도록 approximate gradient approach 를 사용
🧑🏻‍💻 [HuggingFace] Smol Tools
- LLaMA.cpp로 구현된 가벼운 AI-powered tools, small language models의 collection
- SmolSummarizer, SmolRewriter, SmolAgent
- 각각이 엄청난 건 아닌데 작은 모델들을 각자의 작업에 특화시켜서 합친 것에 의미가 있는 듯함
📜 [IBM] Granite 3.0 Language Models
- lightweight SoTA 모델 패밀리 공개. 총 12T 토큰으로 학습된 2B & 8B 사이즈의 모델
- Sparse 1B & 3B MoE 모델. 400M & 800M activate 파라미터. 총 10T 토큰으로 학습.
- 비교군으로는 Llama3.1 8B, Mistral 7B / SmolLM-1.7B 등 모델을 사용
- 상업적으로도 사용 가능하도록 Apache 2.0 라이센스로 공개됨
📜 HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
- RAG 시나리오에서 검색된 html을 plain text로 변환하는 과정에서 heading, table structure와 같은 구조적 or semantic 정보가 많이 소실됨
- 따라서 plain text 대신 HTML을 사용하는 HtmlRAG를 제안
- 그러나 HTML을 바로 사용하기는 어렵기 때문에, HTML cleaning, compression, pruning strategies를 도입하여 정보의 손실을 최소화 하면서도 HTML을 줄이고자 함
📜 [Dartmoouth, Adobe, Stanford, …] Personalization of Large Language Models: A Survey
- personalized LLM usage에 대한 taxonomy를 정비하고 주요 차이점과 챌린지를 요약하는 서베이
- personalization techniques, datasets ,evaluation methods, application 등을 기준으로 구분
📜 [Huawei] Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
- 다양한 science tasks를 자율적로 수행할 수 있는 end-to-end agent, Agent K v1.0 공개
- 기존의 rigid & limited 한 CoT & reflection 대신에 아주 유연한 structrued reasoning 프레임워크를 사용했다고 언급
- iteration마다 핵심 정보를 탐색 및 저장함으로써 long- & short-term memory를 업데이트함. 이를 통해 fine-tuning이나 backpropagation 없이 성능을 개선할 수 있음
📜 [Tancent] Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
- 52B activation parameter를 갖는 389B 사이즈의 MoE 아키텍쳐 LLM 공개
- 256K 길이의 window size를 갖는 모델
- 다양한 태스크에서 LLama3.1-70B를 능가하고, 405B 모델에 비견되는 성능을 보임
- large-scale synthetic data, mixed expert routing, key-value cache compression, expert-specific learning rate 등이 핵심 특징
- MoE 모델의 scaling law와 learning rate schedule에 대해서도 연구
- 깃허브 링크 🔗 허깅페이스 링크 🔗
🧑🏻‍💻 [Ollama] Ollama 0.4 Integrates Meta's Llama 3.2 Vision Models (11B and 90B)
- Llama 3.2 Vision: OCR, handwriting → machine-readable text, 차트와 표 이해
- 터미널에서 사용 가능
📜 [NVIDIA] MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs
- MLLM을 이용하여 다양한 modality, 다양한 retrieval task를 아우르는 universal multimodal retrieval 시나리오 지원
- MLLM을 10개 데이터셋 16개의 태스크에 대해 학습하여 bi-encoder retriever로 사용
- MLLM에 존재하는 modality bias를 완화하기 위해 modality-aware hard negative mining을 제안
- 여러 modality 중에서도 특히 text retrieval 능력을 향상시키기 위해 continually fine-tuning 할 것을 제안
- 허깅페이스 링크 🔗
📜 [Zhejiang] Fine-Grained Guidance for Retrievers: Leveraging LLMs' Feedback in Retrieval-Augmented Generation
- Guided Discovery Learning 교육학 이론을 바탕으로 FiGRet (Fine-grained Guidance for Retrievers) 제안
- retriever가 잘 못하는 샘플들로부터 easy-to-understand 샘플을 LLM으로 생성하는 방식
- 이때 세 가지 learning objective, relevance, comprehensiveness, purity를 고려
- LLM과 retriever 간 dual curriculum learning & reciprocal feedback
🗞️ [XPENG] XPENG Unveils Iron Humanoid Robot, Already Operational in EV Factory
- 중국의 전기차 회사 XPENG에서 인간과 비슷한 사이즈의 휴머노드를 공개 (5’8’’, 154 파운드)
- Eagle Vision 시스템과 end-to-end large AI model이 통합된 시스템
- PoC 수준을 넘어 실제 공정에서 활용 가능
🧑🏻‍💻 [ByteDance, Tsinghua] X-Portrait 2: Highly Expressive Portrait Animation
- static portrait 이미지를 reference video를 참고하여 dynamic, expressive animation으로 변경해주는 모델
- 현실적인 이미지와 만화 그림체 사이에도 style transfer 가능
📜 [Edinburgh] Mixtures of In-Context Learners
- demonstrations subset을 expert로 처리하고, 학습 데이터에서 각각에 대한 output distribution을 병합하는 방식, Mixtures of In-Context Learners (MoICL) → 입력에 불필요하게 포함되는 토큰 숫자를 줄여 메모리, 추론 속도 효율을 높일 수 있음
- 분류 태스크에서 뛰어난 성능, 더 적은 demonstration으로 기존과 유사한 퍼포먼스를 달성하여 파레토 라인을 push
📜 [Google, Peking] TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
- transformer 아키텍쳐로 scale-up 하기 어려운 이유 중 하나는 linear projection에 필요한 파라미터의 숫자가 고정되어 있기 때문
- Tokenformer: attention 메커니즘을 input token 사이의 computation 뿐만 아니라 token과 모델 파라미터 간 interaction에도 활용
- 모든 linear layer를 token-parameter attention layer로 교체!
- 깃허브 링크 🔗
📜 [Hong Kong, Tsinghua, Peking, Tencent] Large Language Models Can Self-Improve in Long-context Reasoning
- 현존 LLM은 Long-context Reasoning에 약세를 보이고 이를 해결하는 방법은 human annotation 기반의 합성 데이터를 학습하는 것 → 추가 발전이 어려움
- 위 문제를 해결하기 위해 SeaLong 제안: 각 질문에 대해 여러 개의 output을 생성하고 Minimum Bayes Risks를 이용한 scoring 후 SFT 또는 preference optimization
- 이런 방법론들은 결국 cost 문제에 직면하기 마련인데..
🧑🏻‍💻 [INF, M-A-P] OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
- 탑티어 Code LLM의 성능에 달하는 오픈소스 코드 모델을 공개 (1.5B & 8B)
- 재현 가능한 960B 토큰의 데이터셋, 4.5M SFT samples, intermediate checkpoints
- Two-Stage Instruction Fine-Tuning for Theory and Practice
- Ollama에서 동작 가능. 로컬에서 코드 모델을 사용하고자 하는 수요가 적지 않은 것 같음
🧑🏻‍💻 [NVIDIA] Cosmos Tokenizer: A suite of image and video neural tokenizers
- SOTA 모델 대비 8배의 압축률을 자랑하는 image & video tokenizer를 공개
- 토크나이저는 생성형 모델들의 성능에 직접적인 영향을 주는데 이를 평가하기 위한 TokenBench도 존재
📜 [Wuhan Univ.] Adaption-of-Thought: Learning Question Difficulty Improves Large Language Models for Reasoning (EMNLP 2024 Main) - simple method로는 LLM이 어려운 질문에 대해 충분히 답변할 수 없음 - Adaptation-of-Thought (AdoT): question의 난이도를 먼저 평가하고 demonstration set을 조정하여 difficulty-adapted retrieval 전략을 사용
🧑🏻‍💻 [Alibaba] Qwen2.5-Coder Series: Powerful, Diverse, Practical.
- Qwen2.5-Coder-32B-Instruct는 코딩에서 GPT-4o 이상의 퍼포먼스를 보임
- 6개의 모델 사이즈를 기준으로 모델을 공개
  - 0.5B / 1.5B / 7B / 14B / 32B 모델은 Apache 2.0, 3B 모델은 Qwen-Research 라이센스를 따름
- coding assistant & Artifact 두 개의 시나리오에서 사용할 수 있게끔 학습됨
🧑🏻‍💻 [Nous Research] Introducing the Forge Reasoning API Beta and Nous Chat: An Evolution in LLM Inference
- Hermes 70B 오픈소스 모델 이용하여 higher expression, long-form thinking, individual alignment가 가능하도록 함
- 📜 모델 테크니컬 리포트 🔗
- MCTS, CoC, MoA 등의 방법론들을 조합하여 모델 사이즈 증가 없이 퍼포먼스를 향상시킴
📜 [Israel Institue of Technology] Backward Lens: Projecting Language Model Gradients into the Vocabulary Space (EMNLP 2024 Best paper)
- 최근에는 Transformer 기반의 언어 모델들이 forward 하는 동안의 weight와 hidden state를 모델의 vocab에 project 함으로써 interpretailiby를 높이고자 하는 시도가 많았음
- gradient matrix가 low-rank linear combination의 forward & backward pass의 입력으로 cast 될 수 있음을 입증 (?)
- 이러한 gradients를 vocab item에 project하고 LM의 neuron에 새로운 정보를 저장할 수 있도록 하는 방법론을 고안
- 깃허브 링크 🔗
📜 [Univ. of Tehran] CoCoP: Enhancing Text Classification with LLM through Code Completion Prompt
- LLM의 성능은 입력 프롬프트의 품질에 크게 영향을 받는다는 문제가 존재
- text classification 문제를 해결하기 위해 LLM의 code 능력을 활용하는 Code Completion Prompt (CoCoP) 방법론 제시: text classification → code completion
- CodeLLaMA와 같은 코드 특화 모델을 사용하는 경우, few-shot learning 수준의 퍼포먼스 가능
🧑🏻‍💻 [Together AI] Llama OCR
- Together AI가 학습한 Llama 3.2 모델의 endpoint를 사용하여 ocr 수행
- Llama 3.2 11B & 90B 모델은 유료로 사용 가능
- 이미지 업로드 페이지 링크 🔗
📜 [Apple] Cut Your Losses in Large-Vocabulary Language Models
- 점점 더 큰 vocab을 사용하는데, 이는 학습 시 cross entropy loss 계산으로 인해 불필요하게 많은 메모리를 차지하는 이슈가 존재함
  - 이는 각 입력 토큰 & vocab item 쌍마다 logit 행렬을 구축하기 때문이고, 작은 모델이라고 할지라도 LLM의 나머지 구성요소의 수배에 달하는 메모리를 차지하게 됨
- Cut Cross-Entropy (CCE) 제안: 모든 토큰에 대한 로짓을 전역 메모리에 저장하지 않고도 Cross Entropy 계산 가능
  - 대신 정답에 대한 logit만 계산, 모든 logit에 대한 log sum-exp를 실시간 평가
- Gemma 2 (2B) 모델의 경우 loss 계산의 메모리 사용량을 24GB → 1MB 로 줄이고, classification head의 전체 학습에서는 28GB → 1GB 로 줄임
- 깃허브 링크 🔗
🧑🏻‍💻 [Anthropic] Improve your prompts in the developer console
- Anthropic Console에서 기존 프롬프트를 개선하는 기능을 추가
- CoT Reasoning, Example standardization, Example enrichment, Rewriting, Prefill addition 등을 활용
- workbench에서 multi-shot example을 관리할 수 있음. Claude를 활용하여 synthetic 데이터를 자동적으로 만들 수도 있음
- (이전에 출시된 기능이긴한데) 최종 생성 결과에 대해 1-5점 점수를 부여하는 평가 기능도 지원함

3rd week

📜 [Harvard, Stanford, MIT, Databricks, CMU] Scaling Laws for Precision
- low precision training & inference는 언어 모델의 성능에 영향을 크게 미치고 있으나 현존하는 scaling law는 이에 대해서 제대로 설명하고 있지 못함을 지적
- training in lower precision은 모델의 effective parameter count를 감소시킴으로써 low precision training과 post-train quantization으로부터의 loss를 예측할 수 있도록 함
- 추론에 대해서는, 모델이 더 많은 데이터로 학습되었을수록 post-training quantization에 의한 성능 하락이 심각
- 학습에 대해서는, 본인들이 제시하는 scaling law를 통해 다른 precision으로 학습한 결과를 예측할 수 있다고 주장. 이때 큰 모델을 낮은 precision으로 학습하는 것을 권장.
📜 [MIT] The Surprising Effectiveness of Test-Time Training for Abstract Reasoning
- test-time training (TTT): input data로부터의 로스를 이용하여, 모델 파라미터를 추론 시 임시 업데이트하는 방법론
- Abstraction and Reasoning Corpus (ARC)를 벤치마크로 사용 (reasoning 포커스)
- TTT의 중요한 구성 요소: (1) initial finetuning on similar tasks (2) auxiliary task format and augmentations (3) per-instance training
📜 [Peking, Tsinghua] LLaVA-o1: Let Vision Language Models Reason Step-by-Step
- 현재 Vision-Lanugage Model은 systematic & structured reasoning에서 어려움을 겪고 있음
- LLaVA-o1, autonomous multistage reasoning
- 일반적인 CoT prompting과 달리 LLaVA-o1은 summarization, visual interpretation, logical reasoning, conclusion generation 으로 구성된 stage들을 독립적 & 연속적으로 engage
- LLaVA-o1-100k dataset: visual question answering, structured reasoning annotations
📜 [Shanghai, Fudan] Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions
- 기존 LLM 벤치마크들은 단순한 QA이고 현실 세계와 같이 복잡한 문제들을 전혀 다루고 있지 못하는 상황
- Compound Question Synthesis (CQ-Syn)을 도입하여 Compound-QA를 제작. multi sub-question에 집중
- Factual-Statement, Cause-and-Effect, Hypothetical-Analysis, Comparison-and-Selection, Evaluation-and-Suggestion, 다섯 개의 카테고리를 다룸
📜 [UIUC, IBM] DELIFT: Data Efficient Language model Instruction Fine Tuning
- single-stage optimization 또는 intensive gradient calculation에만 집중하는 현재 학습 방식이 별로라고 지적
- DELIFT, 세 단계의 fine-tuning을 통해 data selection을 systematically optimize
- (1) instruction tuning (2) task-specific fine-tuning (3) continual fine-tuning
- 현재 데이터 샘플이 현재 모델의 상태에 얼마나 beneficial 한지를 정량화하는 pairwise utility metric 사용
📜 [Univ. of California, Tsinghua, Peking] Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles
- 언어 모델이 프롬프트를 압축할 때, 압축 스타일(extractive or abstractive)이 결과에 큰 영향을 미침
- Style-Compress: smaller model이 새로운 태스크에 대해 추가적인 fine-tuning 없이 프롬프트를 압축할 수 있도록 adapt하는 방법론
- 10개 샘플, 100개 쿼리로 adaptation 한 뒤 compression 적용한 결과가 준수하다는 것을 확인
- 방법론에 대한 간단한 수식, 파이프라인, 다양한 실험을 통해 논문화.. 프레임워크도 중요한 시대
🧑🏻‍💻 [Microsoft] Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators
- Agent 모델을 학습할 수 있는 고품질 instruction dataset 공개 (1M pair)
- 합성 데이터 사용 시 LLM의 학습 속도를 높일 수 있다고 설명
📜 [KAIST] AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML
- 현존 AutoML 시스템은 복잡한 툴들을 셋업하기 위한 전문지식이 필요하고 시간도 많이 걸림
- AutoML-Agent, data retrieval 부터 model deployment 까지 아우르는 multi-agent framework
- retrieval-augmented planning strategy를 사용하여 최적의 plan을 만듦
- 각 plan을 sub-tasks로 쪼개어서 특화된 agent가 이를 처리할 수 있도록 함
🧑🏻‍💻 [AI2] Ai2 OpenScholar: Scientific literature synthesis with retrieval-augmented language models
- a retrieval-augmented LM & 45M-paper datastore (CS, Bio, Physics, … )
- retriever and reranker to search the datastore
- 8B Llama fine-tuned on high-quality synthetic data
- self-feedback generation pipeline
🧑🏻‍💻 [Mistral AI] Mistral has entered the chat
- Web search with citations, Canvas for ideation
- SoTA document and image understanding, powerd bye the new multimodal Pixtral Large
  - SoTA on MathVista, DocVQA, VQAv2
  - 123B multimodal decoder, 1B parameter vision encoder
  - 128K context window
- Faster responses powered by speculative editing
🧑🏻‍💻 [Perplexity] Shop like a Pro: Perplexity’s new AI-powered shopping assistant
- 아직 US 한정인 것 같음
- Buy with Pro: One-click checkout to save time & free shipping
- Snap to Shop: 물건의 사진과 유사한 상품을 찾아주는 visual search tool
- Introducing the Perplexity Merchant Program: 상품 판매자들이 가입하는 프로그램으로, 가입 시 상품이 인덱싱 대상이 되어 추천이 더 잘될 수 있음을 언급
📜 [Together AI, Stanford, etc] RedPajama: an Open Dataset for Training Large Language Models
- 오픈소스 모델이 발전하기 어려운 데이터 관점의 세 가지 문제점을 지적
  - 모델 개발의 투명성 부족 (데이터 정제 포함), 고품질 데이터셋 대량 확보의 어려움, 데이터셋 정제와 분석을 위한 artifact 및 메타 데이터 이용 가능성 낮음
- 이러한 문제를 해결하기 위해 RedPajama-V1 release, open reproduction of the LLaMA training dataset
- RedPajama-V2를 함께 release, 정제되지 않은 날것의 text data로 구성된 massive web-only dataset
- RedPajama 데이터셋은 다양한 도메인에 걸쳐 100T 토큰 이상의 텍스트로 구성됨
📜 [Stony Brook] A Novel Approach to Eliminating Hallucinations in Large Language Model-Assisted Causal Discovery
- LLM이 causal discovery에서 hallucination을 일으키기 때문에 모델 선정이 중요함
- 고품질 데이터에 접근 가능할 때 RAG를 사용하여 hallucination을 줄이는 방법을 제안
- arbiter(결정권자)를 포함한 여러 LLM을 debate에 참여시켜 causal graphs의 edge를 감사함으로써 hallucination을 최소화하는 기법을 제안
- 프롬프트 엔지니어링을 통해 graph를 만드는 것부터 시작
- 고품질 데이터 기반의 RAG, 뛰어난 LLM간 debate를 활용한 hallucination 최소화에 대한 연구
📽️ Cerebral Valley: Alexandr Wang Scale AI
- 사전학습으로 쓸 수 있는 데이터는 사실상 고갈됨.
- 그러나 post training으로 모델을 발전시킬 수 있는 여지는 무궁무진.
- 최근 o1 or DeepSeek이 좋은 사례
🧑🏻‍💻 [DeepSeek] DeepSeek-R1-Lite-Preview is now live: unleashing supercharged reasoning power!
- o1-preview-level의 AIME & MATH 벤치마크 결과
- thought process를 real-time으로 투명하게 공개
- 곧 오픈 소스 모델과 API 공개 예정
- 링크에서 채팅 가능
🧑🏻‍💻 [H] French startup H Company launches Runner H: a web automation agent with human-like precision
- 프랑스 스타트업 H가 웹 자동화 agent를 일부 사용자들에게 공개. 현재는 wait list에 이메일을 올려야 함
- 이것이 첫 product인데 $220M 투자 받은 것으로 알려짐 (한화 약 3,000억원)
- API beta도 제공
🧑🏻‍💻 [HuggingFaceTB] SmolTalk
- SmolLM2-Instruct 모델을 만들 때 사용된 1M 개 데이터
- instruction following 능력을 향상시키면서 다양한 태스크를 잘 수행할 수 있는 데 기여하는 public 데이터셋을 합성하여 공개
🧑🏻‍💻 [Ai2] Tülu 3 opens language model post-training up to more tasks and more people
- post-training의 발전을 위해 제작된 데이터 & 툴
- Data, Data Toolkit, Training Code & Infrastructure, Evaluation Framework, Demo, Models & Checkpoints
🧑🏻‍💻 [Apple] AIMv2
- AIMv2: multimodal autoregressive objective로 사전 학습된 vision model family
- 대부분의 멀티모달 이해 벤치마크에서 OAI CLIP, SigLIP 등을 outperform
- open-vocabulary object detection & referring expression comprehension에서 DINOv2를 outperform
- 📜 Multimodal Autoregressive Pre-training of Large Vision Encoders
📜 [Anthropic] Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
- 현재 LLM에 대한 평가는 experiment analysis and planning 에 대한 중요성을 간과하고 이뤄진다는 문제를 지적
- 통계학 기반의 연구자들에게 언어 모델의 평가 데이터를 어떻게 분석하고 접근해야 하는지 설명하는 연구
- 평가 데이터 분석, 두 모델 간의 차이 측정, 평가 실험 계획을 위한 공식을 제시

4th week

📜 [Aalborg Univ.] Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective
- knowledge integration & evaluating hallucination 방법론에 대한 연구
- LLM의 hallucination 현상을 완화하기 위해 knowledge graph 활용
📜 [Google DeepMind] Learning high-accuracy error decoding for quantum processors (Nature 2024)
- recurrent, transformer-based neural network that learns to decode the surface code
- 구글 딥마인드에서 인공지능을 활용한 quantum computer 연구를 수행하고 있음
📜 [National Univ. of Singapore] The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use
- Claude 3.5 Computer Use를 다양한 도메인과 소프트웨어에서 사용해보며 작성한 case study
- 연구에 활용된 프롬프트나 도메인, 소프트웨어 정보를 다양하게 포함하고 있음
- 깃허브 링크 🔗
📰 [Amazon] Amazon and Anthropic deepen strategic collaboration
- 아마존이 Anthropic과의 전략적 협력을 강화하며 $40억 규모의 추가 투자를 진행 (한화 약 5조)
- Microsoft & OpenAI 의 관계와 유사하다고 이해할 수 있음
- Anthropic의 다음 세대 모델 개발을 위한 accelerator chip, “Trainium” 개발에 사용될 것
🧑🏻‍💻 [Anthropic] Hume AI creates emotionally intelligent voice interactions with Claude
- 2M minute이 넘는 AI voice 대화 완료
- 36%의 유저가 다른 LLM 대신 Claude를 선택
- 실시간으로 자연스럽게 interact 하는 모델을 Anthropic에서도 적극적으로 개발 중인 상황으로 이해됨
📜 [UPC, ETH] Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
- sparse autoencoder를 해석툴로 사용함으로써 entity recognition의 핵심 요소를 파악
- representation space에서 의미있는 방향을 찾아내어 모델이 특정 entity에 대해 인지하고 있는지 확인할 수 있음
- 챗 모델의 refusal behavior에도 영향을 줄 수 있는 내용
📜 [UCL, Shanghai, Brown, Singapore] Natural Language Reinforcement Learning
- 기존 RL은 수학적으로 MDP로 의사 결정을 공식화
- Natural Language Reinforcement Learning (NLRL): 전통적인 MDP를 자연어 기반의representation space로 확장
- 순수 프롬프팅 or gradient-based training 에 의한 RL-like policy & value 를 개선
- 깃허브 링크 🔗
📜 [Arizona] From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
- LLM-based judgment & assessment에 대한 서베이 논문
- LLM-as-a-judge를 평가하는 벤치마크 compile
🧑🏻‍💻 [OpenAI] Advancing red teaming with people and AI
- OpenAI에서 external & automated red teaming과 관련된 두 개의 논문을 공개
- 📜 External red teaming
- 📜 Automated red teaming
📜 [MIT] Model-Based Transfer Learning for Contextual Reinforcement Learning
- zero-shot transfer에서 영감을 받음: selecting a good set of training tasks
- Model-Based Transfer Learning (MBTL) 제시: Gaussian process를 사용한 performance set point, linear function of contextual similarity로 모델링되는 performance loss
- 두 요소를 결합하여 Bayesian Optimization (BO) 프레임워크 내에서 전략적으로 사용
- 50배 이상 개선된 independent & multi-task training 효율성
📜 [NVIDIA] Star Attention: Efficient LLM Inference over Long Sequences
- Star Attention: two-phase block-sparse approximation. attention을 여러 개의 호스트에 배치하면서도 communication overhead는 최소화하는 방식을 제안
- 1단계: blockwise-local attention across hosts → 2단계: query & response tokens 가 이전에 생성 및 캐싱된 토큰에 대해 sequence-global attention
- global attention을 사용하여 학습된 트랜스포머 기반의 모델들은 약 11배 정도까지의 추론 속도 향상을 기대할 수 있음 (정확도는 95~100% 유지)
📜 [Ai2] OLMo 2: The best fully open language model to date
- 5T 토큰으로 학습된 7B & 13B 모델
- Tülu 3에서 얻은 나이스한 레시피를 OLMo 2에도 적용 (근데 둘이 뭐가 다르지 그럼..?)
📜 [Case Western Reserve Univ.] Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models
- DynSDPB: dynamic SelfD from the previous mini-batch, 마지막으로 생성되었던 logit을 활용하는 방식
- distillation influence와 temperature value를 dynamic 하게 조절
- self-correction & self-training 테크닉들과 seamless 하게 integration 가능
📜 [Tsinghua] Training and Evaluating Language Models with Template-based Data Generation
- Template-based Data Generation (TDG) 제안: GPT-4를 이용하여 parameterized meta-template을 생성
- TemplateMath Part 1: TemplateGSM, 7백만 개 이상의 고등학교 수학 문제로 구성된 합성 데이터셋
- 허깅페이스 데이터셋 링크 🔗
🧑🏻‍💻 [Andrew Ng] aisuite
- 다양한 기업의 LLM을 아주 손쉽게 바꿔 사용할 수 있도록 돕는 파이썬 패키지를 앤드류 응이 배포
- OpenAI, Anthropic, Azure, Google, AWS, Groq, Mistral, HuggingFace, Ollama 등을 지원
🧑🏻‍💻 [HuggingFace] SmolVLM - small yet mighty Vision Language Model
- 2B SOTA VLM, SmolVLM 공개: SmolVLM-Base, SmolVLM-Synthetic, SmolVLM Instruct
- 모든 모델 체크포인트, VLM 데이터셋, 학습 레시피, 도구 등 Apache 2.0 라이센스로 공개
📜 [NVIDIA] Hymba: A Hybrid-head Architecture for Small Language Models
- transformer attention mechanism과 SSM을 합쳐 hybrid-head parallel 아키텍쳐를 지닌 small language model family, Hymba 공개
- Attention heads는 high-resolution recall을, SSM heads는 efficient context summarization을 담당
- 프롬프트 앞에 붙어서 중요한 정보를 저장하는 learnable meta token 도입
- 허깅페이스에 Base & Instruct 모델 공개
🧑🏻‍💻 [Qwen] QwQ: Reflect Deeply on the Boundaries of the Unknown
- QwQ: Qwen with Questions, QwQ-32B-Preview
- Language Mixing and Code-Switching, Recursive Reasoning Loops, Safety and Ethical Considerations 등의 한계점
- GPQA, AIME, MATH-500, LiveCodeBench 등 추론 능력이 요구되는 벤치마크에서 뛰어난 성능
🧑🏻‍💻 [IBM, Meta] Supercharging Training using float8 and FSDP2
- FSDP1 bf16 training으로 50% throughput speedup 달성
- 1.8B 부터 405B 에 이르는 라마 모델에 대한 성능 개선을 확인함 (Llama 3 아키텍쳐 기준)
- end-to-end float8 training에 대한 가능성을 입증
📜 [Univ. of Luxembourg] LongKey: Keyphrase Extraction for Long Documents
- Automated keyphrase extraction은 주로 512 토큰 수준의 짧은 문서에 집중
- LongKey, a novel framework for extracting keyphrases from lengthy documents
- encoder 기반의 언어 모델, max-pooling embedder 사용

🎃 10월

1st week

🧑🏻‍💻 [Google DeepMind] How AlphaChip transformed computer chip design
- 강화학습을 이용한 컴퓨터 칩 개발 성과를 공개
- 실제로 6세대 TPU을 몇 개로 구성할지를 이것으로 찾음 (AI for chip design)
🧑🏻‍💻 [Anthropic] Introducing Contextual Retrieval
- RAG에서 각 chunk에 대해 chunk-specific explanatory context를 prepending 함으로써 RAG의 정확도를 높이는 방식
- Contextual BM25에 사용되는 index를 생성
- context를 생성할 때는 사람이 직접할 수 없으므로 AI 모델을 사용 (Claude)
📜 [BAAI] Emu3: Next-Token Prediction is All You Need
- images, text, vidoe를 discrete space로 tokenize하고, 이를 scratch부터 학습
- → diffusion 또는 compositional architecture 불필요
📜 [Waterloo, Peking] MIO: A Foundation Model on Multimodal Tokens
- sppech, text, image, video를 end-to-end로 처리하는데 이것도 역시 multimodal token을 사용 → causal multimodal modeling
- four-stage training process
  - (1) alignment pre-training (2) interleaved pre-training (3) speech-enhanced pre-training (4) comprehensive supervised fine-tuning
📜 [Microsoft] VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
- Second-Order Optimization을 사용하여 LLM VQ (Vector Quantization) 문제를 공식화하고, quantization algorithm을 제시
- Channel-Independent Second-Order Optimization을 사용하여 가중치를 refine
- 깃허브 링크 🔗
📜 [Apple] MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
- text-rich image understanding, visual referring and grounding, multi-image reasoning을 잘 처리하기 위한 multimodal large language models (MLLMs) 공개
- high-quality OCR data & synthetic caption 을 continual pre-training에 활용 → optimized visual instruction-tuning data mixture를 supervised fine-tuning에 활용
- MoE 아키텍쳐를 포함하여 모델 사이즈는 1B ~ 30B 로 구성
- video understanding과 mobile UI understanding에 특화된 MM1.5-Video, UI 버전을 공개.
- 개인적으로 Apple Intelligence를 아주 기대하고 있는 입장에서 모델 성능이 뛰어나서 유용히 사용될 수 있길 간절히 바라는 중 🙏🏻
📜 [Meta, UIUC] Law of the Weakest Link: Cross Capabilities of Large Language Models
- cross capabilities: real-world task를 처리하는데 필요한 다양한 전문 지식의 intersection
- 7개의 core individual capabilities를 정의하고 이를 manually 짝지어 taxonomy를 구축
- 1,400개의 human-annotated prompts로 구성된 CrossEval 벤치마크를 공개. 각 individual & cross capability 마다 100개 prompt로 구성
- 이에 대한 평가를 수행해봤을 때, 현 LLM은 Law of the Weakest Link를 보인다고 주장
🧑🏻‍💻 [Liquid] Liquid Foundation Models: Our First Series of Generative AI Models
- 각 모델 사이즈에서 SOTA를 달성한 생성형 언어모델 패밀리 (LFM). 1B, 3B, 40B (MoE, 12B activated) 모델로 구성.
- 32k token context length, effective across the entire range
- 오픈 소스 모델은 아님. Liquid Playground, Lambda, Perplexity Labs 등에서 사용 가능
- 최근 sLLM 에 대한 관심이 뜨거운 것 같은데, 이중에서도 오픈소스가 아닌 모델 패밀리를 공개하는 것은 오히려 흔하지 않은 상황으로 이해됨
📜 [CMU] Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation
- 로봇 도메인에서 RAG를 활용
- Embodied-RAG: navigation & language generation의 hierarchical knowledge를 자율적으로 구축할 수 있는 non-parametric memory system
- 다양한 환경과 query type에 대해 넓은 범위의 spatial & semantic resolution을 처리할 수 있음
📜 [Yale, OpenAI, Princeton] When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1
- 추론에 특화된 모델 OpenAI o1은 분명 눈에 띄는 성능 향상을 보이지만, 여전히 기존 LLM들과 마찬가지로 모델이 확률 분포에 민감하다는 문제를 극복하지는 못했음
- embers of augoregression이라는 표현을 사용하고 있는데, 결국 다음 토큰을 반복적으로 예측해나가는 근본적인 특성으로 인해 발생하는 문제점을 지적하고 싶은 것으로 이해함
📜 Unleashing the Power of Large Language Models in Zero-shot Relation Extraction via Self-Prompting
- LLM에 내재된 Relation Extraction 지식을 이용하는 Self-Prompting 프레임워크를 제안
- 세 단계로 구성된 diversity approach를 사용하여 다양한 합성 데이터를 생성 → 이는 in-context learning sample로 사용
📜 [Mila, Google DeepMind, Microsoft] Not All LLM Reasoners Are Created Equal
- LLM의 grade-school math (GSM) 문제 풀이 능력을 확인. 이때 두 개의 문제를 상으로 묶고, 첫 번째 문제에 대한 답변을 고치는 것이 두 번째 문제를 풀이하는 것에 주는 영향을 확인하는 연구.
- compositional pair를 풀어내는 것과 각 문제를 따로 푸는 것의 결과가 독립적이라고 주장
- 이러한 결과는 더 작고, cost-efficient하며 수학 특화된 모델에서 두드러진다고 함
📜 [Johns Hopkins] RATIONALYST: Pre-training Process-Supervision for Improving Reasoning
- LLM이 생성하는 reasoning step은 흉내 수준에 가까운 것이라 불완전하다는 점을 지적
- → unlabeled data로부터 추출한 다양한 종류의 rationale annotations에 대한 사전학습을 기반으로 삼는 process-supervision of reasoning 모델, Rationalyst 제안
- Pile 데이터셋으로부터 79K 개 rationale을 추출. 여기에 사람 개입은 최소화.
📜 [Apple] Contrastive Localized Language-Image Pre-Training
- CLIP은 region-level understanding이 요구되는 fine-grained vision representation에 적합하지 않음
- CLIP에 region-text contrastive loss & module 을 보충하는 CLOC를 제안
- 이미지 embedding을 region representation으로 쉽게 변환할 수 있는 promptable embedding을 공식화
🧑🏻‍💻 [Google] Gemini 1.5 Flash-8B is now production ready
- 1.5 Flash 대비 50% 저렴한 가격, 2배 높은 limit, small prompt에 대한 낮은 latency
- 경량화된 모델이라고 하는 것 같은데 실사용 성능이 어떤지는 커뮤니티 반응 조사 필요
📜 [Mila] Were RNNs All We Needed? - 기존 RNN은 BPTT 때문에 느렸는데 LSTM & GRU는 필요 없음. 이를 input, forget, update gate에 대한 hidden state dependencies를 제거함으로써 달성. - 전통적인 모델보다 적은 파라미터를 사용하고, 학습 동안 완전히 parallelizalbe한 버전을 제시

2nd week

📜 [Google Research, Apple] LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
- LLM의 internal representation이 truthfulness에 대해, 알려진 것보다 더 많은 정보를 담고 있다고 주장
- (1) 정보를 많이 담고 있는 특정 토큰을 이용하여 error detction을 시도했으나 generalize 되지 않음 → multifaceted
- (2) internal representation은 모델이 일으키는 에러를 줄이는 데 활용될 수 있다는 것을 확인
- (3) LLM의 internal encoding과 external behavior 사이의 discrepancy를 확인
📜 [Salesforce] Enhance Reasoning by Learning from Mistakes: Peer-Review Knowledge Distillation from Multiple Large Language Models
- 현존 KD는 one isingle LLM으로부터의 response를 gold rationale로 사용하는 문제
- Mistake-Aware Peer-Review Distillation (MAPD) 방식 제안
  - teacher 에게 student의 실수를 파악 및 설명하고 customized instruction learning data를 제공하도록 지시
  - simulated peer-review process를 디자인하여 acceptance threshold를 넘기는 rationale을 사용
- 결국 peer-review라는 게 여러 개의 proprietary 모델을 사용한다는 뜻인데 비용을 n배로 증가시키는 방법론이긴 함
🧑🏻‍💻 feder-cr/Auto_Jobs_Applier_AIHawk
- AI 봇으로 24시간 내에 1,000개 지원서를 제출하고 50개의 인터뷰를 따낸 것으로 화제
🧑🏻‍💻 mendableai/firecrawl
- 웹사이트를 LLM이 사용 가능한 마크다운 또는 구조화된 데이터로 변경해주는 API
📜 [Stanford] Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise
- Tutor Copilot, a novel Human-AI approach. 학생들을 가르치는 Tutor를 보조하는 AI 도구임.
- under-served communities의 900명 tutor와 1,800명 학생이 참여한 대규모 연구
- 수학을 공부하는 학생들이 덕분에 유의미한 점수 향상(4%p)을 얻었다고 함
- tutor마다 연간 $20 밖에 들지 않음
📜 [Hong Kong, Huawei, McGill & MILA] RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
- LLM-as-a-Judge와 인간 평가 사이의 gap은 평가 과정에서 guided oracles의 부재에 기인한다고 주장
- LLM이 text revision을 잘한다는 점을 이용하여 response를 adaptive하게 revise하고 이를 reference로 삼아 이어지는 평가에 활용하는 방식을 고안
📜 [Microsoft, Tsinghua] Differential Transformer
- Transformer는 irrelevant context에 attention을 overallocate하는 문제점이 있다고 지적
- differential attention mechanism은 두 개의 separate softmax attention map의 차이로 attention score를 계산 → sparse attention pattern을 촉진
- 특히 long-context modeling, key information retrieval, hallucination mitigation, in-context learning, reduction of activation outlier 등에 탁월
🧑🏻‍💻 [HuggingFace] gradio-app/openai-gradio
- AI-powered web app을 아주 간단하고 쉽게 만들 수 있도록 돕는 파이썬 패키지
- API 대신 로컬 모델로 구축할 수 있으면 좋을텐데 아쉽
📜 [Tsinghua, Microsoft] Data Selection via Optimal Control for Language Models
- Pontryagin’s Maximum Principle (PMP) conditions를 해결함으로써 optimal data에 근사하도록 만드는 프레임워크 PMP-based Data Selection (PDS)
- CommonCrawl을 대상으로 PDS를 적용했을 때, 사전학습의 효율이 크게 향상된다는 것을 확인
- Mistral 아키텍쳐를 기반으로 160M, 470M, 1B, 1.7B 모델로 실험
- 깃허브 링크 🔗
📜 [Microsoft] VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
- Second-Order Optimization을 사용하여 LLM VQ 문제를 formulate하고 optimization을 풀어냄으로써 quantization algorithm 디자인을 설계
- Channel-Independent Second-Order Optimization을 granular VQ에 적용함으로써 가중치를 refine
- optimization problem을 decomposing함으로써 brief & effective codebook initialization algorithm을 제안
- residual & outlier quantization을 지원하여 모델 정확도를 향상하고 압축률을 높임
- 깃허브 링크 🔗
🧑🏻‍💻 [HuggingFace] LLM Evaluation Guidebook
- 참고 가능한 이전 허깅페이스 블로그 글 🔗
- 초보자/상급자를 위한 내용들이 포함되어 있음
📜 [Baidu] Retrieving, Rethinking and Revising: The Chain-of-Verification Can Improve Retrieval Augmented Generation (EMNLP 2024)
- 기존 RAG의 문제점: 1) original query가 retrieval에 부적합할 수 있음 2) 언어 모델의 지식 한계 때문에 inconsistent answer를 생성할 수 있음
- 이를 해결하기 위해 chain-of-verification (CoV-RAG)를 제안
- verification module을 RAG에 넣어 scoring, judgement, rewriting에 참여하도록 함
- internal generation error를 수정하기 위해 QA와 verification에 CoT reasoning을 포함하여 학습 진행
- 예전에도 CoVE 라는 논문이 Meta에서 hallucination mitigate를 위해 제시되었는데 이와 무엇이 다른지 확인할 필요도 있는 듯함
📜 [HKUST, UIUC] Personalized Visual Instruction Tuning
- 현 MLLM의 face blindness 문제. personalized dialogue를 수행할 수 없음을 뜻함 → mobile device, domestic robot 등에 MLLM을 적용하기 어려움
- MLLM이 target individual을 이미지 내에서 식별하고 coherent dialogue를 이어나갈 수 있도록 data curation & training framework를 포함하는 PVIT를 제안 (Personalized Visual Instruction Tuning)
📜 [Microsoft] Scaling Optimal LR Across Token Horizons
- dataset 사이즈에 따른 하이퍼파라미터 변화에 대한 연구는 아직 없었음
- optimal LR은 token horizon에 따라 변화하는데, longer training일수록 smaller LR이 필요
- optimal LR도 scaling law를 따르기 때문에, longer horizon에 대한 optimal LR을 shorter horizon으로부터 예측할 수 있다고 주장
- 데이터셋, 모델 사이즈를 scale-up 할 때 필수로 참고해야 할 논문이 아닌가..
📜 [KAIST, Washington, LG AI Research] Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition
- knowledge acquisition & forgetting 관점에서, 모델의 parametric knowledge가 pretraining 동안에 어떻게 변화하는지에 대해 연구
- knowlege entropy 개념을 도입하여 모델이 engage하는 memory의 범위를 정량적으로 나타냄. 이 값이 높으면 모델이 넓은 범위의 memory source를 포함하는 것이고, 낮으면 반대임
- pretraining이 진행됨에 따라 knowledge entropy가 낮아지고, 이는 모델의 knowledge acquisition & retain 능력 감소를 의미한다고 주장
📜 [OpenAI] MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
- AI agent가 machine learning engineering을 얼마나 잘하는지를 평가하기 위한 벤치마크를 도입
- 캐글의 75개 MLE competition을 curate하여, 모델 학습, 데이터셋 준비, 실험 수행 등 다양한 real-world ML engineering skill을 테스트 할 수 있도록 함
- OpenAI의 o1-preview가 최고라는 걸 보여주는 연구 결과..?
- 깃허브 링크 🔗
📜 [Hong Kong] Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models
- 학생을 가르치는 선생의 instructional process를 모방하게 하는 Teaching-Inspired Integrated Framework를 제안
- reasoning에 필요한 필수적인 개념, 관련 이론, 유사한 문제 등을 LLM이 떠올릴 수 있도록 함
- 자체적으로 개발한 두 개의 중국어 벤치마크 MathMC, MathToF 공개
- 이런 방식이 정말 모델의 능력을 극대화하는 것이 맞나? 어떤 상황에서도 적용 가능한 방법은 맞나? 또 모델이 학생을 가르치는 내용의 데이터를 학습하지는 않았을 것 같은데 이것이 working 하는 이유는 뭘까?
🧑🏻‍💻 [Tesla] Robotaxi
- 테슬라에서 Robotaxi & Robvan을 공개
🧑🏻‍💻 ML Code Challenges
- 리트코드 스타일의 머신러닝 코드 챌린지 사이트
- 행렬곱, 공분산행렬, Decision Tree 등등 다양한 개념들이 있어서 코드 연습해보기 좋은 것 같음. 카테고리는 linear algebra, machine learning, deep learning, nlp 등으로 구분됨
📜 One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation
- activation vector로 이루어진 mini-batch의 SVD을 계산하여 data-driven 방식으로 LoRA의 가중치를 초기화하는 방식을 제안
- 이를 Explained Variance Adaptation (EVA)라고 부르는데, 다양한 태스크에 적용해 보았을 때, convergence 속도가 빠르고 평균적으로 높은 스코어를 달성할 수 있었다고 주장함
📜 [CMU] Better Instruction-Following Through Minimum Bayes Risk
- LLM judge를 supervision에 활용하는 promising 방식 중 하나로 Minimum Bayes Risk (MBR) decoding을 제안
- 이는 reference-based evaluator를 사용하여 여러 후보 output 중에서 가장 high-quality인 것을 고를 수 있도록 돕는 방식임
📜 [Washington, AI2] Can Language Models Reason about Individualistic Human Values and Preferences? (Yejin Choi) - 진정한 의미의 다양성을 커버하기 위해서 individualistic alignment를 제안 - World Value Survey (WVS)를 변형한 데이터셋 IndieValueCatalog 도입 - 이 데이터셋으로 학습한 IndieValueReasoner 모델 시리즈를 공개 - 코드 & 데이터 링크 🔗

3rd week

📜 [Central Florida] Parameter-Efficient Fine-Tuning of Large Language Models using Semantic Knowledge Tuning
- random token 대신 meaningful words를 사용하는 prompt & prefix tuning, Semantic Knowledge Tuning (SK-Tuning) 제안
- 이를 위해 zero-shot으로 프롬프트의 semantic content를 이해할 수 있는 fixed LLM을 활용
- processed prompt를 입력 텍스트와 통합하여 모델이 특정 태스크에서 더 뛰어난 성능을 발휘할 수 있도록 함
- text classification & understanding에서 다른 tuning method 대비 더 적은 시간과 비용으로 좋은 성능을 낼 수 있었다고 주장
📜 [Peking, Microsoft] Self-Boosting Large Language Models with Synthetic Preference Data
- 고품질의 선호 데이터셋을 획득하는 것은 resource-intensive & creativity-demanding process라는 단점이 있음
- self-prompt generator가 다양한 프롬프트를 생성 → response improver가 response를 점진적으로 개선
- LLM 스스로 자신의 output에 대한 generative reward를 자율적으로 학습하고, 대규모 annotation 작업을 하지 않을 수 있게 됨
- AlpacaEval 2.0 & ArenaHard 에 대한 검증을 통해 모델의 instruction following 능력이 크게 향상되었음을 확인
📜 [UNIST] Response Tuning: Aligning Large Language Models without Instruction
- 적절한 output space를 확립하는 것이 더욱 효과적인 접근 방식이라는 가정 → instruction-conditioning step을 없애고, 오직 response space supervision에만 집중하는 방식
- 실험 결과에 따르면 response에 대해서만 학습한 본인들의 모델이 instruction-tuned 모델들보다 더 다양한 범위의 instruction을 따를 수 있거나 성능이 좋았다고 언급함
- training response distribution을 조절함으로써 target behavior를 유도할 수 있었다고 함
🧑🏻‍💻 [OpenAI] openai/swarm
- 교육적인 목적의 ergonomic & lightweight multi-agent orchestration
- Orchestrating Agents: Handoffs & Routines cookbook의handoff & routines pattern을 보여주기 위해 제작됨
📜 [Alibaba] StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization
- 현재 RAG는 useful infromation이 badly scattered 되어 있어 어려움을 겪는 경우가 많음
- 사람이 raw information을 다양한 structured knowledge로 convert한다는 점에 착안하여 StructRAG를 제안
- 즉, 태스크에 적합한 structured format으로 문서를 재구성하는 방식
🧑🏻‍💻 [Mistral AI] Un Ministral, des Ministraux
- Ministral 3B & 8B 모델 공개
- 128k context length (vLLM에선 현재 32k). 8B 모델은 sliding-window attention
- Llama-3.1-8B 보다 뛰어난 성능임을 벤치마크 결과를 통해 제시하고 있음
- 라이센스는 각각 Mistral Commercial / Commercial & Research License를 따름
📜 [Meta, Berkeley, NYU] Thinking LLMs: General Instruction Following with Thought Generation
- 추가적인 데이터 없이 LLM이 general instruction following 능력을 갖추는 데 사고하는 능력을 갖추게 해주는 방법론 제시
- iterative search & optimiation precedure를 통해 possible thought generation space를 탐색. 여기엔 direct supervision이 필요하지 않음
- 각 instruction에 대한 thought candidate는 judge model이 평가하여 preference optimization에 활용 (DPO)
- AlpacaEval & Arena-Hard 에서 우수한 성능을 보였음을 강조. 그외의 marketing, health, general knowledge 등의 분야에서도 뛰어나다고 주장.
🧑🏻‍💻 [Zyphra] ZAMBA2-7B
- Mistral, Gemma, Llama3 시리즈보다 뛰어난 퀄리티와 퍼포먼스를 자랑하는 오픈소스 모델을 공개
- single shared attention block → two shared attention block
- 토큰 당 추론 속도를 25% 가량 개선한 inference-efficient 모델
- 하루 사이에 Mistral 신모델이 출시되었는데 성능 비교가 필요할지도..
🧑🏻‍💻 [NVIDIA] Llama-3.1-Nemotron-70B
- Llama를 fine-tuning한 NVIDIA의 모델
- 2024년 10월 기준, Arena Hard와 RewardBench에서 SoTA 달성
- GPT-4o와 Claude 3.5를 넘는 성능을 달성했다고 함
🧑🏻‍💻 [Rhymes AI] Aria
- Multi-modal 모델 중 SoTA
- text, image, video 처리 가능하며 64k 사이즈의 context window 지원
- 토큰당 3.9B activated parameters 사용
🧑🏻‍💻 [Perplexity] Introducing Internal Knowledge Search and Spaces
- internal & external data에 동시에 접근 가능한 unified tool (최대 500개 파일)
- Perplexity Space에서 team based search 가능
📜 [Fudan, CMU, ByteDance] Revealing the Barriers of Language Agents in Planning
- language agent가 human-level planning에 실패하는 이유는 뭘까? → limited role constraints & diminishing influence of questions
- Language model을 agent로 사용하여 planning에 활용하는 최근 연구가 많은데, 현재 연구들이 보이는 한계의 원인을 파악한 연구라고 볼 수 있음. 이를 Memory Updating과 연관지어 분석하고 설명한 내용들이 기술되어 있음.
📜 [Tufts University] "Let's Argue Both Sides": Argument Generation Can Force Small Models to Utilize Previously Inaccessible Reasoning Capabilities
- possible inference result에 대한 arguments를 생성하고, end model이 생성된 argument를 rank하는 방식. Argument Generation.
- 추가적인 레이어 없이 zero-shot prompting을 대체할 수 있는 방법론이라고 주장
- CoT나 Argument Generation은 추론이 필요한 태스크에서 zero-shot 할 때나 유용한 보조적인 수단이라고 설명
- 엄청 단순하고 흔한 방식 같긴 한데, 이런 테크닉이 한정적인 보조수단이라고 설명한 내용이 인상 깊음
📜 [DeepSeek-AI, Hong Kong, Peking] Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
- Any to any multimodal autoregressive framework
- visual encoding을 여러 pathway로 분해(decouple)하되, 처리하는 transformer architecture는 통합된 것을 사용
- decoupling은 visual encoder의 역할 간 충돌을 완화하면서도 framework의 유연성은 증가시켜줌
- 깃허브 링크 🔗
📜 [Meta AI, KAUST] Agent-as-a-Judge: Evaluate Agents with Agents
- 현재 agentic system을 평가할 때는 최종 결과에만 집중하고 중간 과정은 평가하지 않는다는 문제점이 있음
- LLM-as-a-Judge에 agentic feature를 통합하여 Agent-as-a-Judge를 만들고 이를 code generation에 활용
- realistic automated AI 개발 태스크로 구성된 새로운 벤치마크 DevAI를 제시
- LLM-as-a-Judge와 비교했을 때, human evaluation baseline에 준할 정도로 뛰어난 성능
- 깃허브 링크 🔗
📜 [UC Berkeley, Washington Univ] JudgeBench: A Benchmark for Evaluating LLM-based Judges
- LLM-based judge를 객관적으로 평가할 수 있는 novel evaluation framework를 제안
- knowledge, reasoning, math, coding 태스크를 다루는 challenging response pari로 구성
- 현존하는 difficult dataset을 challenging response pair with preference label로 convert 해주는 pipeline을 포함하고 있음
- response pair 데이터셋이 아닌 것을 convert 해주는 파이프라인은 활용 가치가 높은 것 같은데, 평가 방식 자체에 대단한 건 없는 것 같음
📜 [KAIST, Naver Cloud AI] How Does Vision-Language Adaptation Impact the Safety of Vision Language Models? (ICLR 2025)
- Vison-Language adaptation (VL adaptation)은 LLM을 LVLM으로 transform 하는데, original LLM의 inherent safety capabilities를 손상시킬 수 있음
- training data가 safe 하더라도 VL adaptation 동안 safety degradation이 발생한다고 설명
- supervised fine-tuning with safety datasets | reinforcement learning from human feedback 등은 risk를 줄일 수 있지만 온전한 해결책이 아니라고 주장
- 해결책으로 weight merging를 제안하여 safety degradation을 줄이면서도 helpfulness를 유지할 수 있도록 함
- 요즘 은근 weight merging이 많이 활용되는 것 같은데 이게 퍼포먼스 한계치인가 싶은 생각
📜 [AI2, Washington] Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback - preference-based learning의 핵심 네 가지 aspects를 identify - preference data, learning algorithm, reward model, policy training prompts - 연구 결과에 따르면 넷 다 중요하지만, preference data > learning algorithm > improves reward models > unlabeld prompts for policy trianing 순서로 영향을 준다고 함 - PPO가 수학에서 2.5%, 일반적인 영역에서 1.2% 우위에 있다고 함

4th week

📜 [Samsung Research] Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs
- continuous pre-training & instruction fine-tuning 간 관계를 연구
- Instruction 모델에 많은 양의 새로운 토큰을 CPT 하면 Instruction Following 성능 크게 하락
- Base 모델은 많은 양의 새로운 토큰을 CPT 해도 안정적인 성능 유지 가능
📜 [OpenAI] First-Person Fairness in Chatbots
- AI 모델이 사람의 ‘이름’에 대해 편향을 갖고 있는지에 대한 OpenAI 연구
- 1% 미만 수준으로 영향을 받는다는 요약글을 본 적이 있는 것 같은데, 사용자수를 고려한다면 훨씬 더 엄밀한 safety 정책이나 방법론이 필요하다는 생각이 듦
📜 [Anthropic, Scale AI, NYU, UC Berkeley] Looking Inward: Language Models Can Learn About Themselves by Introspection
- introspection이란 학습 데이터에 포함되어 있거나 이로부터 얻지 못하는 지식을 습득하는 것으로 정의
- LLM이 가상의 시나리오에 대한 본인의 행동 특성을 예측하도록 fine-tuning
- introspect 할 수 있는 모델 M1이 본인의 output 예측을 더 잘할 것이고, 이것이 곧 M2 보다 뛰어난 성능을 지닌다는 방증으로 이해하는 것 같음
- 요즘 성찰, self-correct 등 모델의 inherent ability를 최대한 이끌어내고자 하는 연구가 꽤 많은 것 같은데, 약간 결과론적인 해석 위주인 것 같아서 아쉽게 느껴짐
📜 [British Columbia] Supervised Chain of Thought
- solution process를 두 파트로 분할: prompt space & answer space
- one-for-all prompting (think step by step) 대신 task-specific supervision이 필요하다고 주장
- reasoning path를 학습하는 방식은 이미 제시된 바 있는데 데이터셋을 잘 구축한 건가 싶은 인상
📜 [Hong Kong, Washington, HKUST, Microsoft] SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
- attention sparsity는 predefined 되는 것이 아니라 learned 되어야 한다고 주장
- learnable gate를 두어 attention map에서 중요한 block를 adaptive 하게 선택하는 mechanism 제안
- → accuracy & speed 균형
- 이를 위한 customized Flash Attention 구현
- 깃허브 링크 🔗
🧑🏻‍💻 [Microsoft] Open-sourced BitNet
- 1-Bit LLM 논문의 코드를 오픈소스로 공개하여 LLM을 local device에서 돌리기 쉬워짐
🧑🏻‍💻 [Meta FAIR] Sharing new research, models, and datasets from Meta FAIR
- SAM 2.1을 공개. image & video 업데이트
- Meta Spirit LM: An open source language model for seamless speech and text integration
  - cross modality generation을 위해 단어 단위의 text & audio 데이터를 interleaving 하는 방식 사용
- Layer Skip: Enhancing large language model performance with accelerated generation times
  - 추론 시 일부 layer만을 사용, 이후 verification & correction layer 통과
  - Llama 3, Llama 2, Code Llama 등은 early exit이 가능하도록 학습
📜 [Texas, Pittsburgh, Princeton, CMU] CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy
- professional psychotherapy를 assist 하는 LLM의 potential에 대한 조사 연구
- CBT-Bench를 구성하는 세 단계의 태스크 (Cognitive Behavior Therapy)
  1. Basic CBT knowledge acquisition
  2. Cognitive model understanding
  3. Therapeutic response generation
📜 [Shanghai AI Lab] CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
- 최초의 open-source all-in-one judge LLM, CompassJudger-1
- unitary scoring & two-model comparison 가능 / 특정 형식을 따라 평가 가능 / critiques 생성 가능 / 일반적인 LLM 태스크 수행 가능
- various subjective evaluation task와 topic을 커버하는 JudgerBench 구축
- 모델 및 코드 공개 커뮤니티 링크 🔗
📜 [CMU] Causality for Large Language Models
- correlation-driven paradigm을 넘어서 more reliable & ethically aligned AI system 필요
- 어떻게 causality가 언어 모델의 각 학습 단계에서 어떻게 영향을 줄 수 있는지 연구하고 앞으로의 연구 방향성을 제시. 프롬프트 기반의 연구들의 한계를 극복하겠다는 취지.
- 말은 거창한데 abstract만 보고서는 무슨 소리인지 모르겠음
- 깃허브 링크 🔗
🧑🏻‍💻 [Anthropic] Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
- Computer use API는 화면을 읽고 커서를 이동 및 클릭, 타이핑을 수행할 수 있음
- 자연어를 컴퓨터 명령어로 변환하는 기능을 포함
- 기존 대비 훨씬 강력한 성능의 모델 업데이트를 공개함
📜 [Alibaba] Aligning Large Language Models via Self-Steering Optimization (ICLR 2025)
- iterative training 동안 predefined principle 기반의 고품질 preference signal을 자동적으로 생성하는 알고리즘, Self-Steering Optimization (SSO) 제안
- chosen & rejected response 간의 consistent gap을 보장하면서도 현재 policy 모델의 learning capacity에 적합한 학습이 진행될 수 있도록 함
- SSO로 생성된 선호 데이터셋은 reward 모델의 성능을 높인다는 결과도 함께 제시
- 깃허브 링크 🔗
📜 [Yonsei, SNU] Large Language Models Still Exhibit Bias in Long Text
- essay-style prompt LLM의 bias를 평가하는 프레임워크 Long Text Fairness Test (LTF-Test) 제안
- 14개 토픽, 10개 demographic axes, 11,948개 샘플로 구성
- 연구에 따르면 특정 demographic group이 선호됨 & excessive sensitivity가 확인됨
- 이를 완화하기 위해 biased prompt를 neutral response와 짝짓는 fine-tuning approach 제안
🧑🏻‍💻 [IBM] IBM Introduces Granite 3.0: High Performing AI Models Built for Business
- OpenLLM 리더보드에서 Llama 3.1 8B 모델을 능가
- larger 모델 대비 3~23x 저렴한 비용
- MoE 아키텍쳐를 이용하여 1B 이하의 사이즈로 enterprise 태스크 수행
- 128K 윈도우 사이즈 지원 (예정)
📜 [NVIDIA] HelpSteer2-Preference: Complementing Ratings with Preferences
- Bradley-Terry training을 위한 preference annotation을 공개하여 현존하는 ratings (designed for Regression style training)을 보완할 수 있도록 함
- 두 방식을 head-to-head comparison → Bradley-Terry and Regression reward modeling 제안
- Llama-3.1-70B-Instruct 모델을 튜닝한 것이 RewardBench에서 94.1점을 달성
- 데이터셋 링크 🔗 모델 링크 🔗
🧑🏻‍💻 [Cohere] Introducing Multimodal Embed 3: Powering AI Search
- text, image에 대한 통합 embedding space 지원
- 나쁘지 않은 수준의 성능으로 100개 이상의 언어를 지원한다고 함 (검증할 길이 없어 아쉽)
- text, image가 독립적으로 clustering 되는 문제가 해결되어 mixed-modality search에서 CLIP 대비 뛰어난 성능을 보여줌
📜 [OpenAI] Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
- diffusion 모델과 Consistency 모델의 이전 parameterization을 통합하는 프레임워크를 제안하여 instability의 root cause를 식별
- only two sampling step만으로도 뛰어난 성능을 거둘 수 있었음
- OpenAI 블로그 & 데모 링크 🔗
🧑🏻‍💻 [Google DeepMind] SynthID Identifying AI-generated content with SynthID
- AI가 생성한 content에 watermark를 부여하거나 식별
- image, audio, text, video 지원
- 이중에서도 특히 audio, text를 어떻게 구분할 수 있다는 건지 전혀 이해가 안됨..
🧑🏻‍💻 [Meta] Introducing quantized Llama models with increased speed and a reduced memory footprint
- 모바일 기기에서 돌릴 수 있을 정도로 작으면서 뛰어난 first lightweight quantized Llama models 공개
- Llama 3.2 모델에 Quantization-Aware Training with LoRA adaptors (accuracy) & SpinQuant (portability), 두 가지 방법론을 적용
📜 [Washington, Google Cloud, DeepMind] Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence - LLM experts pool & utility function으로 시작하는 collaborative search algorithm - 모델 간의 best-found checkpoint를 기반으로 다양한 LLM expert가 집단적으로 weight space를 옮기고 최적화를 수행 - 이러한 방식인 Model Swarms는 tuning-free model adaptation, 데이터의 수는 200개 미만 필요

5th week

🧑🏻‍💻 [Stanford] Co-STORM Get a Wikipedia-like report on your topic with AI
- 이 논문의 preview를 공개. 현재는 무료로 사용 가능 (NAACL 2024 Main)
- 위키피디아 형식으로 작성된 내용들은 모두 PDF로 다운로드 가능
- 글에 존재하는 모든 인용문에 대한 원본 출처 확인 가능
📜 [Michigan, Amazon] A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration
- CoT의 earlier step이 integrated 된다면 transformer가 더 나은 error correction 능력과 accurate prediction을 얻게 된다고 주장
- 추론 단계에서 demonstration example이 corrupted 될 때, Coherent CoT를 사용하는 transformer의 sensitivity를 조사
- → final outcome에 비해 intermediate reasoning step에서 더 sensitive하게 반응
📜 [Shanghai] Agentic Information Retrieval
- LLM이 기존 Information Retrieval 패러다임을 변화시켰다고 주장
- 기존에는 사전에 정의된 candidate item을 filtering 하는 것에 수십년째 의존하고 있던 상황
- Agentic IR을 제시하며 세 종류의 application과 현재의 문제점에 대해 논의
📜 [Michigan, Alibaba] Make LLMs better zero-shot reasoners: Structure-orientated autonomous reasoning
- LLM이 질문을 더 잘 이해하고 problem-solving process를 가이드 할 수 있는 novel structure-oriented analysis method 도입
- 왜 이런 방식이 실제 reasoning에 유용한지를 probabilistic graphical model을 통해 입증
- multi-agent reasoning system, Structure-oriented Autonomous Reasoning Agents (SARA) 제안
🧑🏻‍💻 [Stability.AI] Introducing Stable Diffusion 3.5
- 8B 사이즈 모델로 1 메가픽셀 해상도의 이미지를 처리 (prompt adherence 굿)
- Stable Diffusion 3.5 수준의 성능을 낼 수 있는 distilled version의 turbo 모델도 공개
- transformer block에 Query-Key Normalization 테크닉 적용
📜 [Huawei] Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning
- 추가적인 finetuning이 필요하지 않은 방법론, Step Guidance REasoning을 제안
- LLM은 small reasoning step을 reflect 하고, 이를 inference stage에 포함시킴으로써 첫 스텝을 다음으로 잘 이어나갈 수 있게 됨
- 간단히 살펴봤을 땐 inference를 여러 번 하게 되는 것 같은데.. 근본적인 해결책은 아닌 것 같음
📜 [Google DeepMind, Boston] Measuring memorization through probabilistic discoverable extraction
- generated sample 내에서 target sequence를 추출할 확률을 정량화할 수 있는 probabilistic relaxation을 도입
- 이를 통해 모델이 기억(암기)하고 있는 정보에 대해 파악할 수 있다고 주장
- 이러한 연구는 학습에 사용된 민감한 정보 등이 유출되는 것을 방지하기 위함인데, 그럼 외운 것 없이 순수한 추론, 이해, 언어 능력만으로 여러 태스크를 처리하는 것이 궁극적인 goal이 될지 궁금함
🧑🏻‍💻 [GitHub] Bringing developer choice to Copilot with Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview
- Copilot을 타사의 모델들을 포함한 multi-model AI coding assistant로 전환함
- VS Code, GitHub.com, Apple Xcode와의 직접적인 통합
- VS Code 내에 GitHub Spark 공개 (Cursor의 Composer와 유사한 기능)
- Cursor에 비해 한 발자국씩 대응이 늦는 것 같음. 모델 종류의 다양성이나 Spark 전부 다.

🙇🏻 9월

1st week

📜 [Meta] Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
- discrete & continuous 데이터에 대한 multi-modal model 학습 레시피를 공개
- 언어 모델의 loss function(next token prediction)을 diffusion과 결합하여 mixed-modality sequence에 대해 single transformer를 학습
- 7B 사이즈의 모델을 scratch부터 학습하고 2T multi-modal token을 사용, scaling law 확인.
- 텍스트로 이뤄진 시퀀스 중간에 이미지 패치의 vector가 & 태그 사이에 삽입
📜 [Stanford] Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment - LLM이 선호 데이터셋에 align 되는 과정은 꽤나 복잡하고 기대 이하의 결과로 이어지는 경우가 많음 - → (1) 선호 데이터는 response가 contrastive 할 때 더 나은 learning singnal을 제공 - → (2) alignment objective는 모델 학습에서 control over를 구체화 할 때 더욱 효과적 (?) - Contrastive Learning from AI Revisions (CLAIR): more contrastive preference pairs & Anchored Preference Optimization (APO)
📜 [Google DeepMind, UCLA, Milla] Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
- 합성데이터 생성에서 stronger but expensive (SE) vs. weaker but cheaper (WC) 비교
- 세 개의 주요 메트릭: coverage, diversity, false positive rate → WC가 더 높은 coverage, diversity, but 더 높은 false positive 비율
- weak-to-strong improvement setup: weaker LM이 stronger LM에게 reasoning을 가르침
- WC-generated data로 학습한 모델이 SE-generated data로 학습한 모델보다 뛰어난 성능
📜 [University of Virginia] Dynamic Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling
- SC 관련해서 비용을 최소화하고자 하는 연구는 있었으나 reasoning path의 quality에 집중하는 것은 부족했다고 지적
- → output answer와 CoT로부터의 reasoning path를 동시에 고려하여 생성되는 sample의 숫자를 dynamic하게 조절하는 early framework, Reasoning-Aware Self-Consistency (RASC)
- 생성되는 샘플들에 confidence score를 부여하고 일정 기준이 충족되면 stop → weighted majority voting
🧑🏻‍💻 [LMSYS] Lmsys launches style control for Chatbot Arena to help separating the impact of style from substance in LLM rankings
- style control: 길이가 긴 or 포맷이 잘 갖춰진 답변을 생성하는 모델은 어떤 것인가?
📜 [DP Technology] SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
- LLM 과학 분야에서의 문제점 (1) 과학적 지식 부족 (2) 과학 특화 태스크에 친숙하지 x
- continual pre-training (CPT) & supervised fine-tuning (SFT) 통합한 hybrid strategy 제안 → 과학 도메인 지식을 불어넣고 domain specific 태스크에서 instruction following 능력을 향상
- 이를 위해 (1) 고품질의 CPT corpora 필요 (2) 다양한 SFT instructions 생성 필요
- → PDF text extraction, parsing content error correction, quality filtering, synthetic instruction creation을 아우르는 pipeline으로 해결 시도
📜 [Independent Researcher] CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation
- LoRA에 CUR matrix decomposition을 접목한 CURLoRA 제시
- → catastrophic forgetting during continual learning 완화 & trainable parameters 감소
- 변형된 CUR decomposition: 1) 열과 행 선택에 역확률 (inverted probability) 2) U 행렬 0으로 초기화 3) U 행렬만 fine-tuning
📜 [Tsinghua University] Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
- real-time conversation이 가능하려면 audio modality로 입력을 받는 중에 생성을 할 수 있어야 함
- audio-based end-to-end conversational model, Mini-Omni (real-time speech를 위한 최초의 오픈소스 모델)
- text-instructed speech generation, batch-parallel strategies 사용
- speech output을 만들 수 있도록 학습하는 데 사용 가능한 데이터셋 VoiceAssistant-400K
- 깃허브 링크 🔗
📜 [Peking University, ByteDance] MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models
- 현재 오픈소스 LLM들이 수학적 추론을 할 때 시각적인 정보(geometric diagrmas, charts, function plots)를 활용하지 않고 있음을 지적
- → 네 단계로 학습: 1) vison-language alignment 2) visual instruction-tuning 3) math instruction-tuning 4) process-supervised reinforcement learning → MultiMath-7B
- K-12 수준의 image caption과 step-wise solution을 포함하는 MultiMath-300K 데이터셋 공개
- 깃허브 링크 🔗
📜 [NVIDIA] In Defense of RAG in the Era of Long-Context Language Models
- LLM이 더 긴 입력을 처리할 수 있게 되면서 RAG의 매력도 감소
- 그러나 극단적으로 길이가 긴 입력을 처리하는 것은 결국 관련성 높은 정보에 집중하는 것을 방해함으로써 성능 저하로 이어짐
- → order-preserve retrieval-augmented generation (OP-RAG) 제안
- retrieved chunk가 증가할수록 답변 퀄리티는 초반에 상성하다가 결국 감소하여 U-shaped curve ⇒ OP-RAG가 이득을 볼 수 있는 지점이 분명히 존재한다
📜 [AI2, Washington, Princeton] OLMoE: Open Mixture-of-Experts Language Models
- 7B의 파라미터를 갖고 있지만 input 토큰 당 1B 파라미터만 사용하는 OLMoE-1B-7B 공개
- 5T 토큰으로 사전학습한 모델이며 instruct 버전도 함께 공개
- Llama2-13B-Chat, DeepSeekMoE-16B 보다도 뛰어난 성능이라고 주장
- 모델 가중치, 학습 데이터, 코드, 로그 등을 오픈소스로 공개. 역시 AI2..
- 허깅페이스, 깃허브 링크 🔗
📜 [Tsinghua] LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
- long-context LLM이 sentence-level의 fine-grained citation을 포함한 답변을 생성할 수 있도록 하는 연구, Long-Context Question Answering (LCQA)
- LCQA를 평가하기 위한 벤치마크 LongBench-Cite 제안
- CoF (Coarse to Fine) 파이프라인 제안
- LongCite-45k 데이터셋을 사용하여 LongCite-8B, 9B를 학습
- 깃허브 링크 🔗
📜 [Autodesk AI Research] MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs
- MMLU-Pro를 바탕으로 LLM의 shortcut learning과 higher-order reasoning을 평가하기 위한 벤치마크 MMLU-Pro+를 제안
- 복잡한 추론을 하도록 세팅이 되어 있어서 단순한 problem-solving 전략과 다르다고 주장
- 모델이 실제 추론을 하지 않고 표면적인 패턴을 학습하여 정답을 맞히는 shortcut learning 현상을 최소화하는 것이 본 연구의 목표. shortcut learning의 정도를 평가할 수 있는 메트릭도 제시.
- 깃허브 링크 🔗
🧑🏻‍💻 [SSI] lya Sutskever’s startup, Safe Superintelligence, raises $1 BILLION
- OpenAI의 전 공동 창업자 Ilya Sutskever가 창업한 스타트업 Superintelligence가 1조원 규모의 투자를 받음
📜 [Tsinghua University] Attention Heads of Large Language Models: A Survey
- LLM의 internal reasoning process를 개선할 수 있도록 attention head의 interpretability와 underlying mechanism에 집중
- 사람의 생각을 네 단계의 프레임워크로 distill: 1) Knowledge Recalling, 2) In-Context Identification, 3) Latent Reasoning, 4) Expression Preparation
- 깃허브 링크 🔗
📜 [HSE University] Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing
- 입력 이미지의 전체적인 구조와 변경되지 않아야 하는 local region을 잘 보존할 수 있도록 하는 sef-guidance technique를 탐구
- source 이미지의 local & global 구조를 저장할 수 있도록 하는 layout-preserving energy function을 도입
- → fast & high-quality editing mechanism
- 깃허브 링크 🔗
📜 [Tsinghua University] Pandora's Box or Aladdin's Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models - Noise RAG Benchmark 구축 - 언어학적인 관점에서 7개의 노이즈를 정의 - → beneficial noise vs harmful noise로 구분

2nd week

🧑🏻‍💻 [HuggingFace, IBM] Improving Hugging Face Training Efficiency Through Packing with Flash Attention
- Flash Attention 2를 사용하여 instruction tuning을 진행할 때, padding 없이 packing 해주는 방법에 대한 허깅페이스 블로그 글
- 최대 2배까지 높은 throughput으로 이어진다고 함
📜 [Google DeepMind] Building Math Agents with Multi-Turn Iterative Preference Learning
- 현재 direct preference learning 알고리즘은 single-turn chat task에 집중하고 있음. 즉, multi-turn 또는 external tool integration에 관심이 없음
- → multi-turn direct preference learning framework를 제안: multi-turn DPO & KPO
📜 [University of Toronto, Vector Institute] Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
- LLM은 conventional quantitative 벤치마크로 그 능력을 평가하기 어려움
- → 특정 스킬이나 토픽에 대한 모델의 behavior를 요약한 natrual language summaries, Report Cards를 제안
- specificity, faithfulness, interpretability, 세 기준을 근거로 Report Cards를 평가
- human supervision 없이 Report Cards를 생성하는 iterative algorithm 제안
🧑🏻‍💻 [Replit] Replit Agent
- 자연어 프롬프트로부터 어플리케이션을 만들어 낼 수 있는 AI agent 기능을 공개
- cursor의 composer와 유사한 기능으로 보임
- long context, code understanding & generation에 많은 기업들이 집중하는 이유
🧑🏻‍💻 [Google] Illuminate
- research paper를 short podcast로 변환해주는 툴을 공개
- 현재 waitlist에 등록해야 하는 실험적 기능임
📜 [Beijing University] How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data
- 어떤 데이터를 진정한 high-quality code instruction data로 볼 수 있을까?
- instruction complexity, response quality, instruction diversity 세 개의 기준으로 데이터를 선별
- 선별된 데이터로 Llama-3를 학습하여 XCoder 모델을 공개
📜 [Mila, Princeton, Cambridge, Google DeepMind] Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving (5월 논문)
- Meta cognitive knowledge: 자신의 thinking & reasoning process에 대한 직관적인 지식
- → 본 연구 결과에 따르면 LLM이 meta cognitive knowledge를 지닌 것으로 판단된다고 함
- 수학 문제에 합리적인 skill label을 붙일 수 있다는 것이 확인되었음. 그 결과는 사람도 해석 가능.
📜 [Oxford] Detecting hallucinations in large language models using semantic entropy (Nature)
- 인간이 정답을 알지 못하는 unseen questions에 대해도 LLM이 working 해야 함
- → entropy-based uncertainty estimator를 도입하여 LLM이 hallucinations-confabulations-를 탐지할 수 있도록 함
- 데이터셋이나 task에 대한 사전 지식 없이도 적용 가능한 방법론임을 설명
📜 [Singapore University] Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models
- long-context language models(LM)을 Needle-in-a-Haystack (NIAH) 로 평가하는 것은 부적절
- → 생성된 long text sequences 내의 특정 사건들을 식별할 수 있는 능력을 평가하는 Spinning the Golden Thread (SGT) 제안
- LM이 특정 사건과 constraint를 포함하여 long-form text를 생성하도록 지시
🧑🏻‍💻 [Huawei] Huawei unveils $2,800 tri-fold phone just hours after iPhone 16 launch.
- 화웨이에서 3단으로 접히는 스마트폰을 세계 최초로 출시. 약 377만원부터 시작
📜 [University of Toronto] Seek and Solve Reasoning for Table Question Answering
- Seek-and-Solve 파이프라인: LLM으로 하여금 관련 있는 정보를 먼저 찾고 답변을 생성하도록 지시
- reasoning은 two-stage로 구성, CoT paths는 Seek-and-Solve CoT로 통합 (SS-CoT)
📜 [Stanford University] Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
- 100명의 expert NLP researcher와 LLM ideation agent 를 비교 → blind review
- LLM-generated idea가 사람이 만든 것보다 더 novel 하다는 결과 (p<0.05). 단, feasibility는 조금 더 낮은 것으로 확인됨.
- 얼마 전 Sakana에서 공개한 AI Scientist도 그렇고.. 확실히 연구도 AI로 하는 시대가 오게 될 듯
📜 [Apple] Theory, Analysis, and Best Practices for Sigmoid Self-Attention
- 기존 softmax attention과 비교하여, sigmoid attention이 universal function approximator일 뿐만 아니라 regularity를 개선해줄 수 있다는 측면에서 좋다고 주장
- H100에서 FlashAttention2 위에서 돌아가는 Flash-Sigmoid 도입 → 추론 속도 17% 향상
- 이런 것들은 실제 사용 경험을 많이 접해보고 적용하면 좋을 것 같음
📜 [UIUC, CMU] Paper Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance
- 기존 DocQA는 personalized x, 최신 정보 업데이트 용이성 x 라는 점을 한계로 지적
- → thought-retrieval을 기반으로 researcher를 돕는 self-evoling, efficient LLM 시스템 제안
- 69.92%의 시간을 절약할 수 있다고 주장
- 허깅페이스 스페이스 링크 🔗
🧑🏻‍💻 [Mistral] pixtral-12b-240910
- text-based Nemo 12B에 400M vision adapter를 합친 모델
- 1024 x 1024 이미지까지 처리 가능하며 16 x 16 단위로 쪼갠다고 알려짐
- 131,072개의 unique tokens
- 업데이트 되지 않는 모델 체크포인트를 허깅페이스에 공개
- 허깅페이스 링크 🔗
🧑🏻‍💻 [SambaNova] SambaNova Launches The World's Fastest AI Platform
- Llama 3.1 405B 모델이 full precision으로 초당 132 토큰 출력 가능 / 70B는 570토큰
- 오픈소스는 아니고 fine-tuning과 inference 솔루션을 판매하는 기업의 제품으로 보임
📜 [United We Care] LLMs Will Always Hallucinate, and We Need to Live With This
- hallucination이 LLM의 수학적, 논리적 구조로부터 필연적으로 발생함을 입증
- → 따라서 아키텍쳐 개선, 데이터셋 증가, fact-checking 등으로 hallucination을 제거한다는 것은 불가능하다고 주장
📜 [KAIST] Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation
- Think-Aloud (TA) 방법을 사용해서 checklist 기반의 텍스트 평가를 생성하도록 하는 human expertise & LLM 통합 프레임워크, InteractEval 제안
- 사람은 Coherence & Fluency와 같은 internal quality와 관련된 작업에 능하고, LLM은 Consistency & Relavance와 같은 external alignment에 능하다는 분석 결과
- 깃허브 링크 🔗
🧑🏻‍💻 [Intel, DeepLearning.AI] Multimodal RAG: Chat with Videos
- short course에 Multimodal RAG와 관련된 강의를 인텔에서 제작
🧑🏻‍💻 [Google] DataGemma: Using real-world data to address AI hallucinations
- Data Commons로부터의 real-world 통계 데이터를 통합함으로써 hallucination을 줄인 DataGemma를 공개
- RIG(Retrieval-Interleaved Generation) & RAG 사용
📜 [Tsinghua] General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
- 580M 사이즈의 OCR-2.0 방식의 General OCR Theory (GOT) 모델을 공개
- scene, document, whole-page 스타일 등 다양한 이미지 양식을 커버할 수 있고 “글자” 단위로 처리하는 OCR tasks도 다룰 수 있음
- 좌표나 색상 등으로 설명되는 region-level recognition도 가능
🧑🏻‍💻 [FutureHouse] PaperQA2
- PDF 또는 텍스트 파일 대상으로 RAG를 수행하여 논문을 쉽게 읽을 수 있도록 도와주는 패키지
- QA, 요약, contradiction detection 등 가능
- pip install paper-qa
- 논문 링크 🔗
🧑🏻‍💻 [OpenAI] Introducing OpenAI o1-preview
- 더 오래 생각하고 복잡한 문제를 해결하는 새로운 AI 모델 시리즈 'OpenAI o1' 출시
- 과학, 코딩, 수학 분야에서 뛰어난 성능 보임 (예: IMO 예선 83% 정답률, Codeforces 89번째 백분위)
- o1-preview와 o1-mini 두 모델 제공, ChatGPT Plus/Team 사용자와 일부 API 개발자들에게 접근 권한 부여
- 향상된 안전 기능 적용 (jailbreaking 테스트에서 GPT-4o 대비 큰 성능 향상)
- OpenAI o1 System Card 🔗
📜 [University of Mannheim] Fine-tuning Large Language Models for Entity Matching
- 기존: entity matching을 주로 prompt engineering & in-context learning 으로 해결
- → LLM fine-tuning: 1) LLM이 생성한 학습용 설명 데이터셋 2) LLM을 이용한 학습 데이터 선별
- sLLM (Llama 3.1 8B) > LLM (GPT-4o Mini), in-domain > cross-domain, structured data 효과적
📜 [Meta, Oxford, UCL] Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
- human annotation 없이 LLM에게 새로운 스킬을 가르쳐주는 방법, Source2Synth 제안
- custom data source 입력 → real-wrold source에 근거한 intermediate reasoning step을 포함하여 합성 데이터를 생성
- answerability에 따라 low-quality generation를 버릴 수 있어 데이터셋 퀄리티가 개선됨
- multi-hop question answering (MHQA), tool usage in tabular question answering (TQA) 에 효과적
📜 [Alibaba] mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding - OCR-free Document Understanding을 지원하는 현 MLLMs는 한 개 문서 이미지에 대해 너무 많은 visual tokens를 생성해야 해서 과도한 GPU 사용과 추론 속도 저하라는 문제점이 존재 - → low-resolution global visual feature를 근거로 high-resolution document 이미지를 324개 토큰으로 압축하는 모듈, High-resolution DocCompressor 제안 - Three-stage training framework: 1) Single-image Pretraining 2) Multi-image Continue-pretraining 3) Multi-task Finetuning

3rd week

🧑🏻‍💻 [Stability.AI] Stable Diffusion 3 Medium Fine-tuning Tutorial
- SD3M 모델의 파인튜닝 튜토리얼을 공개
- 기존 SD1.5, SDXL 모델과 SD3M 파인튜닝의 차이점 설명
📜 [CMU, MIT] Agent Workflow Memory
- 현재 방법론들은 복잡한 action trajectories를 갖는 long-horizon task를 잘 처리하지 못함
- Agent Workflow Memory (AWM): 자주 반복되는 routine을 induce 하는 방법론으로, agent에게 workflow를 선택적으로 제공
- offline & online 시나리오 둘 다 적용 가능, Mind2Web & WebArena 벤치마크로 실험
- 깃허브 링크 🔗
📜 [KAIST] Stable Language Model Pre-training by Reducing Embedding Variability
- Token Embedding Variability (TEV) 를 사전 학습 동안의 모델 안정성을 평가하는 proxy로 사용
- Multi-head Low-Rank Attention (MLRA), output embedding의 exponential growth를 제안함으로써 instability를 완화
- 연구실에서는 아직도 GPT-2, Llama-2 등을 사용할 수밖에 없는 실정..
📜 [Peking, Microsoft] CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks
- 현재 언어 모델들은 task-specific reasoning에만 집중하고 generalization capabilities에는 관심이 없음
- → Monte Carlo Tree Search (MCTS)를 이용하여 multi-step reasoning tasks 내의 다양한 planning step을 탐색하는 Critical Planning Step Learning (CPL) 제안
- Step-APO (Step-level Adavantage Preference Optimization): MCTS를 통해 획득 가능한 step-level 선호쌍을 DPO와 통합
📜 [Wisconsin-Madison] Your Weak LLM is Secretly a Strong Teacher for Alignment
- 현존 alignment framework는 human effort 또는 높은 computational cost를 필요로 함
- → weak LLM을 이용해서 human feedback만 사용할 때에 준하는, 혹은 그 이상의 효율을 뽑아내고자 함
- 본 연구에서는 OPT-125M 모델을 사용 → 굉장히 작은 사이즈의 모델로도 좋은 결과를 얻었다고 볼 수 있음
📜 [Chinese Academy of Sciecnes] StruEdit: Structured Outputs Enable the Fast and Accurate Knowledge Editing for Large Language Models
- 최신 정보를 모델에 주입하는 것은 굉장히 어려운 태스크여서 아직 잘 풀리지 않음. 그 원인 중 하나로 unstructured natural language outputs를 들고 있음
- → StruEdit 제안: reasoning triplet으로 structured output을 반환하도록 프롬프팅 → outdated knowledge를 제거하고 효율적으로 up-to-date 정보로 채워 넣음
🧑🏻‍💻 [Microsoft] Microsoft 365 Copilot Wave 2: Pages, Python in Excel, and agents
- Copilot 페이지 내에서 프롬프트 기반으로 검색 & 결과 정리한 것을 다른 사람들과 쉽게 공유할 수 있음
- 이런 통합 시스템을 구현하겠다고 작년부터 구글과 경쟁하고 있는 것 같은데 실효성은 아직 잘 모르겠음
🧑🏻‍💻 [Waymo] Waymo’s Self-driving cars beat humans in safety
- 웨이모피셜) AI가 자율주행한 것이 사람보다 사고율이 낮았다. 사고 원인도 AI 시스템보다 외부에 많았다고 X에 공개
🧑🏻‍💻 [Google] NotebookLM now lets you listen to a conversation about your sources
- 두 명의 AI 호스트가 주제에 대해 이야기를 나누는 형식으로 만들어주는 서비스
- 구글 Illuminate에 이것이 사용된 것으로 보이고 Gemini 1.5의 멀티모달 능력을 이용
- NotebookLM 링크 🔗
📜 [Huawei] Large Language Models are Good Multi-lingual Learners : When LLMs Meet Cross-lingual Prompts
- long & complex contexts를 잘 이해할 수 있도록 Multi-Lingual Prompt, MLPrompt 제안
- LLM이 다른 언어로는 따르기 어려워하는 error-prone rule을 자동으로 번역
- structured data 생성에 대한 auto-checking 메커니즘을 포함하는 프레임워크를 공개
  - 이 부분은 확인할 필요가 있을 듯
🧑🏻‍💻 [Mistral AI] AI in abundance
- 실험과 프로토타입을 위한 무료 티어를 제공
- Mistral AI 모델들의 비용을 크게 줄임: Nemo 50%, Small & Codestral 80%, Large 33, …
- le Chat에서 사용 가능한 Pixtral 12B 모델을 Apache 2.0 라이센스로 공개
🧑🏻‍💻 [Qwen] Qwen2.5: A Party of Foundation Models!
- Qwen2를 업데이트하여 Qwen2.5, -Coder, -Math를 공개. 사이즈가 굉장히 다양함.
- 3B & 72B 를 제외한 모델들은 Apache 2.0 라이센스
- 18T 토큰으로 학습하여 coding, mathematics, instruction following, long texts 등 다양한 영역에서 강점을 보임 → 128K 윈도우 사이즈 지원, 8K 토큰까지 생성 가능, 29개 언어 지원
📜 [ETRI] A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B
- 기존 quantized LLM 평가는 perplexity와 같은 메트릭 또는 구식 데이터셋으로 평가가 이뤄짐
- → GPTQ, AWQ, SmoothQuant, FP8 등 다양한 방식, 7B ~ 405B 사이즈 모델. 13개 벤치마크에서 평가
- (1) FP 16 LLM은 hallucination detection & instruction following 제외하고 괜찮
- (2) quantization 방법, 모델 사이즈, bit-width 등에 따라 결과가 천차만별
- (3) task 난이도가 accuracy degradation에 그렇게 큰 영향을 주지는 않음
- (4) MT-Bench 평가 방식은 뛰어난 최근 LLM들의 독보적인 능력이 발휘되기에 적합하지는 않음
🧑🏻‍💻 [HuggingFace] Fine-tuning LLMs to 1.58bit: extreme quantization made easy
- Microsoft Research에서 제안한 BitNet 구현체에 대한 설명
- 허깅페이스에서 1.58b 로 학습하고 추론하는 방법에 대한 블로그 글을 게시
🗞️ [Snap] Introducing New Spectacles and Snap OS: The Next Frontier of AR Glasses
- Snap에서 5세대 spectacle을 공개. Sanp OS로 동작하는 AR glasses임
- OpenAI와의 파트너십을 발표하여 화제
📜 [ETH] Breaking reCAPTCHAv2
- 구글의 reCAPTCHAv2 시스템을 머신러닝으로 풀기 위한 연구
- YOLO 모델을 사용하여 100% 확률로 통과할 수 있었으며, 통과에 필요한 문제 수가 사람과 다르지 않다는 결론
- 깃허브 링크 🔗
📜 [Texas at Austin, Johns Hopkins, Princeton] To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
- 100개 논문에 대한 메타 데이터 분석, 14개 모델로 20개 데이터셋을 평가
- → CoT는 math, logic 과 같이 논리적인 태스크에서는 효과적이지만 그 외에는 그닥 영향이 없음
- MMLU에서 질문이나 모델의 답변에 ‘=’ 기호를 포함하는 태스크를 제외하고서는 CoT를 쓰나 안쓰나 비슷
- 따라서 CoT는 상황에 맞게 선별적으로 사용하는 것이 좋을 것 같다는 결론
📜 [Texas at San Antonio] Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent
- 기존 multi-agent reasoning은 추론 경로를 얕게 탐색한다는 문제, ToT는 여전히 잘못된 path가 최종 결론으로 이어질 수 있다는 문제점을 포함하고 있음
- Thought Validator agent를 동반한 ToT 기반의 Reasoner agent를 제시
📜 [Qwen] Qwen2.5-Coder Technical Report
- CodeQwen1.5의 후속작 Qwen2.5-Coder-1.5B, 7B의 테크니컬 리포트
- 데이터 정제, 합성 데이터 생성, 데이터 혼합 등. 5.5T 토큰으로 학습. 큰 사이즈 모델보다도 뛰어난 성능을 보고.
- 허깅 페이스, 깃허브 링크 🔗
🧑🏻‍💻 [GitHub] Try out OpenAI o1 in GitHub Copilot and Models
- OpenAI의 o1-preview & o1-mini를 GitHub Copilot 으로 사용 가능. wait list에 등록해야 함.
- Copilot Chat 중간에 o1-preview, o1-mini, GPT-4o 모델 간 변경 가능
🧑🏻‍💻 Open-source FinePersonas datasets dropped in Huggingface with 21 million rows and 142GB size
- 21M개의 페르소나 데이터. 특정 페르소나에 대한 설명이 어떻게 라벨링 되어야 하는지 나타나있음.
- 어떤 프롬프트를 사용했는지도 함께 공개
📜 [Microsoft] Re-Reading Improves Reasoning in Large Language Models
- 질문을 input으로 다시 Re-Reading 하는 방법, RE2를 제안
- 질문을 두 번 처리함으로써 과정에 대한 이해도를 높인다는 것이 컨셉
- 단방향의 decoder-only LLM에서 “bidirectional” encoding을 사용하여 global information 활용
📜 [Huawei, McGill, Mila] Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic Data
- 그래프 기반의 synthetic reasoning data를 training signal로 사용하여 LLM의 추론 능력을 향상시키고자 시도
- 기존의 다른 능력들을 손상시키지 않으면서도 추론 능력을 향상시킬 수 있었다고 주장
- 깃허브 링크 🔗
📜 [Google DeepMind] Training Language Models to Self-Correct via Reinforcement Learning - multi-turn online reinforcement learning (RL) approach, SCoRE 개발 - 전적으로 self-generated data를 이용하여 LLM의 self-correction 능력을 발전 - offline model-generated correction traces (이를테면 SFT)는 self-correction behavior를 instill 하기엔 부족하다고 주장

4th week

📜 [HKUST, Amazon] Constrained Reasoning Chains for Enhancing Theory-of-Mind in Large Language Models - Theory-of-Mind (ToM) 방법론은 주로 zero-shot prompting을 사용하기 때문에 복잡한 reasoning task에서 낮은 퍼포먼스를 보임 - zero-shot prompting method, Constrained Chain-of-ToM (CCoToM) 제안 - prompts에 대한 constraint를 adaptively 부과함으로써 inductive bias를 유도
📜 [Tsinghua, Berkely, Anthropic, NYU] Language Models Learn to Mislead Humans via RLHF
- RLHF는 LM이 만든 에러를 사람이 알아차리기 더욱 어렵게 만든다고 주장 → “U-Sophistry” (Unintended)
- 모델의 출력 결과를 사람이 직접 평가 → RLHF는 모델의 성능도 평가하기 어렵게 만든다.
📜 [Tsinghua, Shanhai AI Lab] On the Diagram of Thought
- LLM이 Directed Acyclic Graph (DAG) 으로서 iterative reasoning 할 수 있도록 모델링 하는 Diagram of Thought (DoT) 제안
- propositions, critiques, refinements, verifications를 DAG 구조 내에 포함 → logical consistency를 유지하면서도 모델이 복잡한 reasoning pathways를 탐색하도록 함
📜 [Arizona State University] LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
- LLM의 빠른 발전에도 PlanBench 정복은 쉽지 않았음
- o1과 같은 Large Reasoning Model (LRM) 은 분명 눈에 띄는 성능 향상을 보여주고 있으나 아직까지 planning 능력이 충분하지 않다고 주장
📜 [NYU, Columbia] Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking
- LLM-judge 선호를 구체적인 metric으로 전환할 수 있을까? → SOS-BENCH 개발: standardized, reproducible LLM meta-benchmark
- LLM-judgement는 safety, world knowledge, instruction following과 관계가 없다고 주장. 대신 style에 대해 더 높은 우선순위를 부여하고 있는 것으로 관측.
- 코드 및 결과물 링크 🔗
📜 [NVIDIA] Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B
- Llama-3.1-70B 대비 220% 빠르고 400% 많은 workload를 처리할 수 있는 51B 모델 공개
- 40B tokens from FineWeb, Buzz-V1.2, and Dolma datasets
- Packaged as NVIDIA NIM inference microservice for easy deployment
- 허깅페이스 링크 🔗
📜 [Google DeepMind] Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries
- a minimal, synthetic, and unleaked long-context reasoning evaluation for LLM
- context 내에서 단순히 정보를 retrieve 하는 것 이상의 long-context 평가를 하기 위한 통합 평가 프레임워크
- 코드 및 자연어 도메인에서 3개의 diagnostic long-context evaluations
🗞️ SocialAI: we tried the Twitter clone where no other humans are allowed
- private twitter 서비스. 본인을 제외한 모든 사람들은 AI bot.
🧑🏻‍💻 [OpenAI] Advanced Voice
- 이번 주 Plus & Team 유저에게 Advanced Voice 기능을 선공개
- Custom Instructions, Memory, five new voices, improved accents 등의 특징
🧑🏻‍💻 [Google] Updated production-ready Gemini models, reduced 1.5 Pro pricing, increased rate limits, and more
- Gemini-1.5-Pro-002, Gemini-1.5-Flash-002 공개
- 1.5 Pro 비용 50% 감소, 2배 높아진 limit, 2배 빨라진 output
- 거대 모델을 이용하는 비용은 확실히 빠른 속도로 줄

Name		Name	Last commit message	Last commit date
Latest commit History 964 Commits
.github		.github
backup/.github		backup/.github
data		data
docs		docs
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 새로운 시스템

📝 항목 추가 방법

🔄 자동화 프로세스

2026

🙇🏻 1월

2025

🎄 12월

🍁 11월

🎃 10월

🙇🏻 9월

🔥 8월

🍉 7월

🌞 6월

🏕️ 5월

🌸 4월

🌱 3월

☃ 2월

🙇🏻 1월

2024

🎄 12월

🍁 11월

🎃 10월

🙇🏻 9월

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

chanmuzi/NLP-Paper-News

Folders and files

Latest commit

History

Repository files navigation

🚀 새로운 시스템

📝 항목 추가 방법

🔄 자동화 프로세스

2026

🙇🏻 1월

2025

🎄 12월

🍁 11월

🎃 10월

🙇🏻 9월

🔥 8월

🍉 7월

🌞 6월

🏕️ 5월

🌸 4월

🌱 3월

☃ 2월

🙇🏻 1월

2024

🎄 12월

🍁 11월

🎃 10월

🙇🏻 9월

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages