[week22] NeurIPS 2025 (Stanford AI Lab)

자료 분석

[week22] NeurIPS 2025 (Stanford AI Lab)

mapsycoy 2025. 12. 7. 18:08

이번 포스팅에서는 지난 2일 스탠퍼드 AI 연구소에서 X에 공개한 Neural Information Processing Systems Conference(신경 정보 처리 시스템 국제 학회) 등재 문서를 정리해보고자 한다.

사실 존스 홉킨스 등 다른 곳의 문서들도 많은데..
굳이 스탠퍼드인 이유는 샘 올트먼과 일론 머스크 모두 그곳 출신이기 때문이다.

NeurIPS는 1987년에 설립되어 현재 전 세계에서 가장 권위 있는 머신러닝·인공지능 학회라고 한다.
그 역사 속에서 가장 영향력 있었던 논문들은 👇아래 주소에서 직접 확인해 볼 수 있다.

"Best Paper" Digest

www.paperdigest.org

스탠퍼드 AI 연구소는 이번 NeurIPS 2025에 선정된 총 23개의 논문을 발표했다.

특히 LLM 기반 에이전트의 성능 향상 및 벤치마킹, 그리고 생성형 모델(Generative Models)의 발전에 초점을 맞추었다고 한다.

Stanford AI Lab Papers and Talks at NeurIPS 2025

The official Stanford AI Lab blog

ai.stanford.edu

각 주제에 대한 논문의 비중을 먼저 살펴보겠다.

주제 분류	논문 수
LLM & Agents🧠	10
Generative Models👁️	6
Robotics🦾	3
ML & 벤치마킹📊	4
총합	23

이제 위 링크에 소개되어 있는 논문 순서대로 하나씩 정리해 보도록 하겠다.

[목차여기]

01. MCP Explorer: Interactive Learning Experience

MCP Explorer - NeurIPS 2025 Education Materials

neurips-mcp-presentation.vercel.app

분류: LLM & Agents🧠(1)

MCP라는 것은 단순한 설명만으로는 충분히 이해하기 어려운 개념이다.

해당 프로젝트는 그 이해도를 높이기 위해 제작된 초보용 대화형 학습 경험 워크숍용 데모라고 한다.

02. Preference Learning with Response Time

Preference Learning from Human Response Time, by Ayush Sawarni

분류: ML & 벤치마킹📊(1)

영상 속에서는 응답 시간이 짧을수록 강한 선호도를 나타낸다는 인지 심리학적 연구를 바탕으로 인간이 AI가 제시한 옵션에 대한 응답까지 걸리는 시간을 활용하여 선호도 학습(Preference Learning)의 정확도를 획기적으로 높이는 새로운 접근법을 제시한다.

이진선택에 따른 보상함수 어쩌고 저쩌고 하는데.. 나는 이게 뭔 말인지 전혀 감도 안 잡힌다.🤣

영상 썸네일 속 [EZ diffusion model]에 대한 보다 자세한 정보는 👉 링크 속 논문에서 확인 가능하다.

03. Procurement Auctions with Predictions: Improved Frugality for Facility Location

NeurIPS Poster Procurement Auctions with Predictions: Improved Frugality for Facility Location

We study the problem of designing procurement auctions for the strategic uncapacitated facility location problem: a company needs to procure a set of facility locations in order to serve its customers and each facility location is owned by a strategic agen

neurips.cc

분류: ML & 벤치마킹📊(2)

실제 AI 추론에는 많은 비용이 부담되는데, 하나의 모델이라도 각각 독립된 GPU 클러스터가 분산 소유하고 있기 때문에 정확한 가격을 알아내기란 불가능에 가깝다고 한다.

그렇기 때문에 비용을 아끼려면 이들이 실제 금액에 최대한 가깝게 이야기하도록 만들어야 하고, 이를 위해서는 VCG방식처럼 아낀 금액만큼의 보상금이 추가로 들어간다.

하지만 경우에 따라 만약 모든 공급자가 단합하여 균형가를 뻥튀기시킨다면 그 보상금이 본래 금액 대비 극단적으로 높게 나온다는 문제가 존재한다.

해당 연구는 AI 예측을 통해 그 문제점을 해소할 수 있다는 내용이다.

과거 데이터를 기반으로, 현재 부르는 금액이 이전에 받던 금액과 근사치에 있는지 아닌지 알 수 있다는 것이다.

증거가 있으니 대부분이 거짓말을 못할 확률이 올라가므로 자연스럽게 보상금이 줄어들어 비용 절감이 가능하다고 한다.

04. Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models

Retrospective In-Context Learning for Temporal Credit Assignment...

Learning from self-sampled data and sparse environmental feedback remains a fundamental challenge in training self-evolving agents. Temporal credit assignment mitigates this issue by transforming...

openreview.net

분류: LLM & Agents🧠(2)

LLM이 스스로 자신의 과거 행동을 되돌아보고 피드백을 생성 및 학습하여 더 똑똑해지도록 만드는 방법에 대한 연구라고 한다.

05. A Practical Guide for Incorporating Symmetry in Diffusion Policy

A Practical Guide for Incorporating Symmetry in Diffusion Policy

Recently, equivariant neural networks for policy learning have shown promising improvements in sample efficiency and generalization, however, their wide adoption faces substantial barriers due to implementation complexity. Equivariant architectures typical

sym-in-dp.github.io

분류: Robotics🦾(1)

로봇이 다양한 각도나 위치에서 같은 행동을 인식하고 학습할 수 있도록 하는 방법에 대한 연구라고 한다.

06. Agentic Bridge Framework: Closing the Gap Between Agentic Capability and Performance Benchmarks

GitHub - stanford-ppl/Agentic_Benchmarking_For_GAIA

Contribute to stanford-ppl/Agentic_Benchmarking_For_GAIA development by creating an account on GitHub.

github.com

분류: LLM & Agents🧠(3)

본 연구에서는 AI Agent 시스템의 성능 평가 과정에서 발생하는 상세한 기록을 분석하여, 시스템의 병목 현상을 찾아내고 성능 최적화를 돕는 프레임워크를 제안하고 있다.

07. Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents

Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient...

LLM-based agent applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs and latency due to extensive planning and reasoning requirements....

openreview.net

분류: LLM & Agents🧠(4)

LLM 에이전트의 테스트 시간 메모리 사용을 최적화하여 속도와 비용 효율성을 높이는 연구라고 한다.

08. CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

Inductive program synthesis, or programming by example, requires synthesizing functions from input-output examples that generalize to unseen inputs. While large language model agents have shown promise in programming tasks guided by natural language, their

arxiv.org

분류: LLM & Agents🧠(5)

CodeARC라는 귀납 합성 평가를 통해 인공지능이 예시만 보고 완벽한 파이썬 함수를 만들 수 있는지 막상 테스트해 보니
아직은 갈 길이 멀다는 이야기이다.

09. Discovering Latent Graphs with GFlowNets for Diverse Conditional Image Generation

Discovering Latent Graphs with GFlowNets for Diverse Conditional Image Generation

Capturing diversity is crucial in conditional and prompt-based image generation, particularly when conditions contain uncertainty that can lead to multiple plausible outputs. To generate diverse images reflecting this diversity, traditional methods often m

arxiv.org

분류: Generative Models👁️(1)

조건문 속에는 여러 개의 숨겨진 가능성이 존재한다.
본 연구에서는 Rainbow라는 프레임워크를 통해 다양한 경로를 찾아내어 각 경로마다 다른 느낌의 이미지를 생성하는 방법을 다루고 있다.

10. DynaGuide: Steering Diffusion Polices with Active Dynamic Guidance

DynaGuide: Steering Diffusion Polices with Active Dynamic Guidance

dynaguide.github.io

분류: Robotics🦾(2)

최근 로봇 AI는 Diffusion Model을 통해 복잡한 행동을 잘 학습한다고 한다.

하지만 이미 학습된 AI에게 완전히 새로운 조건을 걸거나 행동을 수정하기가 매우 어렵다.
따라서 해당 연구에서는 Diffusion 노이즈 제거 과정에 외부 가이드인 Dynamics Model을 추가하는 방법을 제시한다.
이 방식을 사용하면 실시간 가이드가 이루어져 AI는 재학습 없이도 새로운 상황에 맞춰 유연하게 행동을 수정할 수 있다고 한다.

11. Exploring Diffusion Transformer Designs via Grafting

Exploring Diffusion Transformer Designs via Grafting

Designing model architectures requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these decisions on model quality requires costly pretraining, limiting ar

grafting.stanford.edu

분류: Generative Models👁️(2)

이 문서에서는 사전 학습된 확산 트랜스포머(DiT)를 편집해 새로운 아키텍처를 구현하는 방법으로 ‘Grafting’을 제안하고 있다.

[Grafting]은 한국어로 해석하면 [접목]으로, 본래 식물의 일부를 떼어 다른 식물에 붙이는 것을 뜻한다.

모델에 새로운 아키텍처 구성 요소나 연산자(operator)를 '접목'하여 적은 컴퓨팅 자원으로 새로운 모델을 생성하고 탐색하는 기법이라고 한다.

12. Fantastic Bugs and Where to Find Them in AI Benchmarks

Fantastic Bugs and Where to Find Them in AI Benchmarks

Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck f

arxiv.org

분류: ML & 벤치마킹📊(3)

수천 개에 이르는 벤치마크 문항 속 오류를 사람이 일일이 찾아 수정하는 것은 사실상 불가능하며, 평가의 신뢰성을 저해하는 중대한 병목 지점이기도 하다.

본 연구에서는 응답 패턴에 대한 통계적 분석을 활용해 문제가 있을 가능성이 높은 문항을 선별하고, 이후 전문가 검토로 이어지도록 하는 체계적인 벤치마크 수정 프레임워크를 제안하고 있다.

13. From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries

From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries

How can we learn to generate realistic scenes from limited data? Our key insight is that, despite inherent noisiness, indoor scenes retain significant underlying structure based on how rooms were intentionally designed, following social norms and preferenc

stanford.edu

분류: Generative Models👁️(3)

본 연구에서는 실제 사람이 거주한 공간에서 물체 배치의 변화를 학습하면서, 동시에 방의 기본적인 구조를 활용해 사실적인 3D 장면을 합성하는 프레임워크인 FactoredScenes를 제안하고 있다.

14. HouseLayout3D Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild

HouseLayout3D Dataset

Current 3D layout estimation models are primarily trained on synthetic datasets containing simple single room or single floor environments. As a consequence, they cannot natively handle large multi floor buildings and require scenes to be split into indivi

houselayout3d.github.io

분류: Generative Models👁️(4)

현재의 3D 레이아웃 추정 모델들은 대부분 단일 방이나 단일 층으로 구성된 단순한 합성 데이터셋으로 학습되어 있다.

이로 인해 여러 층으로 구성된 대규모 건물을 있는 그대로 처리하지 못하고, 장면을 층별로 잘라서 따로 분석해야 한다. 하지만 이런 방식은 층을 연결하는 계단과 같은 구조를 이해하는 데 필수적인 맥락을 잃게 만든다는 한계가 있다.
이에 본 연구에서는 여러 층과 복잡한 건축 구조를 포함한 ‘건물 전체 규모’의 레이아웃 추정 연구를 지원하기 위한 실제 환경 기반 벤치마크인 HouseLayout3D를 제안하고 있다.

15. In-Context Learning Strategies Emerge Rationally

In-Context Learning Strategies Emerge Rationally

Recent work analyzing in-context learning (ICL) has identified a broad set of strategies that describe model behavior in different experimental conditions. We aim to unify these findings by asking why a model learns these disparate strategies in the first

arxiv.org

분류: LLM & Agents🧠(6)

AI에게 Task를 주었을 때 풀어야 할 문제가 뻔하고 비슷비슷하면 암기 방식이 '가성비'가 좋으므로 그냥 외워버리는데, 반대로 문제가 다양하고 복잡해지면 이때 AI는 비로소 일반화, 즉 규칙 학습 전략으로 갈아탄다고 한다.
이 연구에서는 그 이유를 수학적으로 증명하고 있다고 한다.

16. Joint Design of Protein Surface and Structure Using a Diffusion Bridge Model

GitHub - guanlueli/Pepbridge

Contribute to guanlueli/Pepbridge development by creating an account on GitHub.

github.com

분류: Generative Models👁️(5)

PepBridge는 단백질 표면과 핵심 구조를 동시에 설계하기 위한 새로운 프레임워크라고 한다.

17. LLM-Guided Autoscheduling for Large-Scale Sparse Machine Learning

LLM-Guided Autoscheduling for Large-Scale Sparse Machine Learning

Optimizing sparse machine learning (ML) workloads requires navigating a vast schedule space. Two of the most critical aspects of that design space include which operators to fuse and which...

openreview.net

분류: ML & 벤치마킹📊(4)

복잡한 머신러닝 시스템 최적화 문제에 LLM을 도입하여 '최소한의 인간 개입'으로 효율적인 스케줄링이 가능함을 보여주는 연구라고 한다.

18. Latent Policy Barrier: Learning Robust Visuomotor Policies by Staying In-Distribution

Latent Policy Barrier

Visuomotor policies trained via behavior cloning are vulnerable to covariate shift, where small deviations from expert trajectories can compound into failure. Common strategies to mitigate this issue involve expanding the training distribution through huma

project-latentpolicybarrier.github.io

분류: Robotics🦾(3)

본 연구에서는 로봇이 낯선 환경(OOD, Out-of-Distribution)에서도 강건하게 작동하도록 하는 모방 학습에 대해 얘기한다.

전문가가 말로 설명하지 않은 안전 수칙과 노하우(암묵지)를, 기계가 시범 데이터의 분포(Latent Space)를 통해 스스로 터득하고 지키게 만드는 기술이라고 한다.

[week10] AI 암묵지 학습에 대하여

인공지능(AI)의 암묵지(tacit knowledge) 학습은 AI가 명시적으로 표현하기 어려운 지식, 예를 들어 인간의 직관, 경험, 또는 '노하우'를 데이터나 훈련 과정을 통해 습득하는 것을 의미한다.강화학습(r

mapsycoy.tistory.com

예전 10주 차에 다뤘던 [암묵지 학습]과 연관성이 매우 크다.

19. On the Entropy Calibration of Language Models

NeurIPS Poster On the Entropy Calibration of Language Models

We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing (and text quality decreasing) as

neurips.cc

분류: LLM & Agents🧠(7)

언어 모델의 예측 신뢰도를 개선하고 오류를 줄이기 위한 연구라고 한다.

20. SATBench: Benchmarking LLMs’ Logical Reasoning via Automated Puzzle Generation from SAT Formulas

GitHub - Anjiang-Wei/SATBench

Contribute to Anjiang-Wei/SATBench development by creating an account on GitHub.

github.com

분류: LLM & Agents🧠(8)

SATBench는 충족 가능성 문제(SAT)^[1]에서 파생된 논리 퍼즐을 통해 대규모 언어 모델(LLM)의 논리적 추론 능력을 평가하기 위한 벤치마크라고 한다.

21. SWE-smith: Scaling Data for Software Engineering Agents

SWE-smith

Citation @misc{yang2025swesmith, title={SWE-smith: Scaling Data for Software Engineering Agents}, author={John Yang and Kilian Lieret and Carlos E. Jimenez and Alexander Wettig and Kabir Khandpur and Yanzhe Zhang and Binyuan Hui and Ofir Press and Ludwig S

swesmith.com

분류: LLM & Agents🧠(9)

어떤 GitHub 저장소든 수백~수천 개의 작업 인스턴스^[2]를 자동으로 생성할 수 있다고 한다.

22. SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a signifi

arxiv.org

분류: LLM & Agents🧠(10)

인간은 실제 세계의 역동적인 사회적 상호작용 속에서 주변 환경을 지각하며, 타인의 상태·목표·행동을 지속적으로 추론한다.

그러나 기존의 대부분 ToM 벤치마크는 [정적인 텍스트 기반 상황]만을 평가하고 있어, 실제 상호작용과는 큰 괴리가 있다.
SoMi-ToM은 상호작용 환경 SoMi에서 생성된 풍부한 멀티모달 상호작용 데이터를 기반으로 하며, 다양한 제작 목표와 사회적 관계를 포괄한다고 한다.

23. VIPScene: Video Perception Models for 3D Scene Synthesis

Video Perception Models for 3D Scene Synthesis

3D scene synthesis traditionally demands expert knowledge and manual effort. Automating this process could greatly benefit fields such as architectural design, robotics simulation, VR, and gaming. Recent approaches to 3D scene synthesis often rely on the c

vipscene.github.io

분류: Generative Models👁️(6)

본 연구에서 제안하는 VIPScene이라는 프레임워크는 비디오 생성 모델에 내재된 3차원 물리 세계의 상식적 지식을 활용하여, 시점 전반에 걸쳐 일관된 장면 배치와 객체 배치를 보장한다고 한다.

올해 선정 논문들의 특징

🧠 : 인지적 자율성 확보
👁️ : 물리적인 현실세계를 이해할 수 있는 모델 구축
🦾 : 다양한 변수에도 안전하고 강건한 제어 기술
📊 : 신뢰성 확보, 인간 개입 최소화, 선호도 학습

이 모든 것의 궁극적인 지향점은 무엇일까?

🧠+ 👁️+ 🦾+ 📊 = 🤖

바로 우리의 미래 친구 [피지컬 AI]이다.

마무리로 "Wow, cool robot!" 밈을 들고 와봤다.

[기동전사 건담]의 원작자는 작품 속에서 전쟁과 기술의 위험성을 담아내려 노력하였으나, 실제로 이를 본 사람들은 그저 시각적으로 멋있는 로봇과 사이버펑크 요소만을 소비하는 것을 풍자하는 밈이다.

피지컬 AI의 기반이 되는 LLM과 Generative Model은 일반인들을 대상으로 계산적으로 '유희화'되어 있다.

빠르게 소비량을 늘리면 더 많은 데이터를 취할 뿐만 아니라 그렇게 벌어들인 수익으로 개발비용을 충당할 수 있기 때문이다.

그리고 그 뒤에서는 여러 엘리트 공학도들이 모여 이런 연구를 계속 진행하고 있다.

썸네일 출처: https://www.cs.jhu.edu/news/johns-hopkins-researchers-to-present-work-at-neurips-2025/

어떠한 변수들로 이루어진 논리식이 주어졌을 때, 그 논리식이 참이 되는 변숫값이 존재하는지를 찾는 문제↩
인스턴스는 일반적으로 실행 중인 임의의 프로세스, 클래스의 현재 생성된 오브젝트를 가리킨다.↩

저작자표시 동일조건 (새창열림)

'자료 분석' 카테고리의 다른 글

[week24] AI시대 소셜 컴퓨팅 (HCI 디자인) (0)	2025.12.22
[week23] The era of AI evangelism (Stanford HAI) (0)	2025.12.20
[week16] OpenAI DevDay 2025 내용 정리 (0)	2025.10.17
[week15] 사실 AI는 지속 가능한 기술이라 보기 어렵다. (3)	2025.10.10
[week11] 다양한 프롬프팅 기법 그리고 고착 (0)	2025.09.07

현재글[week22] NeurIPS 2025 (Stanford AI Lab)

맵시's AI 잡지식 저장소

영상디자인 전공하고 AI관련 공부하는 그림쟁이가 운영하는 블로그입니다.

대충 이렇게 살고 있는 사람입니다.

맵시's AI 잡지식 저장소