ChenYang

Server-Side R\&D Lead of Coze Compass at ByteDance

Yang Chen is the Server-Side Technical Lead of Coze Compass and a Technical Expert at the AI Platform department of ByteDance. He built the AI AgentOps platform from the ground up (0–1), supporting AI applications across multiple business lines including ByteDance Flow, TikTok, e-commerce, and Dongchedi. He continuously tracks the latest trends and innovations in AI application development platforms and possesses deep insights into AI application deployment and performance optimization. He has led and actively contributed to the open-source Coze & Coze Compass projects, which received strong community attention with GitHub stars exceeding 11k for `coze-studio` and 4k for `coze-loop` in the first week of launch, and he is deeply involved in building and fostering the open-source community.

Topic

Coze Loop: Practical Evaluation and Iterative Optimization of Agent Performance

In 2025, Agents have moved from proof-of-concept to production deployment, with enterprises shifting from traditional chatbots to complex intelligent agents capable of multi-turn, multi-modal, and cross-tool interactions. Compared with traditional software testing, evaluating Agent performance faces new challenges such as ambiguous metric definitions, high result uncertainty, and fluctuating online behavior. This talk will draw on ByteDance’s experience deploying Agents across multiple business lines to systematically explain full-chain practices—from constructing evaluation datasets and designing metric systems to continuous integration and online monitoring. It will explore how to build a reusable performance evaluation system under uncertain AI behavior to support rapid iteration and stable online operation. Outline: 1. Introduction * Background * Current state of Agent application development * AgentOps: A new paradigm for Agent performance evaluation * Challenges * Continuous integration: How to quickly meet production standards given the uncertainty of large models, unlike deterministic software quality metrics * Online monitoring: How to continuously track and optimize online performance * Designing robust scientific metrics to comprehensively evaluate increasingly complex Agents 2. Core Process of Agent Performance Evaluation * Evaluation process * Testing phase: Offline quality evaluation and continuous integration * Online phase: Continuous online evaluation, monitoring, and iterative optimization 3. Kouzi Compass Evaluation Practices * Building continuously iterated evaluation datasets * Methods for constructing multi-modal, multi-turn dialogue evaluation sets * Selecting appropriate evaluation metrics for different business scenarios * Designing and applying metrics for Agent, multi-modal, multi-turn dialogue, and consistency * Evaluation practices using LLM-as-Judge, Code, and other methods * Identifying bad cases based on flexible and intelligent evaluation results * Methods for single-experiment analysis and multi-experiment comparison * Insights: How Agents can intelligently detect issues and provide recommendations * Continuous online monitoring, optimization, and iteration * Using online evaluation to detect and address performance issues 4. User Cases * ByteDance internal: Live streaming business * Short-video compliance review, transitioning from human to AI review; managing evaluation datasets and multi-modal, multi-turn evaluation methods * Commercial: Agent evaluation solutions * Full-code Agent applications, trace-based online evaluation, and evaluation dataset management 5. Future Planning and Prospects * How to evaluate performance for complex Agents and multi-agent systems

© boolan.com 博览 版权所有

沪ICP备15014563号-6

沪公网安备31011502003949号