AI Breakthrough: Code-Based Systems Reshape How Language Models Think and Process Information

2026-02-15 data

Amsterdam, Sunday, 15 February 2026.
Revolutionary CodeAct and recursive language model approaches are transforming AI development by replacing traditional token-feeding methods with executable code frameworks. CodeAct boosts performance by 20% across benchmarks by treating Python code as the universal action language, while recursive models address context limitations through programmatic exploration. These innovations enable AI systems to self-debug, handle complex reasoning tasks more effectively, and maintain performance even with massive datasets—marking a fundamental shift from scaling context windows to restructuring computation itself.

CodeAct Framework: Transforming Agent Performance Through Executable Actions

The CodeAct framework represents a fundamental departure from traditional AI agent architectures by unifying all agent actions as executable Python code rather than relying on JSON schemas or fixed tool calls [1]. Instead of generating structured outputs, agents now produce arbitrary Python code such as ‘result = 5 * 2; print(result)’ which executes in a Python interpreter that returns outputs and errors as feedback [1]. This approach enables multi-turn revision, loops, libraries, and self-debugging capabilities that were previously unavailable to language model agents [1]. Testing across 17 different large language models demonstrates CodeAct’s consistent superiority over traditional JSON-based tool use, with performance improvements of 20 percentage points on API-Bank benchmarks, 20 points on ToolAlpaca, and 20 points on new agent benchmarks [1]. Even for simple, atomic APIs, performance improvements range from 5-10% due to code’s natural structure incorporating loops and conditionals [1].

Recursive Language Models: Addressing Context Limitations Through Code

Recursive Language Models (RLM) tackle the persistent challenge of context rot—where language model performance degrades as contexts grow larger—by separating the variable space from the token space through programmatic exploration [2]. RLM operates through a sandboxed Python REPL environment, treating context as external data that can be examined through code execution and recursive sub-LLM calls [2]. The system functions in an iterative REPL loop where the language model receives metadata, writes Python code to explore data, executes code in a sandboxed interpreter, can make sub-LLM calls via llm_query(prompt), and finally submits output using SUBMIT(output) [2]. This architecture addresses scenarios where context is too large for traditional processing, tasks benefit from programmatic exploration, or the language model needs to decide how to decompose complex problems [2]. The approach relies on Deno and Pyodide to create a local WASM sandbox for secure Python execution, with default parameters including maximum iterations of 20, maximum LLM calls of 50, and maximum output characters of 10,000 [2].

ROMA Framework: Multi-Agent Systems with Recursive Task Decomposition

On February 14, 2026, researchers introduced ROMA (Recursive Open Meta-Agents), a domain-agnostic framework that standardizes agent construction around four modular roles: Atomizer, Planner, Executor, and Aggregator [3]. ROMA decomposes goals into dependency-aware subtask trees executable in parallel and uses aggregation to compress and validate intermediate results [3]. The framework addresses limitations in current agentic frameworks for long-horizon tasks by providing transparency and traceability of agent behavior through structured, hierarchical traces [3]. GEPA+, an improved Genetic-Pareto prompt proposer introduced alongside ROMA, adapts the framework to specific tasks without fine-tuning [3]. Performance benchmarks released on February 14, 2026, show ROMA instantiated with GLM-4.6 improves accuracy by 9.9 percentage points over Kimi-Researcher on SEAL-0, achieving 45.9% accuracy compared to Perplexity Deep Research’s 31.5% [3]. On the EQ-Bench long-form writing benchmark, ROMA enables DeepSeek-V3 to match the performance of leading closed-source models such as Claude Sonnet 4.5 [3].

S1-NexusAgent: Self-Evolving Scientific Research Framework

The S1-NexusAgent framework, also introduced on February 14, 2026, represents a specialized application of these computational advances specifically designed for multidisciplinary scientific research [4]. S1-NexusAgent employs a hierarchical Plan-and-CodeAct execution paradigm with a dual-loop architecture that decouples scientific planning from tool execution [4]. The system natively supports the Model Context Protocol (MCP) and integrates thousands of cross-disciplinary scientific tools through intention-aware dynamic tool retrieval [4]. To handle long-context and large-scale data challenges, S1-NexusAgent introduces object-reference-based sparse context management enabling sub-task context isolation and intermediate result compression [4]. A Critic Agent evaluates complete execution trajectories and distills high-quality research paths into reusable Scientific Skills, forming a closed loop for continuous self-evolution [4]. Performance evaluations on authoritative scientific benchmarks show S1-NexusAgent achieving state-of-the-art results: 76.07% average accuracy on Biomni-Eval1 using Claude-4.5-Sonnet, superior performance on ChemBench across both open-weight and proprietary model categories, and leading results on MatSciBench, outperforming agents based on strong foundation models like Claude-3.7-Sonnet [4]. The framework’s development was led by researchers at the Institute of Automation, Chinese Academy of Sciences (CASIA), with project leadership by Jiajun Zhang [4].

Bronnen

AI reasoning computational architecture