← Back to Papers

RLM

2026-02-05

https://arxiv.org/pdf/2512.24601

1. What is the paper about as a whole?

Improving the performance of LLMs in long-context prompts. The authors propose a method of prompt management.

2. What is being said in detail, and how?

LLMs typically struggle in long-context task. They can run out of context or struggle from context rot. There are existing methods that try to overcome this but existing benchmarks show they help but still struggle.

The authors suggest to treat the task prompt as a string variable. Instead of prompting the LLM directly with the task, they start a REPL and instruct the LLM to create functions that can operate on the prompt and 'recursively' spin LLM calls to achieve the task. For example, one of the benchmarks involves a character name search inside a whole book. Naively, you can give instructions to the model and paste the whole book. Or you could use RLM and it would create a python REPL, create the required regex and create a function that chunks the book into chapters. Then, for each chapter, it spins up a sub-agent. The context never fills up.

3. Is the paper true, in whole or part?

The paper does appear to achieve substantial improvement on the benchmarks shown. However, it is unclear how useful those benchmarks are for truly complex task.

4. What of it?

This is promising as the authors achieved this purely on inference-side; without the models being trained for such use case. It seems likely that post-processing with RLM in mind will further improve results. Does appear to save some $$ for the use cases it solves but more experimentation is necessary.