Skip to main content
The Open Reward Standard (ORS) is an HTTP-based protocol that standardises how agents interact with reinforcement learning environments. It defines:
  • How agents discover available tools (actions they can take)
  • How agents access tasks (problems to solve)
  • How agents receive rewards (feedback signals for RL training)
  • How episodes progress until completion (via finished signals)
In ORS, an environment is a server that agents connect to via HTTP. The server implements the ORS protocol, providing endpoints for tool discovery, task retrieval, and tool execution.

Key Principle: Actions are Tools

A fundamental assumption in ORS:
The only way agents interact with environments is by calling tools.
This design decision has important benefits:
  • Leverages existing capabilities: Major LLMs support function calling
  • Clear interface boundary: Agent actions are explicit and well-defined
  • Traceable interactions: Every action is a structured function call
  • Type safety: Tools have schemas defining their inputs and outputs
For example, in a math environment, the agent might have access to a submit tool which it can use to submit an answer to a prompt:
{
  "name": "submit",
  "description": "Submit an answer to the current math problem",
  "input_schema": {
    "type": "object",
    "title": "AnswerParams",
    "properties": {
      "answer": {"type": "number"}
    },
    "required": ["answer"]
  }
}

Primary use case: Reinforcement Learning

ORS is designed to accelerate agentic reinforcement learning, by making it easier to define and interact with RL environments.

How RL works with ORS

  1. An agent is connected to a task from an ORS environment via its exposed tools, and the initial prompt detailing the task to be accomplished.
  2. The agent executes tools, receives tool output and rewards, and continues until a finished signal.
  3. At the end of the episode, we have a trajectory as well as rewards we can use for credit assignment.
  4. We use an RL algorithm of choice, e.g. GRPO based policy gradient.

Example: Math Problem Solving

Consider training an agent on math problems. Here’s the protocol flow:
1. List available tasks
   POST /math/tasks {"split": "train"}
   → {"tasks": [{"question": "If x + 5 = 12, what is x?", "answer": "7"}, ...]}

2. Create session
   POST /create_session
   → {"sid": "session-123"}

3. Create episode with a task
   POST /create
   Headers: X-Session-ID: session-123
   Body: {"env_name": "math", "task_spec": {"question": "If x + 5 = 12, what is x?", "answer": "7"}}

4. Get initial prompt
   GET /math/prompt
   Headers: X-Session-ID: session-123
   → [{"text": "If x + 5 = 12, what is x?", "detail": null, "type": "text"}]

5. Call submit tool
   POST /math/call
   Headers: X-Session-ID: session-123
   Body: {"name": "submit", "input": {"answer": "7"}}
   → (SSE) {"ok": true, "output": {
       "blocks": [{"text": "Correct!", "detail": null, "type": "text"}],
       "metadata": null,
       "reward": 1.0,
       "finished": true
     }}
To use this data for reinforcement learning, we would need to represent the trajectory in a way that the model can consume - e.g. tokenising it - and using the reward (1.0) as part of the gradient update. For example, in GRPO it would be used to calculate the group advantage.

Secondary use case: Agentic Evaluation

While designed for RL training, ORS can also be used for agentic evaluation:
  • Standardised benchmarks: Common interface across different environments
  • Train/test splits: Tasks are organised into clear training/evaluation splits
  • Reproducible results: The environment is standardised for different agents
  • Diverse task types: ORS supports tasks from basic question/answer environments to more complicated agentic workflows involving sandbox execution and computer-use.

Core Components

An ORS server provides access to four core components:

1. Tools

Tools are the actions available to agents. Each tool has:
  • A name (e.g., bash, submit, read_file)
  • A description explaining what it does
  • An input schema (JSON Schema) defining parameters
  • A return type (ToolOutput with blocks, reward, finished)

2. Tasks

Tasks are the problems agents need to solve. Each task is a JSON object containing problem-specific data:
{
  "question": "What is 2+2?",
  "ground_truth": 4,
  "difficulty": "easy"
}
The structure is environment-specific. Math environments have questions and answers. Coding environments have problem descriptions and test cases.

3. Splits

Splits can be used to organise tasks into categories. For example, standard splits can be used:
  • train - Tasks for training agents
  • validation - Tasks for hyperparameter tuning
  • test - Tasks for final evaluation
But also custom splits, for example versions of the environment, or splits with different hardware requirements (e.g. CPU or GPU).

4. Prompts

Prompts are the initial instructions given to agents for each task. They’re returned as blocks (text or images):
# Agent gets prompt at start of episode
prompt = session.get_prompt()
# → [TextBlock(text="What is 2+2?")]
Prompts can be multi-modal (text + images) and are generated dynamically based on the task.

Episodes are sessions

In ORS, a session is an RL episode.

Episode Lifecycle

1. Create session → Start episode with a specific task
2. Get prompt     → Receive initial state
3. Call tools     → Take actions, get results and rewards
4. Repeat step 3  → Until finished=True
5. End session    → Episode complete
The episode continues until a tool returns finished: true. This is different from typical API sessions - there’s semantic meaning to when an episode ends. It represents task completion (success or failure).

Episode Example

Episode 1: Single-step (correct answer)
POST /create_session → session_id_1
POST /create (task: problem_1)
POST /env/call ("submit", {"answer": 42})
→ finished=true, reward=1.0
Episode 2: Multi-step interaction
POST /create_session → session_id_2
POST /create (task: problem_2)

Step 1: Explore
POST /env/call ("bash", {"command": "cat question.txt"})
→ finished=false, reward=0.0

Step 2: Solve
POST /env/call ("submit", {"answer": "Tokyo"})
→ finished=true, reward=1.0

Rewards

Rewards are numeric feedback signals that enable RL training.

Reward Design

  • Sparse rewards: Only at task completion (0 or 1)
  • Dense rewards: After each action (incremental progress)
  • Shaped rewards: Guide agent toward solution
Example sparse rewards:
POST /env/call ("submit", {"answer": 42})
→ reward=1.0, finished=true    # Correct

POST /env/call ("submit", {"answer": 43})
→ reward=0.0, finished=true    # Incorrect
Example dense rewards (trading environment):
POST /env/call ("place_trade", {"ticker": "AAPL", "action": "buy", "quantity": 10})
→ reward=0.0, finished=false    # Position opened

POST /env/call ("place_trade", {"ticker": "GOOGL", "action": "buy", "quantity": 5})
→ reward=0.0, finished=false    # Another position opened

POST /env/call ("end_day", {})
→ reward=0.023, finished=true   # Day settled, portfolio returned +2.3%

Protocol Overview

ORS uses HTTP + Server-Sent Events for communication:

HTTP for Control

Standard REST endpoints for:
  • Listing tools, splits, tasks
  • Creating/deleting sessions
  • Health checks

SSE for Tool Execution

Tool calls return results via Server-Sent Events:
  • Chunks large responses into smaller pieces for reliable delivery
  • Keeps connections alive during long-running tool calls
  • Allows clients to reconnect and resume results via task IDs

Language-Agnostic

Because ORS is HTTP-based, ORS can be implemented in any language:
  • Python: OpenReward SDK (reference implementation)
  • TypeScript: Custom server with Express/Fastify
  • Go: Custom server with stdlib http
  • Rust: Custom server with Actix/Axum

ORS vs MCP

Both ORS and MCP involve agents calling tools, but they serve different purposes: MCP (Model Context Protocol):
  • Purpose: Connect LLMs to tools, data sources, workflows
  • Use case: General-purpose tool access
  • Protocol: JSON-RPC over stdio/SSE
  • Key feature: Seamless tool integration
ORS (Open Reward Standard):
  • Purpose: Connect agents to RL training environments
  • Use case: Training and evaluating agents
  • Protocol: HTTP + SSE
  • Key features: Rewards, episodes, task organization

What’s Different?

ORS adds RL-specific features:
FeatureMCPORSWhy ORS Needs It
RewardsNoYesRL training signal
FinishedNoYesEpisode termination
TasksNoYesProblem organization
SplitsNoYesTrain/test separation

Can They Work Together?

Yes! They serve complementary purposes:
  • MCP: Agent uses tools to access external data/APIs
  • ORS: Agent operates in structured RL environment with rewards
You might use both: an agent in an ORS environment that uses MCP tools to access external resources.

Next Steps

Quick Start

Build your first ORS server with GSM8K example

Protocol Specification

Dive into the HTTP API details

Core Concepts

Understand tools, tasks, rewards, and prompts

Implementation Guide

Learn how to implement an ORS server

Key Takeaway: ORS brings RL to language models by providing a standardised protocol with rewards, episode structure, and task organization.