Skip to main content
A reward signal defines the goal in a reinforcement learning problem. Typically this is a single number (scalar feedback) that tells an agent how good or bad an outcome is in an immediate sense. In ORS, rewards are obtained by executing tools and are returned as part of a ToolOutput. They can be used for training in reinforcement learning, or for agentic evaluation - for example, binary reward can be used for computing accuracy.

What is a Reward?

A reward is a numeric signal (typically a float) that indicates how good an action was:
interface ToolOutput {
  blocks: Blocks
  reward?: number  // ← The reward signal
  finished: boolean
  metadata?: JSONObject
}
Characteristics:
  • Optional field (can be null)
  • Often normalised in the range [-1, 1] or [0, 1]
  • Returned with every tool call
  • Accumulated over an episode

Reward Examples

Positive reward (success):
{
  "blocks": [{"text": "Correct!", "detail": null, "type": "text"}],
  "metadata": null,
  "reward": 1.0,
  "finished": true
}
Zero reward (neutral):
{
  "blocks": [{"text": "File listed successfully", "detail": null, "type": "text"}],
  "metadata": null,
  "reward": 0.0,
  "finished": false
}
Negative reward (penalty):
{
  "blocks": [{"text": "Error: Command failed", "detail": null, "type": "text"}],
  "metadata": null,
  "reward": -0.1,
  "finished": false
}

Reward Design Patterns

Pattern 1: Sparse Rewards

Only reward at episode end:
def submit(self, params) -> ToolOutput:
    correct = (params.answer == self.task["answer"])

    return ToolOutput(
        blocks=[TextBlock(text="Correct!" if correct else "Incorrect")],
        reward=1.0 if correct else 0.0,
        finished=True
    )
All intermediate steps:
def bash(self, params) -> ToolOutput:
    output = execute_command(params.command)

    return ToolOutput(
        blocks=[TextBlock(text=output)],
        reward=0.0,  # No reward for intermediate steps
        finished=False
    )
Pros:
  • Simple to implement
  • Clear success/failure signal
  • Easy to understand
Cons:
  • Hard for agent to learn (delayed feedback)
  • Credit assignment problem
  • Slow learning
Best for: Simple tasks, short episodes

Pattern 2: Dense Rewards

Reward every meaningful action:
def bash(self, params) -> ToolOutput:
    output = execute_command(params.command)

    # Reward progress
    reward = 0.0
    if "answer.txt" in output:
        reward = 0.2  # Found relevant file
    if len(output) > 0:
        reward += 0.1  # Successful command

    return ToolOutput(
        blocks=[TextBlock(text=output)],
        reward=reward,
        finished=False
    )

def submit(self, params) -> ToolOutput:
    correct = (params.answer == self.task["answer"])

    return ToolOutput(
        blocks=[TextBlock(text="Correct!" if correct else "Incorrect")],
        reward=1.0 if correct else 0.0,  # Final reward
        finished=True
    )
Pros:
  • Faster learning
  • Guides agent toward solution
  • Better credit assignment
Cons:
  • Harder to design
  • Can bias agent toward suboptimal paths
  • Risk of reward hacking
Best for: Complex tasks, long episodes

Pattern 3: Shaped Rewards

Reward based on distance to goal:
def calculate_progress_reward(self, state) -> float:
    """Reward based on how close to solution"""
    # Example: Math problem solving

    # Check if agent has seen the problem
    if not state["viewed_problem"]:
        return 0.0

    # Check if agent has attempted calculation
    if state["attempted_calculation"]:
        reward = 0.3

        # Check if calculation is close to answer
        if state["last_guess"]:
            error = abs(state["last_guess"] - self.task["answer"])
            if error < 5:
                reward = 0.7  # Getting close!
            if error < 1:
                reward = 0.9  # Very close!

        return reward

    return 0.1  # Viewed problem
Pros:
  • Strong learning signal
  • Guides exploration
  • Accelerates training
Cons:
  • Requires domain knowledge
  • Can create reward hacking opportunities
  • Complex to implement
Best for: Well-understood domains, complex navigation

Pattern 4: Penalty-Based Rewards

Penalize bad actions:
def bash(self, params) -> ToolOutput:
    # Penalize dangerous commands
    dangerous = ["rm -rf", ":(){ :|:& };:", "dd if=/dev/zero"]
    if any(cmd in params.command for cmd in dangerous):
        return ToolOutput(
            blocks=[TextBlock(text="Error: Dangerous command blocked")],
            reward=-1.0,  # Large penalty
            finished=True  # End episode
        )

    # Penalize inefficient actions
    if self.step_count > 20:
        reward = -0.1  # Penalty for taking too long
    else:
        reward = 0.0

    output = execute_command(params.command)

    return ToolOutput(
        blocks=[TextBlock(text=output)],
        reward=reward,
        finished=False
    )
Use cases:
  • Safety constraints
  • Efficiency requirements
  • Avoiding bad behaviors

Pattern 5: Binary Rewards

Simple success/failure:
def submit(self, params) -> ToolOutput:
    correct = (params.answer == self.task["answer"])

    return ToolOutput(
        blocks=[TextBlock(text="Correct!" if correct else "Incorrect")],
        reward=1.0 if correct else 0.0,  # Binary: 1 or 0
        finished=True
    )
Best for: Classification, Q&A, simple decisions

Reward Scales

Common Reward Ranges

[0, 1] scale:
  • 0 = failure
  • 1 = perfect success
  • 0.5 = partial success
[-1, 1] scale:
  • -1 = worst outcome
  • 0 = neutral
  • +1 = best outcome
Custom scales:
  • Can use any range, but normalise for RL algorithms
  • Some RL algorithms normalise automatically; e.g. group advantage in GRPO

Cumulative Rewards

In RL, we often care about total episode reward:
total_reward = 0.0

# Episode loop
while not finished:
    ...
    result = session.call_tool(...)
    total_reward += result.reward or 0.0
    finished = result.finished

print(f"Total episode reward: {total_reward}")
Interpretation:
  • Higher total reward = better performance
  • Compare across episodes for learning progress
  • Use for evaluation metrics

Rewards for Evaluation

Rewards in ORS can be used for evaluation as well as training:
# Run evaluation
total_rewards = []

for task in test_tasks:
    session = create_session(task)
    episode_reward = 0.0

    while not finished:
        result = agent.step(session)
        episode_reward += result.reward or 0.0
        finished = result.finished

    total_rewards.append(episode_reward)

# Evaluation metrics
avg_reward = sum(total_rewards) / len(total_rewards)
success_rate = sum(1 for r in total_rewards if r > 0.9) / len(total_rewards)

print(f"Average reward: {avg_reward:.3f}")
print(f"Success rate: {success_rate:.1%}")

Next Steps

Tools

Design tools that return rewards

Tasks & Splits

Organize tasks for RL training

Sessions & Episodes

Understand episode lifecycle and reward accumulation

Implementing a Server

Build an ORS server with reward logic

Key Takeaway: Rewards are the learning signal for RL. Design them carefully to align with your true objective, provide timely feedback, and avoid unintended behaviors. Good reward design is critical for successful RL training.