Skip to main content

Rewards

Rewards are numeric feedback signals that enable reinforcement learning with ORS. They tell the agent which actions lead to success, forming the foundation of RL training.

Why Rewards?

From the ORS specification:
“Actions in an environment yield states with rewards; for example, submitting a correct solution to a mathematics problem may yield a positive reward.”
Primary purpose: Enable RL training of language model agents Secondary purpose: Provide evaluation scores

What is a Reward?

A reward is a numeric signal (typically a float) that indicates how good an action was:
interface ToolOutput {
  blocks: Blocks
  reward?: number  // ← The reward signal
  finished: boolean
  metadata?: JSONObject
}
Characteristics: -Optional field (can be null) -Typically in range [-1, 1] or [0, 1] -Returned with every tool call -Accumulated over an episode

Reward Examples

Positive reward (success):
{
  "blocks": [{"text": "Correct!", "detail": null, "type": "text"}],
  "metadata": null,
  "reward": 1.0,
  "finished": true
}
Zero reward (neutral):
{
  "blocks": [{"text": "File listed successfully", "detail": null, "type": "text"}],
  "metadata": null,
  "reward": 0.0,
  "finished": false
}
Negative reward (penalty):
{
  "blocks": [{"text": "Error: Command failed", "detail": null, "type": "text"}],
  "metadata": null,
  "reward": -0.1,
  "finished": false
}

Reward Design Patterns

Pattern 1: Sparse Rewards

Only reward at episode end:
def submit(self, params) -> ToolOutput:
    correct = (params.answer == self.task["answer"])

    return ToolOutput(
        blocks=[TextBlock(text="Correct!" if correct else "Incorrect")],
        reward=1.0 if correct else 0.0,
        finished=True
    )
All intermediate steps:
def bash(self, params) -> ToolOutput:
    output = execute_command(params.command)

    return ToolOutput(
        blocks=[TextBlock(text=output)],
        reward=0.0,  # No reward for intermediate steps
        finished=False
    )
Pros: -Simple to implement -Clear success/failure signal -Easy to understand Cons: -Hard for agent to learn (delayed feedback) -Credit assignment problem -Slow learning Best for: Simple tasks, short episodes

Pattern 2: Dense Rewards

Reward every meaningful action:
def bash(self, params) -> ToolOutput:
    output = execute_command(params.command)

    # Reward progress
    reward = 0.0
    if "answer.txt" in output:
        reward = 0.2  # Found relevant file
    if len(output) > 0:
        reward += 0.1  # Successful command

    return ToolOutput(
        blocks=[TextBlock(text=output)],
        reward=reward,
        finished=False
    )

def submit(self, params) -> ToolOutput:
    correct = (params.answer == self.task["answer"])

    return ToolOutput(
        blocks=[TextBlock(text="Correct!" if correct else "Incorrect")],
        reward=1.0 if correct else 0.0,  # Final reward
        finished=True
    )
Pros: -Faster learning -Guides agent toward solution -Better credit assignment Cons: -Harder to design -Can bias agent toward suboptimal paths -Risk of reward hacking Best for: Complex tasks, long episodes

Pattern 3: Shaped Rewards

Reward based on distance to goal:
def calculate_progress_reward(self, state) -> float:
    """Reward based on how close to solution"""
    # Example: Math problem solving

    # Check if agent has seen the problem
    if not state["viewed_problem"]:
        return 0.0

    # Check if agent has attempted calculation
    if state["attempted_calculation"]:
        reward = 0.3

        # Check if calculation is close to answer
        if state["last_guess"]:
            error = abs(state["last_guess"] - self.task["answer"])
            if error < 5:
                reward = 0.7  # Getting close!
            if error < 1:
                reward = 0.9  # Very close!

        return reward

    return 0.1  # Viewed problem
Pros: -Strong learning signal -Guides exploration -Accelerates training Cons: -Requires domain knowledge -Can create reward hacking opportunities -Complex to implement Best for: Well-understood domains, complex navigation

Pattern 4: Penalty-Based Rewards

Penalize bad actions:
def bash(self, params) -> ToolOutput:
    # Penalize dangerous commands
    dangerous = ["rm -rf", ":(){ :|:& };:", "dd if=/dev/zero"]
    if any(cmd in params.command for cmd in dangerous):
        return ToolOutput(
            blocks=[TextBlock(text="Error: Dangerous command blocked")],
            reward=-1.0,  # Large penalty
            finished=True  # End episode
        )

    # Penalize inefficient actions
    if self.step_count > 20:
        reward = -0.1  # Penalty for taking too long
    else:
        reward = 0.0

    output = execute_command(params.command)

    return ToolOutput(
        blocks=[TextBlock(text=output)],
        reward=reward,
        finished=False
    )
Use cases: -Safety constraints -Efficiency requirements -Avoiding bad behaviors

Pattern 5: Binary Rewards

Simple success/failure:
def submit(self, params) -> ToolOutput:
    correct = (params.answer == self.task["answer"])

    return ToolOutput(
        blocks=[TextBlock(text="Correct!" if correct else "Incorrect")],
        reward=1.0 if correct else 0.0,  # Binary: 1 or 0
        finished=True
    )
Best for: Classification, Q&A, simple decisions

Pattern 6: Continuous Rewards

Gradual reward based on quality:
def submit(self, params) -> ToolOutput:
    # Calculate how close the answer is
    correct_answer = self.task["answer"]
    submitted_answer = params.answer

    # Reward based on proximity
    error = abs(correct_answer - submitted_answer)
    max_error = 100  # Define reasonable max

    if error == 0:
        reward = 1.0  # Perfect
    else:
        # Reward proportional to accuracy
        reward = max(0, 1.0 - (error / max_error))

    return ToolOutput(
        blocks=[TextBlock(text=f"Your answer: {submitted_answer}")],
        reward=reward,
        finished=True
    )
Best for: Regression, optimization, generation quality

Reward Scales

Common Reward Ranges

[0, 1] scale (most common):
  • 0 = failure
  • 1 = perfect success
  • 0.5 = partial success
[-1, 1] scale:
  • -1 = worst outcome
  • 0 = neutral
  • +1 = best outcome
Custom scales: -Can use any range, but normalize for RL algorithms -Example: [0, 100] for percentage scores

Recommendation

Use [0, 1] scale: -Standard in RL -Works well with most algorithms -Easy to interpret -Prevents reward explosion

Cumulative Rewards

In RL, we often care about total episode reward:
total_reward = 0.0

# Episode loop
while not finished:
    result = session.call_tool(...)
    total_reward += result.reward or 0.0
    finished = result.finished

print(f"Total episode reward: {total_reward}")
Interpretation: -Higher total reward = better performance -Compare across episodes for learning progress -Use for evaluation metrics

Reward Design Principles

1. Align with True Objective

# Good - reward actual goal
def submit(self, params) -> ToolOutput:
    correct = check_answer(params.answer)
    return ToolOutput(..., reward=1.0 if correct else 0.0, finished=True)

# No Bad - reward proxy metric
def submit(self, params) -> ToolOutput:
    # Rewarding answer length, not correctness!
    reward = min(len(str(params.answer)) / 10, 1.0)
    return ToolOutput(..., reward=reward, finished=True)

2. Be Consistent

# Good - consistent scale
def tool_a(...) -> ToolOutput:
    return ToolOutput(..., reward=1.0, finished=True)  # [0, 1]

def tool_b(...) -> ToolOutput:
    return ToolOutput(..., reward=0.5, finished=True)  # [0, 1]

# No Bad - inconsistent scale
def tool_a(...) -> ToolOutput:
    return ToolOutput(..., reward=1.0, finished=True)  # [0, 1]

def tool_b(...) -> ToolOutput:
    return ToolOutput(..., reward=100.0, finished=True)  # [0, 100]???

3. Provide Immediate Feedback

# Good - immediate reward
def bash(self, params) -> ToolOutput:
    if self.found_solution_file():
        reward = 0.5  # Progress!
    else:
        reward = 0.0

    return ToolOutput(..., reward=reward, finished=False)

# No Bad - delayed feedback
def bash(self, params) -> ToolOutput:
    # Always 0 until the end
    return ToolOutput(..., reward=0.0, finished=False)

4. Avoid Reward Hacking

# No Bad - hackable reward
def bash(self, params) -> ToolOutput:
    # Agent can game this by running trivial commands
    reward = 0.1  # Reward every command!
    return ToolOutput(..., reward=reward, finished=False)

# Good - meaningful progress only
def bash(self, params) -> ToolOutput:
    # Only reward useful commands
    useful = self.command_provides_new_info(params.command)
    reward = 0.1 if useful else 0.0
    return ToolOutput(..., reward=reward, finished=False)

5. Scale Appropriately

# Good - balanced rewards
final_reward = 1.0  # Success
step_reward = 0.01  # Small progress

# No Bad - imbalanced rewards
final_reward = 1.0  # Success
step_reward = 0.9  # Too large - swamps final reward!

Common Pitfalls

Pitfall 1: Reward Sparsity

Problem: Agent never learns because no feedback. Solution: Add intermediate rewards or shape reward function.

Pitfall 2: Reward Hacking

Problem: Agent finds unintended way to maximize reward. Example:
# Agent learns to run "echo" repeatedly for +0.1 reward each time
def bash(self, params) -> ToolOutput:
    reward = 0.1  # Oops - too easy!
    return ToolOutput(..., reward=reward, finished=False)
Solution: Only reward meaningful progress.

Pitfall 3: Reward Scaling Issues

Problem: Some rewards dominate others. Example:
step_reward = 1.0  # Each step
final_reward = 1.0  # Final success

# Problem: 100 steps = 100 reward, success = 1 reward
# Agent learns to take many steps, not solve task!
Solution: Keep final reward >> step rewards.

Pitfall 4: Ambiguous Rewards

Problem: Same reward for different outcomes. Example:
# Both failure and partial success get 0.0?
if perfect:
    reward = 1.0
else:
    reward = 0.0  # No distinction between partial and complete failure
Solution: Use graduated rewards.

Rewards for Evaluation

While rewards are primarily for RL training, they’re also useful for evaluation:
# Run evaluation
total_rewards = []

for task in test_tasks:
    session = create_session(task)
    episode_reward = 0.0

    while not finished:
        result = agent.step(session)
        episode_reward += result.reward or 0.0
        finished = result.finished

    total_rewards.append(episode_reward)

# Evaluation metrics
avg_reward = sum(total_rewards) / len(total_rewards)
success_rate = sum(1 for r in total_rewards if r > 0.9) / len(total_rewards)

print(f"Average reward: {avg_reward:.3f}")
print(f"Success rate: {success_rate:.1%}")

Next Steps


Key Takeaway: Rewards are the learning signal for RL. Design them carefully to align with your true objective, provide timely feedback, and avoid unintended behaviors. Good reward design is critical for successful RL training.