ToolOutput. They can be used for training in reinforcement learning, or for agentic evaluation - for example, binary reward
can be used for computing accuracy.
What is a Reward?
A reward is a numeric signal (typically a float) that indicates how good an action was:- Optional field (can be null)
- Often normalised in the range [-1, 1] or [0, 1]
- Returned with every tool call
- Accumulated over an episode
Reward Examples
Positive reward (success):Reward Design Patterns
Pattern 1: Sparse Rewards
Only reward at episode end:- Simple to implement
- Clear success/failure signal
- Easy to understand
- Hard for agent to learn (delayed feedback)
- Credit assignment problem
- Slow learning
Pattern 2: Dense Rewards
Reward every meaningful action:- Faster learning
- Guides agent toward solution
- Better credit assignment
- Harder to design
- Can bias agent toward suboptimal paths
- Risk of reward hacking
Pattern 3: Shaped Rewards
Reward based on distance to goal:- Strong learning signal
- Guides exploration
- Accelerates training
- Requires domain knowledge
- Can create reward hacking opportunities
- Complex to implement
Pattern 4: Penalty-Based Rewards
Penalize bad actions:- Safety constraints
- Efficiency requirements
- Avoiding bad behaviors
Pattern 5: Binary Rewards
Simple success/failure:Reward Scales
Common Reward Ranges
[0, 1] scale:- 0 = failure
- 1 = perfect success
- 0.5 = partial success
- -1 = worst outcome
- 0 = neutral
- +1 = best outcome
- Can use any range, but normalise for RL algorithms
- Some RL algorithms normalise automatically; e.g. group advantage in GRPO
Cumulative Rewards
In RL, we often care about total episode reward:- Higher total reward = better performance
- Compare across episodes for learning progress
- Use for evaluation metrics
Rewards for Evaluation
Rewards in ORS can be used for evaluation as well as training:Next Steps
Tools
Design tools that return rewards
Tasks & Splits
Organize tasks for RL training
Sessions & Episodes
Understand episode lifecycle and reward accumulation
Implementing a Server
Build an ORS server with reward logic
Key Takeaway: Rewards are the learning signal for RL. Design them carefully to align with your true objective, provide timely feedback, and avoid unintended behaviors. Good reward design is critical for successful RL training.

