Rewards
Rewards are numeric feedback signals that enable reinforcement learning with ORS. They tell the agent which actions lead to success, forming the foundation of RL training.Why Rewards?
From the ORS specification:“Actions in an environment yield states with rewards; for example, submitting a correct solution to a mathematics problem may yield a positive reward.”Primary purpose: Enable RL training of language model agents Secondary purpose: Provide evaluation scores
What is a Reward?
A reward is a numeric signal (typically a float) that indicates how good an action was:Reward Examples
Positive reward (success):Reward Design Patterns
Pattern 1: Sparse Rewards
Only reward at episode end:Pattern 2: Dense Rewards
Reward every meaningful action:Pattern 3: Shaped Rewards
Reward based on distance to goal:Pattern 4: Penalty-Based Rewards
Penalize bad actions:Pattern 5: Binary Rewards
Simple success/failure:Pattern 6: Continuous Rewards
Gradual reward based on quality:Reward Scales
Common Reward Ranges
[0, 1] scale (most common):- 0 = failure
- 1 = perfect success
- 0.5 = partial success
- -1 = worst outcome
- 0 = neutral
- +1 = best outcome
Recommendation
Use [0, 1] scale: -Standard in RL -Works well with most algorithms -Easy to interpret -Prevents reward explosionCumulative Rewards
In RL, we often care about total episode reward:Reward Design Principles
1. Align with True Objective
2. Be Consistent
3. Provide Immediate Feedback
4. Avoid Reward Hacking
5. Scale Appropriately
Common Pitfalls
Pitfall 1: Reward Sparsity
Problem: Agent never learns because no feedback. Solution: Add intermediate rewards or shape reward function.Pitfall 2: Reward Hacking
Problem: Agent finds unintended way to maximize reward. Example:Pitfall 3: Reward Scaling Issues
Problem: Some rewards dominate others. Example:Pitfall 4: Ambiguous Rewards
Problem: Same reward for different outcomes. Example:Rewards for Evaluation
While rewards are primarily for RL training, they’re also useful for evaluation:Next Steps
Tools
Design tools that return rewards
Tasks & Splits
Organize tasks for RL training
Sessions & Episodes
Understand episode lifecycle and reward accumulation
Implementing a Server
Build an ORS server with reward logic
Key Takeaway: Rewards are the learning signal for RL. Design them carefully to align with your true objective, provide timely feedback, and avoid unintended behaviors. Good reward design is critical for successful RL training.

