Skip to main content

Quick Start

Get started with ORS by building a simple math environment server and testing it locally.

What You’ll Build

A working ORS server for math problems (GSM8K-style) with:
  • One tool (submit) for submitting answers
  • Train and test task splits
  • Reward signals for RL training
  • Local HTTP server you can test with curl or Python
Time: ~15 minutes

Prerequisites

  • Python 3.8+ installed
  • Basic Python knowledge
  • Terminal/command line access

Step 1: Install Dependencies

The OpenReward Python SDK is one implementation of the ORS specification. We’ll use it for this quickstart.
pip install openreward pandas
Remember: The Python SDK is ONE way to implement ORS. You can also implement the HTTP protocol from scratch in any language.

Step 2: Download GSM8K Data

Download the GSM8K dataset from the HuggingFace repository:
  1. Download train-00000-of-00001.parquet
  2. Download test-00000-of-00001.parquet
Place both files in your working directory.
GSM8K is a dataset of grade school math word problems. Each task has a question and an integer answer.

Step 3: Create Your Environment

Create a file gsm8k_env.py:
from openreward.environments import Environment, Server, tool
from openreward.environments.types import ToolOutput, TextBlock
from pydantic import BaseModel
import pandas as pd

# Load GSM8K tasks from parquet files
train_tasks = pd.read_parquet("train-00000-of-00001.parquet").to_dict(orient="records")
test_tasks = pd.read_parquet("test-00000-of-00001.parquet").to_dict(orient="records")

# Add IDs to tasks
for i, task in enumerate(train_tasks):
    task['id'] = str(i)
for i, task in enumerate(test_tasks):
    task['id'] = str(i)

# Tool parameter schema (must be defined before GSM8KEnvironment)
class SubmitParams(BaseModel):
    answer: str

class GSM8KEnvironment(Environment):
    """GSM8K math problem environment"""

    @classmethod
    def list_splits(cls):
        return ["train", "test"]

    @classmethod
    def list_tasks(cls, split: str):
        if split == "train":
            return train_tasks
        elif split == "test":
            return test_tasks
        raise ValueError(f"Unknown split: {split}")

    def get_prompt(self):
        question = self.task_spec["question"]
        return [TextBlock(text=question)]

    @tool
    def submit(self, params: SubmitParams) -> ToolOutput:
        """Submit your answer to the math problem"""
        # Extract the final answer from GSM8K format (after ####)
        gold_answer = self.task_spec["answer"].split("####")[-1].strip()
        user_answer = str(params.answer).strip()

        if user_answer == gold_answer:
            return ToolOutput(
                blocks=[TextBlock(text="Correct!")],
                reward=1.0,
                finished=True
            )
        else:
            return ToolOutput(
                blocks=[TextBlock(text=f"Incorrect. The answer was {gold_answer}.")],
                reward=0.0,
                finished=True
            )

# Create and run server
if __name__ == "__main__":
    server = Server([GSM8KEnvironment])
    server.run(port=8080)
What this code does:
  • Loads GSM8K tasks from the parquet files
  • Defines an ORS environment with math tasks
  • Implements list_splits(), list_tasks(), and get_prompt()
  • Creates a submit tool that checks answers and returns rewards
  • Starts an HTTP server on port 8080

Step 4: Run the Server

python gsm8k_env.py
You should see:
INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
Your ORS server is now running!

Step 5: Test with HTTP

Let’s test the server with curl. Open a new terminal:

List environments

curl http://localhost:8080/list_environments
Response:
["gsm8kenvironment"]

List tools

curl http://localhost:8080/gsm8kenvironment/tools
Response:
{
  "tools": [
    {
      "name": "submit",
      "description": "Submit your answer to the math problem",
      "input_schema": {...}
    }
  ]
}

List splits

curl http://localhost:8080/gsm8kenvironment/splits
Response:
[
  {"name": "train", "type": "train"},
  {"name": "test", "type": "test"}
]

List tasks

curl -X POST http://localhost:8080/gsm8kenvironment/tasks \
  -H "Content-Type: application/json" \
  -d '{"split": "train"}'
Response (first 2 tasks shown):
{
  "tasks": [
    {
      "id": "0",
      "question": "Natalia sold clips to 48 of her friends in April...",
      "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\n...#### 72"
    },
    {
      "id": "1",
      "question": "Weng earns $12 an hour for babysitting...",
      "answer": "...#### 20"
    }
  ],
  "env_name": "gsm8kenvironment"
}
The full dataset contains 7,473 training tasks and 1,319 test tasks.

Step 6: Run an Episode

Now let’s run a complete episode (session):

Create session ID

curl -X POST http://localhost:8080/create_session
Response:
{"sid": "abc-123-def-456"}
Save this session ID for the next steps.

Create episode

Use a task from the dataset:
curl -X POST http://localhost:8080/create \
  -H "X-Session-ID: abc-123-def-456" \
  -H "Content-Type: application/json" \
  -d '{
    "env_name": "gsm8kenvironment",
    "task_spec": {
      "id": "0",
      "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
      "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72"
    },
    "secrets": {}
  }'
Response:
{"sid": "abc-123-def-456"}

Get prompt

curl http://localhost:8080/gsm8kenvironment/prompt \
  -H "X-Session-ID: abc-123-def-456"
Response:
[
  {
    "text": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "detail": null,
    "type": "text"
  }
]

Call submit tool

curl -N -X POST http://localhost:8080/gsm8kenvironment/call \
  -H "X-Session-ID: abc-123-def-456" \
  -H "Accept: text/event-stream" \
  -H "Content-Type: application/json" \
  -d '{"name": "submit", "input": {"answer": "72"}}'
Response (SSE stream):
event: task_id
data: 877bb56c594e4a0f921ad55c439a3762

event: end
data: {"ok":true,"output":{"blocks":[{"text":"Correct!","detail":null,"type":"text"}],"metadata":null,"reward":1.0,"finished":true}}
Success! The agent got reward 1.0 and finished: true.

Cleanup

curl -X POST http://localhost:8080/delete \
  -H "X-Session-ID: abc-123-def-456"

Step 7: Test with Python Client

Create test_client.py:
from openreward import OpenReward

# Connect to local server
client = OpenReward()
env = client.environments.get(
    name="gsm8kenvironment",
    base_url="http://localhost:8080"
)

# Get tasks
tasks = env.list_tasks(split="train")
print(f"Found {len(tasks)} training tasks")

# Run an episode
task = tasks[0]  # First task from GSM8K

with env.session(task=task) as session:
    # Get prompt
    prompt = session.get_prompt()
    print(f"Question: {prompt[0].text[:80]}...")  # Show first 80 chars

    # Submit answer (the correct answer is 72)
    result = session.call_tool("submit", {"answer": "72"})

    print(f"Result: {result.blocks[0].text}")
    print(f"Reward: {result.reward}")
    print(f"Finished: {result.finished}")
Run it:
python test_client.py
Output:
Found 7473 training tasks
Question: Natalia sold clips to 48 of her friends in April, and then she sold half...
Result: Correct!
Reward: 1.0
Finished: True

Understanding the Code

Key Components

1. Environment Class
class GSM8KEnvironment(Environment):
Inherits from Environment base class, which handles HTTP protocol details. 2. Splits and Tasks
@classmethod
def list_splits(cls):
    return ["train", "test"]

@classmethod
def list_tasks(cls, split: str):
    return [...]  # Task list
Organize problems into train/test sets. 3. Prompt Generation
def get_prompt(self):
    return [TextBlock(text=f"Solve: {self.task_spec['question']}")]
Convert task into initial agent prompt. 4. Tools
@tool
def submit(self, params: SubmitParams) -> ToolOutput:
    # Check answer, return reward and finished signal
Actions agents can take. Return ToolOutput with reward and finished flag. 5. Tool Output
ToolOutput(
    blocks=[TextBlock(text="Correct!")],
    reward=1.0,  # RL feedback signal
    finished=True  # Episode termination
)
Structured response with content, reward, and termination signal.

What You’ve Learned

How to implement an ORS server using the Python SDK Core ORS concepts: splits, tasks, tools, prompts, rewards How sessions (episodes) work The HTTP API for ORS How to test an ORS server locally

Next Steps

Common Issues

”ModuleNotFoundError: No module named ‘openreward’”

Install the SDK:
pip install openreward

“404 Environment not found”

Check the environment name matches the class name (lowercase): -Class: GSM8KEnvironment -Name: gsm8kenvironment

”Connection refused”

Make sure the server is running:
python gsm8k_env.py

“Session not found”

Create a new session ID:
curl -X POST http://localhost:8080/create_session

Congratulations! You’ve built your first ORS server. You now understand the core concepts and can start building more complex environments for RL training and agent evaluation.