SPAgent

SPAgent is a foundation agent for the physical & spatial world

Perception, reasoning, and action in the physical and spatial world, powered by an open-ended multimodal ecosystem of tools spanning 2D, 3D, world models, and beyond.

Get Started Models Dataset Research

Capabilities

Modular Tool System

Mix and match any combination of expert tools. Add or remove tools at runtime with a single function call.

Open Tool Integration

Integrate any tool into the ecosystem. Define a schema, plug it in, and the agent will use it automatically.

Spatial Reasoning

Purpose-built prompts for 3D spatial understanding. Grounded perception in complex physical environments.

RL Training

Built-in reinforcement learning pipeline. Train agents with tool-calling rewards.

Supported tools

An open-ended ecosystem spanning 2D perception, 3D reconstruction, video generation, and beyond.

2D Perception

Depth Anything V2

High-accuracy monocular depth estimation for dense depth maps from a single image.

:20019

SAM 2

Promptable image and video segmentation with fast, precise masks and tracking.

:20020

Grounding DINO

Open-vocabulary object detection driven by natural-language prompts and referring expressions.

:20022

Moondream

A small, fast vision-language model for captioning, visual Q&A, and lightweight visual reasoning.

:20024

YOLO-E / Supervision

Real-time open-vocabulary detection and segmentation with annotation, tracking, and visualization utilities.

local
3D Reconstruction

Pi3 / Pi3X

3D point cloud reconstruction from single or multiple images, with Pi3X adding smoother metric-scale outputs.

:20030 / :20031

VGGT

Feed-forward multi-view 3D reconstruction with camera pose, depth, and geometry prediction in one pass.

:20032

MapAnything

Universal metric 3D reconstruction for dense point clouds, depth, poses, and multi-view geometry.

:20033
Video Generation

Veo

Cinematic text-to-video and image-to-video generation with audio and strong creative control.

API

Sora

Text-to-video and image-to-video generation for realistic, dynamic scenes with strong prompt fidelity.

API

Quick start

from spagent import SPAgent
from spagent.models import GPTModel
from spagent.tools import DepthEstimationTool, SegmentationTool

# Create model and tools
model = GPTModel(model_name="gpt-4o-mini")
tools = [
    DepthEstimationTool(),
    SegmentationTool()
]

# Create agent and solve
agent = SPAgent(model=model, tools=tools)
result = agent.solve_problem(
    "image.jpg",
    "Analyze depth relationships and main objects"
)
print(result['answer'])

Architecture

SPAgent Core

Agent logic, tool registry, prompt system, data collection

Tools

Modular expert implementations with client/server architecture

Models

Supports leading open-source and closed-source models worldwide.

Training

Reinforcement learning with supervised fine-tuning

Research

arXiv 2026

Think3D: Thinking with Space for Spatial Reasoning

Zaibin Zhang*, Yuhan Wu*, Lianjie Jia*, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, Lijun Wang, Huchuan Lu.

* Equal contribution   ◆ Project leader   ✉ Corresponding author

Teaching agents to think in 3D space like humans, through drag-based spatial interaction.

Institutions

Dalian University of Technology
University of California, San Diego
University of Oxford
Get in touch
dlutzzb@gmail.com

Start building with SPAgent

Open-source and ready to use. Deploy expert tools, connect your model, and reason about the physical world.

View on GitHub Research