A self-hostable, open-source harness for researching LLM tool-calling behavior. Design fake tools, compose testing plans, and analyze model decisions.
LLMs make hundreds of tool-calling decisions. Llaboratory lets you design controlled experiments to understand how — and why — they choose the tools they do.
Create tools with static, dynamic (Python), or manual responses. Parameter schemas, descriptions, and response modes are all first-class experimental variables.
Assemble tools, model configs, and prompts into versioned, reproducible testing plans. Pin tool versions and freeze model snapshots for exact reproducibility.
Execute sessions with real-time streaming. Watch model responses and tool calls arrive incrementally. Manual tools let you interactively shape the conversation.
Per-session metrics, within-plan aggregation, and cross-model comparison. Tool-selection rates, call-order patterns, termination reasons — all exportable.
Manual tool responses are recorded and automatically replayed in subsequent runs. Run dozens of repetitions without manual intervention while keeping human data.
Share tool libraries and plans as portable JSON bundles. Imported dynamic tools are gated behind explicit user approval — no arbitrary code execution.
From designing an experiment to publishing findings in four steps.
Create fake tools with static payloads, dynamic Python responses, or manual prompting. Each save creates an immutable version.
Point to any OpenAI-compatible endpoint (OpenRouter, LM Studio, etc.). Set the model snapshot, params, and API key via environment variables.
Select tools, choose a model, write system/user prompts, and set run parameters. Snapshot everything into an immutable plan version.
Launch sessions with live streaming. Inspect every tool call and model response. Aggregate across runs and compare models. Export data for write-ups.
Live session view showing model conversation and tool calls.
Docker Compose is the quickest way to get up and running.