v0.2.0 · Open Source · MIT Llaboratory logo

Study how LLMs choose tools

A self-hostable, open-source harness for researching LLM tool-calling behavior. Design fake tools, compose testing plans, and analyze model decisions.

Get Started GitHub

Why Llaboratory?

LLMs make hundreds of tool-calling decisions. Llaboratory lets you design controlled experiments to understand how — and why — they choose the tools they do.

Design Fake Tools

Create tools with static, dynamic (Python), or manual responses. Parameter schemas, descriptions, and response modes are all first-class experimental variables.

Compose Testing Plans

Assemble tools, model configs, and prompts into versioned, reproducible testing plans. Pin tool versions and freeze model snapshots for exact reproducibility.

Run & Watch Live

Execute sessions with real-time streaming. Watch model responses and tool calls arrive incrementally. Manual tools let you interactively shape the conversation.

Analyze Results

Per-session metrics, within-plan aggregation, and cross-model comparison. Tool-selection rates, call-order patterns, termination reasons — all exportable.

Record & Replay

Manual tool responses are recorded and automatically replayed in subsequent runs. Run dozens of repetitions without manual intervention while keeping human data.

Import & Export

Share tool libraries and plans as portable JSON bundles. Imported dynamic tools are gated behind explicit user approval — no arbitrary code execution.

Workflow

From designing an experiment to publishing findings in four steps.

1. Build your tool library

Create fake tools with static payloads, dynamic Python responses, or manual prompting. Each save creates an immutable version.

2. Configure models

Point to any OpenAI-compatible endpoint (OpenRouter, LM Studio, etc.). Set the model snapshot, params, and API key via environment variables.

3. Assemble a testing plan

Select tools, choose a model, write system/user prompts, and set run parameters. Snapshot everything into an immutable plan version.

4. Run & analyze

Launch sessions with live streaming. Inspect every tool call and model response. Aggregate across runs and compare models. Export data for write-ups.

See it in action

Live session view showing model conversation and tool calls.

Llaboratory screenshot showing the tool-calling session UI

Get started in seconds

Docker Compose is the quickest way to get up and running.

# Clone & launch
git clone https://github.com/ampyard/Llaboratory.git
cd Llaboratory
docker compose up --build
Full Quickstart Guide