Workflow

Llaboratory follows a five-step workflow from experiment design to analysis.

1. Tool Library

The Tool Library is where you create and manage fake tools. Each tool consists of:

A model-facing name and description — these are primary experimental variables you can tweak across plan versions.
A parameter schema defined as JSON Schema. Use the built-in field builder or edit raw JSON.
A response mode with three choices:

Response modes

Static — Returns a fixed payload regardless of arguments. Useful for deterministic baselines.
Dynamic — Runs a user-written Python function: def respond(args, context) -> response. The context object provides session-scoped mutable state for stateful tools (e.g., a fake database that remembers writes). Dynamic code runs in-process without sandboxing — only run code you trust.
Manual — Pauses the session and prompts the human for a response. Responses are automatically recorded and replayed in subsequent runs (see Record & Replay below).

Every save creates an immutable ToolVersion. Edits always produce a new version; prior versions remain referenceable forever. Plans pin specific versions, so updating a tool never breaks an existing experiment.

Built-in tools

Llaboratory ships with 10 built-in tools that demonstrate the range of what's possible. Built-in tools are read-only — they cannot be edited or deleted — but you can clone them to create your own editable copies. The built-in set includes:

read_mood — Reads the ethereal mood aura of any question or situation.
pet_butterfly — Gently attempts to pet a butterfly. Results may vary.
vibe_check — Determines whether a statement passes the cosmic vibe check.
summon_cat — Summons a cat using ancient incantations and an offering.
existential_crisis_button — A large red button labeled "DO NOT PRESS."
snake_oil — Sells questionable remedies for whatever ails you.
submit_request_to_government — The bureaucratic process for submitting requests.
gossip_mill — Returns the juiciest fake gossip about any subject.
slap_bad_human — Administers a dramatic, harmless slap to a misbehaving human.

2. Model Configs

Model Configs store the connection details for an LLM provider. Configure:

Provider kind — OpenAI-compatible (v1). Works with OpenAI, OpenRouter, LM Studio, and any endpoint exposing the chat completions API.
Base URL — The API endpoint.
Model snapshot — The exact model identifier (e.g. gpt-4o-2025-05-10). No aliases — pin the exact version.
Parameters — temperature, top_p, seed, max_tokens, tool_choice.
API key env var — The name of the environment variable (e.g. OPENAI_API_KEY). The key value itself is never stored.
Pricing — Input/output cost per 1k tokens for cost accounting.

3. Plans

A Plan assembles everything needed for an experiment:

An ordered set of tools (with specific versions).
A model config (copied by value into the plan version).
System prompt and user/starting prompt.
Run settings: repetitions, tool-order strategy (fixed or randomized_per_session), and agent-loop limits.

Saving a plan creates an immutable PlanVersion. All subsequent sessions are bound to this version. You can create as many plan versions as needed to track prompt tweaks, tool changes, or model parameter variations — each with its own audit trail.

4. Sessions

A Session executes one run of a PlanVersion. Launch sessions from the plan detail view. Key behaviors:

Live streaming — Watch model responses and tool calls arrive incrementally as the provider streams them.
Agent loop — The model requests, (optionally) calls tools, results are fed back, and the loop continues until a termination condition fires (max turns, max tool calls, loop guard, timeout, user abort).
Manual tools — Sessions pause at awaiting_manual_input status. Provide a response in the UI; the session resumes automatically.
Record & Replay — Manual responses are recorded keyed by (tool_version, args_hash, occurrence_index). Subsequent sessions automatically replay matching calls — no human needed.
Batch runs — Set repetitions > 1 and launch N sessions at once (up to 5 concurrent).

Termination conditions

completed_no_tool_call — Model returned a final message without calling a tool.
max_turns — Hit the turn limit (default 20).
max_tool_calls — Hit the tool call limit (default 50).
loop_guard — Same tool + same args 5× in a row.
timeout — Wall clock exceeded (default 5 min; paused during manual input).
aborted — User killed the session.
errored — Provider error after retries.

5. Analysis

Every interaction is logged as structured Events in SQLite (WAL mode). The analysis layer computes:

Per-session metrics — tools called, call order, turn count, token usage, computed cost, duration, termination reason.
Within-plan aggregation — tool-selection rates, first-tool distributions, call-order patterns, variance across repetitions.
Cross-model comparison — same plan run across different models with comparable metrics.
Export — CSV/JSON for external plotting and write-ups.

Failed and aborted sessions are counted explicitly in every rate — a high failure rate can never masquerade as a high "no tool call" rate.