Not Really Design Philosophy

Not Really started as "Claude Code, but for media." It evolved into something else — a full agentic tooling and experience layer designed specifically for a media creative experience.

It is not Claude Code but wrangled to do media. It is whatever it needs to be, FOR media.

💡 Core Insight

The Core Insight

All humans are terrible prompters — you don't know what you want until you see something, then ideas start flowing. But humans are great preference machines. Shown options, you know what you prefer.

The goal is not to make a better prompt box. It's to build a system where the human steers through preference signals, and the system learns to anticipate. Show them things — search results, reference images, style explorations — and let their reactions reveal what words can't. The interface is a shared workspace — a canvas — where humans and AI agents communicate through lightweight signals and build creative artifacts together.

Why Claude Code Works (and Media Doesn't — Yet)

Coding agents work because: (1) programming languages are rigorous and Markovian, (2) verification tools — compilers, test runners, linters — let the agent check its own work. The agent loop closes: generate code, verify, fix, verify again.

Media has neither property. It's subjective and has no "compiler." The agent generates an image and has no idea if it's good. The loop is open.

🔍 The Four Gaps

The Four Gaps

Claude Code for media requires filling four gaps that coding agents get for free:

Gap 1: Human-in-the-Loop

For coding, correctness is objective — tests pass or fail. For media, "correctness" lives in the human's head. Turn-based chat forces full articulation upfront (impossible for creative work) and delays feedback (too costly).

The system must make feedback early, frequent, and low-friction. A 👍 on an image is worth a thousand words of prompt engineering. The experience should be fun — unlike coding, the creative process is as important as the output because it inspires and leads to novel creation.

Design principles:

Progressive disclosure — low-fi previews first, expensive high-fi later
Continuous steering — interrupt, redirect, refine mid-generation; non-blocking, async
Beyond-text preference — pointing, circling, A/B picks, reactions, mood boards; minimize the cost of "I like this but not that"
Proactive clarification — show visual options when uncertain instead of guessing
Context accumulation — every reaction (even implicit: hovering, skipping) builds a preference model

Gap 2: Automated Evaluation — The Media Compiler

Coding agents verify output via compiler, linter, type-checker, test suite — deterministic, immediate, machine-readable. Media agents are blind after generating. Without non-AI verification, the AI part doesn't work well.

The Actor-Critic pattern: a generating agent (Actor) paired with an evaluating VLM (Critic). The Critic provides natural language feedback — "background too cluttered, subject off-center, palette doesn't match" — not just scalar scores. NL critique lets the agent reason about what to fix, like reading compiler errors. In practice: fast scalar metrics for hard checks (alignment, technical quality) + VLM critic for soft checks (aesthetics, composition, vibe).

Evaluation axes (with code analogs):

Prompt alignment — does it match the request? (tests pass)
Aesthetic quality — does it look good? (no code equivalent)
Technical quality — artifacts, blur, noise? (compiles clean)
Compositional correctness — right objects, right places? (signatures match)
Semantic understanding — can the agent "see" what it made? (reading its own code)
Holistic NL critique — what works and what doesn't (code review)

Gap 3: Version Control & Non-Destructive Editing

Git makes the coding agent loop safe: every change is reversible, diffable, branchable, mergeable. The agent can try things aggressively. Media has no equivalent.

Reversibility enables boldness — git revert is free. Destructive pixel overwrites make the agent brittle. Non-destructive editing (layers, masks, param snapshots) enables experimentation.
Diffs enable evaluation — git diff shows what changed. Media needs visual/audio diffs, overlay toggles, difference maps.
Branching enables exploration — "try warmer" and "try cooler" are parallel paths. Fork, explore, let the human pick or merge.
A/B comparison as first-class interaction — comparing is cognitively cheaper than describing. This is where human feedback gets efficient.

Gap 4: Persistent Preference Learning

This is the media equivalent of .editorconfig, claude.md, README — the non-code context that coding agents write down, commit, and know to look for.

Without preference memory, every session starts cold. The human re-explains "warm tones," "less saturated," "more negative space" every time — like a coding agent forgetting your project uses TypeScript.

What to learn:

Visual — palettes, saturation, contrast, lighting, composition tendencies
Stylistic — photorealistic vs. illustrated, moody vs. bright, vintage vs. modern
Feedback patterns — recurring corrections become implicit constraints
Rejection signals — what the user skips is as informative as what they choose

How to build it:

Implicit signal capture — track selections, hovers, skips, rejections; every A/B comparison generates preference data
Lightweight preference model — structured style profile (preference axes + weights), not a fine-tune; a .editorconfig for aesthetics
Cross-session memory — persist and let the user inspect/edit; "you seem to prefer warm, desaturated palettes — correct?"
Per-project overrides — like per-repo configs: default style is warm/minimal, but this project is bold/maximalist
Drift detection — notice when recent feedback contradicts stored preferences; ask if taste has changed

🏗️ What We're Building

What We're Building

A multi-agent AI creative workspace built on a generic engine + configurable team blueprints (hatsets):

A master agent (creative director) — the one permanent coordinator. Interprets intent, recruits specialists, learns taste. Its core behavior is fixed; hatsets append domain context.
Generic specialist agents — no hardcoded roles. Each specialist receives identity entirely from the hatset roster: system prompt, tool whitelist, personality, model preference. An image expert, a critic, a researcher — all configured, not coded.
Autonomous canvas observers — agents independently read canvas state via tools. The master sends short nudges, not detailed specs. Each agent owns a watermark tracking what it's seen.
The human steers through reactions, selections, comments, and spatial arrangement — not prompt engineering. Interactions batch until the user triggers the next agent run.
A whiteboard (persistent scratchpad) — path-addressed markdown filesystem for agent memory. Style guides, briefs, preference profiles. The .editorconfig for aesthetics.
Hatsets (team blueprints) — YAML-defined configurations that control master context, specialist roster, eval criteria, layout preference, and suggested brief. Same engine, completely different creative experiences. A Hat Shop marketplace lets users browse, fork, and share team configurations.

The output isn't one image — it's an organized creative artifact: a book of pages, each a canvas of grouped, connected, reacted-to widgets.

Architecture Bets

VM-less agentic infrastructure — long-running, coordinative, async multi-agent system without giving each user a persistent VM. Ship anything without dedicated infrastructure
Collaboration over generation — it should feel like working with colleagues on a shared canvas, not prompting a machine
Canvas over chat — spatial interaction beats sequential conversation for creative work
Fluid feedback over typing — reactions, selections, comparisons, and spatial arrangement replace written descriptions as the primary input
Agents observe, not receive — each agent subscribes to and intelligently observes the canvas, like a human expert watching a shared workspace, not waiting to be briefed
Generic tools, specific hats — tools are use-case-agnostic primitives; all domain behavior lives in hatset prompts. If it can be prompt-driven, it should be
Non-turn-based UX on turn-based LLMs — the user is never blocked waiting for the system, and the system is never blocked waiting for the user

🎯 Design Principles & Goals

Design Principles

Progressive fidelity — cheap previews first, expensive renders only when approved
Continuous steering — interrupt, redirect, refine mid-generation
Beyond-text preference — pointing, comparing, reacting, arranging > describing
Proactive clarification — show visual options when uncertain instead of guessing
Context accumulation — every signal compounds into a preference model

Goals

VM-less multi-agent orchestration — scales to dogfood (~200 users) without dedicated infrastructure
Non-turn-based UX — turn-based LLMs under the hood, but the user is never forced into chatbot-style interaction
Hatset-driven experiences — the engine doesn't make manga or brand kits. Hatsets do. Same engine, completely different creative workflows via configuration
Fluid feedback capture — the UI minimizes the cost of expressing preference and maximizes the signal extracted from every interaction
Hat Shop ecosystem — users browse, fork, share, and distill team configurations. The platform grows through community-created hatsets, not new features

🤔 So Is This...

Is this Claude Code?

Is this Figma with a prompt box?

Is this a wrapper around better image models?

Is this another AI art generator?

Is this a chatbot that makes pictures?