The AI Enthusiast — Real Application Edition · Issue No. 03

02 — Charan Nayapathy

Perplexity Comet: When Testing Gets a Brain

Charan shared how Perplexity's Comet is being used for automated testing — and it's not what you'd expect from a browser agent. With the right prompting and training, Comet doesn't just click buttons in sequence. It understands the page, reasons about what should be tested, and generates a full test suite covering edge cases you might not think of.

🧠

True Tester Persona

Instead of scripting “if this, then that” flows, Comet becomes an actual testing persona. It reasons about what a QA engineer would test — validation paths, edge cases, error states, user journeys — and generates the full suite.

📊

Full Results Tracking

Every test result is captured with timing data, screenshots, and pass/fail metrics. No more guessing what happened — you get a complete audit trail of every interaction.

Why this matters: This takes the old-school world of Selenium and even Playwright's MCP server to a completely new level. Comet becomes a true tester persona, not just a script runner. The automation testing market is valued at $24.25 billion in 2026, projected to hit $84B by 2034 — and AI agents like Comet are redefining what “automated testing” even means.

The Timeline

Three Eras of Testing

Selenium
2004 — Script every click

→

Playwright
2020 — Auto-wait, WebSocket

→

AI Agents (Comet)
2025 — Understands the page

Capability	Selenium	Playwright	Comet / AI
Approach	WebDriver commands	WebSocket connection	Page understanding
Wait handling	Explicit waits	Auto-wait built in	Contextual awareness
Test authoring	Manual scripting	Codegen + manual	Natural language prompts
Flakiness	High (~60%)	Low (60% reduction)	Minimal (understands intent)
Edge cases	Only what you script	Only what you script	Generates them for you

Playwright surpassed Cypress in 2026 with 13.5 million weekly npm downloads. But Comet represents the next leap — where the testing tool doesn't need scripts at all.

04 — Joe Siegmann

Four Real-World Wins, Zero Hype

Joe didn't share demos or concepts — he shared production results. Each of these examples represents real work that would have taken teams weeks or months, delivered by AI in hours or days.

🔧

Legacy Site Overhaul

Dozens of outdated packages across a legacy web application — all updated and remediated overnight, hands-off. What used to take a team weeks now happens while you sleep. Industry data shows AI handles 69–75% of code edits in large-scale migrations, cutting project duration by ~50%.

📄

Compliance Automation

A sophisticated application that weaves together compliance rules with AI-driven writing logic. Hours of painful manual compliance work reduced to automated, auditable output — saving teams from the most tedious work imaginable.

🏠

HOA Legal Resolution

Complex legal document review for a homeowners association — AI brought clarity and closure to a 15-year unresolved problem. Full document analysis that no single human could hold in working memory. Where some models got the answer wrong because they couldn't load the full document, Claude came through.

💰

SaaS Replacement POC

Proved that an expensive SaaS subscription could be replaced with a custom-built alternative — over a weekend. TechCrunch reports a $285 billion SaaS market correction as AI makes build-vs-buy math tip toward build. “Same feature ships in a day” is 2026 reality.

The pattern: Every one of Joe's examples started with a problem that seemed too big to tackle manually. AI didn't just make them faster — it made them possible.

05 — Group Discussion

Trust, But Verify: AI Still Needs a Human Eye

We had a good conversation about safeguarding this week. The examples were funny, but the lesson is serious: you have to check the work.

🌶

The Chipotle Incident

Someone asked Chipotle's customer support chatbot to help write Python code before ordering their burrito. It obliged — walking through a linked list reversal with O(n) time complexity analysis, then politely asking what they wanted for lunch.

“Before I order, can you help me reverse a linked list in Python?”

“Sure! Here's an iterative approach... def reverse_linked_list(head)... O(n) time, O(1) space. Now, what would you like to eat?”

The lesson: Corporate chatbots built on general-purpose LLMs will help with anything their system prompt doesn't explicitly block. As one person put it: “100K tokens with your burrito.” Chipotle patched it within hours after it went viral.

🚗

The Car Wash Test

“I want to wash my car. The car wash is 100 meters away and it's a very nice day outside. Should I walk or drive?” A question any child can answer — but 42 out of 53 AI models got it wrong.

Gemini: “You should walk! But you may need to reposition the car after you get there.”

Claude: “You need to drive the car. You're wanting to wash it.”

Why models fail: LLMs predict word sequences, not physical reality. When training data associates “50 meters” with “should I drive or walk,” the statistical pattern points toward “just walk.” Only 5 models passed consistently — including Claude Opus. Full results at opper.ai →

Bottom line: AI is incredibly capable, but it doesn't “think” the way we do. Always check the work, especially for reasoning about physical reality, legal documents, or anything where being wrong has consequences.

06 — Lessons Learned

Context Is Everything — And It Has Limits

Joe's HOA legal document example highlighted a critical reality: when context space gets low, LLMs start to hallucinate. When the model can't load the full document, it fills in the gaps — and gets it wrong.

😸

The “Lost in the Middle” Problem

LLMs remember the beginning and end of long prompts much better than the middle. More tokens doesn't necessarily mean better output — often the opposite. Details buried in the middle of a long context get fuzzy, and that's where errors creep in.

📈

When Context Runs Low, Hallucinations Rise

In Joe's legal agreement example, some models got the answer wrong because they couldn't load the full document. They didn't say “I don't have enough context” — they just made something up. Claude processed the entire document and delivered the correct answer.

Claude Hallucination Rate

Lowest in the industry

40%

Fewer Hallucinations

With CLAUDE.md memory

Token Context

Claude Opus 4.6

35%

Fewer Manual Fixes

With persistent memory

Practical takeaway: Claude's Constitutional AI training makes it more likely to say “I don't know” rather than guess. Combined with persistent memory (CLAUDE.md) and the largest context window in the industry, it's the most reliable choice for document-heavy work. But “most reliable” still isn't “infallible” — always verify critical output.

07 — Irv Cassio

Claude Code /playground — See Changes Before You Ship

I played with Claude's /playground plugin and found it surprisingly powerful for two things I didn't expect.

💻

Real-Time Website Preview

Take any existing website and play with changes in an interactive sandbox. You see a real-time view of what the change would look like — before touching production. Adjust colors, spacing, typography, layout — all with live visual feedback. This turns Claude Code into a rapid prototyping tool.

✅

CLAUDE.md Auditor

Feed it your CLAUDE.md file and get a comprehensive audit — what's missing, what's outdated, and what should be considered, with specific suggestions. Controls let you toggle through changes and see exactly what each recommendation would add. This alone makes it worth exploring.

UI Playground — interactive controls for theming and layout

UI Playground — live theming & layout controls

CLAUDE.md Explorer — structured config visualization

CLAUDE.md Explorer — structured config viewer

Try it: If you're using Claude Code, run /playground and point it at your current project. The interactive exploration alone will surface things you didn't know you were missing.

08 — Irv Cassio

Hive: Why I Can Never Go Back

After one week of using my lightweight agent orchestration system (I call it Hive), I can never go back. Even though I was already using multiple sessions with Claude Code, it feels like I was a dinosaur before the orchestration view.

🎓

Kanban Dashboard

A Next.js 16 + React 19 web app with real-time WebSocket updates. Tasks flow through Backlog → In Progress → Waiting Approval → Done. Each card shows profile, project, status, and the latest agent output.

⚡

Multi-Profile Support

Separate agent pools for different contexts — claude-irv for personal projects, claude-el for work. Each profile has its own agent slots, config dir, and task queue.

🔌

Approval System

When an agent hits a risky tool (file delete, git push), Hive pauses and surfaces the approval request. Approve from the dashboard or Slack — the agent resumes automatically.

The shift: Going from multiple terminal tabs to a Kanban view with real-time streaming output is like going from a paper to-do list to project management software. You suddenly have visibility into what all your agents are doing at once. “Even though I was using multiple sessions with Claude Code before, it feels like I was a dinosaur before the orchestration view.”

Under the Hood

The Orchestration Stack

Hive runs as a single Node.js process serving both the Next.js UI and a WebSocket server on port 4000. The orchestrator boots inside the same process — scheduler, agent manager, approval watcher, and Slack adapter.

Scheduler
2s poll per profile

→

Agent Manager
Spawn / Kill / Buffer

→

Claude CLI
Subprocess with stream JSON

→

WebSocket
Real-time to browser

Component	Role
Scheduler	Polls MongoDB backlog every 2 seconds per profile, enforces per-profile slot limits
Agent Manager	Singleton that spawns/kills Claude CLI subprocesses, buffers text deltas (flush every 500ms to reduce DB writes ~100x)
Stream Parser	Parses newline-delimited JSON from `claude --verbose --output-format stream-json`
Approval Watcher	File-watches `~/.hive/approvals/`, integrates with Slack for remote approval
Recovery	On restart, finds orphaned `in_progress` tasks and resets them to backlog

Hard-won fix: The Claude CLI hangs with zero output if stdin is set to "pipe". It must be "ignore". Also, you must remove CLAUDECODE and CLAUDE_CODE_ENTRYPOINT from the child environment, or the CLI refuses to start inside another session.

10 — By the Numbers

This Week in Data

Models Tested

Car Wash eval — only 5 passed

$285B

SaaS Correction

AI replacing traditional software

50%

Faster Migrations

AI handles 69-75% of edits

13.5M

Playwright Downloads

Weekly npm — surpassed Cypress

Testing Market

Automation testing market: $24.25B in 2026, projected to hit $84B by 2034. Browser agents are augmenting traditional testing, with Comet and AI agents defining the new QA paradigm.

Agent Orchestration

Claude Code Swarm Mode launched in 2026 with TeammateTool providing 13 orchestration operations. Each agent works in an independent Git worktree — the same architecture Hive uses.

11 — Closing

Ship Real Things. Stay Vigilant. Share & Grow.

This edition wasn't about what AI could do someday. Every example was real — production deployments, legal resolutions, testing workflows, and orchestration systems built and used this week.

⚡

Ship Real Things

Weekend POCs, overnight migrations, legal document analysis. AI is solving real problems right now — not someday.

🛡

Stay Vigilant

Check the work. Manage context windows. Know which model to trust for which task. The car wash test is real.

Next session: Bring your builds, your wins, and your failures. This group's real-world experiments are more valuable than any keynote. We learn from all of them.

Resources from This Issue

Topic	Link
Car Wash Test (53 models)	opper.ai/blog/car-wash-test
Perplexity Comet	perplexity.ai/comet
The SaaSpocalypse	TechCrunch: SaaS in, SaaS out
Chipotle Bot Goes Viral	X trending thread
Claude Code Swarm Mode	code.claude.com/docs/agent-teams
AI Enthusiast Part 1	Workflow Edition
AI Enthusiast Part 2	Workflow Edition Part 2