# Pythoughts - Complete Content Archive

> A publishing platform for thoughtful writing. Discover articles on technology, science, philosophy, and creative writing from independent creators.

Last updated: 2026-02-26T03:11:55.693Z
Total posts: 8

---

## Building Professional Websites with Claude Code

URL: https://pythoughts.com/@moelkholy/posts/building-professional-websites-with-claude-code
Author: Mohamed Elkholy (@moelkholy)
Published: 2026-02-19
Tags: claude, code, ai

![cc](https://pythoughts.com/api/media/384519d5-3c9c-492d-9a66-96d6adbf9a70/BETuLFtLQES3vJVBIqqud/S3644yBCRlJqI9sLunxZQ/1771559584021-ivvlf6-cc.webp)

> A complete guide to setting up, designing, and deploying production-ready websites using Claude Code — from local development to live deployment.

---

## Getting Started

### Setup Requirements

Requirement Details Editor Visual Studio Code (VS Code) Extension Cloud Code (install via Extensions panel) Account Anthropic/Claude paid subscription Tier Pro or Max (free tier not supported)

> 💡 Start with **Pro tier** — upgrade to **Max** only if you hit usage limits.

---

## Hack #0 — The Claude.md File

### What It Is

The `claude.md` file acts as a **system prompt** — Claude Code reads it before every single action. Think of it as the constitution of your project.

> *"Every time before you ask Claude Code to do something, it will read the claude.md file first. It will always process that."*

### Key Rules

- Keep it **concise** — no unnecessary bloat
- Include all project-critical rules and context
- Update it **iteratively** as your project evolves

### What to Include

```markdown
# Example claude.md Structure

- Front-end design skill invocation instructions
- Screenshot workflow configuration (Puppeteer)
- Brand asset reference guidance
- Coding and styling best practices
```

### Getting Started

A free template (`web design cloud.mmd`) is available from the free school community. Simply drag it directly into your VS Code file explorer.

---

## Hack #1 — Front-End Design Skill

### What Are Skills?

Skills are **custom instruction files** that supercharge Claude's capabilities for specific tasks. They work like this:

```
User sends prompt
       ↓
Claude reads claude.md
       ↓
Claude checks: "Do I have a skill for this?"
       ↓
  Yes → Reads skill → Takes action
  No  → Uses general knowledge
```

Skills are stored as Markdown files and **install globally** across all Claude Code projects.

### Why the Front-End Design Skill Matters

Without it, AI-generated designs tend to look generic and "vibecoded." With it:

- ✅ Modern, professional-looking designs
- ✅ Animations and dynamic elements
- ✅ Significantly elevated quality from minimal prompting
- ✅ Does not appear AI-generated

### Installation

Run the provided commands directly in Claude Code — installation and configuration are handled automatically.

---

## Hack #2 — Screenshot Loop

### The Core Problem

AI can move projects toward a goal, but initial outputs rarely match the final vision. Manual iteration is slow and tedious.

### The Solution: Automated Visual Feedback

The screenshot loop creates a **self-correcting cycle**:

```
Build → Screenshot → Compare vs. Reference → Fix → Repeat
```

### How It Works

- Claude creates a `temporary_screenshots/` folder
- Captures screenshots automatically as it builds
- Compares screenshots against your reference/goal
- Makes visual adjustments
- Runs a **two-pass review and polish** process

**Technical stack:** Implemented via **Puppeteer**, configured in `claude.md`.

### ⚠️ When to Turn It OFF

> *"Sometimes you might want to turn off the screenshot tool."*

Disable it for **animated elements** — static screenshots can't capture motion, causing infinite correction loops.

**Example instruction:**

```
Because this is an animated background, do not use the screenshot
tool to compare. Just work in the code and I will review manually.
```

---

## Hack #3 — Reference Websites for Inspiration

### Where to Find Inspiration

Resource Type Dribbble Design inspiration platform Godly.website Curated website showcases Awwwards Award-winning web designs

### 3-Step Process to Clone Any Website

#### Step 1 — Capture a Full-Page Screenshot

- Press `F12` to open DevTools
- Go to the **Console** tab
- Press `Ctrl+Shift+P` and search for **"screenshot"**
- Download the full-page screenshot

#### Step 2 — Extract Styling Information

- Open the **Elements** tab in DevTools
- Copy all code from the **Styles** section
- This captures all HTML/CSS defining the page appearance

#### Step 3 — Give It to Claude Code

```
1. Paste the style code into Claude Code
2. Drag in the full-page screenshot
3. Reference assets with the @ symbol
4. Prompt: "Clone this website for us"
```

### What to Expect

The agent will automatically:

- Write the website code
- Start a local server
- Take screenshots
- Run two comparison rounds (build vs. reference)
- Fix any mismatches
- Repeat until satisfied

### Applying Your Own Branding

After cloning, add your brand in a new prompt:

```
Work in our brand assets — brand guidelines, logo, and colors.
```

The agent automatically integrates **colors, typography, logos, and copy**.

---

## Hack #4 — Individual Component Inspiration

### The Resource: 21st.dev

**[21st.dev](https://21st.dev/)** offers a curated gallery of high-quality, reusable components:

- Shaders & animated backgrounds
- Home screens & hero sections
- Buttons (rainbow outlines, shiny effects, etc.)
- Mouse highlight effects
- Dark/light mode toggles
- And much more

### Implementation Process

- Browse the gallery on 21st.dev
- Copy the component code or prompt
- Tell Claude Code where to place it: 

```
I want you to work in this [element] right behind the [location]
```
- Paste the copied code/prompt
- Agent integrates it into the existing design

### Managing Animated Components

Use the same rule as the screenshot loop — **disable comparison for animated elements**:

```
Because this is an animated background, do not use the screenshot
tool to compare. Just work in the code and I will let you know
if we need changes.
```

### Iterative Refinement

Give specific visual feedback to guide improvements:

Vague Feedback ❌ Specific Feedback ✅ "It looks bad" "The animation is too distracting" "Fix the background" "The background looks pixelated and cheap" "Make it better" "Needs more contrast with the text overlay"

---

## Deployment: GitHub + Vercel Pipeline

### The Overall Workflow

```
Claude Code (local dev)
       ↓
   GitHub (version control)
       ↓
   Vercel (live deployment + auto-deploy)
```

---

### Step 1 — Set Up a GitHub Repository

- Go to **[github.com](https://github.com/)** and create an account
- Click **"Create new repository"**
- Name your repo (e.g., `my-project-website`)
- Click **Create repository**

Then in Claude Code:

```
Push this to a GitHub repository called [repository-name]
```

Claude Code will:

- Authenticate with GitHub (prompts you)
- Auto-create a `.gitignore` file
- Handle all git setup and initial commit

> **🔒 Security Warning:** Never push API keys, passwords, or sensitive credentials to GitHub.

---

### Step 2 — Deploy with Vercel

- Go to **[vercel.com](https://vercel.com/)** — sign up with GitHub credentials
- Click **"Add New Project"**
- Choose your GitHub repository → **Import**
- Click **"Deploy Project"**

Your site will be live at:

```
https://[project-name].vercel.app
```

For a **custom domain**, go to Project Settings → Domains.

---

### Step 3 — Continuous Deployment

Once connected, every push to GitHub **automatically deploys** to Vercel:

```
1. Make changes locally in Claude Code
2. Test on localhost
3. Approve → "Push these changes to GitHub"
4. GitHub receives the commit
5. Vercel detects the change
6. Vercel auto-deploys within minutes
7. Live site is updated ✅
```

**Best practice — add to your claude.md:**

```
We're always going to test on localhost until I explicitly tell
you to push to GitHub.
```

---

## Bypass Permissions Mode

### What It Does

Allows Claude Code to **execute commands without confirmation prompts** — faster, more fluid development.

### How to Enable

```
Settings → Search "claude code" → Enable "Allow dangerously skip permissions"
```

### ⚠️ Security Warning

> *"This is dangerous. It has the potential to run any command that it wants."*

**Safe usage rules:**

- ✅ Always stay nearby and actively monitor
- ✅ Never set it to run unattended overnight
- ✅ Supervise all executions — in practice, not a major risk if watched

---

## Resources & Community

### Free Resources

- Free `claude.md` template available via the free school community
- Includes design resources and downloadable assets
- Available through the link in the video description

### Paid Community — AI Automation Society Plus

A paid community with thousands of members building AI-based businesses, featuring:

- Deep-dive Claude Code courses
- AI automation strategies
- Building AI agencies
- Continuously updated course content

---

## Best Practices & Workflow Tips

### Development Process

```
Simple prompt → Review → Targeted feedback → Iterate → Approve → Ship
```

- **Start simple** — vague prompts work well with a good foundation
- **Test before pushing** — always verify on localhost first
- **Iterate in small steps** — don't try to fix everything at once
- **Get specific** — as you refine, your feedback should become more precise

### Testing Checklist

- [ ] Test on localhost before any GitHub push
- [ ] Visual inspection in browser
- [ ] Screenshot verification for animated elements
- [ ] Review against reference design (if cloning)

### File Organization

```
project/
├── claude.md              ← System prompt / project rules
├── brand_assets/          ← Logos, colors, brand guidelines
│   ├── logo.svg
│   └── brand-guidelines.md
├── temporary_screenshots/ ← Auto-created during dev (deletable)
└── src/                   ← Your actual code
```

Reference files directly in prompts using the `@` symbol.

### Design Quality Checklist

Factor Action Front-end skill Install globally — enables modern aesthetics automatically Brand consistency Always reference  brand_assets/  in prompts Component quality Source from  21st.dev  for polished UI Animations Disable screenshot loop for dynamic elements Production readiness Test locally, then deploy via GitHub → Vercel

---

*Built with Claude Code · Deployed on Vercel · Versioned on GitHub*

---

## Tool Calling: Fundamentals and Its Evolution in Modern AI Systems

URL: https://pythoughts.com/@mk_99/posts/tool-calling-fundamentals-and-its-evolution-in-modern-ai-systems
Author: Malik Rabee (@mk_99)
Published: 2026-02-23
Tags: ai, llm, anthropic, openai

# 

![ChatGPT Image Feb 23, 2026, 01 14 00 AM](https://pythoughts.com/api/media/f4274aa4-832a-4f57-bf69-e7ebf947f77c/ILvZlmVvmhfGaUV88tQnA/uploads/1771827261213-bu48ej-chatgpt-image-feb-23_-2026_-01_14_00-am.png)

Large language models (LLMs) have rapidly evolved from text generators into action-oriented systems capable of interacting with APIs, databases, and external tools. At the heart of this transformation is **tool calling** (also known as function calling).

This article explores how traditional tool calling works, its limitations, and how newer approaches—particularly programmatic tool execution and dynamic filtering—are reshaping agent architectures.

---

## What Is Tool Calling?

**Tool calling** is the mechanism that allows a language model to move beyond generating text and instead produce structured outputs (typically JSON) that invoke external functions or APIs.

In practical terms, tool calling:

- Enables LLMs to trigger real-world actions
- Allows models to interact with databases, web services, CRMs, and other systems
- Forms the foundation of modern AI agents

For nearly two years, the core mechanics of tool calling remained largely unchanged. However, recent advancements have significantly improved its efficiency and reliability.

---

# The Traditional Tool Calling Architecture

## How It Works

The conventional tool calling workflow typically looks like this:

- The agent receives a list of available tools, including:

- Tool name
- Description
- Parameters
- A user submits a request.
- The model receives:

- The user message
- The conversation history
- The full tool schema
- The model returns a structured JSON response specifying:

- Which tool to call
- The parameters to use
- The server executes the tool.
- The tool’s response is sent back to the model.
- The model synthesizes everything into a final response.

Behind the scenes, this creates a continuous back-and-forth loop between the model and the execution environment.

---

## Limitations of the Traditional Approach

While functional, this design introduces several systemic inefficiencies.

### 1. Inefficiency in Multi-Step Workflows

For complex tasks requiring multiple tool calls, the model must:

- Re-read context
- Regenerate structured parameters
- Make sequential decisions repeatedly

This creates excessive round trips between the model and the backend.

---

### 2. Nondeterministic Behavior

The model must regenerate precise parameters each time. Even small deviations in formatting can break workflows.

For example:

- An email search returns a list of message IDs.
- The model must correctly reproduce those exact IDs to fetch individual emails.

This repeated regeneration increases fragility.

---

### 3. Token Waste and Context Bloat

Tool responses often contain:

- Excess metadata
- Irrelevant fields
- Large payloads not needed for the next step

All of this gets injected into the model’s context window.

Even though modern models advertise very large context windows, effective usable context is far smaller once noise accumulates.

---

### 4. Web and Nested Tool Problems

Consider a web research workflow:

- Web search returns URLs.
- Web fetch retrieves raw HTML.
- HTML is injected into context (including navigation bars, scripts, ads).
- The model must manually extract relevant content.
- That content is then passed to a writing tool.

At every stage, irrelevant data increases token usage and reduces clarity.

---

# The Shift Toward Programmatic Tool Calling

Recent advancements introduce a major shift: instead of using the model purely as a decision engine that emits JSON, the model can generate **executable code** inside a controlled environment.

This approach fundamentally changes agent design.

---

## Core Idea

Rather than:

- Model → JSON → Tool → Response → Model → Next Tool

The model:

- Writes executable code
- Calls tools directly within that code
- Uses loops and conditionals
- Processes intermediate results internally

The model becomes an orchestrator operating inside a sandboxed runtime.

---

## How It Works

A secure code execution environment is added to the agent.

Key aspects:

- The model is allowed to output executable code.
- Each tool specifies an `allowed_caller` parameter.
- The execution sandbox runs the model’s generated code.
- Intermediate data stays inside the runtime environment instead of flooding the context window.

The system then returns only relevant results back to the model.

---

## Example Workflow

User request:“Query customer purchase history from the last quarter and identify the top five customers by revenue.”

Instead of making multiple separate tool calls:

- The model writes code that:

- Calls the database
- Aggregates revenue
- Sorts customers
- Selects the top five

All processing happens inside the execution environment.

Only the final structured result is surfaced.

---

## Benefits of Programmatic Tool Execution

### 1. Significant Token Reduction

Because intermediate results are not injected into the context window, token consumption can drop substantially.

### 2. Fewer Model Round Trips

Instead of repeated decision cycles, logic is executed deterministically within code.

### 3. Deterministic Logic

Code enables:

- Loops
- Conditional filtering
- Batch processing
- Structured transformations

This reduces unpredictability compared to repeated JSON emissions.

### 4. Better Handling of Large Data

Large datasets can be processed internally without overwhelming the model’s context window.

---

## When This Approach Is Most Useful

Programmatic tool execution is especially effective for:

- Large dataset aggregation
- Financial calculations
- Multi-step data pipelines
- Structured workflows with deterministic logic
- Scenarios requiring filtering before summarization

---

# Dynamic Filtering for Web Fetch

Another major improvement addresses web retrieval inefficiency.

---

## The Problem

Traditional web fetch tools return full HTML pages, including:

- Navigation menus
- Scripts
- Styling
- Ads
- Irrelevant sections

Most of this content is useless for answering a question.

---

## The Solution: Intermediate Filtering

Modern implementations introduce a filtering layer inside the web fetch tool:

- HTML is retrieved.
- Code extracts only relevant content.
- Cleaned, structured data is returned to the model.

The model no longer sees the entire raw page.

---

## Benefits

- Reduced token usage
- Improved accuracy
- Lower noise in reasoning
- More focused responses

This approach prevents context pollution at the source.

---

# Tool Search: Scalable Tool Loading

As agents grow more complex, another problem emerges: too many tools.

---

## The Scalability Challenge

Loading hundreds of tool schemas into the model’s context:

- Consumes tokens
- Slows reasoning
- Reduces clarity

Even when most tools are irrelevant to the current task.

---

## The Solution: Tool Search

Instead of preloading everything:

- A single **tool search tool** is provided.
- When needed, the model retrieves only relevant tools.
- Non-essential schemas remain unloaded.

This dynamic loading model makes large agent systems more scalable and efficient.

---

# The Broader Implication

The evolution from static JSON-based tool calling to programmatic execution represents a major architectural shift:

- From reactive orchestration
- To structured, deterministic workflows

Large language models are generally stronger at writing code than repeatedly emitting structured JSON while reasoning step-by-step. Leveraging that strength produces more efficient, scalable agent systems.

---

# Final Thoughts

Tool calling transformed language models from passive text generators into active systems capable of real-world interaction.

However, early implementations exposed limitations around efficiency, token usage, and deterministic execution.

The newer generation of approaches—programmatic execution, dynamic filtering, and tool search—signals a maturation of agent design. These improvements reduce noise, lower costs, and increase reliability while enabling more complex workflows.

As AI systems continue to scale, the architecture of tool calling will remain one of the most critical foundations in building robust, production-grade agents.

---

## Claude Sonnet & Opus 4.6 vs GLM‑5, Kimi K2.5, and MiniMax M2.5

URL: https://pythoughts.com/@moelkholy/posts/claude-sonnet-opus-46-vs-glm5-kimi-k25-and-minimax-m25
Author: Mohamed Elkholy (@moelkholy)
Published: 2026-02-19
Tags: claude, ai, coding

![ChatGPT Image Feb 19, 2026, 05 44 14 PM](https://pythoughts.com/api/media/384519d5-3c9c-492d-9a66-96d6adbf9a70/BETuLFtLQES3vJVBIqqud/uploads/1771541176218-3g3bcx-chatgpt-image-feb-19_-2026_-05_44_14-pm.png)

### A five‑model cage match for agentic coding, browsing, and “real work” economics

Two polished, gated enterprise thoroughbreds (Claude Sonnet 4.6 + Opus 4.6) step into the ring with three open‑weights bruisers (GLM‑5, Kimi K2.5, MiniMax M2.5). This isn’t a beauty contest. It’s a *deployment decision*—and in 2026, that decision is increasingly governed by three forces:

- **Agentic capability** (terminal work, browsing, repo fixing)
- **Variance** (how often you need retries or human babysitting)
- **Economics** (especially output tokens)

If you’re choosing a model for “real work,” you’re choosing how often your system breaks and how much it costs when it doesn’t.

---

## Executive summary

Across widely cited agentic benchmarks (terminal tasks, browsing, and computer‑use), **Claude Opus 4.6** remains the strongest overall closed model in this comparison, while **Claude Sonnet 4.6** offers a notably better performance‑per‑dollar profile for many enterprise workflows—especially “office work” proxy tasks—at lower token prices.

Anthropic’s benchmark tables show Opus 4.6 leading Sonnet 4.6 on several core agentic / reasoning anchors:

- **Terminal‑Bench 2.0:** 65.4 vs 59.1
- **BrowseComp:** 84.0 vs 74.7
- **GPQA Diamond:** 91.3 vs 89.9
- **HLE with tools:** 53.1 vs 49.0

Sonnet 4.6 stays competitive and occasionally leads on applied “work” metrics (e.g., GDPval‑AA Elo and Finance Agent v1.1 in Anthropic’s published materials).

Among the open‑weights contenders:

- **GLM‑5** appears the most “agentic-search/tool-use” optimized in public reporting, with vendor charts showing strong BrowseComp and MCP‑Atlas results alongside a strong SWE‑bench Verified score.
- **Kimi K2.5** emphasizes multimodal and long-horizon agent workflows, plus unusually explicit evaluation setup disclosures (temperature, top‑p, context length) that improve reproducibility.
- **MiniMax M2.5** positions as a “real‑work productivity” model with standout engineering economics, pairing strong SWE‑bench Verified performance with very low output token pricing and unusually transparent evaluation methodology.

**Pricing (uncached input / output) is stark:**

- **MiniMax M2.5:** $0.30/M input, $1.20/M output
- **GLM‑5:** $1.00/M input, $3.20/M output
- **Claude Sonnet 4.6:** $3/M input, $15/M output
- **Claude Opus 4.6:** $5/M input, $25/M output

Kimi K2.5’s pricing is published in RMB with cache hit/miss semantics; USD conversion depends on FX.

**Practitioner recommendation (default choices):**

- Default premium-but-scalable: **Claude Sonnet 4.6**
- Hardest long-horizon agentic tasks: **Claude Opus 4.6**
- Output-token economics dominate: **MiniMax M2.5**
- Open-weights with elite tool/search posture: **GLM‑5**
- Multimodal + long-horizon “agent swarm” workflows: **Kimi K2.5**

---

## The cast: what each model *wants* to be

### Claude Sonnet 4.6 — the operator

Sonnet 4.6 is what you deploy when you want a default premium model that doesn’t create operational chaos. It’s the workhorse: strong enough to ship, predictable enough to scale.

It’s also the model that makes finance people stop emailing you at 2 a.m.

**Why it matters:** Sonnet is the “reliable daily driver” you can actually afford to run everywhere—until the task becomes a long, brittle chain of tool calls.

---

### Claude Opus 4.6 — the closer

Opus 4.6 is the “pay for the last 10% because that 10% is the whole job” option. When tasks get gnarly—long horizon, high branching, tool failures, and partial progress that must be recovered—Opus is the model you deploy when you’d rather pay more than explain the failure.

**Why it matters:** Opus tends to win on the benchmarks that look like real agent work instead of pretty chat.

---

### GLM‑5 — the agentic engineer

GLM‑5 reads like it was built by someone who got tired of watching models crumble in tool loops. It’s one of the most “agentic” open‑weights contenders in public reporting—strong browsing, credible terminal performance, and a SWE‑bench Verified score that keeps it in the same room as the frontier.

**Why it matters:** if you care about hosting, customization, instrumentation, and you still want serious agent capability, GLM‑5 is a loud signal.

---

### Kimi K2.5 — the long‑context multitool

Kimi K2.5’s identity is “long horizon, multimodal, agent workflows.” It wants to be the model you point at messy, real-world research tasks—files, screenshots, browser trails, and multi‑step synthesis.

**Why it matters:** it’s one of the more explicit vendors about eval conditions (mode/settings), which helps practitioners reproduce behavior rather than worship scoreboard numbers.

---

### MiniMax M2.5 — the cost assassin

MiniMax M2.5 shows up holding a spreadsheet and smiling.

It pairs a strong SWE‑bench Verified headline with output token pricing that’s dramatically lower than the closed leaders. That doesn’t just change your bill—it changes your *behavior*. You can afford to try more things.

**Why it matters:** when output tokens dominate cost (PRs, long diffs, extensive explanations, multi‑candidate generation), MiniMax’s economics can make previously “too expensive” workflows suddenly viable.

---

## The benchmarks that actually feel like work

The “agentic” set is where models get exposed:

- **SWE‑bench Verified:** real GitHub issues, real repos, real pain
- **Terminal‑Bench 2.0:** terminal tasks where “close enough” fails
- **BrowseComp:** browsing + navigation + synthesis under constraints
- **OSWorld‑Verified:** computer-use tasks across real apps
- **HLE (with tools) + GPQA Diamond:** reasoning anchors under pressure

If a model looks good here, it’s usually good in production.

---

## Round 1: Terminal work (where agents go to die)

Terminal‑Bench 2.0 is a cruel test. It punishes brittle planning, sloppy state tracking, and the classic failure mode of modern agents: “I will now describe the correct approach instead of doing it.”

Anthropic’s published tables show:

- **Opus 4.6:** 65.4
- **Sonnet 4.6:** 59.1

GLM‑5’s reported chart shows:

- **GLM‑5:** 56.2

**How this feels in practice:**

- Opus is most likely to keep its footing after tool errors, partial progress, and annoying file state surprises.
- Sonnet is close enough that you’ll often prefer it—until the task requires long, brittle chains where small mistakes compound.
- GLM‑5 looks tuned to survive agent loops.

---

## Round 2: Browsing (the “can you actually *find* it?” test)

BrowseComp is the archetypal research-agent benchmark: navigate, persist, extract, verify.

Anthropic reports:

- **Opus 4.6:** 84.0
- **Sonnet 4.6:** 74.7

GLM‑5’s reported chart shows:

- **GLM‑5:** 75.9

Kimi and MiniMax both publish BrowseComp results with context management variants in their release materials.

**Interpretation that matters:**

- Opus is the “I need the answer and I need it clean” browsing choice.
- Sonnet is the “I need this at scale” browsing choice.
- GLM‑5 is the “open weights, still elite at browsing loops” option.

---

## Round 3: SWE‑bench Verified (repo‑fixing credibility)

SWE‑bench is where marketing gets punched in the face.

Reported scores put several of these models in the same neighborhood:

- Claude Sonnet 4.6: 79.6
- Claude Opus 4.6: 80.8
- GLM‑5: 77.8
- MiniMax M2.5: 80.2
- Kimi K2.5: strong coverage across SWE‑bench variants with configuration notes

**What it means:** SWE‑bench is no longer “closed models only.” The real differentiator becomes variance: how often it works *without* retries, and how expensive those retries are when they happen.

---

## The money round: token economics that change behavior

Pricing isn’t just budget. It changes what you *attempt*.

### First‑party API pricing vibes (uncached input/output)

- **MiniMax M2.5:** $0.30/M input, $1.20/M output
- **GLM‑5:** $1.00/M input, $3.20/M output
- **Claude Sonnet 4.6:** $3/M input, $15/M output
- **Claude Opus 4.6:** $5/M input, $25/M output

### Kimi K2.5 (RMB with cache semantics)

Moonshot publishes RMB pricing and makes cache hit/miss explicit:

- input (cache hit): ¥0.70/M
- input (cache miss): ¥4.00/M
- output: ¥21.00/M

**Why it matters:**

- If you reuse a stable system prompt + long context, Kimi’s cache semantics can be meaningful.
- If you generate huge outputs, MiniMax’s output pricing becomes a structural advantage.
- If you need compliance and predictability at scale, Sonnet is the safer default.
- If your agent loops regularly reach the “hard mode” cliff, Opus often pays for itself by avoiding reruns.

---

## Decision flowchart: pick your default in 60 seconds

### If you’re building enterprise agent workflows

Use **Claude Sonnet 4.6** as the default.

Upgrade to **Opus 4.6** when:

- the task is long-horizon and brittle
- the downside of failure is high
- you’re doing tool-heavy terminal or browsing chains

### If you want open weights *and* top-tier agent posture

Pick **GLM‑5** for agentic tool/search behavior.

### If multimodal + long-horizon research is the core product

Pick **Kimi K2.5**, especially if caching and explicit configuration control matter.

### If output tokens dominate your bill

Pick **MiniMax M2.5** and invest in scaffolding (tests, validators, strict tool schemas) to reduce variance.

---

## The uncomfortable truth: benchmark numbers aren’t clean

Even with honest vendors, results are sensitive to:

- tool availability & tool schema strictness
- context management
- retries and timeouts
- “thinking mode” / effort settings
- scaffold quality

Treat published results as **directional signals**, not absolute truth. If your workloads are high-stakes, you still need a reproducible harness.

---

## The final punchline

In 2026, “the best model” is increasingly a *portfolio*:

- **Sonnet** for the majority of production work
- **Opus** for the jobs that break agents
- **MiniMax** when output economics unlock scale
- **GLM‑5** when you need open weights with elite agent posture
- **Kimi** when multimodal + long-horizon research workflows are the product

If you want to ship, don’t pick a winner.

Pick a default, define an escalation path, and make cost part of your architecture—not an afterthought.

---

## Mastering Error Handling and Bug Catching with Claude Code: A Comprehensive Guide

URL: https://pythoughts.com/@hamza/posts/mastering-error-handling-and-bug-catching-with-claude-code-a-comprehensive-guide
Author: Hamza (@hamza)
Published: 2026-02-25
Tags: ai, claude, debug

![ChatGPT Image Feb 25, 2026, 05 27 03 PM](https://pythoughts.com/api/media/afb138d5-349a-4d36-af08-4335febefe96/1g94l5ONuhMIwlTSJJ1Vg/uploads/1772058446142-bm272g-chatgpt-image-feb-25_-2026_-05_27_03-pm.png)

---

## Introduction

Claude Code is an AI coding assistant from Anthropic that can read your code, run commands, and make changes to your project. It can significantly accelerate development by automating tasks, fixing bugs, and building features—but strong results in **error handling** and **bug catching** require a deliberate setup.

This guide walks you through the most effective patterns for making Claude Code reliable in real-world development:

- Building a **verifiable feedback loop** so Claude can validate its own work
- Using a **systematic workflow** for bug resolution
- Configuring **Hooks**, **MCP integrations**, and a well-structured **CLAUDE.md**
- Applying advanced options like skills, sub-agents, agent teams, and context management

A key theme throughout is that Claude performs best when it can check itself via tests, linters, and deterministic automation.

---

## The Core Philosophy: Verifiable Work

The single most important principle for a successful Claude Code setup is **verifiability**. Claude Code is not just a chatbot—it’s an agent acting on your codebase. Its performance improves dramatically when it has a clear mechanism to verify the success or failure of its work.

> Claude performs dramatically better when it can verify its own work, like run tests, compare screenshots, and validate outputs. Without clear success criteria, it might produce something that looks right but actually doesn’t work.

In practice, this means building a feedback loop around:

- Unit tests and integration tests
- Linters and formatters
- Build/compile steps
- Simple scripts that validate outputs

When Claude can run these checks itself, it becomes a far more reliable partner.

---

## A Systematic Workflow for Bug Resolution

A structured process prevents Claude from jumping to conclusions and helps ensure a correct fix—especially in unfamiliar codebases.

Phase Description Key Actions 1. Explore Understand the bug without making changes. Use  Plan Mode . Ask Claude to read relevant files, explain the architecture, and trace execution flow. 2. Plan Create a step-by-step fix plan. Ask Claude to write a plan. Use  Ctrl+G  to open and refine it in your editor. 3. Implement Apply changes following the plan. Switch to  Normal Mode . Instruct Claude to implement and run verification steps. 4. Commit Finalize and integrate. Ask Claude to commit with a clear message and open a pull request.

This separation of exploration, planning, and implementation is especially useful for complex bugs.

---

## Hooks for Error Handling and Automation

**Hooks** are user-defined shell commands that execute at specific points in Claude Code’s lifecycle. They provide deterministic control—ensuring certain actions (like running tests, formatting code, or blocking unsafe commands) happen automatically.

Hooks are one of the fastest ways to improve bug catching because they:

- Enforce quality gates consistently
- Provide immediate feedback when something fails
- Reduce reliance on “remembering to run tests”

### Hook Events That Matter Most

- **PreToolUse:** Runs before a tool executes. Great for validation and blocking risky operations.
- **PostToolUse:** Runs after a tool succeeds. Great for running tests or formatting.
- **PostToolUseFailure:** Runs when a tool fails. Useful for logging and alerting.
- **SessionStart:** Runs at the start of a session—useful for initialization or reinjecting context.
- **Notification:** Runs when Claude is waiting for user input.

### Practical Hook Examples

**1) Run tests after any file write**

```json
{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write",
        "hooks": [
          {
            "type": "command",
            "command": "pytest tests/"
          }
        ]
      }
    ]
  }
}

```

**2) Log failures automatically**

```json
{
  "hooks": {
    "PostToolUseFailure": [
      {
        "matcher": ".*",
        "hooks": [
          {
            "type": "command",
            "command": "echo 'Tool failed: $TOOL_INPUT' >> /tmp/claude-error.log"
          }
        ]
      }
    ]
  }
}

```

**3) Block dangerous commands (example script)**

```bash
#!/bin/bash
# Prevent dangerous commands
if [[ $TOOL_INPUT =~ (rm -rf|sudo) ]]; then
  echo "Dangerous command blocked: $TOOL_INPUT" >&2
  exit 2 # Exit code 2 blocks the action
fi
exit 0

```

**4) Auto-format after writes (Prettier example)**

```json
{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write",
        "hooks": [
          {
            "type": "command",
            "command": "npx prettier --write ."
          }
        ]
      }
    ]
  }
}

```

### Hook Best Practices

- Validate before execution using `PreToolUse`
- Always run tests/linting after changes
- Log failures so debugging data isn’t lost
- Use matchers to keep expensive checks targeted
- Start simple, then iterate

---

## MCP Integration

**MCP (Model Context Protocol)** is an open standard that lets Claude connect to external tools and data sources. This is a major upgrade for debugging because it gives Claude real-world context beyond your local codebase.

### What MCP Enables

- Query production errors from monitoring tools
- Read and update issues in trackers
- Interact with GitHub/GitLab PRs and CI logs
- Query databases to reproduce state-dependent bugs
- Pull design specs from tools like Figma

### Essential MCP Integrations for Debugging

- **Error monitoring:** Sentry (stack traces, frequency, context)
- **Project management:** Jira / Linear (requirements and acceptance criteria)
- **Version control:** GitHub / GitLab (PR discussions, checks, CI failures)
- **Databases:** PostgreSQL (query state, logs, user journeys)

### MCP Best Practices

- Use OAuth or secure auth wherever possible
- Scope access tightly—only what Claude needs
- Test each integration with small prompts first
- Limit tool output size to avoid flooding context
- Combine MCP + Hooks for automated feedback loops

---

## The `CLAUDE.md` File

**CLAUDE.md** is a project-level Markdown file that becomes part of Claude’s system prompt when you work in that directory. It’s the best place to define project-specific rules and workflows.

### What to Put in CLAUDE.md

- Tech stack and versions
- Project structure and “where things live”
- Testing commands (unit, integration, e2e)
- Error handling conventions (exceptions, logging, retries)
- Coding standards and formatting rules
- Deployment notes and required env vars
- Team conventions (branches, commits, PR expectations)
- Known bugs and reproduction steps

### Example Sections

**Tech stack**

```markdown
# Tech Stack
- Language: Python 3.10
- Framework: Django 4.2
- Database: PostgreSQL 15
- Env Vars: DATABASE_URL, SECRET_KEY, DEBUG

```

**Testing & validation**

```markdown
# Testing
- Unit tests: `pytest tests/`
- E2E tests: `./run_e2e_tests.sh`
- Run tests before merging to main
- New features must include tests

```

**Error handling conventions**

```markdown
# Error Handling
- APIs return JSON: {"status": "...", "message": "..."}
- Log exceptions with stack traces
- Use custom exceptions:
  - BadRequestError: validation issues
  - DatabaseError: DB failures

```

### CLAUDE.md Best Practices

- Keep it short, clear, and practical
- Use headers and bullet points for scanability
- Split large repos using imports (e.g., `@security.md`)
- Update it as standards change
- Add short code examples for common patterns

---

## Other Advanced Setup Considerations

### Skills and Sub-Agents

Skills extend Claude with specialized workflows (generate tests, security checks, refactoring helpers). Sub-agents are separate Claude instances that can tackle focused subtasks—great for debugging deep modules without cluttering the main thread.

### Agent Teams

Agent teams allow parallel investigation—one agent reads logs, another traces backend flow, another inspects frontend state. This can accelerate root-cause discovery in multi-system bugs.

### Memory and Context Management

Claude can compact context during long sessions. To avoid losing critical details:

- Keep CLAUDE.md concise and high-signal
- Use `SessionStart` hooks to re-inject key project conventions when needed
- Maintain a lightweight “registry” document for recurring bugs and fixes

### Performance Tuning

- Cap overly large MCP outputs
- Prefer targeted tasks (fix one function) over vague requests (fix everything)
- Ensure the environment can run tests quickly—feedback speed matters

---

## Conclusion

A strong Claude Code setup transforms it from a helpful assistant into a more autonomous partner for maintaining code quality.

The winning formula is an integrated loop:

- **CLAUDE.md**: teaches Claude how your project works
- **Hooks**: enforce safety, formatting, and tests deterministically
- **MCP**: supplies real-world context from monitoring, tickets, CI, and databases

Start small—add a test-running hook and a basic CLAUDE.md—then expand toward MCP and advanced automation. Over time, Claude becomes dramatically better at catching bugs early and fixing them correctly.

---

---

## 🐍 Beginner’s Complete Guide to Building an MCP Server in Python

URL: https://pythoughts.com/@hamza/posts/beginners-complete-guide-to-building-an-mcp-server-in-python
Author: Hamza (@hamza)
Published: 2026-02-25
Tags: ai, ML, mcp

![ChatGPT Image Feb 25, 2026, 01 14 44 AM](https://pythoughts.com/api/media/afb138d5-349a-4d36-af08-4335febefe96/1g94l5ONuhMIwlTSJJ1Vg/uploads/1772000107410-2oe5j3-chatgpt-image-feb-25_-2026_-01_14_44-am.png)

---

# What Is MCP?

**MCP (Model Context Protocol)** is a standard that allows AI models to interact with external tools in a structured and reliable way.

Instead of the model guessing how to call APIs, MCP:

- Clearly defines available tools
- Defines required parameters
- Allows safe execution
- Returns structured responses

Think of MCP as:

> A universal contract between AI models and your backend systems.

---

# Why MCP Is Important

Without MCP:

- You parse natural language manually
- You extract intent manually
- You map to backend functions manually
- You handle errors manually

With MCP:

- Tools are structured
- Parameters are validated
- AI understands what it can use
- Execution becomes standardized

---

# High-Level Architecture

Here’s how the system works:

```plaintext
User → AI Model → MCP Client → MCP Server → Your Backend Logic

```

### 🔹 MCP Server (What We’ll Build)

- Defines tools
- Validates inputs
- Executes logic
- Returns structured results

### 🔹 MCP Client

- Connects AI to MCP server
- Sends tool calls
- Returns results

### 🔹 AI Model

- Decides which tool to use
- Supplies parameters

---

# What We’ll Build

We’ll build a simple Python MCP Server with:

- `add_numbers`
- `get_weather`
- `get_current_time`

It will:

- Expose tool metadata
- Accept tool execution requests
- Return structured JSON responses

---

# Setup Environment

### Install Python 3.9+

Check version:

```bash
python --version

```

### Create Project

```bash
mkdir python-mcp-server
cd python-mcp-server

```

### Create Virtual Environment (Recommended)

```bash
python -m venv venv
source venv/bin/activate   # Mac/Linux
venv\Scripts\activate      # Windows

```

### Install Dependencies

We’ll use:

- FastAPI (for API server)
- Uvicorn (for running server)
- Pydantic (for validation)

```bash
pip install fastapi uvicorn

```

---

# Create MCP Server File

Create:

```plaintext
server.py

```

---

# Step 1 — Basic FastAPI Setup

Add:

```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Dict, Any
import datetime

app = FastAPI(title="Python MCP Server")

```

Run server:

```bash
uvicorn server:app --reload

```

Visit:

```plaintext
http://127.0.0.1:8000/docs

```

FastAPI automatically gives you API documentation 🎉

---

# Step 2 — Define MCP Tools Metadata

MCP tools must include:

- name
- description
- parameters schema

Add this below your app definition:

```python
tools = [
    {
        "name": "add_numbers",
        "description": "Add two numbers together",
        "parameters": {
            "type": "object",
            "properties": {
                "a": {"type": "number", "description": "First number"},
                "b": {"type": "number", "description": "Second number"}
            },
            "required": ["a", "b"]
        }
    },
    {
        "name": "get_weather",
        "description": "Get current weather by city name",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
        }
    },
    {
        "name": "get_current_time",
        "description": "Get current server time",
        "parameters": {
            "type": "object",
            "properties": {}
        }
    }
]

```

---

# Step 3 — Expose Tools Endpoint

Add:

```python
@app.get("/tools")
def list_tools():
    return tools

```

Now open:

```plaintext
http://127.0.0.1:8000/tools

```

You’ll see available tools.

---

# Step 4 — Define Execution Request Model

We need a structured request body.

Add:

```python
class ExecuteRequest(BaseModel):
    name: str
    arguments: Dict[str, Any]

```

---

#  Step 5 — Implement Tool Execution Logic

Add this route:

```python
@app.post("/execute")
def execute_tool(request: ExecuteRequest):

    name = request.name
    args = request.arguments

    try:
        if name == "add_numbers":
            a = args.get("a")
            b = args.get("b")

            if a is None or b is None:
                raise HTTPException(status_code=400, detail="Missing parameters")

            return {"result": a + b}

        elif name == "get_weather":
            city = args.get("city")

            if not city:
                raise HTTPException(status_code=400, detail="City is required")

            # Fake weather data
            return {"result": f"The weather in {city} is 25°C and sunny."}

        elif name == "get_current_time":
            now = datetime.datetime.now()
            return {"result": now.isoformat()}

        else:
            raise HTTPException(status_code=400, detail="Tool not found")

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

```

---

#  Testing the MCP Server

Using curl:

```bash
curl -X POST http://127.0.0.1:8000/execute \
-H "Content-Type: application/json" \
-d '{"name":"add_numbers","arguments":{"a":10,"b":5}}'

```

Response:

```json
{
  "result": 15
}

```

Your MCP server works 🎉

---

# How AI Uses This Server

When connected to an AI model:

- Model reads `/tools`
- User asks:“What time is it?”
- Model selects:

```plaintext
get_current_time({})

```
- MCP Client sends request to `/execute`
- Server responds
- Model formats response for user

---

# Improving the Server (Best Practices)

## ✅ Add Proper Validation

Use Pydantic models per tool instead of manual `.get()` checks.

## ✅ Separate Logic From Routing

Better structure:

```plaintext
/tools/
    weather.py
    math.py
main.py

```

## ✅ Add Logging

```python
import logging
logging.basicConfig(level=logging.INFO)

```

## ✅ Add Authentication

For production, add:

- API keys
- JWT validation
- OAuth

## ✅ Rate Limiting

Use:

```plaintext
pip install slowapi

```

---

# Connecting to Real APIs

Replace fake weather with real API:

```python
import requests

response = requests.get("https://api.weatherapi.com/...")
data = response.json()

```

Return structured data instead of plain text.

---

# Production Deployment

## Using Gunicorn

```bash
pip install gunicorn
gunicorn -w 4 -k uvicorn.workers.UvicornWorker server:app

```

## Using Docker

Example Dockerfile:

```dockerfile
FROM python:3.10

WORKDIR /app
COPY . .
RUN pip install fastapi uvicorn

CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]

```

---

# Common Beginner Mistakes

❌ Not validating parameters❌ Returning plain text instead of structured JSON❌ Putting all logic in one giant function❌ No error handling❌ No security layer

---

# Understanding the Big Picture

MCP servers are the foundation for:

- AI agents
- Enterprise AI systems
- Autonomous workflows
- AI copilots
- Internal automation

When you build an MCP server, you are basically making your backend “AI-ready”.

---

# 🎯 Final Summary

Component Role MCP Server Hosts and executes tools Tool Structured callable function Parameters JSON schema definition Execute endpoint Runs tool logic AI Model Chooses when to call tools

---

# 🚀 Congratulations

You now know how to:

- Build a Python MCP Server
- Define structured tools
- Execute tool calls
- Return safe results
- Prepare for AI integration

---

---

## Inside the Black Box: 5 Surprising Truths About How Modern AI Actually Thinks

URL: https://pythoughts.com/@hamza/posts/inside-the-black-box-5-surprising-truths-about-how-modern-ai-actually-thinks
Author: Hamza (@hamza)
Published: 2026-02-23
Tags: ai, ML, neuralnetworks

# 

![ChatGPT Image Feb 23, 2026, 12 27 08 PM](https://pythoughts.com/api/media/afb138d5-349a-4d36-af08-4335febefe96/1g94l5ONuhMIwlTSJJ1Vg/uploads/1771867698519-riavk9-chatgpt-image-feb-23_-2026_-12_27_08-pm.png)

We have all experienced it: that "uncanny valley" moment where a chatbot provides a response so nuanced, empathetic, or witty that it feels like there must be a ghost in the machine. It is a striking transition from the rigid, "if-then" logic of traditional computing to a system that seems to genuinely understand the subtle textures of human thought.

But behind the curtain, there is no ghost—only elegant architecture. Modern Large Language Models (LLMs) have bridged the gap between raw calculus and fluent conversation through a series of technical breakthroughs. By looking at the structural dynamics of these systems, we can uncover five surprising truths about how AI actually "thinks."

### 1. AI Doesn’t See Words, It Navigates a "Latent Space"

When you type a sentence into an AI, the model immediately discards the letters and words. Instead, an **Embedding Layer** converts that text into high-dimensional numerical vectors. This mathematical environment, where words are represented as coordinates in a universe often spanning 512 to over 1,000 dimensions, is known as **Latent Space**.

Think of Latent Space as a massive "meaning map." In this space, every concept is assigned a precise coordinate. Because the model has been trained on billions of parameters and pages of text, it has learned that words with similar meanings should be physically positioned closer together. For example, "manager" and "executive" will be located near each other in this 1,000-dimensional neighborhood, while a word like "banana" will be in a completely different sector. This allows the AI to understand semantic relationships without being manually told that two words are synonyms; it simply "sees" their mathematical proximity.

"Latent space is a machine learning concept that refers to a compressed, abstract representation of data. Instead of working with raw, high-dimensional information... machine learning algorithms use latent space to focus on the essential features, uncovering hidden patterns or relationships." — *Coursera*

### 2. The "Attention" Mechanism and "Positional Encoding" Solved the Memory Problem

Before the era of "Transformer" models, AI relied on Recurrent Neural Networks (RNNs) that processed words one by one. These models suffered from "short-term memory" because they were forced to process information sequentially; by the time they reached the end of a long paragraph, they had literally "forgotten" the beginning.

The **Multi-Head Attention** mechanism changed this by allowing Transformers to look at the entire sequence of text simultaneously. However, because Transformers process everything in parallel, they face a unique challenge: they don’t naturally know the order of words. To solve this, developers use **Positional Encoding**, injecting information about word order using sine and cosine waves. Without this "secret ingredient," the model wouldn’t know the difference between "The dog bit the man" and "The man bit the dog."

Once order is established, **self-attention** allows the model to resolve ambiguity. Consider the sentence: *"Although the CEO praised the engineer, she declined the promotion."* Through self-attention, the model calculates the relationship between "she" and every other word, successfully resolving that "she" refers to the CEO.

"The power of the attention mechanism is that it doesn't suffer from short-term memory... the attention mechanism in theory and given enough compute resources have an infinite window to reference from." — *The AI Hacker*

### 3. "Tokenization" is the Secret Bridge for Words the AI has Never Seen

If an AI only understood whole words, it would be stumped by typos, new slang, or complex medical terms. To solve this, AI uses **Subword Tokenization**, breaking words into smaller, meaningful chunks. This solves the "Out-of-Vocabulary" (OOV) problem and allows the model to handle morphologically rich languages by understanding prefixes and suffixes (like "un-" "believ" "-able").

The three primary methods of tokenization used by major models involve different strategies for handling spaces and merging characters:

Tokenization Method Core Algorithm Space Handling Primary Model Use Case Byte Pair Encoding (BPE) Iterative merging of most frequent adjacent pairs Requires whitespace pre-tokenization GPT-4, Llama WordPiece Probabilistic model maximizing corpus likelihood Uses "##" to mark subword continuation BERT SentencePiece Language-agnostic (treats text as raw stream) Treats spaces as a special character (_) T5, Llama-2

While BPE and WordPiece rely on language-specific rules like whitespace, **SentencePiece** is uniquely language-agnostic because it doesn't require pre-tokenization, making it a vital bridge for non-European languages.

### 4. Why AI Needs "Human Rewards" to Stay on the Rails

Raw AI models are trained on the internet, which means they are technically capable of being rude or biased. To move from a "raw" model to a helpful assistant, developers use a two-step process: **Supervised Fine-Tuning (SFT)** and **Reinforcement Learning from Human Feedback (RLHF)**.

In RLHF, humans rank multiple AI-generated responses. This feedback is used to train a separate **Reward Model**, which then acts as a "digital judge" to score the LLM’s future attempts. The AI practices generating responses to maximize its "reward" score, learning human preferences through trial and error.

This "human touch" provides three critical benefits:

- **Improved Performance:** Better alignment with complex user intent.
- **Reduced Bias:** Identifying and mitigating prejudices found in raw data.
- **Dynamic Adaptability:** While SFT provides **static safeguards** through fixed examples, RLHF allows the model to adapt to nuanced, subjective human values.

### 5. "Temperature" is the Dial for AI Creativity

Have you ever noticed that the same prompt can yield a boring, factual answer or a creative, unexpected one? This is controlled by a hyperparameter called **Temperature**. It is essentially a "randomness dial" for the model’s word choice.

- **Low Temperature (e.g., 0.2):** The model becomes deterministic and "staid." It will almost always choose the most statistically probable next word. This is ideal for medical or legal tasks where precision is paramount.
- **Standard Temperature (1.0):** This represents the model's standard probability distribution.
- **High Temperature (0.8 to 2.0):** The model "flattens" the probability distribution, making it more likely to pick less-probable words. While this encourages creative flair, moving toward a temperature of **2.0** often leads to "nonsensical" or "unstable" outputs.

### Conclusion: The Future is Agentic

We are currently witnessing a shift from "Traditional AI"—which followed hand-coded rules—to a generation of models that can reason through latent relationships. However, the next frontier is **Agentic AI**.

We are moving away from models that merely "speak" toward systems structured to plan, use tools and APIs, and iterate independently toward a goal. This transition introduces a new technical layer: **function calling**. Instead of just predicting the next word, an agentic model can decide to call a software tool to solve a math problem or check a real-time database.

As we move from predictive accuracy to agentic autonomy, we face a fascinating crossroads: how do we balance the emerging "reasoning" and planning abilities of these machines with our absolute need for ethical alignment? The "black box" is opening, but the responsibility of steering what is inside remains firmly with us.

---

## Claude Code Agents Design Overview

URL: https://pythoughts.com/@moelkholy/posts/claude-code-agents-design-overview
Author: Mohamed Elkholy (@moelkholy)
Published: 2026-02-22
Tags: claude, ai, design

# 

![Gemini Generated Image 4tzeqx4tzeqx4tze](https://pythoughts.com/api/media/384519d5-3c9c-492d-9a66-96d6adbf9a70/BETuLFtLQES3vJVBIqqud/uploads/1771802013520-jqyd3t-gemini_generated_image_4tzeqx4tzeqx4tze.png)

## What are Agent Teams

- **Agent Teams**: An experimental feature released with Claude Opus 4.6 that enables multiple agents to work in parallel while communicating directly with one another.
- Key distinction from sub-agents: Agent Teams allow multiple agents to operate concurrently and coordinate through direct communication. This is the primary differentiator from traditional sub-agent architectures.
- With sub-agents, coordination is not possible. Sub-agents report their results back to a main agent but cannot communicate directly with each other.
- With Agent Teams, agents share a task list and communicate autonomously during execution.
- "Teammate one can send a message to teammate three. Teammate two can send a message back to the main agent as it is working. They can complete tasks and coordinate with each other."

## Limitations of Agent Teams

- **Non-deterministic behavior**: Allowing agents to control coordination introduces reduced determinism and less direct control for developers.
- **Token overhead**: Communication between agents introduces significant token usage overhead, particularly when operating in parallel.
- **Reliability**: "Agent teams are currently somewhat unreliable but clearly indicate the direction of future agentic development."
- Expected maturity timeline: Within the next six months, building with agent teams is expected to become standard practice.
- Current readiness: Not yet production-grade for all use cases. The feature remains experimental but demonstrates strong potential.

## Cost Analysis

- Usage costs are lower than commonly assumed.
- Example from demonstration: 16% of the monthly session limit was used to complete a full payment integration build.
- 11% of the weekly limit was used across multiple development days.
- With $100–$200 per month Claude plans, developers can frequently deploy agent teams without consistently reaching usage limits.
- Token overhead is manageable when compared to the time saved through parallel execution.

# Planning Methodology for Agentic Development

## The Priming Process

- **Slash Prime Command**: An initial step-by-step process used to help a coding agent understand the existing codebase.
- Purpose: Establish context at the outset so the agent understands what research is required prior to implementing new features.
- Typically includes instructing the agent to search the codebase, review core files, and report findings.
- Used at the beginning of each new development conversation.
- Can be customized and made dynamic depending on the project.

## Question-Based Planning Strategy

- **Primary Objective**: "The number one goal of planning is to reduce the number of assumptions the coding agent is making."
- Two common categories of agent mistakes:

- Producing incorrect or suboptimal code.
- Deviating from the intended implementation.
- Both outcomes stem from insufficient planning clarity and are ultimately the developer’s responsibility.
- **Best Practice**: Require the agent to ask a minimum of 10 or more clarifying questions instead of attempting to over-specify requirements within a single prompt.
- Benefits of a question-driven approach:

- Surfaces assumptions that may otherwise go unnoticed.
- Enables rapid execution through multiple-choice responses.
- Supports both quick selections and detailed custom answers.
- Claude Code includes a built-in "ask user question" tool with a multiple-choice interface.

## Plan Review and High Leverage

- Structured plans are high-leverage artifacts. A single planning error can result in hundreds of lines of incorrect implementation.
- Conduct a thorough review before proceeding to implementation.
- A full line-by-line audit is not required; however, structural integrity and alignment must be verified.
- Confirm that:

- The plan follows an established structure.
- Database and schema considerations are addressed.
- Existing codebase conventions are respected (critical in brownfield development).
- A comprehensive validation strategy is defined.

## Context Window Management

- Once a structured plan is finalized, it becomes the sole context required for implementation.
- The context window can be reset after planning is complete.
- This separates unstructured exploration from structured execution.
- Reduces token usage during implementation by providing focused and explicit instructions.

# Contract-First Approach for Agent Teams

## Implementation Strategy

- **Lead Agent Role**: Responsible for defining contracts between frontend, backend, and database components before deploying worker agents.
- Contracts define communication expectations and data structures upfront.
- Enables parallel execution without overlap or conflict.
- Eliminates the need for fully sequential task completion.

## Task Organization

- Identify tasks that can run in parallel versus those requiring sequential execution.
- Invest time upfront defining contracts prior to spawning agents.
- Worker agents implement against defined contracts autonomously.
- Lead agent performs final coordination and validation.

## Parallel vs. Sequential Work

- Example: Database schema must be finalized before backend implementation can proceed effectively.
- Once contracts are defined, frontend, backend, and database tasks can proceed concurrently.
- Recommended team size: 3–5 agents for optimal coordination, with scalability beyond that as needed.

# Real-World Project: Payment Integration with ChargeB

## Project Overview

- **Base Application**: Agentic chat application similar to ChatGPT or Claude.
- **Feature Implemented**: Payment integration allowing users to purchase tokens for agent interaction.
- **Integration Platform**: ChargeB (a Stripe-like monetization platform).
- **Scope**: Required modifications across database, frontend, and backend, representing complex brownfield development.

## Technology Stack

- **Frontend**: Next.js (existing codebase).
- **Backend**: Next.js API routes.
- **Database**: Supabase (authentication and data storage).
- **Payment Platform**: ChargeB with pre-configured integration skill.
- **Testing**: Vercel Agent Browser CLI for end-to-end validation.

## Database Schema Changes Required

- Addition of a transactions table.
- Token balance tracking per user.
- Support for both single balance tracking and a full transaction ledger.
- "The database schema must be updated to support transactions and token balances for users."

## User Authentication Requirements

- Chat access requires authentication to associate token balances with users.
- Users linked to ChargeB customer IDs via Supabase.
- Each authenticated user maintains an independent token balance.

## Token Mechanics

- **Consumption Model**: One token deducted per conversation turn (user message and AI response).
- **Pre-deduction**: Tokens deducted prior to sending request to the language model.
- **Error Handling**: Automatic token refund if the LLM call fails.
- **Display Requirement**: Token balance visible in the chat header at all times.
- **Zero Balance State**: Disable message input and display a "Buy Tokens" button when balance reaches zero.

## Billing Page Features

- **Navigation Path**: Standalone page at `/billing` or under a dashboard layout.
- **Components**:

- Token balance display card.
- Three token purchase tier cards.
- Transaction history list of past purchases.
- **Checkout Flow**: Hosted ChargeB checkout with redirect back to a success page.
- **Post-Payment Behavior**: Automatic token balance update upon successful payment.

## Token Pricing Structure

- **Tier 1**: $5 for 100 tokens.
- **Tier 2**: $10 for 250 tokens.
- **Tier 3**: $25 for 600 tokens.
- Pricing structured to incentivize larger purchases.
- New user registration grants 10 complimentary tokens.

## ChargeB Integration Details

- **Webhook Endpoint**: POST requests to `/api/webhooks/chargeb`.
- **Local Development**: Ngrok tunnel used to expose HTTPS endpoint.
- **Webhook Purpose**: Process payment confirmations and credit tokens.
- **Environment Configuration**: Variables pre-configured in `.env` file.
- **Integration Skill**: ChargeB provides a skill to ensure accurate SDK implementation and prevent hallucinated usage.
- "The integration skill serves as authoritative guidance for correct ChargeB SDK usage."

## Pre-Implementation Setup

- ChargeB account established.
- Webhook endpoint configured in ChargeB dashboard.
- Ngrok tunnel active.
- Environment variables set.
- Test payment environment available.

# End-to-End Testing with Vercel Agent Browser CLI

## Testing Philosophy

- The agent autonomously executes user journeys as a real user would.
- Replaces traditional manual validation.
- Launches a browser and interacts with the application directly.
- More advanced than Playwright or Puppeteer MCP server setups.

## Benefits of Autonomous Testing

- Identifies real-world usage issues.
- Enables self-validation before presenting results to developers.
- Reduces manual QA workload.
- Ensures most basic issues are resolved prior to review.
- "During end-to-end validation, the system identifies issues and resolves them autonomously."

## User Journeys Validated

- User registration and authentication.
- Navigation to billing page.
- Purchasing token packages.
- Verification of token balance updates.
- Chat interaction with token consumption validation.
- Transaction history display.
- Zero-token state handling.
- Webhook processing and balance updates.

## Testing Best Practices

- Define validation scenarios during planning.
- Include detailed end-to-end scenarios in structured plan.
- Use headed mode for observation when necessary.
- Use headless mode for faster execution.
- Allow iterative correction of discovered issues.

## Common Issues Identified During Testing

- Browser state management inconsistencies.
- Timing discrepancies in balance updates.
- Form submission errors.
- Webhook processing delays.
- Authentication state persistence issues.
- UI visibility and rendering problems.

# Best Practices and Insights

## Brownfield Development

- Brownfield development refers to adding features to existing systems.
- Strict adherence to established codebase conventions is required to maintain maintainability.
- "It is essential to follow existing conventions to prevent architectural degradation."
- Review existing implementation patterns prior to planning new features.
- Align new features with established architectural structures.

## Model-Specific Prompt Adjustments

- Different Claude versions interpret prompts differently.
- Opus 4.6 may require explicit instructions that Opus 4.5 handled implicitly.
- Prompts and commands should be reviewed and adjusted when new models are released.
- Example: Opus 4.6 may omit end-to-end testing unless explicitly instructed.
- "When a new model is released, prompt and command adjustments may be necessary."

## Platform Skills and Integrations

- SaaS platforms are increasingly providing integration skills tailored for coding agents.
- Skills supply up-to-date documentation independent of training data.
- Reduces hallucinations and incorrect SDK usage.
- Industry trend suggests all major platforms will provide coding agent skills.
- These skills are critical for accurate first-attempt implementation.

## Agent Team Workflow Summary

- Begin with exploratory planning.
- Transition to question-driven clarification.
- Formalize a structured plan with defined contracts.
- Deploy multiple agents for parallel execution.
- Lead agent coordinates task distribution and communication.
- Worker agents communicate autonomously.
- Conduct end-to-end validation.
- Deliver a refined and validated implementation to the developer.

# Technical Configuration Details

## Claude Code Settings

- Experimental agent teams enabled via `settings.local.json`.
- Configuration: `"cloud code experimental agent team": 1`.
- Optional use of `dangerously-skip-permissions` flag for accelerated development.
- Execution command: `build with agent team [path] [number_of_agents]`.

## Development Environment

- WSL recommended for multi-terminal agent coordination.
- Tmux for terminal multiplexing.
- Ngrok for webhook tunneling.
- Hot reload enabled.
- Vercel Agent Browser CLI for automated browser validation.

## Skills Utilized

- **ChargeB Skill**: Official integration documentation and best practices.
- **Vercel Agent Browser CLI**: Automated browser testing.
- **Custom Slash Commands**: Prime, plan, and build workflows.

# Project Outcomes

## Implementation Results

- Complete payment integration implemented and validated.
- Three token purchase tiers available.
- Token balances tracked and deducted per conversation.
- ChargeB webhook processes payments successfully.
- Transaction history displayed accurately.
- Zero-token state handled correctly.
- All user journeys validated via end-to-end testing.
- Final token balance after testing: 568 tokens.

## Performance Metrics

- Parallel agent implementation time: approximately 5–10 minutes (excluding end-to-end testing).
- Estimated single-agent implementation time: approximately 30 minutes.
- Multiple issues identified and resolved autonomously during testing.
- Application achieved production-ready state after validation.
- "End-to-end testing identified issues introduced during the agent team build and resolved them effectively."

## Key Learnings

- Agent teams substantially accelerate multi-layer implementations.
- Rigorous planning and validation are more critical than flawless initial implementation.
- Autonomous testing surpasses manual review in identifying defects.
- Token overhead remains cost-effective relative to productivity gains.
- The trajectory of agentic development is moving toward autonomous multi-agent coordination.

---

## Inside the AI Brain: How Neural Networks, Tokens, and Transformers Work

URL: https://pythoughts.com/@moelkholy/posts/inside-the-ai-brain-how-neural-networks-tokens-and-transformers-work
Author: Mohamed Elkholy (@moelkholy)
Published: 2026-02-25

## Abstract

The shift from rule-based software to modern artificial intelligence represents one of the most consequential paradigm changes in computer science. Large Language Models (LLMs) sit at the center of that transition, showcasing striking advances in natural language understanding, generative reasoning, and—more recently—multimodal synthesis. This post offers a structured, technically grounded tour of the mathematics and architecture behind today’s LLMs. We’ll examine tokenization, embedding spaces, neural network fundamentals, the Transformer and its attention mechanism, scaling laws, alignment methods, and hardware-aware inference optimizations. The goal is to bridge conceptual intuition with the concrete mechanics that make these systems work—at a level suitable for both professional and academic readers.

---

## 1. Introduction

Contemporary AI systems are no longer built primarily from handcrafted rules or symbolic decision trees. Instead, they learn statistical structure directly from large datasets using highly parameterized neural networks. Large Language Models are the clearest expression of this trend: they combine representation learning, probabilistic modeling, and massively parallel computation to produce fluent, context-aware text generation.

To understand LLMs well, it helps to move past metaphors like “digital brains” and focus on the operations that actually run—matrix multiplications, attention-weighted mixtures, and gradient-based optimization. What follows is a practical walkthrough of the major components, beginning with how text becomes numbers and ending with the techniques that make inference fast enough to deploy.

---

## 2. Tokenization and Embeddings: Where Language Becomes Math

Before any neural computation happens, raw text must be converted into numerical form. Neural networks don’t process words—they process vectors. That conversion typically happens in two steps: tokenization and embedding.

### 2.1 Subword Tokenization Algorithms

Early NLP systems often tokenized at the word level or character level. Word-level tokenization led to exploding vocabularies and frequent “out-of-vocabulary” failures. Character-level tokenization avoided OOV issues but produced long sequences and weakened semantic cohesion.

Modern LLMs primarily use **subword tokenization**, which offers a practical balance between vocabulary size and expressive coverage. Common methods include:

- **Byte Pair Encoding (BPE)**Originally developed for compression, BPE repeatedly merges the most frequent adjacent character pairs until reaching a target vocabulary size. It efficiently encodes common words while still representing rare words by decomposing them. This approach appears in families like GPT-style models and LLaMA.
- **WordPiece**Used in BERT-style encoder models, WordPiece selects merges using likelihood-based criteria rather than raw frequency, and often marks continuation fragments (e.g., `##able`) to preserve morphological structure.
- **SentencePiece**Built for multilingual and language-agnostic processing, SentencePiece treats text as a raw character stream and encodes whitespace explicitly. This makes it robust across languages and scripts and is common in models such as T5 and ALBERT.

### 2.2 Tokenization Artifacts and Failure Modes

Tokenizers reflect the statistics of their training corpora, which means they can introduce quirks. Some rare but recurring substrings may become single tokens with poorly trained embeddings, occasionally producing unstable behavior.

Tokenization also influences numerical reasoning. Because BPE-like approaches treat digits as text fragments, values like `9.11` and `9.9` can be split inconsistently, which can contribute to unreliable magnitude comparisons. Typical mitigations include digit-aware tokenization schemes and structured formatting conventions.

### 2.3 From Static to Contextual Embeddings

After tokenization, each token ID maps to a dense vector via an embedding matrix.

Earlier approaches like Word2Vec or GloVe produced **static embeddings**: each word had one vector no matter the context. This struggled with polysemy—“bank” in finance versus “bank” of a river.

Transformers build **contextual embeddings**. Tokens start as lookup vectors, but each self-attention layer refines them based on neighboring tokens, producing representations that shift meaning depending on context.

---

## 3. Neural Networks: The Core Computation

At a foundational level, neural networks are parameterized function approximators. A standard layer applies an affine transformation followed by a nonlinearity:

[v_j = \sum_{i=1}^{d} w_{ji} x_i + b_j]

Where (w_{ji}) are learned weights, (b_j) is a bias term, and (x_i) are input activations. Activation functions such as **ReLU** or **GELU** introduce nonlinearity—without them, stacking layers collapses into a single linear map.

Training is driven by gradient-based optimization: the model produces outputs, a loss function measures error, and gradients propagate backward through the network to update weights. Over many iterations, the model becomes a strong statistical estimator of patterns present in the data.

---

## 4. The Transformer Architecture

### 4.1 From Recurrence to Parallelism

Older sequence models (RNNs, LSTMs) processed tokens sequentially, limiting parallelization and struggling with long-range dependencies due to vanishing gradients.

Transformers replaced recurrence with **full-sequence parallel processing**, enabling efficient training at scale across GPUs and distributed clusters.

### 4.2 Scaled Dot-Product Self-Attention

Self-attention is the key innovation. Each token generates three learned projections:

- **Query (Q)**
- **Key (K)**
- **Value (V)**

Attention is computed as:

[\text{Attention}(Q,K,V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V]

The dot product (QK^T) measures similarity, the (\sqrt{d_k}) term stabilizes gradients, and softmax produces normalized weights. Those weights blend value vectors into a context-informed representation.

### 4.3 Multi-Head Attention

Instead of computing one attention pattern, Transformers compute many in parallel. Each head can learn different relationships (syntax, coreference, sentiment, discourse structure). The results are concatenated and projected, expanding representational capacity.

### 4.4 Positional Encoding

Because attention alone is permutation-invariant, the model needs an explicit notion of token order. Common strategies include:

- **Absolute positional embeddings (APE)**: add position vectors (often sinusoidal).
- **RoPE (Rotary Position Embeddings)**: rotate queries and keys to encode relative position geometrically.
- **ALiBi**: add distance-based biases directly into attention scores.

RoPE and ALiBi are especially known for improving behavior on longer contexts.

---

## 5. Pretraining Objectives and Scaling Laws

### 5.1 Masked vs. Causal Language Modeling

Two widely used objectives dominate:

- **Masked Language Modeling (MLM)**: predict masked tokens using bidirectional context (BERT-style).
- **Causal Language Modeling (CLM)**: predict the next token autoregressively under a causal mask (GPT, LLaMA).

CLM naturally produces fluent generation because it directly trains the model to continue text.

### 5.2 Compute-Optimal Training and Chinchilla-Style Scaling

Scaling isn’t just “bigger is better.” Performance depends on balancing parameter count (N), dataset size (D), and compute. Chinchilla-style findings suggest that compute-optimal training often requires **more data per parameter** than earlier scaling strategies assumed—roughly on the order of tens of tokens per parameter in common regimes.

A representative loss approximation is:

[L(N,D) = 406.4N^{-0.34} + 410.7D^{-0.28} + 1.69]

The constant term reflects irreducible entropy in the data distribution.

---

## 6. Alignment After Pretraining

A pretrained model is fundamentally a next-token predictor—not automatically a helpful or safe assistant. Alignment adapts it for instruction following, safety constraints, and user-facing usefulness.

### 6.1 Supervised Fine-Tuning (SFT)

SFT trains on curated instruction–response pairs to teach conversational structure and task completion patterns.

### 6.2 RLHF

Reinforcement Learning from Human Feedback typically trains a reward model from human preferences, then optimizes the policy using methods like PPO to maximize predicted reward. RLHF can be powerful, but it adds complexity and can be sensitive to training instability.

### 6.3 Direct Preference Optimization (DPO)

DPO reframes alignment as a direct optimization problem using preference pairs, avoiding a separate reward model and simplifying the pipeline while preserving strong outcomes.

---

## 7. Inference: Prefill, Decode, and KV Caching

Generation generally happens in two phases:

- **Prefill**: the model processes the full prompt in parallel (often compute-bound).
- **Decode**: the model generates tokens one at a time (often memory-bandwidth-bound).

A major optimization is **KV caching**, which stores previous keys and values so the model doesn’t recompute attention history at every step. This can significantly reduce compute during decoding, but it increases memory usage and creates VRAM pressure—especially for long contexts and large batch sizes.

---

## 8. Escaping the Quadratic Cost of Attention

Standard self-attention scales as (O(N^2)) with sequence length, which becomes expensive for long contexts.

### 8.1 FlashAttention

FlashAttention is an IO-aware exact attention method that tiles operations into fast on-chip memory, reducing memory traffic and accelerating attention without approximations.

### 8.2 Sliding Window Attention

Windowed attention limits each token’s attention scope to a local neighborhood, dramatically lowering compute while sacrificing full global access.

### 8.3 State Space Models (Mamba)

Structured State Space Models maintain fixed-size hidden states, enabling sub-quadratic scaling and constant-memory inference in many settings. The tradeoff is that compressing long histories can reduce exact recall. Hybrid designs increasingly combine attention layers with SSM layers to balance precision and efficiency.

---

## 9. Conclusion

Large Language Models aren’t mystical—they’re carefully engineered mathematical systems. Their capabilities emerge from a small set of core ideas executed at massive scale:

- efficient subword tokenization,
- contextual embedding refinement,
- parallel self-attention via Transformers,
- compute-aware scaling strategies,
- post-training alignment techniques,
- and hardware-conscious inference optimizations.

Looking forward, the momentum in AI research continues toward hybrid architectures, better data curation, longer and more reliable context handling, and alignment methods that are both safer and more stable. The most enduring progress will come from treating theory and infrastructure as a single design problem—mathematical rigor paired with scalable systems.

**Keywords:** Large Language Models, Transformer Architecture, Self-Attention, Tokenization, Scaling Laws, RLHF, FlashAttention, State Space Models