AI can now write code that looks beautiful.

It uses clean variable names.
It adds comments.
It formats everything nicely.
Sometimes it even writes tests.

And that is exactly the problem.

Because polished code is not the same thing as production-ready code.

A large language model can generate a function that looks like it belongs in your codebase, passes the obvious test, and still quietly breaks under real traffic, leaks data, ignores your architecture, or turns a simple endpoint into a future incident.

So the real question is no longer:

“Can AI write code?”

It can.

The better question is:

“How do we know whether AI-written code is actually good?”

Let’s investigate.

The trap: code that looks right

AI-generated code often fails in a very annoying way: it looks correct at first glance.

That makes it different from messy beginner code. Bad human code often looks suspicious immediately. AI code, however, can be confidently wrong while wearing a suit.

It may:

solve the prompt but not the real problem
handle only the happy path
assume tiny data
skip authentication
ignore your existing utilities
create a new pattern when your repo already has one
pass simple tests but fail realistic ones

This is why reviewing AI code requires more than asking, “Does it run?”

A better review asks:

“What happens when this code meets reality?”

Reality includes weird inputs, impatient users, large databases, expired tokens, flaky services, malicious requests, old architecture, naming conventions, logging standards, and the one edge case nobody mentioned in the prompt.

That is where AI code earns trust — or loses it.

The five-question test for AI-generated code

A practical way to review AI-written code is to check it through five layers:

Did it solve the real problem?
What happens when things go wrong?
Will it survive real data and real traffic?
Is it secure?
Does it actually fit this codebase?

Think of these as five doors the code must pass through before it gets anywhere near production.

1. Did it solve the real problem?

AI is very good at satisfying the literal prompt.

That sounds great until the prompt is incomplete, vague, or accidentally misleading.

You ask:

“Create a function to list users.”

The AI gives you this:

async function listUsers() {
  return User.find({});
}

Looks simple. Maybe even elegant.

But in production, this is terrifying.

What if you have 3 million users?
What if user records include private fields?
What if the frontend only needs active users?
What if this endpoint gets hit 500 times per minute?

The code solved the sentence.
It did not solve the system problem.

A better version looks like this:

async function listUsers({ page = 1, pageSize = 50, status }) {
  const limit = Math.min(pageSize, 100);
  const query = status ? { status } : {};

  return User.find(query)
    .select("_id name email status")
    .sort({ _id: 1 })
    .skip((page - 1) * limit)
    .limit(limit);
}

Now we have boundaries. Pagination. Field selection. Filtering. Predictable behavior.

The reviewer question is:

“Did the model answer the prompt, or did it solve the production requirement?”

Those are not always the same thing.

2. What happens when things go wrong?

AI loves the happy path.

The happy path is where the user exists, the API responds, the database is healthy, the object has the expected shape, and nothing surprising happens.

In other words, the happy path is the fantasy world where bugs do not live.

Here is a classic AI-generated pattern:

def get_profile(user_id):
    user = repo.get_user(user_id)
    return user["profile"]

This works beautifully until user_id is missing, the user does not exist, or the profile field is absent.

A more production-ready version is boring in the best possible way:

def get_profile(user_id):
    if not user_id:
        raise ValueError("user_id is required")

    user = repo.get_user(user_id)
    if user is None:
        raise LookupError(f"user not found: {user_id}")

    return user.get("profile", {})

Good code is not only code that works when everything goes right.

Good code knows what to do when things go wrong.

Review AI code by asking:

“What happens with empty input, null input, bad input, missing records, timeouts, retries, and partial failures?”

If the answer is “I don’t know,” the code is not done.

3. Will it survive real data and real traffic?

AI often writes code as if the database has 12 rows and only one person is using the app.

That is cute.

Production is not cute.

A common performance issue is the N+1 query problem:

orders = session.query(Order).filter_by(user_id=user_id).all()

for order in orders:
    order.product_name = session.get(Product, order.product_id).name

This may work fine for 5 orders.

But with 500 orders, it can quietly hammer your database.

A better version loads related data intentionally and limits the result:

orders = (
    session.query(Order)
    .options(selectinload(Order.product))
    .filter_by(user_id=user_id)
    .order_by(Order.id.desc())
    .limit(100)
    .all()
)

for order in orders:
    order.product_name = order.product.name

The important question is not:

“Does this code work on my laptop?”

The important question is:

“What happens when the data gets large?”

Check for:

unbounded queries
missing pagination
full table scans
N+1 queries
expensive loops
memory-heavy operations
synchronous work that should be async
no caching where caching clearly matters
no timeout around external calls

AI code often passes correctness checks while quietly failing the scale test.

4. Is it secure?

Security is where “looks fine” becomes dangerous.

An AI model may generate code that works functionally but creates a vulnerability.

Example:

app.get("/files/:name", (req, res) => {
  const filePath = `./uploads/${req.params.name}`;
  res.sendFile(path.resolve(filePath));
});

This is risky because user input is being used directly to build a file path.

A safer version is more defensive:

app.get("/files/:name", requireAuth, (req, res, next) => {
  const safeName = path.basename(req.params.name);
  const filePath = path.join(UPLOAD_DIR, safeName);

  res.sendFile(filePath, err => {
    if (err) next(err);
  });
});

Now we have authentication, safer path handling, and error forwarding.

When reviewing AI code, security questions should be automatic:

“Can a user abuse this?”

Look for:

missing authentication
missing authorization
unsafe input handling
SQL injection risks
path traversal risks
secrets in logs
sensitive fields returned in responses
weak validation
risky dependencies
insecure defaults
overly broad permissions

AI does not understand your threat model unless you force it to.

And even then, you still need to check.

5. Does it fit your actual codebase?

This is the most human part of code review.

AI can write technically valid code that does not belong in your repository.

For example:

import requests

def charge_invoice(invoice):
    r = requests.post("https://billing.internal/pay", json=invoice)

    if r.status_code != 200:
        raise Exception("billing failed")

    return r.json()

This might work.

But what if your codebase already has:

a shared billing client
retry logic
timeout defaults
structured logging
typed errors
tracing
service-specific response handling

Then this AI-generated function bypasses the architecture.

A better version uses the existing system:

from app.clients.billing import billing_client
from app.errors import UpstreamServiceError

def charge_invoice(invoice):
    result = billing_client.charge(invoice)

    if not result.ok:
        raise UpstreamServiceError("billing", result.status, result.body)

    return result.data

The reviewer question is:

“Does this code feel native to the repo?”

Good AI-generated code should use your existing patterns, not invent a second codebase inside your codebase.

Check whether it follows your project’s:

naming conventions
error handling style
logging format
API patterns
shared utilities
dependency rules
test structure
folder organization
security model
observability standards

The code may be correct in isolation and still be wrong for your system.

Why benchmarks keep making AI code look less magical

The more realistic the benchmark, the harder AI coding looks.

Simple coding benchmarks often test small, isolated tasks. But real software engineering is not isolated. It involves context, dependencies, tradeoffs, legacy code, tests, performance, security, and weird requirements written by humans in a hurry.

That is why newer evaluations tend to reveal the same pattern:

AI looks strongest on clean, short, self-contained tasks.
It struggles more as the task becomes long, messy, repo-specific, or architecture-heavy.

This does not mean AI coding tools are bad.

It means they are draft accelerators, not responsibility machines.

They can help you move faster, but they do not remove the need for engineering judgment.

In fact, the better AI gets, the more important review becomes — because the mistakes become harder to spot.

A practical workflow for reviewing AI code

To make AI code safe at scale, do not rely on vibes.

Create gates.

A good workflow looks like this:

flowchart LR
    A[Capture prompt and assumptions] --> B[Run unit and edge tests]
    B --> C[Check performance and data scale]
    C --> D[Run security and dependency scans]
    D --> E[Review codebase fit]
    E --> F[Record reviewer decision]
    F --> G{All gates pass?}
    G -- Yes --> H[Merge]
    G -- No --> I[Revise code or prompt]
    I --> B

Each gate catches a different class of problem.

Gate	What to check	Common AI failure
Requirement fit	Does it solve the real task?	Solves the prompt too literally
Functional behavior	Do tests cover normal and weird cases?	Happy path only
Scale	Are queries, loops, and memory bounded?	Assumes tiny data
Security	Are auth, input, secrets, and dependencies safe?	Expands attack surface
Repo fit	Does it use existing patterns?	Reinvents architecture
Audit trail	Can we reproduce the decision?	No prompt, no context, no notes

This turns code review from “looks okay to me” into a repeatable process.

A simple AI-code pull request checklist

Use this before merging AI-generated code:

The original prompt is attached
Assumptions are written down
The code was actually executed
Unit tests pass
Edge cases are tested
Empty, null, wrong-type, and large inputs are tested
Database access is bounded
Pagination or limits exist where needed
N+1 query risk was checked
Sensitive fields are not returned or logged
Authentication and authorization were reviewed
Dependencies were scanned
The code uses shared utilities and existing patterns
Reviewer notes explain the final decision

This may look like extra work, but it is cheaper than debugging a confident AI mistake in production.

Keep the prompt, not just the code

One underrated best practice: save the prompt that produced the code.

Why?

Because the prompt is part of the artifact.

If the model misunderstood the task, skipped a constraint, or invented an assumption, you need to know where that happened.

A lightweight template is enough:

task_id: ENG-1234
repo: payments-service
commit_sha: abcdef123456
model: <model name + version>

user_prompt: |
  <full task prompt>

retrieved_context:
  - path: src/clients/billing.py
    reason: existing billing client and error conventions
  - path: src/errors.py
    reason: shared typed exceptions

assumptions:
  - peak_qps: 1200
  - max_page_size: 100
  - auth_required: true

tests_run:
  - unit
  - edge_case
  - security_scan
  - perf_smoke

review_decision: needs_changes

notes: |
  Missing pagination; returns full user object; bypasses shared client.

This is not bureaucracy.

This is memory.

When a bug appears later, you can answer:

“What did we ask the AI to do, what did it assume, and why did we accept the result?”

That is how teams get better over time.

Tools that help, but do not replace review

No tool can magically certify that AI code is good.

But the right stack can catch a lot.

Useful categories include:

Tool type	Examples	What it helps catch
Static analysis	Semgrep, CodeQL	insecure patterns, dangerous code, policy violations
Dependency scanning	Trivy, Dependabot-style tools	vulnerable packages, secrets, image issues
Property-based testing	Hypothesis	weird edge cases humans forget
Load testing	k6	latency, throughput, regression risks
Code search	Sourcegraph-style tools	existing patterns, shared utilities, references

A strong AI-code review process uses tools as guardrails.

But tools are not judgment.

They can tell you something is suspicious.
They cannot always tell you whether the code belongs in your architecture.

That part still needs a human.

FAQ

Should better AI models get lighter review?

No.

Better models often produce more convincing code, not necessarily risk-free code.

The review should match the risk of the change, not the confidence of the model.

A small UI copy change does not need the same review as a payment flow, authentication change, database migration, or permissions system.

Is running tests enough?

No.

Tests are necessary, but they only check what you remembered to test.

AI-generated code also needs review for:

missing requirements
security issues
scalability problems
codebase fit
maintainability
operational behavior

Tests help. They do not replace thinking.

Should teams allow AI-generated code in production?

Yes, but not blindly.

Treat AI code like a junior developer’s first draft: useful, fast, sometimes impressive, but still requiring review.

The goal is not to ban AI.

The goal is to create a process where AI helps without quietly lowering quality.

How much context should we give the AI?

Enough to understand the task, the relevant files, and the existing patterns.

Not the entire universe.

Too little context leads to generic code.
Too much noisy context leads to confusion.

The sweet spot is targeted context: the files, conventions, examples, and constraints that matter for the task.

Final thought: AI code should earn trust

AI-generated code is not magic.

It is also not garbage.

It is a draft — sometimes a very good draft.

But production code needs more than clean syntax. It needs correct behavior, safe failure modes, realistic performance, secure design, and a natural fit inside the existing system.

So the next time AI gives you a beautiful patch, do not ask only:

“Does this look good?”

Ask:

“Can this survive production?”

That is the real test.

And if the code passes that test, then AI did not just write code.

It helped build software.

aicodellm eval

Discussion

Responses

No comments yet. Be the first to add one.