◆Now in private beta

// for hiring teams

Your take-home just got solved by GPT in 30 seconds▊

Wetstone is how you hire engineers who can actually judge, spec, and debug AI-generated system and code design — not just prompt it.

Book a demo →See a sample assessment

// 01 · the problem

The interview is broken.
You already know this.

01 —

Take-homes are LLM fodder.

Any candidate with Claude Code passes your take-home. Signal is zero.

02 —

Live coding is theater.

Either you watch them fight an IDE, or you watch them fight an AI tool they'd never use on the job.

03 —

You still don't know if they can design.

None of this tells you whether they'll catch a load-bearing flaw in an AI-generated system or a subtle bug in AI-generated code.

// 02 · the platform

Everything you'd expect from a technical assessment platform. Built for 2026.

Custom problem sets

Pick from 500+ problems or commission private ones tied to your stack.

Live and take-home modes

Timed, proctored, or async. Both work.

Auto-graded submissions

Code execution + LLM-judge harness + rubric scoring on design and correctness.

Integrated video interviews

Screen share + code editor + playback. No Zoom tab chaos.

Plagiarism & AI-use detection

We can see when candidates pasted from another model.

Candidate scorecards

Rubric-level breakdowns on system and code design, not just pass/fail.

ATS integrations

Greenhouse, Lever, Ashby, Workable.

Team dashboards

Track funnel metrics, calibrate interviewers, compare candidates fairly.

Wetstone Rating verification

Candidates can share their public rating directly into your pipeline.

SOC 2 + SSO

Because your security team will ask.

// 03 · how a problem works

01 / 04

One bug. Four steps.

// Read plausible AI-generated code, find the failing assumption, patch it, and see how the top 1% diagnosed it.

retrieve.py

BUG-HUNT · 002

1def retrieve_context(query: str, k: int = 5):
2    embedding = embed(query)
3    results = vector_store.query(embedding, top_k=k)
4    docs = [r.document for r in results]
5    reranked = sorted(docs, key=lambda d: d.score)
6    return reranked[:k]
7 
8# downstream:
9context = retrieve_context(user_question)
10answer = llm.generate(prompt, context)

// 1,247 attempted · 31% caught it · median 4m 12s

Try a sample →

01
Read
Skim the code and scenario. The bug is plausible by design.
02
Diagnose
Click the suspicious line. Explain the failing assumption in one sentence.
03
Fix & submit
Patch it. Hidden tests run against your change.
04
Compare
See how the top 1% diagnosed it, and how fast.

booking-agent.eval

EVAL-DESIGN · 014

ACCURACY78%

DEFECTS14

MEDIAN LATENCY420ms

inputmodel output·

"book a flight to NYC next fri"search_flights(NYC, 2026-05-08)
"cancel my last booking"cancel_booking(id=auto)
"what's the refund status?"search_flights(refund)// wrong tool
"change seat to 14A on that one"change_seat(flight=?, seat=14A)// unresolved reference
"hey!"chat.echo(hi!)// idle no-op
"book for my team, 6 seats"book_flight(pax=1)// pax dropped
"is my 3pm delayed?"flight_status(AA-1231)

// 847 attempted · median 9m · 41% passed adversarial review

01
Inspect
Read the generated architecture, data flow, and stated tradeoffs.
02
Find the cracks
Surface what breaks — bottlenecks, consistency holes, cost blowups, missing failure modes.
03
Rework
Redraw the boundaries. Justify each tradeoff in plain text.
04
Compare
See how the top 1% structured it and where your design gave up signal.

invoice.spec.md

SPEC-WRITE · 007

// target behavior

Extract a valid Invoice JSON from messy PDFs. Reject malformed input. Be resilient to OCR noise.

// constraints

01currency must be ISO-4217.
02sum(line_items.total) must equal total_due ± 0.01.
03On missing total → reject("MISSING_TOTAL").
04Dates in ISO-8601. No timezone guessing.
05// (your turn — add more)

// expected output · schema

{
  invoice_id: string,
  currency: ISO4217,
  line_items: LineItem[],
  total_due: number
}

// 612 attempted · 48/60 hidden tests passed · P72 robustness

01
Read target behavior
Understand the desired contract. Edge cases are unstated on purpose.
02
Write constraints
Author the prompt, schema, and rules that pin down behavior.
03
Test against hidden cases
Graded on adversarial inputs and malformed outputs.
04
Compare robustness
See how your spec holds up versus the top 1% across models.

agent.run

BUILD-LOOP · 021

$ wetstone run agent.py --budget 10.00

▸

tool · search_flights(NYC, 2026-05-10)

204ms

▸

tool · book_flight(id=AA-1231)502 · retry 1/3

143ms

▸

tool · notify_user("booked · $210.40")

81ms

test_books_valid_request ................PASSED0.9s

test_retries_on_5xx .....................PASSED1.1s

test_rejects_invalid_date ...............PASSED0.3s

test_handles_tool_flake .................PASSED2.1s

test_caps_total_budget ..................FAILED—

AssertionError: spent $14.10, budget $10.00

PASSED4

FAILED1

MEDIAN0.9s

// 529 attempted · 61% pass rate · median 8.4s · recovery P58

01
Read the spec
Behavioral contract, budget, and tool surface are given.
02
Implement
Author the loop — tool calls, retries, stop conditions.
03
Run tests
Hidden tests check correctness, resilience, and recovery.
04
Compare tradeoffs
Score reflects pass rate plus cost and latency budgets.

// 04 · how it works

Three steps from broken loop to better signal.

Kickoff

30-minute call to match problems to your stack and bar.

Deploy

Wetstone link replaces your take-home. Send it to candidates today.

Hire sharper

You get a calibrated signal on AI-generated system and code design. We track outcomes with you.

// 05 · pricing

Honest pricing. Annual discounts.

Starter

Free

3 assessments

for teams wanting to try out

Startup

$500/mo

20 assessments

for teams hiring 1–3 engineers