Your interview loop is lying to you. CTFs, LeetCode, trivia—frontier LLMs ace them now. Even the security community is admitting it: open CTF formats are getting steamrolled by general‑purpose models. And it’s not just security—research venues like arXiv are tightening policies, with reports of year‑long suspensions for authors who shovel AI‑generated slop into the queue without oversight. The signal you thought you had from standardized puzzles is gone.
If you’re hiring senior engineers—US or nearshore—you don’t need a better puzzle. You need a loop that measures the only two things that still predict outcomes in the age of agents: how a human engineers with tools (AI included), and how they behave under real system constraints.
What changed (and why your loop must)
- Frontier LLMs now reliably solve canned problems. They trace through bytecode puzzles, reverse strings in assembly, and regurgitate canonical solutions with minor variations. That’s not cheating; that’s the new baseline.
- Gatekeepers are reacting. See coverage of arXiv’s crackdown on AI‑overuse by authors. Institutions are moving the line from “AI is novel” to “AI is a tool; your judgment is the product.” Your hiring loop must evolve the same way.
- Candidate environment is AI‑saturated. IDEs auto‑suggest, browsers ship built‑in assistants, and phones are sidecars. If your loop assumes isolation from AI, it’s measuring museum skills.
None of this means you should shrug and let a bot take your phone screen. It means you must redesign the loop to capture judgment, systems thinking, and collaboration—while explicitly modeling AI as part of the toolchain.
A decision framework for AI‑robust interviews
Before tactics, decide your stance on AI use during interviews. There are only three coherent positions.
1) Ban AI during evaluation
When to choose: You’re hiring for core research security, compilers, safety‑critical code, or roles where provenance and personal authorship are non‑negotiable.
Pros: Clear protocol. Easier to reason about authorship.
Cons: Lower realism. You will reject strong engineers who thrive with tools. You also incentivize covert use and adversarial behavior.
How to make it fair: Provide an instrumented, offline dev environment with complete docs and manpages. State the ban explicitly and explain why. Keep tasks short (≤90 minutes) to avoid “tool fatigue.”
2) Permit AI with disclosure
When to choose: Most product and platform engineering. You expect engineers to use assistants but want to see their judgment and verification loop.
Pros: Realistic signal. You observe how candidates prompt, critique, test, and integrate.
Cons: You must instrument for verification steps and “trust but verify” behavior, not just the final code.
How to make it fair: Ask candidates to narrate how they used AI. Score on validation, not on avoiding AI. Keep local unit tests and linters available. Disallow pasting company‑confidential prompts into external tools.
3) Require AI
When to choose: Roles explicitly centered on agentic workflows, codebase navigation with assistants, or internal tooling where AI fluency is a core multiplier.
Pros: You test the exact, modern skill: orchestrating AI to do useful work with high confidence.
Cons: You may screen out brilliant low‑tooling engineers. Provide an alternate path for exceptional profiles.
How to make it fair: Standardize on a provided assistant with the same context window and plugins for all candidates. Assess prompt engineering, retrieval use, and guardrail design.
Stop grading puzzles. Start grading engineering.
Here’s the loop we deploy for US and Brazilian senior candidates building production SaaS and AI backends. It’s AI‑robust because it measures how someone builds, reasons, and verifies—tools included. Total candidate time: 3 hours. Total panel time: ~3.5 hours. Target time‑to‑offer: 7 business days.
Stage 1: 30‑minute architecture case (whiteboard, no IDE)
- Prompt: “Design a minimal, resilient ingestion and inference pipeline for streaming JSON events at 5K/sec with 99.9% daily availability and cost caps of $X/day. Support backfill and idempotency.”
- What you score: Clear SLIs/SLOs, backpressure strategy, hot/warm storage choices, exactly‑once effects, failure domains, and cost reasoning under traffic spikes. If the candidate jumps straight to tech logos, steer them back to failure budgets and data flow.
- AI posture: Irrelevant. This is about systems thinking under constraints.
Stage 2: 60‑minute repository comprehension and change request
- Setup: A trimmed but non‑toy repo (1–3 services, 3–5K LOC) with missing edges and a readme that’s 80% accurate. Provide a single issue: “Add a rate‑limited, idempotent endpoint for X, with migration and rollback.”
- Environment: An ephemeral devcontainer with tests, linters, and an optional, instrumented assistant. Record only: test runs, git diffs, and terminal command history. No keystroke or screen capture; respect privacy.
- What you score: Navigation strategy, test‑first (or test‑last, but explicit), ability to find seams for change, safety checks (migrations, feature flags), and a bounded diff. Bonus for tightening a flaky test or improving a doc stub.
- AI posture: Permit with disclosure. Ask, “Show one suggestion you accepted and one you rejected. Why?” You’re measuring judgment, not aversion.
Stage 3: 45‑minute production incident drill
- Setup: A contained sandbox with a low‑traffic service that has a latent bug (resource leak, race, or cache stampede). Provide logs, metrics (Grafana snapshots), and a dead‑simple runbook with gaps.
- Prompt: “You’re on call. It’s 11:07 AM ET. Error rate spiked from 0.2% to 3% on the write path. Page fired. Work the problem.”
- What you score: Hypothesis discipline, testable experiments, use of observability, clear comms (“I’m rolling back in 60 seconds unless X”), and containment. Code fix optional; stabilization mandatory.
- AI posture: Permit guardrailed assistants for reading stack traces or docs, but score the hypothesis‑driven loop and rollback discipline above all.
Stage 4: 45‑minute pair session with a future teammate
- Setup: A real bug from your backlog, scoped to fit the session. Your engineer drives 50% of the time.
- What you score: Collaboration style, boundary negotiation (“let’s stub this and circle back”), and code review hygiene. This is where cultural fit appears without culture‑theater.
- AI posture: Candidate’s choice. If they pull in AI, observe how they keep the pair engaged and verify changes.
Rubric: stop averaging vibes
Define weights and stick to them. Here’s a pragmatic split for senior backend or platform roles:
- Systems reasoning (Stage 1): 25%
- Codebase navigation and safe change (Stage 2): 35%
- Operational judgment (Stage 3): 25%
- Collaboration (Stage 4): 15%
Provide specific anchors, e.g., “Idempotency: 0 = not mentioned, 1 = mentioned but wrong, 2 = correct at handler, 3 = correct end‑to‑end with replay protection and dedupe ID.” Avoid catch‑alls like “senior presence.”
Instrument for evidence, not surveillance
You don’t need spyware to make this work. You need provenance you can discuss with a candidate and re‑review internally.
- Collect: git diffs, commit messages, test runs and coverage, terminal history, and a short post‑task reflection (“What did you try? What surprised you? What would you do with 2 more hours?”).
- Don’t collect: keystrokes, screen video, or browser history. Aside from privacy landmines, none of those predict job performance as well as the diff + tests + narration.
- Flag AI‑shaped code: Large, instantly pasted diffs with uncommon style are a signal to ask about verification and understanding. Treat it as a coaching opportunity, not a trap.
Design tasks AI can help with—but not carry
Good tasks look like the work your team actually does under real constraints. They also have features that force human judgment:
- Ambiguity with consequences: A readme that lies in one small but important way. The candidate must notice, test, and correct it.
- Hidden coupling: A migration that breaks a downstream job unless a feature flag is in place. Can the candidate foresee and stage changes?
- Performance cliff: A naive approach passes tests but blows up at 10x load. Offer a simple load harness to surface it.
- Docs archaeology: Intentionally incomplete docs. Borrow ideas from this code archaeology playbook: ask the candidate to map intent from code, not from tutorials.
By contrast, avoid puzzles with a single canonical solution an assistant can dump wholesale. If your internal testers can solve it in under 5 minutes with a generic prompt, throw it out.
Nearshore specifics: Brazil and the reality of distributed hiring
Brazil gives you 6–8 hours of overlap with US Eastern and Central time, a deep pool of senior devs, and rates typically 20–30% below US on‑market. Your loop has to be just as tight for nearshore candidates as for US ones, with two practical tweaks:
- Language clarity: Keep tasks in clear English, but avoid culture‑or slang‑heavy prompts. If your product domain is specialized, include a short glossary up front.
- Platform parity: Verify your ephemeral environment runs identically on candidates’ machines across OSes and bandwidth realities. If you provide a browser‑based IDE, test it from São Paulo and Porto Alegre at 2–5 Mbps upstream.
Don’t add extra hoops for nearshore. The point is parity and predictability, not proving worthiness via bureaucracy.
Cost and throughput math (so you can defend this to your CEO)
Assume your total panel cost averages $200–$300/hour fully burdened in the US, and $90–$150/hour nearshore for engineers participating in interviews. A classic three‑round loop (recruiter, coding puzzle, system design) often eats 5–6 hours of panel time per candidate and produces a 10–15% onsite‑to‑offer rate with high false negatives for seniors.
The AI‑robust loop above:
- Panel time: ~3.5 hours per candidate (0.5 + 1.0 + 1.0 + 1.0), tightly scheduled.
- Candidate time: 3 hours, all signal‑dense.
- Expected pass‑through: 20–30% onsite‑to‑offer for well‑sourced senior pipelines.
- False‑negative reduction: We consistently see 25–40% fewer “regret declines” after retro‑calibration, because you’ve stopped filtering out tool‑effective engineers.
Even a 10% improvement in hit rate pays for itself in 1–2 quarters of earlier productivity from the right hire. That’s before you count candidate NPS improvements (which translate to higher acceptance rates) when you cut out soul‑sucking puzzle rounds.
Compliance, ethics, and candidate trust
Transparency matters. Publish your AI policy in the interview invite:
- State clearly whether AI use is banned, permitted with disclosure, or required for specific stages.
- List exactly what telemetry you collect and for how long (e.g., 30 days). Promise deletion and honor it.
- Forbid pasting company‑confidential materials into external tools. If you permit AI, provide a walled‑garden assistant or insist on local‑only context.
If your legal team is nervous, don’t default to surveillance. Default to purpose‑limited evidence and a post‑task reflection. It’s hard to cheat understanding, especially when your tasks are complex but bounded and your questions are specific.
Implementation plan: 30/60/90
Day 0–30: Replace puzzles with product‑shaped tasks
- Fork an internal service and trim it to a 3–5K LOC exercise. Seed one migration, one flaky test, and one perf cliff.
- Stand up an ephemeral devcontainer and a browser‑based IDE fallback. Bake in tests and a smoke load harness.
- Write the scoring rubric with anchors. Run three pilots with your own seniors. Kill any task an LLM can ace in under 5 minutes.
Day 31–60: Ship AI policy and instrumentation
- Decide stage‑by‑stage AI posture (ban, permit, or require). Publish it.
- Instrument for diffs, tests, terminal history, and a short reflection form. No keystrokes, no screen capture.
- Train interviewers on the new rubric. Calibrate with recorded dry‑runs.
Day 61–90: Measure and iterate
- Track onsite‑to‑offer, time‑to‑decision, and acceptance rates. Ask every declined offer: was the process fair and relevant?
- Review 10 random failed loops per month. Identify rubric drifts and ambiguous prompts. Fix them.
- Introduce one variant per quarter (e.g., different bug class in Stage 3) to prevent overfitting and leak risk.
Security and IP hygiene for AI‑permitted loops
- Confidentiality: Never expose real secrets or production data. Use synthetic or scrubbed fixtures. Rotate any sample tokens after each session.
- Model choice: If you provide an assistant, prefer a self‑hosted or vendor‑hosted instance with strict data‑retention controls. Turn off training on prompts and completions.
- Provenance notes: If you accept a take‑home, require a short CHANGELOG with attributions (“I used X assistant for Y; I copied Z from doc ABC”). This encourages honest disclosure and gives reviewers context.
What about pure security roles and broken CTFs?
Yes, frontier AI has punched holes in open CTF formats. If you’re hiring security engineers, stop treating public CTF medals as a proxy for depth. Build tiered, private labs that require chain‑of‑thought planning, not pattern‑recall exploitation:
- Tier 1: Basic recon and exploitation against known, patched vulns. Time‑boxed. AI will be helpful—good.
- Tier 2: Unknown service with a logic flaw that reveals itself only through traffic correlation and log timeline analysis. Score the investigation journal.
- Tier 3: Blue‑team drill: propose guardrails (rate limits, IDS rules, canaries) and deploy them. Many attackers can’t defend.
Again: measure judgment, not memorized payloads.
The meta point
ArXiv’s stance is a mirror: they’re not banning tools; they’re demanding stewardship. Do the same. Candidates who can leverage AI while preserving correctness, cost, and safety will ship more value, sooner. Candidates who hide behind tools or refuse them outright will struggle. Your interview loop should tell those two apart in three hours, not three weeks.
Key Takeaways
- Standard puzzles are obsolete signals; LLMs ace them. Measure engineering judgment under real constraints instead.
- Choose an explicit AI posture per stage—ban, permit, or require—and publish it to candidates.
- Use a 3‑hour loop: architecture case, repo change request, incident drill, and pair session. Score with anchored rubrics.
- Instrument for diffs, tests, and terminal history—not keystrokes or screen capture. Respect privacy; get better signal.
- Design tasks that AI can help with, but not carry. Force human judgment via ambiguity, coupling, and performance cliffs.
- For nearshore (Brazil), aim for parity: clear English, platform‑tested environments, and 6–8 hours overlap with US ET/CT.
- Expect 20–30% onsite‑to‑offer for senior roles and 25–40% fewer false negatives when you move beyond puzzles.