Preference Model is automating ML engineering and a critical component is models' abilities to develop software.
The way we build software is changing fast. Five years ago we wrote every line of code by hand. Today, we don't. What does our work look like five years from now? We are shaping this future.
Recent models work well on narrow tasks but are still brittle on real software work: large codebases with real conventions and technical debt, judgment-heavy design decisions, and multi-step problems. The bottleneck on fixing that is the supply of hard, high-fidelity scenarios that find where the best models still break. That is what we build.
Our founding team has previous experience on Anthropicβs data team building data infrastructure, and datasets behind Claude. We are partnering with leading AI labs to push AI closer to achieving its transformative potential.
You will work on frontier AI from day one. You will be finding the limits of the most capable models on earth before the public ever sees them and building the hard problems that push those limits further.
The work: hunt for the specific places the best coding models in the world still fail, then build the self-contained, rigorously graded scenarios that expose it. You own each one end to end. There is no permission to ask for and no queue to wait in. If you find a place a frontier model breaks, you build the thing that teaches it to do better.
Each problem is a fresh challenge that you own end to end: a realistic system, a genuinely difficult problem, and verification robust enough that a frontier model can't game it. Itβs build-the-future work, and the people who do it develop something rare: a deep intuition for how frontier models behave that only a handful of engineers in the world have.
You will work closely with a small team of engineers and directly with our founders, with full ownership and autonomy over what you build. This is independent, high-ownership work with regular feedback.
Hunt for where frontier models break across software, and build the hard, high-fidelity scenarios that expose those failures and push the ceiling of what the best models can do.
Own the hardest problems on the roadmap end to end: multi-step workflows, realistic stakeholder interactions, large codebases with real conventions and technical debt, and challenging system design.
Build verification robust enough that a frontier model can't hack it, and tell genuine capability gaps apart from artifacts of your own setup.
Direct coding agents heavily in your day-to-day work, evaluate their output critically, and recognize when they are failing in subtle ways.
Build the tooling your own work depends on.
Mentor newer engineers on the team as it grows.
Deep software engineering experience across multiple domains, with genuine expertise in at least one specialty: infrastructure, distributed systems, performance, security, compilers, databases, or similar.
Proficiency in Python.
Extensive hands-on experience with coding agents (Claude Code, Cursor, Codex, or similar), including an intuition for where they cut corners and how to direct them well.
Strong intuition for how models behave, even without prior ML or AI experience. You can anticipate where a model will take shortcuts and design around that.
Comfort working independently on complex, ambiguous problems with minimal direction.
Track record of owning work end-to-end in previous roles.
You have been a senior or staff software engineer at a company known for engineering rigor (e.g., a frontier lab, infrastructure startup, or systems-heavy team) and want to apply that experience to model training.
You have deep specialty expertise in an area that current models struggle with (distributed systems, low-level performance, security, compilers) and can build the problems that expose those weaknesses.
You get excited about building a new hard problem from scratch on a regular basis.
You have been an early engineer at a previous startup, shipped independently, and want to do it again in AI.
You have spent significant time building with coding agents, written about their failure modes, or contributed to agent evaluation work.
Competitive cash and equity compensation (>90th percentile)
Ownership and autonomy in a fast moving startup environment
Opportunity to work alongside senior and staff engineers from frontier labs and infrastructure companies, plus top ML engineers
Health, vision, dental, benefits
401K match
Lunch provided everyday onsite
Weekly snack orders
Visa sponsorship & relocation support available
We value diverse perspectives and experiences. If you're excited about this role but don't check every box, we still encourage you to apply.