The Lion in a Box

LLMs, enterprise, and the art of fitting powerful things into imperfect containers

Jun 12, 2026

A question worth sitting with

Have you ever fit a lion into a box? I suppose you have fit an LLM into a workflow, or something, maybe an agent. How many lions, how many boxes? Are all boxes the same? Did you fit the same lion into the same boxes?

Lion in the Box

A question worth sitting with

LLMs are lions. Precisely, not just loosely. Powerful across a surprising range of terrain, unpredictable in ways that are hard to anticipate, and capable of both extraordinary performance and confident failure, sometimes in the same breath. Ethan Mollick calls this the Jagged Frontier. AI is dramatically better than humans at some tasks, surprisingly worse at others, and the edge between the two is neither straight nor obvious. You can’t look at a lion and know exactly where its reach ends.

An enterprise workflow is a box. Not a cage, a fit problem. A specific context, a specific user, a specific set of stakes, a specific tolerance for error. Whether to put the lion in the box is almost never the hard question. The hard question is what kind of box this actually is, what this particular lion does inside it, and what happens to everything around the box when it starts moving.

Most enterprise AI decisions are made as if the lion is the variable and the box is a given. It might be the other way around.

Three boxes

Of the agents I’ve built, three sit in my head as a useful illustration, not because they were the most complex, but because of how different the boxes were.

The first box was a scheduling workflow agent. The problem was efficiency, a process that took two hours compressed to five minutes. On that narrow measure, it worked. But the design problem was never really the scheduling. It was the surrounding workflow, the rescheduling friction it could create, the candidate experience it could degrade, the interviewer dissatisfaction it could generate if it got things wrong. The HITL wasn’t a safety net bolted on at the end. It was load-bearing architecture, carefully placed around exactly the moments where human judgment was doing invisible work that the agent couldn’t replicate. The goal wasn’t to automate a task. It was to optimise a system without breaking the parts of it that weren’t visible in the brief.

The second was a form filling agent. A completely different box. Employees filling a complex form they didn’t fully understand, with fields requiring values they didn’t know, against policies they weren’t always aware of. They’d abandon mid-flow, or complete it incorrectly, and the cost landed downstream on recruiters, corrections, delays, longer time-to-hire. The lion’s job here was to inform, consolidate, and advise. The design principles flipped entirely: how much to surface, in what form, with what confidence signalling, how directive to be.

The third was a research agent. This is the one that keeps me most honest. The users varied enormously across multiple axes, their familiarity with prompting, their understanding of the methodology being applied, their ability to validate whether the data was correct, their awareness of where the research was sourced from. A probabilistic attrition report doesn’t stay in a dashboard. It becomes an input to a campaign, a headcount plan, a retention initiative. The lion’s output propagates through layers of downstream decisions, amplifying as it travels, until something is materially off and nobody can easily trace it back to a confidence interval the model produced six weeks earlier. The box here was about epistemic responsibility across a decision chain you can’t fully see from where you’re standing.

Three boxes. Three completely different design philosophies. Three different answers to what the lion was even for.

The chariot

Now scale that by an entire organisation.

Take a single function most people outside the space wouldn’t think twice about, interview management. One sub-function within hiring, itself one function within HR. It has easily ten to twelve agents sitting inside it. Scheduling, feedback collection, candidate communication, interviewer briefing, offer management. Ten to twelve boxes. Ten to twelve lions, each tamed differently. Multiply that across every sub-function of hiring, then across every HR function, then across every department in the enterprise, finance, legal, support, sales, marketing, IT. Support alone will likely have hundreds of boxes before any of this matures.

The chariot isn’t being pulled by a handful of lions. It’s being pulled by hundreds, eventually thousands, each one the same species, the same raw capability, but coloured differently by how it’s been tamed, what box it’s been built into, what constraints have been designed around it for that specific workflow, that specific user, that specific tolerance for error. And this is just one enterprise. Every organisation running AI at any meaningful depth is assembling the same chariot, independently, simultaneously.

Each of those boxes is a design problem. Each requires someone to understand the workflow deeply enough to know where the lion’s power is useful and where its unpredictability is dangerous. Each requires a different answer to the same set of questions: what is this lion for, what are the walls, what happens when it gets something wrong.

The scale isn’t a detail. It’s the whole problem.

Some might push back on the box as a metaphor. Workflows aren’t rigid containers, they’re fluid, they shift, they overlap. Fair enough. A box at least has fixed walls. If the container is amoeba-like, changing shape, expanding, contracting, then you’re not just fitting a lion into a box. You’re fitting it into something that won’t hold still while you’re trying to design around it.

Colouring the lion

Building one box well is harder than it looks.

You need to know the lion, its jagged frontier, where it performs with surprising competence, where it fails with surprising confidence, and how that edge shifts as the model evolves, as users learn to prompt, as the organisation accumulates context around it. That edge is not fixed. It moves.

You need to know the box too. The workflow, the user, the stakes, the tolerance for error. But the box moves as well. User behaviour adapts as AI becomes familiar. Organisational processes shift around the agent once it’s running. What felt like a well-designed constraint six months ago starts to chafe or gap in ways nobody anticipated at design time.

And you need to know how to fit the lion into the box: which parts of its capability to surface, which to constrain, where to place human judgment, how to communicate confidence, how much to give and in what form. Three agents built for the same organisation required three completely different answers to those questions.

None of this is obvious from the outside. The scheduling agent required understanding not just the workflow but the ripple effects of a poorly timed invite, the candidate’s first impression, the interviewer’s frustration, the recruiter absorbing the fallout. The form filling agent required understanding why employees abandon forms mid-flow, what they actually don’t know, and how much guidance tips from useful into overwhelming. The research agent required understanding how differently users across expertise levels would interpret a probabilistic output, and how far that interpretation would travel before anyone thought to question it. In each case, the depth of user knowledge wasn’t a precondition to building. It was the design. Without it, the box would have been the wrong shape entirely.

It is, in the most literal sense, an art. Designing the box and colouring the lion, knowing what this specific lion should be in this specific context, is not a problem that yields to technical expertise alone. The technology is almost the easier part.

The economics

Every enterprise deploying AI at scale is facing the same problem. Hundreds of boxes to design, hundreds of lions to colour, each requiring the kind of depth described above. The instinctive response is to build that capability internally, hire the people, stand up the team, develop the expertise, own the lions.

It’s an understandable instinct and, in most cases, a structurally poor investment.

The expertise required to build one box well is expensive. It doesn’t transfer cleanly across workflow types. What you learn designing a scheduling agent doesn’t map onto what you need to know to design a research agent. It doesn’t compound at the speed the problem demands. And the moment the lion changes, a new model, a new capability, a new failure mode, significant parts of the box need rethinking anyway.

Building boxes is not an enterprise’s purpose. Their purpose is what happens inside the workflows those boxes enable, hiring the right people, serving customers, closing deals, managing finances. The box is infrastructure, not strategy. Like most infrastructure, the economics suggest it’s built better by those for whom it is the purpose.

The organisations for whom box-building is the purpose are the ones operating across enterprises, not within them. A platform that has designed boxes for interview management across hundreds of companies has encountered a range of workflow variations, user behaviours, and failure modes that no single enterprise will ever see. That surface area is not just experience. It’s the raw material for something more durable.

The question for any enterprise isn’t whether to use AI. It’s whether the capability to deploy it well should be built internally or demanded from the platforms they already rely on. The economics suggest it’s almost always the latter.

Who sees the most lions

Surface area alone doesn’t determine who wins this.

A platform that has seen a thousand boxes has an advantage over an enterprise that has seen ten. Two platforms that have seen the same number of boxes can still be in very different positions, depending on what they’ve done with what they’ve seen.

Lion-taming intuition doesn’t live in one person. It lives in the organisation. A single person who has designed fifty boxes has accumulated something valuable but finite. An organisation where what that person learned is absorbed, challenged, refined, and built upon by hundreds of others compounds in a way that individual expertise simply can’t. How efficiently the learning from each lion travels across the people responsible for the next one matters more than how many lions the platform has encountered.

This is harder to build than it sounds. Expertise accumulates in pockets. The team that built the scheduling box carries knowledge that would be genuinely useful to the team building the research box. That knowledge doesn’t move on its own. It lives in decisions that were never documented, in problems that were solved once and then quietly forgotten, in intuitions that took months to develop and have no obvious home in any handoff document. This isn’t a failure of any individual or team. It’s the natural state of how complex knowledge accumulates inside organisations.

The platforms that build a genuine moat here are the ones that treat organisational learning as a product problem, with the same rigour they apply to the boxes themselves. Compounding lion-sightings rather than just accumulating them. The gap that opens between a platform that does this well and one that hasn’t figured it out yet isn’t visible in the short term. It widens in one direction and doesn’t close easily.

Questions worth sitting with

Most enterprises measuring their AI deployments right now are asking whether the lions are running. Adoption rates, task completion, time saved. The lion is moving and that registers as success.

What’s harder to see from that vantage point is whether the box was right. Whether the lion’s jagged edges were understood before the walls were built. Whether the design accounted for the users who would interact with it, the workflows adjacent to it, the decisions that would be made downstream from it. Whether the person who built the box had seen enough other boxes to know what they didn’t know.

These aren’t questions with clean answers. AI is still being figured out by the people building it, by the people deploying it, by the people sitting inside the workflows it’s changing. Anyone who tells you they have this solved is probably describing a very small part of the problem.

The boxes are imperfect. The lions are unpredictable. The chariots are already moving.

How well do you know the box before you build it? How well do you know the lion before you colour it? Who has seen enough of both to know what they don’t know yet? And as the boxes keep shifting, workflows adapting, user behaviours changing, the lions themselves evolving, who has the capability to keep up with all of that, continuously, at this scale?

I don’t have answers to any of those. I’m not sure anyone does yet. But I’ve found that the right questions have a way of pointing somewhere useful, even when the destination isn’t visible.

Discussion about this post

Ready for more?