When I'm fixing UI, what I want to do is simple. I want to look at the screen and say "here, this, like this." With a person, I'd just point at the monitor and say "tighten up the spacing on this button" and that's it. But with an agent, that doesn't work. I have to write in text: "the spacing on the button in the second section below the header is too wide." In my head, I can see exactly where it is, but the moment I put it into words, the positional information gets blurred. The agent interprets that hazy description in its own way, gets it wrong, I explain again—and we repeat. This was the bottleneck.
Even between people, words are only one part of how we communicate. When we're in the same room, we use gestures, eye contact, facial expressions, and tone all at once. We say "here" while pointing with a finger, and when the other person tilts their head in confusion, we elaborate. Because we can read from their reaction that the message didn't land, communication gets corrected in real time.
With agents, none of these means exist. Since you have to pack location, context, intent, and priority into a single text message, the cost of writing precise instructions becomes abnormally high. This is one reason why fixing it yourself is sometimes faster.
In the previous post, I said the problem was "what goes unspecified." Go one step deeper, and there are things you want to specify but can't easily put into words. Visual location is a prime example. Without knowing the file path or component name, accurately referring to "that thing on the screen" in text is harder than you'd think.
I recently tried a tool called Agentation. In the browser, you hover over the UI element you want to fix, select it with a shortcut key, and it automatically captures the element's class name, selector, and component hierarchy. Attach a note like "reduce the padding" and it becomes structured context the agent can immediately understand. The work of translating what you see on screen into a location in code—the work I used to do myself—the tool does it for me.
What I realized using it was that it wasn't that I was bad at explaining. The medium of text itself was too narrow to carry visual context. When the tool changed, the quality of communication changed.
Looking at how agent tools are evolving today, most focus on improving the agent's capabilities. Larger context windows, better code comprehension, more accurate execution. These matter, of course—but no matter how smart agents get, if text is the only channel for conveying intent, the bottleneck remains.
We need more tools like Agentation that convert visual context into structured data. Widening the means by which humans convey intent is a problem that carries the same weight as improving the agent's capabilities.
Working well with agents requires two things moving in tandem: the ability to structure your intent, and the breadth of means available to convey it. The former is the human's job, the latter is the tooling's job. We're at the point where the tooling's job matters just as much.