Building with AI Agents: Lessons from the Trenches

Everyone is building AI agents right now. Most of them will fail — not because the models are bad, but because the teams building them are solving the wrong problems. I know this because we almost made the same mistakes at ConnectMachine.

Over the past year, we have been building an AI agent that manages professional contacts and networking for our users. Not a chatbot. Not a recommendation engine. An agent that observes your professional world, makes decisions, and takes action on your behalf. The journey from prototype to production taught us lessons that no amount of reading about LLMs could have prepared us for.

This post is a candid look at what we got right, what we got wrong, and the six hard-won lessons that shaped how our agent works today.

What AI Agents Actually Are

Before diving into lessons, it is worth drawing a clear line between AI agents and the chatbots most people are familiar with.

A chatbot is reactive. You type a question, it generates a response. The interaction is bounded by a single exchange. ChatGPT, customer support bots, and most AI features shipping today are chatbots at their core, even sophisticated ones.

An agent is proactive. It has goals, it can observe its environment, it makes plans, and it executes actions — sometimes without being asked. An agent does not just answer your question about a contact. It notices that a contact changed jobs three days ago, cross-references that with the business card you scanned at a conference last month, updates the record, and surfaces a suggestion: “Looks like Sarah moved to Stripe. Want to send a congrats note?”

The difference matters because it changes every engineering decision you make. Chatbots optimize for response quality. Agents optimize for decision quality, and decisions have consequences. When an agent updates a contact record incorrectly, merges two people who should not be merged, or sends a follow-up at the wrong time, the cost is not a bad answer — it is a broken relationship.

That distinction shaped everything we built at ConnectMachine.

Why We Built an AI Agent at ConnectMachine

ConnectMachine exists because professional networking is broken at the point of capture. You meet someone at a conference, scan their LinkedIn QR code, maybe grab a business card, and then… nothing. The contact sits in your phone as a name and number. The context — where you met, what you discussed, what you promised to follow up on — evaporates within days.

We started by solving the capture problem. Our scanner digitizes business cards in under three seconds. Our LinkedIn QR integration adds event context that LinkedIn itself does not provide. But capture is only the first step. The real value is in what happens after.

That is where the agent comes in. We needed a system that could take raw contact data and turn it into a living, contextualized professional network. One that enriches profiles automatically, suggests follow-ups at the right time, keeps records current when people change roles, and does all of this without the user having to think about it.

We could have built this as a series of automated workflows — triggers and rules, the way CRMs have worked for decades. But workflows are rigid. They break when reality does not match the template. An agent can handle ambiguity. It can reason about whether two slightly different email addresses belong to the same person. It can decide whether a job change is worth surfacing immediately or can wait for the weekly digest. It can adapt.

That flexibility is what made us choose the agent approach. But it also made everything harder.

Lesson 1: Start Narrow, Expand Later

Our first prototype could do too much. It could scan cards, enrich profiles, suggest follow-ups, draft emails, and schedule reminders. It was impressive in demos and terrible in practice. The agent made plausible-sounding decisions that were subtly wrong, and because it was doing so many things, users could not figure out where the errors were coming from.

We stripped it back to a single capability: business card scanning and structured data extraction. That is it. The agent would take a photo of a business card, extract the fields, and present them for confirmation. No enrichment, no follow-ups, no drafts.

This was painful. We had investors asking about the AI roadmap and we were shipping a fancy OCR tool. But starting narrow gave us two critical things.

First, it gave us a trust foundation. Users learned that the agent was accurate at its one job. When we later expanded to contact enrichment, users gave the new capability the benefit of the doubt because the agent had earned credibility.

Second, it gave us clean data about failure modes. When the agent only does one thing, every bug is attributable. We built detailed error taxonomies for card scanning — misread characters, incorrect field mapping, language detection failures — that informed how we designed every subsequent capability.

Today the agent handles enrichment, follow-up suggestions, voice queries, and event-based context tagging. But each capability was added one at a time, only after the previous one had reached a quality bar where users stopped checking its work. The sequence matters. You cannot bolt trust on after the fact.

Lesson 2: Trust Is the Hardest Engineering Problem

Speaking of trust — it deserves its own section because it is genuinely the hardest problem we have faced. Harder than model accuracy. Harder than latency optimization. Harder than any infrastructure challenge.

Trust in an AI agent has three components, and you need all of them.

Correctness: The agent must be right. Not most of the time — overwhelmingly so. We found that users have an asymmetric tolerance for errors. If the agent correctly updates 99 contacts and botches one, the user remembers the one. In professional networking, a wrong phone number or a misattributed company can have real consequences. We set our accuracy bar at 99.5% for structured data extraction before we shipped, and we measure it continuously.

Self-awareness: The agent must know when it does not know. This is where most AI agents fail. The model will hallucinate a plausible-looking phone number rather than admitting it cannot read the card. We invested heavily in calibration — training the system to recognize its own uncertainty. When the agent is not confident about a field, it asks the user rather than guessing. Users tell us this is one of the features they trust most. An agent that says “I could not read this clearly — can you check?” feels more competent than one that confidently gets it wrong.

Transparency: The agent must show its work. Not in the “here is my chain of thought” sense that AI researchers care about, but in a practical, actionable way. When the agent updates a contact record, it shows exactly what changed and why. When it suggests a follow-up, it references the specific event or conversation that triggered the suggestion. The user should never wonder “why did the agent do that?” — the answer should be visible without asking.

We built a decision log that records every action the agent takes, along with the reasoning that led to it. Users rarely read it. But the fact that it exists, and that they can read it any time, changes how they feel about the agent. Transparency is as much about the option of oversight as it is about the act of overseeing.

Lesson 3: Make Every Action Reversible

This lesson sounds obvious but has deep architectural implications. If you tell users “the agent works on your behalf,” you are asking them to give up control. The only way that bargain works is if they can take control back at any moment.

Every action our agent takes is reversible. Updated a contact field? One tap to revert. Merged two duplicate contacts? Undo the merge and both records reappear exactly as they were. Suggested a follow-up and you sent it? Okay, we cannot unsend an email, but we can flag it and draft an apology if needed.

This required us to build an event-sourcing architecture from the start. We do not store the current state of a contact record — we store the sequence of changes that produced it. Any change can be rolled back without side effects. It is more storage, more complexity, and worth every byte.

The undo capability changed user behavior in a way we did not expect. People became more willing to let the agent act autonomously once they knew they could reverse anything. The safety net made them braver. Counterintuitively, building undo made the agent feel more capable, not less, because users let it do more.

Lesson 4: Explain in Plain Language, Not Confidence Scores

Early in development, we showed users the agent’s confidence scores. “87% confident this is the correct email address.” We thought this was transparent and informative. Users hated it.

The problem is that confidence scores create anxiety without enabling action. What is the user supposed to do with 87%? Is that good? Should they check? What if it said 92% — would that be different? The number provides an illusion of information while actually communicating nothing useful.

We replaced confidence scores with plain-language explanations. Instead of “87% confident,” the agent says “I found this email on their company website.” Instead of “62% match,” it says “This might be the same person you met at AWS re:Invent — they share the same first name and company, but the email is different.”

The shift from numbers to narratives changed everything. Users understand context. They can evaluate “I found this on their company website” because they know whether that is a reliable source. They cannot evaluate 87%.

This extends to how the agent communicates about its own limitations. We never say “low confidence.” We say “I could not verify this — you might want to double-check.” The message is the same, but the framing treats the user as a collaborator rather than a statistician.

Lesson 5: Design for Graceful Failure

AI agents will be wrong. Not occasionally — regularly, especially at the edges. The question is not whether the agent will fail but what happens when it does.

We designed three tiers of failure handling.

Tier 1: Self-correction. The agent detects its own mistake before the user sees it. For example, if it extracts a phone number from a business card and the number does not match the expected format for the country code it detected, it re-examines the card before presenting the result. Most failures are caught here.

Tier 2: Transparent uncertainty. The agent is not sure and tells the user. It presents its best guess alongside the alternatives and lets the user decide. This is where the plain-language explanations from Lesson 4 are critical. The user needs enough context to make a quick decision.

Tier 3: User-reported correction. The agent was wrong and the user catches it. When this happens, the agent does three things. It reverts the action. It apologizes plainly — “Got it, I have reverted that change.” And it logs the correction as a training signal so it is less likely to make the same mistake again.

The third tier is where most teams stop, but the logging step is crucial. Every user correction is a labeled data point. Over time, these corrections compound into significant quality improvements. Our accuracy on field extraction improved by 3.2 percentage points in six months purely from user correction signals — no new training data, no model changes, just better calibration from real-world feedback.

Lesson 6: The Agent Should Feel Like a Thoughtful Assistant, Not a Robot

This is the lesson that is hardest to engineer because it is about feel. The difference between an agent that users love and one they tolerate often comes down to tiny interaction details.

Our agent does not say “Contact record updated successfully.” It says “Done — I have updated Sarah’s title to Head of Partnerships at Stripe.” The first is a system notification. The second is what a human assistant would say.

When the agent surfaces a suggestion, it provides context that shows it has been paying attention. Not “You have a follow-up due” but “You met David at the Founders Dinner three weeks ago and mentioned you would share that article about Series A fundraising. Want me to draft a note?”

This is not just cosmetic. The contextual framing serves a functional purpose — it helps the user recall the interaction and evaluate whether the suggestion is appropriate. But it also communicates that the agent is not a dumb automation. It is tracking the things that matter to you.

We spent more time on the agent’s voice and interaction patterns than on any individual technical feature. Tone, timing, level of detail, when to be proactive versus when to wait for a prompt — these are design decisions that require the same rigor as architecture decisions.

Practical Architecture Decisions

A few technical choices that proved important:

Event-driven architecture. Every external signal — a LinkedIn profile change, a new email signature detected, a calendar event — flows through an event bus. The agent subscribes to relevant events and decides what to act on. This decouples data ingestion from decision-making and makes the system easier to debug and extend.

Local-first data. Contact data stays on the user’s device by default. The agent runs inference locally for simple tasks and only reaches out to cloud services for enrichment that requires external data. This is not just a privacy feature — it is a latency feature. The agent responds in milliseconds for local operations, which makes it feel instant.

Privacy constraints as design constraints. We treat privacy requirements not as limitations but as design inputs. The agent cannot read your emails, so it learns to infer context from signals it can access — calendar events, LinkedIn activity, card scans. This constraint forced us to build a more creative and ultimately more respectful system.

Modular capability system. Each agent capability — scanning, enrichment, follow-ups, voice queries — is an independent module with its own accuracy metrics, failure handling, and rollback logic. This lets us ship, monitor, and if necessary disable capabilities independently without affecting the rest of the system.

What’s Next for AI Agents in Professional Tools

We are still early. The current generation of AI agents, ours included, operates in a narrow band — they can handle well-defined tasks with structured data. The next frontier is agents that can navigate ambiguity in professional relationships the way humans do.

Imagine an agent that understands not just that you met someone, but the dynamics of the relationship — that this is a warm lead versus a casual acquaintance, that you have mutual connections who could make an introduction, that the timing is right for a check-in because their company just announced a funding round.

We are building toward that, carefully. The principles remain the same: start narrow, earn trust, make everything reversible, communicate like a human. The capabilities will expand, but the philosophy will not.

The teams that will win in the AI agent space are not the ones with the most powerful models. They are the ones that solve the trust problem first. Because the most capable agent in the world is useless if nobody lets it act on their behalf.

If you are building AI agents — for professional tools, for productivity, for anything that touches real-world decisions — I hope these lessons save you some of the mistakes we made. And if you want to see what a trust-first AI agent looks like in practice, come try ConnectMachine. The agent is waiting to meet you.