Navi: what we are learning from building a personal agent

Memory, tools, and the difference between an agent and a chatbot. Notes from the in-house product that taught us how to build for clients.

AI building

·Vashishta Mithra·4 min read·May 12, 2026

Navi is V19's personal agentic operating system. It spans long-horizon memory, device and OS control, calendar and comms orchestration, browser automation, and life-management workflows. One agent, planning and acting across an entire digital life.

It is also our most expensive piece of R&D and our most useful one, because every pattern we ship into client agentic systems lands in Navi first.

These are the things we did not believe before we built it, and now do.

A chatbot is a function call. An agent is a system.

The defining shift is not the model. It is everything around the model.

A chatbot wraps an LLM with a UI. An agent wraps an LLM with tools, memory, planning, error recovery, evals, and an explicit notion of what succeeded means. Most projects that fail at "AI agent" are still architected like chatbots and surprised when a chatbot cannot reliably book a meeting across three apps.

What separates the two:

Tools, not just prompts. The model decides what to do. The infrastructure decides whether what it did worked.
Memory that is not the context window. Long-horizon agents need a separate persistence layer, with explicit write, read, summarize, and forget operations.
A planning step. For anything that takes more than one tool call, the agent should propose a plan, get it acknowledged (by the user or by an eval gate), then execute. Skipping this is how agents end up doing twelve wrong things in a row.
Observability from day one. You cannot debug what you cannot see.

Memory architecture: three layers, not one

We tried single-vector-store memory first. It broke immediately on anything that needed temporal reasoning ("what did I commit to on Tuesday's call") or factual stability ("what is my mother's birthday").

What works is three layers:

Episodic - timestamped log of events, decisions, conversations. Indexed by recency and entity.
Semantic - facts that survive ("I am allergic to peanuts," "my company is registered in Delaware"). Curated, deduplicated, versioned.
Procedural - patterns of behavior the agent has learned ("when the user says 'standup notes', they want this format").

Each layer has its own write policy and its own retrieval. Conflating them produces an agent that forgets your birthday but remembers a typo from six months ago.

Tools are 80% of the build

The model's job is small. It interprets and decides. Everything else is in the tool layer:

Schema and types for every tool, with strict validation in and out.
Idempotency for anything that touches the outside world.
Explicit error surfaces. "It didn't work" is not a useful agent error. "The Gmail API returned a 403, scope is missing" is.
Permissioned execution. Some tools are auto-run. Some require confirmation. Some require a typed phrase. This is product design as much as it is engineering.

The honest measure of an agentic system is not "how impressive is the demo." It is "how many tools does the model have to choose from, and how often does it pick the right one." Navi has dozens. Picking the right tool is harder than picking the right answer.

Evals: write them first, ship to them, never trust the demo

The first thing we built in Navi after the first tool was an eval harness. Inputs we cared about, expected outputs, ground-truth where available, LLM-as-judge where not. We run it on every change.

We have shipped exactly zero agent features without an eval. Some of them were small (twelve test cases). Some were larger (three hundred). The point is that an agent that "feels good" in five demos will silently regress the moment you change a prompt, and you will not know it until a user notices. Evals are the only honest signal.

This is also what we ship into client builds, by default. If a client's agent does not have an eval suite at handover, we have not finished building it.

What we are still figuring out

Three things keep us up at night:

The cost curve. An agent that does useful work runs many tool calls, many model calls, many embeddings. Token bills are nontrivial. We are working through a layered model strategy (small fast models for routing, frontier models only on hard steps) and aggressive caching, but no one has nailed this yet.
The trust handoff. When the agent has high autonomy, the user stops paying attention. Then it fails silently and the user is surprised. We are experimenting with "narrate before acting" patterns where the agent says what it is going to do out loud, in advance, with a one-click cancel.
The interface metaphor. A chat box is a bad interface for an agent that takes hours of action. The right metaphor is closer to a project tracker than a chat. We are still iterating.

If you are building an agentic product and any of this lands, we should talk. Every client agent we ship is downstream of Navi, and the patterns above are the ones we charge for.