MONA AI: Building a WhatsApp Bot With Gemini, Stability AI, and Puppeteer
Ahad Nawaz3 min read
Architecture, prompt design, and the message queue that keeps a WhatsApp AI bot reliable. Why I picked Puppeteer over the official Cloud API, the rate-limiting strategy, and what broke at scale.
MONA AI is a WhatsApp assistant that answers questions, generates images, and routes complex requests to humans. It runs on Puppeteer + Gemini + Stability AI + a Postgres queue. Here is how it works and what I would do differently next time.
Why Puppeteer Instead of the WhatsApp Cloud API
The official Cloud API is the right answer for most teams: signed templates, business verification, predictable rates. I went with whatsapp-web.js on Puppeteer for one reason: time to first message. The client wanted to validate the idea in days, not weeks of business verification.
The trade-off was clear: Puppeteer means a real browser, real memory, and a real risk of session expiry. I mitigated with three things:
- Session persistence on disk. The browser session lives on a mounted volume so a restart does not re-pair the device.
- Heartbeat watchdog. A sidecar pings the browser every 30 seconds and restarts the worker if it loses connection.
- Idempotent message handling. WhatsApp can replay incoming messages on reconnect. Every message id is deduped at the queue.
The Message Loop
Every incoming message hits this loop:
browser.on("message", async (msg) => {
const ctx = await loadConversation(msg.from);
await queue.add("process-message", {
messageId: msg.id._serialized,
chatId: msg.from,
text: msg.body,
type: classify(msg),
});
});
The Puppeteer worker only ingests. All the AI work happens in BullMQ workers backed by Redis. This separation matters: when Gemini is slow, the WhatsApp listener never blocks.
Classifying the Intent
Before calling an LLM, I run a cheap classifier on the message:
- Image request, "draw", "image of", "generate a picture", → Stability AI
- Question or chat, default fallback, → Gemini
- Operator handoff trigger, "talk to human", "speak with someone", → human queue
The classifier is a regex pass plus a small Gemini call when the regex is ambiguous. The cheap path catches 70% of messages without touching the LLM, which cuts costs meaningfully.
Prompt Design That Stayed Stable
The system prompt has three sections, in this order, every time:
- Identity and tone. Who MONA is, how she speaks, what she will and will not do.
- Context. The last N messages of the conversation, plus any structured data the user has shared (their order id, location, etc).
- Task. The user's latest message, isolated and labeled clearly.
The order matters. Putting identity first anchors the model. Putting context next gives it grounding. Putting the user's input last keeps the model's attention on what to answer.
Image Generation Without Hammering Stability
Image requests are slow (4-8 seconds) and expensive. The queue gates them:
- Per-user rate limit: 5 images per 24 hours, enforced in Redis with a sliding window.
- Global concurrency: max 3 in-flight Stability calls. Excess requests wait in BullMQ.
- Result caching: identical prompts return the previously generated image for 24 hours.
The user feels nothing different. The bill goes down by 60%.
What Broke at Scale
Puppeteer Memory Leaks
After a few days the worker would consume 4GB of RAM. The fix: schedule a graceful restart every 6 hours. The watchdog handles the cutover so no messages are dropped.
Conversation Context Bloat
Naively appending every message to the context window made Gemini calls progressively more expensive. I now summarize conversations older than 20 messages into a single "memory" block and prepend that instead of the full history. Context size stays bounded.
Token Bombs
Users pasting huge documents or message dumps. The pre-processor truncates anything over 4,000 characters and asks the user to clarify what they want done with it. Token usage and latency both stabilized.
What I Would Change
I would migrate to the WhatsApp Cloud API as soon as a project survives validation. Puppeteer is great for prototypes and hostile to operations. The Cloud API has rate limits but never randomly logs out.
I would also push more of the classifier into a small fine-tuned model. The regex catches the obvious cases, but a tiny supervised classifier would catch the ambiguous ones without paying Gemini latency.
The Stack
Node.js + TypeScript, BullMQ + Redis, Postgres, Puppeteer with whatsapp-web.js, Gemini for chat, Stability AI for image generation, Docker on a small VPS. Total monthly cost at the scale I ran it: under $40 including infra.
The lesson worth repeating: separate the listener from the worker. Anything that talks to WhatsApp should do nothing else. Anything that talks to an LLM should never block message intake. With that split, the rest is engineering discipline.
Comments
Sign in to leave a comment.