Strategy 7 min read

Apify + Claude = Research Agents That Don't Hallucinate

Enurgen DUET — grounded analytics product designed by Cinnaboner

Key takeaway: AI research agents fail in production when they can't distinguish observed data from inferred guesses — the fix is engineering discipline: scrape first, declare every signal's source, and force the model to label its output as grounded or inferred.

The number one reason AI research agents fail in production isn't the model. It's that the model can't tell the difference between "I observed this" and "I guessed this from context." Ask a raw LLM to audit a website and it will confidently report Core Web Vitals it never measured, competitor rankings it never checked, and review counts it pulled from thin air. The fix is not a better model. The fix is engineering discipline: scrape first, declare every signal's source, and force the model to label its own output as grounded or inferred.

This is the Apify plus Claude pattern we run in production on our AI Business Analyst tool. It turns a category of agents known for hallucinating into a category of agents you can put in front of a paying client.

The three-layer rule

Good research agents have three layers. Bad ones collapse all three into a single LLM call.

Layer one is the scrape. Apify, cheerio plus axios, Playwright — pick your poison. The job is to collect a structured snapshot of observable reality: HTML, DOM nodes, script tags, detected frameworks, counted product links, JSON-LD, meta tags, headers.

Layer two is the measurement. Some signals can't be inferred from HTML. Core Web Vitals need Google PageSpeed Insights. Keyword volume needs an SEO API. Review counts only exist if aggregateRating JSON-LD is actually present on the page. Either you have the measurement or you don't. Be honest about it.

Layer three is the language model. Claude's job is not to invent new facts. Its job is to synthesise the scraped snapshot into sections of a report, label each field with its source, and return "Not detectable" when a signal is missing.

When the three layers stay separate, the agent is trustworthy. When an engineer skips layer one and asks the LLM to "just analyse the site," the report becomes fiction.

The grounded signals vs. the inferred ones

Our AI Business Analyst declares this split explicitly in the README, because clients ask the question and they deserve a straight answer. Here's the shape:

Core Web Vitals and performance score — Google PageSpeed Insights, live. High confidence.
Tech stack — scraped from the DOM and script tags. High confidence.
SEO checks — computed from the DOM: title, meta description, alt text, schema, H1. High confidence.
SKU count — counted from /products/* links. High confidence when detected.
Review count — aggregateRating JSON-LD, only reported when present. Otherwise: absent.
Blog velocity — we probe /blog and /news, read <time datetime> attributes. High when a blog exists.
E-E-A-T and GEO readiness — LLM-inferred, but only from signals in the snapshot. Medium confidence.
Lean Canvas — LLM-inferred from body copy and tech stack. Medium confidence.
Competitor cards — LLM-suggested, qualitative only, no scraped per-competitor data. Medium confidence.
Brand voice — LLM-inferred from body text style. Medium confidence.
50 ICE-scored actions — LLM-synthesised from every prior agent's output. High because the inputs are grounded.

Notice the Competitor card line: "LLM suggested, qualitative only." That sentence is in our report, visible to the client, because the alternative — fabricating domain authority numbers, review counts, and rankings for five competitors we never scraped — is exactly the kind of AI slop that breaks trust. Better to say what it is.

Things we deliberately dropped

When we started the project the brief included several items we could not ground with the tools we had: Share of Voice rank, per-competitor DA and review counts, a 2D positioning matrix, UGC volume, press-tier scoring, revenue-impact predictions, keyword rank claims. Every one of those would have been easy to ask Claude to produce. Every one would have been a guess.

We dropped them. The report is shorter than the original pitch. It's also the first AI audit tool we've shipped that doesn't invent numbers when a client fact-checks it. As the README puts it: the report is only as honest as the signals we can collect.

Building a research agent for your vertical?

We'll ground the pipeline so the output holds up when a client fact-checks it.

Book a Discovery Call Get a Quote

Taking on new projects

A worked example: the scraper

Our services/scraper.js uses cheerio and axios — not Apify, but the same pattern. A real Apify actor would replace the axios GET with an actor run, but the downstream contract is identical.

// pseudo-flow
const html = await axios.get(url);
const $ = cheerio.load(html);

const snapshot = {
  title: $('title').text() || null,
  metaDescription: $('meta[name="description"]').attr('content') || null,
  h1Count: $('h1').length,
  imagesWithoutAlt: $('img:not([alt])').length,
  schemaBlocks: $('script[type="application/ld+json"]').map((i, el) =>
    safeParse($(el).html())
  ).get(),
  scriptTags: $('script[src]').map((i, el) => $(el).attr('src')).get(),
  productLinks: $('a[href*="/products/"]').length,
  hasBlog: await probe(`${url}/blog`),
};

Every field in that snapshot is a fact. Not a guess. The LLM never sees the raw HTML — it only sees the snapshot. That constraint alone kills half the hallucinations that plague naive "analyse this URL" prompts.

A worked example: the prompt

The LLM call is where most teams ruin the pipeline. Here's the rule we enforce in every agent prompt: reference the snapshot explicitly, instruct the model to output "Not detectable" when a signal is missing, demand strict JSON output, and state the confidence level expected for each field.

Our assembler prompt ends with a block that looks like this:

You are synthesising a section of a business audit.
Inputs: scrapeSnapshot, pageSpeedReport.
Rules:
- If a required signal is missing from inputs, output "Not detectable" for that field.
- Do not invent numerical metrics.
- Every claim must be traceable to an input field.
- Return valid JSON matching the provided schema. No prose outside JSON.

We run the response through utils/jsonCall.js which extracts valid JSON even if the model slips in a stray sentence. If the schema still fails, we fail the section loudly, not quietly — a missing section is better than a wrong one.

Why Apify specifically

Apify is a good fit for this pattern for three reasons.

One, it already has actors for most of the hard targets — Google SERPs, LinkedIn, Instagram, review sites. You don't write the scraper; you wire the output.

Two, actors return structured JSON. That's the exact shape the LLM layer needs. No HTML parsing on the LLM side, ever.

Three, Apify handles proxies, retries, and captchas — the operational boring parts that eat engineering time. You focus on the signal-to-claim contract.

Replace Apify with a homegrown cheerio script if the target is a small company site. Use Apify when you need scale, anti-bot tolerance, or platforms that punish naive scrapers.

Where this pattern pays off for clients

Anywhere the answer to a business question begins with "it depends on what's actually on the site." We've used it for competitive snapshots inside product strategy for B2B SaaS like Tough Commerce — where the client needed a defensible read on where three competitors actually stood — and for technical audits inside CleanTech work like Enurgen, where precision matters more than fluency. In both cases the value wasn't the LLM prose. The value was the client knowing which numbers came from a real page and which came from a model's best guess.

When you can point at a line in a report and say "that came from PageSpeed, that came from the DOM, that one the model suggested," the conversation stops being about AI and starts being about the business. Which is the only conversation worth having.

Takeaway

Research agents don't hallucinate because of the model. They hallucinate because the pipeline lets them. Put a scraper in front of the LLM. Label every field's source. Force "Not detectable" over invention. You'll ship agents you can actually sell.

If you want a grounded research pipeline built for your vertical, book a call.

Ship research agents you can actually sell.

Apify plus Claude, done right. We'll build one for your domain.