How to Structure Your Website for AI Crawlers
AI crawlers read HTML, not JavaScript. They need static content, clean heading hierarchy, and machine-readable structure. Here's how to build a website AI platforms can actually read.
AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) read HTML source, not rendered JavaScript. Websites that rely on client-side rendering are effectively invisible to AI search. Static HTML, clean heading hierarchy, answer capsule formatting, and proper robots.txt configuration are the foundations of AI-visible website structure.
The fundamental problem: JavaScript rendering
Most AI crawlers do not execute JavaScript. They read raw HTML source code. This creates a massive visibility gap:
| Rendering approach | AI crawler visibility | Common platforms |
|---|---|---|
| Static HTML / SSG | Full visibility | Astro, Hugo, Eleventy, Jekyll |
| Server-side rendered (SSR) | Full visibility | Next.js (SSR mode), Nuxt, Astro |
| Static export from SSR | Full visibility | Next.js (static export), Gatsby |
| Client-side rendered (CSR) | Minimal to zero | React SPA, Vue SPA, Angular SPA |
| Heavy JS WordPress themes | Partial — depends on theme | WordPress with Elementor, Divi, WPBakery |
If your website content only appears after JavaScript executes in the browser, AI crawlers cannot see it. This applies to GPTBot (ChatGPT), ClaudeBot (Claude), PerplexityBot (Perplexity), and most other AI crawlers. Static Site Generation (SSG) or Server-Side Rendering (SSR) are required for AI visibility.
How to test what AI crawlers see
- View page source (not inspect element) — this is what crawlers read
- Disable JavaScript in your browser and reload — this is what crawlers see
- Use
curl https://yoursite.com/pagein terminal — this returns raw HTML - If your content disappears in any of these tests, AI crawlers cannot see it
Robots.txt configuration for AI crawlers
Your robots.txt file controls which AI crawlers can access your content. Many websites block AI crawlers without realising it — either through broad wildcard rules or security plugins.
# Allow all AI search crawlers
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
# Block sensitive directories from all bots
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /staging/ Common robots.txt mistakes
- Wildcard blocking:
User-agent: * / Disallow: /blocks everything including AI crawlers - Security plugin defaults: WordPress security plugins often block unknown user agents
- Forgetting OAI-SearchBot: GPTBot is for training, but OAI-SearchBot is for real-time ChatGPT search
- Blocking ClaudeBot: Some sites block ClaudeBot specifically — this prevents training data inclusion
See also: What should my robots.txt look like for AI search?
The answer capsule format
An answer capsule is a 40-60 word factual paragraph placed immediately after a heading. It contains a direct, specific answer to the question implied by the heading. AI platforms extract these capsules as citation-ready content. Pages using this format see significantly higher citation rates across ChatGPT, Gemini, and AI Overviews.
Answer capsule structure
- Placement: Immediately after the H2 or H3 heading
- Length: 40-60 words (concise enough for extraction)
- Content: Direct factual answer with specific data points
- Formatting: Bold the first sentence or the entire capsule
- CSS class: Use
.answer-capsulefor Speakable schema targeting
Example
After a heading "How much does AI search optimisation cost?", the answer capsule would be:
"AI search optimisation costs between £500-£5,000 per month from specialist agencies. The price depends on scope, competition, and the number of AI platforms targeted. Most UK agencies charge separately for audit, implementation, and ongoing monitoring."
Heading hierarchy for AI extraction
AI crawlers use heading hierarchy to understand content structure and extract relevant sections. Follow these rules:
| Rule | Why it matters |
|---|---|
| One H1 per page | Defines the primary topic for AI extraction |
| H2 for major sections | Each H2 should be independently answerable |
| H3 for subsections | Provides granular extraction targets |
| No skipped levels | Don't jump from H2 to H4 — breaks hierarchy logic |
| Question-format headings | Match user queries directly for citation matching |
| Answer capsule after each H2 | Gives AI a citation-ready extract per section |
One idea per paragraph
AI models process content at the paragraph level. Long paragraphs that cover multiple ideas create extraction confusion. Keep paragraphs focused:
- One claim per paragraph — don't bundle multiple statistics or facts
- 2-4 sentences maximum — shorter is easier to extract
- Lead with the fact — put the key information in the first sentence
- Avoid transition fluff — "As we discussed earlier" adds nothing for AI crawlers
Content freshness signals
76.4% of ChatGPT-cited pages were updated within 30 days. Freshness is a real citation factor. Implement these:
- dateModified in schema — update this whenever you revise content
- Visible "Last updated" date on the page — AI crawlers read this
- Genuine content updates — don't just change the date, actually revise the content
- Regular content audits — review and update key pages at least monthly
llms.txt — the machine-readable index
llms.txt is an emerging standard that provides AI models with a machine-readable index of your most important content. Similar to how robots.txt tells crawlers what they can access, llms.txt tells AI models what they should prioritise. Place it at your domain root alongside robots.txt and sitemap.xml.
# Example llms.txt
# Your Company Name
# https://example.com
## About
> Brief description of your company and what you do.
## Key Pages
- [Homepage](https://example.com/)
- [About Us](https://example.com/about/)
- [Services](https://example.com/services/)
- [Contact](https://example.com/contact/)
## Expertise Areas
- [Topic Area 1](https://example.com/topic-1/)
- [Topic Area 2](https://example.com/topic-2/)
## FAQs
- [Common Questions](https://example.com/faq/) IndexNow protocol
IndexNow notifies Bing (and therefore ChatGPT) immediately when you publish or update content. Without IndexNow, you're waiting for Bing to discover changes through normal crawling.
- Supported by: Bing, Yandex, Seznam, Naver
- Not supported by: Google (uses its own systems)
- Impact: Near-instant Bing indexation, which feeds ChatGPT and Copilot
- Implementation: API call or plugin (WordPress, Cloudflare Workers)
Bing Webmaster Tools submission
Since ChatGPT uses Bing's index, submitting your sitemap to Bing Webmaster Tools is essential. Many businesses only submit to Google Search Console and miss Bing entirely.
- Go to bing.com/webmasters
- Add your site and verify ownership
- Submit your XML sitemap
- Enable IndexNow for instant update notifications
- Monitor crawl errors and coverage
The Astro + Cloudflare advantage
Static site generators like Astro, combined with edge deployment on Cloudflare, create the ideal architecture for AI visibility:
- Pre-rendered HTML — every page is static, fully readable by all crawlers
- No JavaScript dependency — content exists in the HTML source
- Edge caching — fast response times from global CDN
- Markdown for Agents — Cloudflare's feature that serves clean markdown to AI crawlers
- Lighthouse scores 95+ — compared to WordPress average of 40-70
This site is built on Astro and deployed to Cloudflare — you can read about our methodology.
Technical checklist
| Item | Priority | Status check |
|---|---|---|
| Static HTML or SSR rendering | Critical | View source — is content visible? |
| Allow AI crawlers in robots.txt | Critical | Check for GPTBot, ClaudeBot, PerplexityBot |
| Submit sitemap to Bing | High | Bing Webmaster Tools dashboard |
| Implement IndexNow | High | Test with Bing URL Submission API |
| Answer capsules after headings | High | 40-60 word factual paragraphs |
| Clean heading hierarchy | High | H1 > H2 > H3, no skipped levels |
| One idea per paragraph | Medium | 2-4 sentences, lead with the fact |
| Schema markup | High | Google Rich Results Test |
| Create llms.txt | Medium | File at domain root |
| Content freshness dates | Medium | dateModified in schema + visible date |
What to do next
Oliver Mackman
AI Search Analyst, SEOCompare
Oliver leads SEOCompare's editorial and comparison research. With over a decade in digital marketing, he oversees agency evaluation, tool testing, and AI search data analysis.
Last reviewed: 7 April 2026
Need help with AI search visibility?
Get a free AI visibility audit to see how your business appears across ChatGPT, Gemini, Perplexity, and AI Overviews.
Request your free audit