AI Search is a continually evolving space. LLMs are continually improving with new versions being deployed regularly, and AI Overviews are already in place in more than 200 countries and more than 40 languages. Google’s AI Mode has officially launched in the US, promising to expand what AI Overviews can do with more advanced reasoning, thinking, and multimodal capabilities to help with the toughest questions. The use of vector models is enabling semantic search to understand what is being asked, not just how it’s phrased (read our research on Semantic Relevance).
With all of this in mind, our AI research and development team is examining how AI Search is evolving, and how we can help Lumar users understand the performance of their site and content, and optimize it for optimal results—broadening from traditional Search Engine Optimization (SEO) to Generative Engine Optimization (GEO).
The good news is that GEO also uses a lot of the foundational parts of SEO, like crawling and indexing. As John Mueller said in Search Central Live NYC in March 2025:
“All of the work that you all have been putting in to make it easier for search engines to crawl and index your content, all of that will remain relevant.”
Lumar has several reports that will help you with GEO, which we’ve outlined below (along with a few that are coming soon). And as I said above, our team is looking at additional reporting and analysis (including a dedicated GEO section) that can help, so watch this space!
View our recent webinar on technical SEO in the age of AI search
View the webinar
Crawlability Reports
These Lumar reports identify whether AI bots are blocked from crawling your pages. You can use these reports to verify that valuable content is not blocked to AI:
- ChatGPT Blocked. Pages with a 200 response that are blocked in robots.txt for the GPTBot or ChatGPT-User user-agent tokens. This prevents ChatGPT from training on or referencing your content in responses.
- Google AI Blocked. Pages with a 200 response that are blocked in robots.txt for the Google-Extended user-agent token. These pages are blocked from Google’s AI systems (Bard/SGE) from accessing content for generative responses.
- Google SGE Blocked. Pages with a 200 response that are specifically blocked in robots.txt to prevent Google Search Generative Experience from using your content.
- Bing AI Blocked. Pages with a 200 response that are blocked in robots.txt for the Bingbot or MSNBot user-agent tokens. Prevents Bing AI/Copilot from accessing and citing your content.
- Common Crawl Blocked. Pages with a 200 response where the URL is blocked in robots.txt for Common Crawl’s CCBot user-agent token. Common Crawl data feeds many AI training databases, so blocking limits AI model training exposure.
- AI Bot Blocked (coming soon). Pages that are blocked to AI bots so they cannot access content.
- Perplexity Blocked (coming soon). Pages blocked to Perplexity AI search engine, preventing content inclusion in AI responses.
- Pages with Meta Nosnippet (coming soon). These pages prevent AI systems from using page content in generated snippets and responses.
- Pages with HTML data-nosnippet (coming soon). Pages with an HTML-level directive that blocks specific content sections from AI snippet generation.
- Pages with Header Nosnippet (coming soon). Pages with an HTML header directive that prevents AI systems from creating snippets from page content.
Availability Reports
These Lumar reports identify whether AI bots can access your pages. If URLs return 4xx or 5xx errors, are blocked by robots.txt, or trapped in redirect loops, AI systems won’t be able to crawl them, making discovery impossible.
Use the following Lumar reports to identify issues and make sure all your content is accessible to AI bots:
- Broken Pages (4xx Errors). URLs that return a 400, 404, or 410 status code indicate a page could not be returned by the server because it doesn’t exist. AI bots cannot access or index broken pages, so this is a fundamental availability issue.
- 5xx Errors. URLs that return any HTTP status code in the range 500 to 599 indicate a page is temporarily unavailable and may be removed from the search engine’s index. Server errors prevent AI bots crawling content.
- Failed URLs. URLs that did not return a response within Lumar’s timeout period. This shows potential temporary issues due to poor server performance, or a permanent issue. Failed URLs cannot be processed by AI systems for content understanding.
- Redirect Loops. URLs with redirect chains that redirect back to themselves, creating an indefinite redirection loop, which prevents AI bots from reaching actual content.
- Redirect Chains. URLs redirecting to another URL that is also a redirect, resulting in a redirect chain. Long redirect chains can cause AI bots to abandon crawling before reaching content.
- Redirect Chains. Minor redirect issues (3xx errors) that may impact AI bot efficiency.
- JavaScript Redirects. Pages that redirect to another URL using JavaScript.
- Internal Redirects in Web Crawl. URLs that were found in the web crawl, that redirect to another URL with a hostname that is considered internal, based on the domain scope configured in the project settings.
Renderability Reports
AI systems may rely on rendered (JavaScript-executed) content. If important elements only appear post-render, but bots can’t access them, you risk your content not being visible to AI. The following reports help you identify potential issues, so they can be addressed:
- Rendered Link Count Mismatch. Pages with a difference between the number of links found in the rendered DOM and the raw HTML source. AI bots need a consistent link structure between raw and rendered content for proper understanding.
- Rendered Canonical Link Mismatch. Pages with canonical tag URLs in the rendered HTML that does not match the canonical tag URL found in the static HTML. Inconsistencies can confuse AI systems about which content version to reference.
Indexability Reports
Even if a page is crawlable, it may not appear in AI results if it’s not indexed or lacks visibility in search. The following Lumar reports identify pages that are ignored by search engines or missing from SERPs. Use these to prioritize fixes that improve discoverability:
- Non-Indexable Pages. Pages that return a 200 status but are prevented from being indexed via a noindex or canonical tag URL that doesn’t match the page’s URL. Non-indexable pages cannot be included in AI training data or search responses.
- Canonicalized Pages. Pages with a URL that does not match the canonical URL found in the HTML canonical tag or X-Robots-Tag response header. Canonical tags help AI systems understand preferred content versions.
- Noindex Pages. Pages that cannot be indexed because they contain a robots noindex directive in the robots meta tag, robots.txt file or X-Robots-Tag in the header, preventing AI systems from including pages in responses.
- Disallowed Pages. All URLs included in the crawl that were disallowed in the robots.txt, blocking AI bot access to content.
- unavailable_after Scheduled Pages. Pages with a 200 status code and an unavailable_after robots directive in a meta tag or response header, where the date specified is in the future. Future unavailability may impact long-term AI content freshness.
- unavailable_after Non-Indexable Pages. Pages with a 200 status code and an unavailable_after robots directive in a meta tag or response header, where the date specified is in the past. Expired content should not be referenced by AI systems in current responses.
- Pages with AI Bot Hits (coming soon). Showing URLs have been successfully accessed by AI bots.
- Pages without AI Bot Hits (coming soon). Showing URLs that have not been accessed by AI bots.
Page Content Reports
Page content is crucial for AI systems to understand topics and context, so any issues in areas like titles and descriptions can prevent AI systems from understanding, and therefore using, your content. The following reports help you identify content issues and make relevant improvements:
- Missing Titles. Indexable pages with a blank or missing HTML title tags. Titles are crucial for AI systems to understand page topics and context.
- Short Titles. Indexable pages with a title tag less 10 characters*. Short titles may not provide enough context for AI content understanding.
- Missing Descriptions. Indexable pages without a description tag. Descriptions help AI systems understand page content and generate relevant responses.
- Short Descriptions. Pages with a description tag less than 50 characters*. Brief descriptions may lack sufficient context for AI content comprehension.
- Thin Pages. Indexable pages with content of less than 3,072 bytes* (thin page threshold), but more than 512 bytes* (empty page threshold). Thin content provides limited value for AI training and response generation.
- Max Content Size. Pages with content of more than 51,200 bytes*. Very large content may be truncated by AI systems with token limits.
- Max HTML Size. Pages that exceed 204,800 bytes*. Large HTML files may impact AI bot crawling efficiency.
- Missing H1 Tags. Pages without any H1 tags. H1 tags help AI systems understand page hierarchy and main topics.
- Rendered Word Count Mismatch. Pages with a word count difference between the static HTML and the rendered HTML. Content discrepancies between raw and rendered versions affect AI content understanding.
*These values can be changed in Lumar’s Advanced Settings if required.
Structured Data Reports
As mentioned above, AI search is a constantly evolving space. It’s not 100% clear right now how much of an impact structured data (or schema markup) has on LLMs. It is possible AI systems may use schema markup (e.g. FAQ, Product, How To, etc.) to understand context and relationships. The following Lumar reports expose missing or invalid markup and highlight high-value structured content already in place, and so may also assist in optimizing content for AI search
- Pages with Schema Markup. All pages included in the crawl that have schema markup found in either JSON-LD or Microdata. Highlights any schema.org markup on a page.
- Pages without Structured Data. All pages included in the crawl that do not have schema markup.
- Product Structured Data Pages. All pages in the crawl that were found to have product structured data markup.
- Valid Product Structured Data Pages. All pages with valid product structured data based on Google Search Developer documentation.
- Invalid Product Structured Data Pages. All pages with invalid product structured data based on Google Search Developer documentation.
- Event Structured Data Pages. All pages in the crawl that were found to have event structured data markup.
- News Article Structured Data Pages. All pages in the crawl that were found to have news article structured data markup.
- Valid News Article Structured Data Pages. All pages with valid news article structured data based on Google Search Developer documentation.
- Invalid News Article Structured Data Pages. All pages with invalid news article data based on Google Search Developer documentation.
- Breadcrumb Structured Data Pages. All pages in the crawl that were found to have product breadcrumb data markup.
- FAQ Structured Data Pages. All pages in the crawl that were found to have FAQ structured data markup.
- How To Structured Data Pages. All pages in the crawl that were found to have How To structured data markup.
- Recipe Structured Data Pages. All pages in the crawl that were found to have recipe structured data markup.
- Video Structured Data Pages. All pages in the crawl that were found to have video structured data markup.
- QA Structured Data Pages. All pages in the crawl that were found to have QA structured data markup.
- Review Structured Data Pages. All pages in the crawl that were found to have review structured data markup.
Bot Behavior & Crawl Budget Reports
Insights into bot behavior show how often AI bots hit your pages and which pages are ignored. Pages with no hits or low frequency may be under-prioritized by AI crawlers, especially if they aren’t in sitemaps or are disallowed. The following reports help you assess crawl efficiency and make improvements:
- AI Discoverability Reporting. Understand the proportion of requests coming from Google vs AI bots.
- AI Bot Breakdown Reporting. Understand the breakdown of which AI bots are hitting the site most often.
- AI Bot Requests. See when AI bots are hitting the site to understand spikes and dips in requests.
- Top Pages By Bot. Discover and sort by which pages are getting the most requests from AI bots, to understand if the pages you want to be discovered are being discovered.
- AI Bot Hits by Response Code. Understand if the pages being found by AI bots are actually available. Logs are grouped by request over time to see if there are any trends in which pages are being hit by AI bots.
Experience
Duplicate content can cause confusion for AI bots—meaning they struggle to determine which version is the most relevant and authoritative. This can lead them to either choose a less desirable version, or fail to include any of them effectively. This section highlights duplication issues that should be addressed to ensure efficient crawling of your site by AI systems.
- Duplicate Pages. This report identifies pages that are exact duplicates of other pages, harming the perception of content quality and expertise. This is a strong negative signal to AI systems.
- Duplicate Page Sets. This report groups pages with identical or near-identical content into sets, highlighting issues of content duplication. AI systems may struggle to choose which page to rank or reference, potentially splitting the value between multiple URLs instead of consolidating it.
- Duplicate Title Sets. This report groups pages that have exactly the same title tag, which can suggest duplicated or low-effort content. This can signal to AI that your content may not be unique or tailored to a specific topic.
- Duplicate Description Sets. This report groups pages that share the exact same meta description, which can provide generic, less useful context to AI systems analyzing your content.
- Duplicate Body Sets. This report groups pages that have the same or very similar body content, which is a strong indicator of content duplication issues, and may mean AI systems will struggle to identify the original, authoritative source.
- Pages with Duplicate Body. This report identifies individual pages that have identical body content to another page on the site, which can confuse AI systems.
Authority Reports
AI systems prioritize content from sources that are perceived as knowledgeable, leading to better visibility in AI search. Backlinks are a foundational signal of authority and trust, so pages with backlinks are seen by AI systems are more important and credible. This section provides reports that have issues with backlinks, and also pages with a good number of backlinks that can be used to promote authority through your site.
- Redirecting URLs with Backlinks. Pages with backlinks to them but then redirecting to another URL introduces an unnecessary step for both users and AI crawlers, slightly diminishing the value of the backlink.
- Error Pages with Backlinks. Pages that return an error (4xx or 5xx) but have valuable backlinks pointing to them signals to AI systems that your site is unreliable, damaging trust.
- Disallowed URLs with Backlinks. Pages that have backlinks pointing to them but are disallowed from crawling in the robots.txt file can’t be accessed by AI systems, which means they cannot understand why it’s being linked to.
- Pages with Meta Nofollow and Backlinks. Pages with incoming backlinks, but also a ‘nofollow’ meta directive effectively creates a dead-end for authority flow, limiting the benefit of your backlink profile.
- Indexable Pages with Backlinks. Pages that are both indexable and have backlinks represent your most powerful pages for passing authority. These pages should be fully optimized and strategically linked to other important content on your site.
- Pages with Backlinks but No Links Out. This report identifies pages that receive authority via backlinks, but do not link out to any other pages. This can sometimes be interpreted as a ‘hoarding’ of authority and may not represent a natural, helpful linking pattern. You should review these pages and find relevant opportunities to add links to other important pages, to help spread the authority throughout your site.
- Non-indexable Pages with Backlinks. Pages with backlinks but marked as non-indexable is a significant missed opportunity to leverage external validation to boost your site’s overall authority.
- Pages with Backlinks. This report shows all the pages on your site that have at least one external backlink, which is a key factor in how both traditional search and AI systems rank and reference content.
Trustworthiness
AI systems also prioritize content that is perceived to be trustworthy. This section highlights issues that can impact how trustworthy AI systems view your site and content.
- Empty Pages. Pages that have no discernible content in their body make them useless to AI systems. These pages should be removed and ensure no internal links point to them. If they’re the result of a bug, it should be investigated and fixed.
- Broken Images. Pages where images fail to load can signal a quality issue that can detract from perceived experience and authority.
- HTTPS Pages. Pages that are correctly served over a secure HTTPS connection are prioritized by AI systems. This report helps confirm your SSL/TLS implementation.
- Mixed Content. This report identifies secure HTTPS pages that are attempting to load insecure resources (like scripts or images) over HTTP. AI systems will view the page as not fully secure, harming its perceived trustworthiness.
- All Broken Links. Pages that link to broken destinations waste crawl budget, preventing AI bots from efficiently discovering your working content, which may also damage the trustworthiness of your site.
- Unique Broken Links. Broken destination URLs prevent AI systems from discovering content, and may harm the trustworthiness of your site.
- External Redirects. Links on your site that point to an external URL that then redirects to another external URL can lead to broken links if the redirect chain breaks, signalling a poorly managed site.
- Fast Fetch Time (<1sec). Quickly loading pages allows AI bots to crawl your site more efficiently, covering more pages in the same amount of time. You can use these pages as a benchmark for performance optimization efforts across the rest of your site.
Inclusion
These reports will bring additional data in via our Google Analytics connector so you can understand which AI systems are referencing your content.
- Pages with AI Referral Sessions. Content that has been referenced by AI systems.
- Pages without AI Referral Sessions. Indicates content that is not being discovered or referenced by AI systems.
- In addition, the following metrics will be available to add to reports (coming soon):
- ChatGPT Bot Hits
- Anthropic Bot Hits
- Google Extended Bot Hits
- Bingbot Bot Hits
- PerplexityBot Bot Hits
- ChatGPT Referral Sessions
- Perplexity Referral Sessions
- Claude Referral Sessions
- Gemini Referral Sessions
Relevancy Reports (coming soon)
The use of vector models enables semantic search to understand what is being asked, not just how it’s phrased. These reports will identify content with relevance issues, so they can be optimized to improve performance for AI search:
- Pages with Search Queries (coming soon). A positive signal indicating that the content matches user search intent.
- Pages with Low / Medium / High Query Relevance (coming soon). Indicating how well queries match content. Poor query matching reduces the likelihood of AI systems referencing content.
- Search Queries with Landing Pages (coming soon). Showing content that successfully matches search intent.
- Search Queries. Indicating search demand for content topics.
- Pages with Low / Medium / High Title Relevance (coming soon). Poor relevance may impact AI understanding of the page’s main topic.
- Pages with Low / Medium / High H1 Relevance (coming soon). Poor relevance may impact AI understanding of the page’s main topic.
- Pages with Low / Medium / High Description Relevance (coming soon). Poor relevance may impact AI content summarization quality.
- Pages with Low / Medium / High Content Relevance (coming soon). Poor content relevance reduces the likelihood of AI system content usage.
- Search Queries with Poorly / Moderately / Well Matched Landing Pages (coming soon). Poor query-matching may confuse AI content understanding.
How Else Does Lumar Help?
Aside from collecting and providing analysis at scale for your site, Lumar also helps you prioritize and action data to quickly identify, prioritize, and fix issues—and stop them happening again.
- Make the most important, impactful fixes first with logical grouping of issues, health scores and visualizations to avoid data overload. You can even engage our own industry experts to help.
- Save time, action tasks, and improve collaboration with AI-supported processes—like ticket content creation to ensure devs have all the information they need—so issues get properly fixed and technical debt is reduced.
- Stop issues recurring with automated QA tests and dev tools to prevent new code introducing issues.
- Mitigate risk with customizable alerts when issues do return or new issues appear, and customizable dashboards so you can easily monitor multiple domains, geographies, or important site sections in one place.
Find out how Lumar can help you optimize for AI Search
Our New Dedicated GEO Reporting
Our team have built a new, dedicated area for GEO, with dashboards, health scores and visualizations to show you how your site is performing, and where improvements can be made. This new area also includes subcategories for discovery, understanding, and inclusion, so you can drill into the details and make specific improvements to help AI systems find and understand your content. Find out more about our new GEO analysis.
What’s Next in Lumar for GEO
Our experts are undertaking research and tests to identify the factors that impact how AI systems understand content, and use it in AI search results—to ensure the most useful, specific, and actionable analysis. So watch this space!