Soft 404 Errors: How to Find and Fix Them With a Crawler
Soft 404 errors are the ghosts in your machine: pages that claim to be live with a 200 OK status but are actually empty. They waste crawl budget and confuse search engines. Here’s how to exorcise them.
What Are Soft 404 Errors, and Why Should You Care?
Let’s be direct. Soft 404 errors are the ghosts in your server—pages that tell browsers and bots they’re alive and well with a 200 OK status code, but display content that says ‘Page Not Found’ or is otherwise empty. It’s a classic case of your server being too polite for its own good, and it’s hurting your SEO.
This polite lie causes real damage. Google sees a 200 OK and thinks, ‘Great, a new page to index.’ It then wastes precious crawl budget analyzing these dead ends, which could have been spent crawling and indexing your actual, valuable content.
This confusion leads to serious indexing problems. At best, Google eventually figures out the page is useless and flags it as a soft 404 in Google Search Console. At worst, it indexes a legion of thin, duplicate ‘not found’ pages, diluting your site’s quality signals. Trust me, ‘your search for XYZ returned no results’ is not a keyword you want to rank for.
The Technical Cause of Most Soft 404 Errors
To defeat the enemy, you must understand it. A soft 404 isn’t a simple content mistake; it’s a server configuration failure. The server has been instructed to respond with a 200 OK status for a URL that has no unique content, creating a fundamental mismatch between the technical header and the on-page reality.
Common culprits are lazy CMS configurations. Many platforms, when faced with a request for a non-existent URL, will redirect to an internal search results page. That search page, finding nothing, helpfully displays ‘No results found’ while returning a 200 OK status code. You’ve just generated a soft 404.
Another frequent offender is the improperly configured single-page application (SPA). Client-side routing might correctly show a ‘Not Found’ component in the browser, but if the server isn’t configured for server-side rendering (SSR) or proper status codes, it will always serve the same initial 200 OK response for every URL, valid or not. Essentially, your backend is telling search engines ‘Everything is fine!’ while your frontend is screaming ‘This page doesn’t exist!’ This digital schizophrenia is what creates soft 404 errors.
How to Find Soft 404 Errors With a Crawler
You can’t just filter a crawl report by status code and call it a day. You have to be smarter than your server. The only reliable way to find soft 404 errors at scale is to crawl your site and inspect the content of every single page that returns a 200 OK.
This is a job for a capable crawler, not a human. ScreamingCAT is built for this kind of large-scale analysis. It’s written in Rust, which means it’s ridiculously fast, and it’s open-source, which means it’s free. If you haven’t already, follow our quick start guide to get it installed. Your only expense is a few minutes of your time.
The process is straightforward. First, run a standard crawl on your domain. Once it’s complete, you’ll use the Custom Search or Custom Extraction feature to hunt for common text ‘footprints’ of ‘not found’ pages within the raw HTML of all your 200 OK URLs.
Common ‘not found’ footprints to search for include:
For more precision, use Custom Extraction with a regular expression to pull the content of the H1 tag or a specific error message container. This isolates the exact text and helps avoid false positives from stray phrases in your body copy.
Configure this in ScreamingCAT before you crawl. The extracted data will appear in a new column in your crawl report, making it painfully obvious which of your ‘OK’ pages are actually impostors in disguise.
REGEX: (?s)<div class="error-message".*?>(.*?)</div>
- “page not found”
- “could not be found”
- “this page no longer exists”
- “your search returned no results”
- “0 results for”
- “is no longer available”
Analyzing Crawl Data to Pinpoint True Soft 404s
Having raw data is one thing; interpreting it is another. Once your crawl is finished, export the results to a CSV or work within the ScreamingCAT interface. Filter the report to show only URLs where your custom search or extraction found a match.
Now, put on your detective hat and look for patterns. Do all the flagged URLs have identical, low word counts? Do they share the same H1 tag, like ‘Search Results’ or ‘Error’? These are strong indicators of a systemic issue generating soft 404s.
Cross-reference these URLs with other data points from the crawl, such as the title tag, meta description, and crawl depth. A page titled ‘Page Not Found’ sitting at a crawl depth of 8 is a symptom of a much larger internal linking problem that needs to be addressed at the source.
Warning
Beware of false positives. A blog post you wrote about fixing 404 errors will naturally contain phrases like ‘page not found’. Always manually verify a sample of the flagged URLs before you start making bulk changes.
Fixing Soft 404 Errors: The Definitive Solutions
Finding the problem is the easy part. Fixing it requires precision. Do not just slap a `noindex` tag on these pages and call it a day—that’s the SEO equivalent of shoving clutter under the bed. It doesn’t solve the core problem of wasted crawl budget.
Solution 1: Return a Real 404 (The Correct Fix). This is non-negotiable. Work with your developers to configure your server to return a proper 404 Not Found or 410 Gone HTTP status code for any request to a non-existent URL. This is the clearest, most direct signal you can send to search engines. For a full refresher, read our guide to HTTP status codes.
Solution 2: Use a 301 Redirect (The Relevance Fix). If a page’s content has moved permanently or a highly relevant alternative exists, implement a server-side 301 redirect. This passes link equity and provides a good user experience. Do not, under any circumstances, bulk redirect all your broken URLs to the homepage. That’s just trading one bad signal for another.
Solution 3: Fix the Source (The Proactive Fix). Your soft 404s are being linked from somewhere. Use your ScreamingCAT crawl data to find the ‘inlinks’ to these dead-end pages. Go to the source pages and update the internal broken links to point to a live, relevant URL. This cleans up your site architecture and stops the bleeding.
Automating Detection to Prevent Future Issues
Technical SEO is a process, not a one-time project. Soft 404s can reappear due to CMS updates, site migrations, or simple human error. The only sane way to manage this is through automation.
Schedule your ScreamingCAT crawls to run on a regular basis—weekly or monthly, depending on your site’s volatility. Use the same custom extraction configuration from before to create a continuous monitoring system. If the number of soft 404s suddenly spikes, you’ll know immediately.
For the truly advanced, integrate your crawler into your development CI/CD pipeline. Run a crawl on a staging environment before any new code is pushed to production. Catching soft 404s before they go live is infinitely better than cleaning them up afterward. It also makes you look very, very smart in front of your developers.
Key Takeaways
- A soft 404 is a page that returns a 200 OK status code but displays ‘not found’ or thin content, wasting crawl budget.
- The root cause is almost always a server or CMS misconfiguration, not a content issue.
- Use a crawler like ScreamingCAT with Custom Search or Extraction to find soft 404s by searching for ‘not found’ text footprints within the HTML of 200 OK pages.
- The correct fix is to configure the server to return a proper 404 or 410 status code. Use 301 redirects only for pages with a relevant replacement.
- Automate your crawling and analysis to continuously monitor for new soft 404 errors and prevent them from impacting your SEO.
Ready to audit your site?
Download ScreamingCAT for free. No limits, no registration, no cloud dependency.