A Burmese python slithers through grass on a sunny day, showcasing its beautiful patterned scales.

SEO Audit With Python: Scripts for Technical Analysis

Stop drowning in spreadsheet tabs. A proper SEO audit with Python transforms raw crawl data into actionable insights. This guide provides the scripts and concepts you need to automate your technical analysis, find critical issues, and get back to what matters: fixing them.

Crawlers Give You Data; Python Gives You Answers

Let’s be direct. A crawler is a data collection engine. Whether you’re using our own ScreamingCAT, or another tool, its primary job is to fetch URLs and report what it finds. This is a critical, but fundamentally incomplete, first step.

The real work of a technical audit begins *after* the crawl finishes. You’re left with thousands, sometimes millions, of rows in a CSV. The challenge isn’t getting the data; it’s extracting meaning from it. This is where performing an SEO audit with Python becomes not just an advantage, but a necessity for any serious technical SEO.

Sure, you could wrangle this data in Google Sheets or Excel. You could spend hours building pivot tables, writing convoluted formulas, and watching your laptop’s fan spin up to escape velocity. Or, you could write a few lines of Python to do the same work in seconds, repeatably and at any scale.

Python bridges the gap between raw data and strategic insight. It allows you to create custom checks, merge disparate data sources (like Google Search Console or server logs), and automate the tedious parts of your workflow. This guide assumes you’ve already run a crawl and are ready to analyze the exports. If you need a refresher, check out our guide on exporting and analyzing crawl data.

Setting Up Your Python Environment for an SEO Audit

Before you can start analyzing, you need the right tools. We’ll skip the basics of installing Python itself; if you’re here, you’ve likely already crossed that bridge. Your focus should be on installing the libraries that do the heavy lifting for data manipulation and analysis.

These libraries are the foundation of nearly every data analysis task in Python. You can install them all with a single command in your terminal. There’s no need to overcomplicate things.

Pro Tip

Pro Tip: Use a virtual environment (`python -m venv my_seo_project`). It keeps your project’s dependencies isolated and prevents you from accidentally breaking your system’s Python installation. Don’t be the person who can’t run a script because of a global dependency conflict.

pip install pandas jupyter notebook
  • Pandas: The undisputed champion for data manipulation in Python. It provides data structures called DataFrames, which are essentially super-powered spreadsheets you can control with code. This is non-negotiable.
  • Jupyter Notebook: An interactive environment that lets you run code in blocks and see the output immediately. It’s perfect for exploratory analysis, allowing you to test ideas and visualize data on the fly without re-running an entire script.

Core Script: Finding Thin & Orphaned Pages

Time to get practical. One of the most common tasks in a technical audit is identifying pages that are either thin on content or poorly linked internally. These ‘orphan’ or ‘forgotten’ pages are often dead weight, wasting crawl budget and offering little value to users or search engines.

The following script uses Pandas to load a standard `internal_all.csv` export from a ScreamingCAT crawl. It then filters for valid HTML pages and flags those with low word counts or zero internal follow links. This is a foundational check in any serious SEO audit with Python.

This script is a starting point. You can adjust the `WORD_COUNT_THRESHOLD` or `INLINK_THRESHOLD` based on your site’s specific needs. The goal is to create a repeatable process that instantly flags potential problem areas for further investigation.

# Import the pandas library
import pandas as pd

# --- Configuration ---
CRAWL_FILE = 'internal_all.csv'
WORD_COUNT_THRESHOLD = 300
INLINK_THRESHOLD = 1 # We're looking for pages with 1 or 0 'follow' inlinks

# --- Script ---
# Load the crawl data into a DataFrame
df = pd.read_csv(CRAWL_FILE)

# Filter for HTML pages with a 200 OK status code
html_df = df[(df['Content Type'] == 'text/html; charset=utf-8') & (df['Status Code'] == 200)].copy()

# Find pages with thin content
# Note: We use .copy() above to avoid SettingWithCopyWarning
thin_content_df = html_df[html_df['Word Count'] < WORD_COUNT_THRESHOLD]

# Find pages with few or no internal 'follow' links
# ScreamingCAT's 'Inlinks' column is for all links, we need to be more specific if possible
# For this example, we'll use 'Inlinks' as a proxy. A more advanced script would use a different export.
orphan_pages_df = html_df[html_df['Inlinks'] <= INLINK_THRESHOLD]

# --- Output ---
print(f"--- Thin Content Pages (less than {WORD_COUNT_THRESHOLD} words) ---")
print(thin_content_df[['Address', 'Word Count']])

print(f"n--- Poorly Linked Pages ({INLINK_THRESHOLD} or fewer inlinks) ---")
print(orphan_pages_df[['Address', 'Inlinks']])

# Export the results to a new CSV for review
thin_content_df.to_csv('thin_content_report.csv', index=False)
orphan_pages_df.to_csv('orphan_pages_report.csv', index=False)

Advanced SEO Audit with Python: Beyond the Basics

Once you’ve mastered the basics of loading and filtering data, you can move on to more complex analyses. An advanced SEO audit with Python involves combining data points to uncover nuanced issues that basic filters would miss.

Consider a canonical tag audit. It’s not enough to know if a canonical tag exists. You need to know if it points to the correct URL, if it’s self-referencing, or if you’ve accidentally created a canonical chain. With Python, you can compare the ‘Address’ column to the ‘Canonical Link Element 1’ column to instantly flag any non-self-referencing canonicals for review.

Another powerful technique is to merge crawl data with external data sources. Export your performance data from Google Search Console and join it with your crawl data on the URL. Now you can identify pages that are technically sound (Status Code 200, indexable) but receive zero impressions or clicks. These are your highest priority content decay candidates.

For the truly adventurous, you can start parsing server logs. By combining log file data showing Googlebot’s activity with your crawl data, you can pinpoint pages Googlebot crawls frequently despite them being non-canonical, or pages in your sitemap that Googlebot never visits. This is the path to truly understanding crawl budget optimization. You’ll likely need some Regex skills for that, but the payoff is enormous.

Automating and Visualizing Your Findings

A script is only useful if you run it. The final step is to move from manual execution to an automated workflow. Your goal should be to run these audit scripts on a schedule—perhaps weekly or after major site changes—to proactively catch issues.

You can use a simple cron job on a server or get more sophisticated with tools like GitHub Actions to run your Python scripts automatically. Imagine running a crawl via the ScreamingCAT command-line interface, then triggering your Python script to analyze the export and email you a report of any new critical issues. That’s the power of a fully automated audit workflow.

Finally, don’t underestimate the power of visualization. Stakeholders rarely want to see a CSV of 500 thin pages. They want to understand the scale of the problem at a glance. Use libraries like Matplotlib or Seaborn to generate charts from your DataFrames.

A simple histogram showing the distribution of word counts across the site is far more impactful than a spreadsheet. A bar chart of non-200 status codes makes the scope of a migration error immediately obvious. A good chart can save you from a thousand-word email and get you the resources you need to fix the problem.

Good to know

Remember, the final output of your analysis isn’t a script or a spreadsheet; it’s a prioritized list of actions. Your code should serve one purpose: to get you to that list faster and more accurately than any manual process ever could.

Stop Clicking, Start Scripting

Moving your technical SEO audit process into Python is a force multiplier. It takes you from being a data operator, clicking through interfaces and filtering spreadsheets, to a data strategist, building systems to find problems.

The learning curve is real, but the investment pays for itself the first time you audit a million-page website in minutes instead of days. Start with the simple script provided here, adapt it to your needs, and build from there.

Stop wrestling with pivot tables. Start writing code.

Key Takeaways

  • Python, particularly with the Pandas library, is essential for scalable and repeatable technical SEO audits.
  • Basic scripts can quickly analyze crawl exports to find common issues like thin content and orphan pages.
  • Advanced techniques involve merging crawl data with external sources like GSC or server logs for deeper insights.
  • Automating your Python scripts creates a proactive monitoring system, catching issues before they escalate.
  • Visualizing your findings with libraries like Matplotlib is key to communicating the scale of problems to stakeholders.

ScreamingCAT Team

Building the fastest free open-source SEO crawler. Written in Rust, designed for technical SEOs who value speed, privacy, and no crawl limits.

Ready to audit your site?

Download ScreamingCAT for free. No limits, no registration, no cloud dependency.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *