HTML code displayed on a screen, demonstrating web structure and syntax.

Custom Extraction in ScreamingCAT: XPath, CSS Selectors, and Regex

Stop settling for standard crawl data. This guide details how to use ScreamingCAT custom extraction with XPath, CSS Selectors, and Regex to scrape the exact data you need for truly advanced SEO audits.

Why Your Standard Crawl Data Is Lying to You (By Omission)

Let’s be direct. A standard crawl gives you the basics: status codes, titles, meta descriptions, word count. It’s a fine starting point, but it’s table stakes. It doesn’t tell you if the correct schema is deployed, if the blog post has a publication date, or if a rogue analytics script was hardcoded onto a single page. For that, you need to go deeper. This is where ScreamingCAT custom extraction becomes your most valuable tool.

Custom extraction allows you to scrape virtually any piece of information from a page’s source code during a crawl. Instead of just knowing a page exists, you can know its specific attributes — data that informs real strategy, not just checklist SEO. We’re talking about scraping author names, SKUs, hreflang attributes, structured data values, and anything else you can target in the DOM.

We support the three main methods you’ll need: XPath, CSS Selectors, and Regular Expressions (Regex). Each has its place, its strengths, and its… quirks. We have our opinions on which to use, and we’ll share them. If you’re just getting started with our crawler, you might want to check out our Quick Start guide before diving in.

Configuring Your First ScreamingCAT Custom Extraction

Before you can wield this power, you need to know where the buttons are. The configuration is straightforward, because we believe powerful tools shouldn’t require a PhD to operate. Navigate to `Configuration > Custom > Custom Extraction` to open the main settings window.

Here, you’ll see a list of 100 available ‘Extractors’. Each one represents a unique piece of data you want to collect. For each extractor, you need to define how and what you want to scrape. It’s less intimidating than it sounds.

  • Name: Give your extractor a descriptive name (e.g., ‘Schema Type’, ‘PubDate’). This name becomes the column header in your crawl export, so don’t be lazy.
  • Selector: This is where you choose your weapon: XPath, CSS Selector, or Regex.
  • Expression: The actual query or pattern you’ll use to find the data. This is where the magic happens, and we’ll spend the rest of this article on it.
  • Extract Type: Choose what you want to pull back. Do you want the full HTML of the element, just the text content inside it, or the value of a specific attribute (like the `href` from an `` tag)?

Method 1: XPath – The Surgical Scalpel

XPath is the most powerful and precise way to select elements from an HTML (or XML) document. It navigates the document’s tree structure, allowing for incredibly specific targeting that CSS selectors often can’t match. If you need to find an element based on its content or its relationship to other elements, XPath is your tool.

Frankly, if you’re serious about technical SEO and data extraction, you should learn XPath. It’s excellent for scraping complex structured data, finding meta tags by property attribute, or grabbing the publication date that’s buried three `divs` deep without a unique class name. It’s the professional’s choice.

For example, let’s say you want to extract the content of the `og:image` meta tag to audit your Open Graph data at scale. A simple CSS selector can’t target an attribute’s value. XPath handles this with ease.

//meta[@property='og:image']/@content

Method 2: CSS Selectors – The Familiar Friend

If you’ve ever written a line of CSS or used browser DevTools to inspect an element, you already know CSS Selectors. They’re intuitive, easy to read, and perfect for the majority of common extraction tasks. This is your workhorse for grabbing elements by their ID, class, or tag type.

Want to extract all H1s to check for duplicates? `h1`. Need to find the text inside your primary call-to-action button? `a.btn-primary`. CSS Selectors are fast and efficient for these common use cases. They are, however, less powerful than XPath. They can’t select elements based on their text content or traverse up the DOM tree.

Think of it this way: CSS Selectors are great for grabbing things you can see. XPath is for grabbing things based on their position and properties within the document’s structure. Both are essential for a complete ScreamingCAT custom extraction toolkit.

Pro Tip

Pro Tip: Don’t write selectors from scratch. Right-click an element in your browser’s DevTools, choose ‘Copy’, and then ‘Copy selector’. It’s a great starting point you can refine for your crawl.

div.product-details > span.price

Method 3: Regex – The Terrifying, Brilliant Nuclear Option

Then there’s Regex. Regular Expressions don’t parse the HTML structure; they parse raw text. This makes them both incredibly powerful and notoriously difficult. Use Regex when the data you need isn’t contained within a clean HTML element or attribute.

When do you reach for this chaotic tool? When you need to pull a Google Analytics ID from an inline JavaScript block. When you want to find any email address formatted as `[text]@[text].[text]` anywhere in the source code. When XPath and CSS Selectors fail because the data you need is a messy string inside a larger text node.

Regex is a last resort, but sometimes it’s the only resort. A word of warning: a poorly written Regex pattern can bring a crawl to its knees or, worse, return garbage data. Test your patterns rigorously before running them on a 10-million-URL site.

UA-[0-9]{5,}-[0-9]{1,}

Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.

Jamie Zawinski

Advanced Use Cases and Automation

Now, let’s connect theory to practice. The real value of ScreamingCAT custom extraction is unleashed when you apply it to solve complex, site-wide problems that are impossible to check manually.

Imagine auditing a multi-national site’s hreflang implementation. You can set up extractors to pull the `href` and `hreflang` attributes from every single “ tag on every page. Export the data, pivot it, and you have a complete map of your international SEO setup, ready for validation.

Other advanced uses include: scraping product prices and SKUs from e-commerce sites for competitor analysis, extracting ‘last updated’ dates from thousands of articles to prioritize content refreshes, or validating that every page has the correct GTM container ID deployed. The possibilities are limited only by your ability to target the data.

Once you’ve perfected your extraction configuration for a specific audit type, you can save it and reuse it. Better yet, integrate it into your CI/CD pipeline for true SEO automation. Run a crawl with your custom configuration on every deployment to catch critical errors before they ever reach production.

Key Takeaways

  • Custom extraction moves beyond basic crawl data, allowing you to scrape any information from a page’s source code.
  • ScreamingCAT supports three extraction methods: XPath (for precision), CSS Selectors (for accessibility), and Regex (for unstructured text).
  • Use XPath for complex tasks like scraping meta attributes or schema, and CSS Selectors for common tasks like finding elements by class or ID.
  • Regex is a powerful last resort for finding patterns in raw text, like tracking codes within script blocks.
  • Combine custom extraction with automation to create powerful, repeatable SEO audits that can be integrated into development workflows.

ScreamingCAT Team

Building the fastest free open-source SEO crawler. Written in Rust, designed for technical SEOs who value speed, privacy, and no crawl limits.

Ready to audit your site?

Download ScreamingCAT for free. No limits, no registration, no cloud dependency.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *