ScreamingCAT Regex Filters: 15 URL Patterns for Precise Crawling
Stop wasting time on bloated site crawls. Master precise auditing with these 15 essential ScreamingCAT regex filters to include and exclude specific URL patterns.
In this article
- Why Your Full Site Crawl is a Waste of Resources
- The Basics: Include vs. Exclude with ScreamingCAT Regex Filters
- Foundational Regex Patterns for Everyday Audits
- Intermediate Regex for Surgical Crawling
- Advanced ScreamingCAT Regex Filters for the Masochistic SEO
- Putting It All Together: A Sample Configuration
Why Your Full Site Crawl is a Waste of Resources
Let’s be direct. Clicking ‘Start’ on a full site crawl without any configuration is an amateur move. You’re telling your machine to boil the ocean, wasting CPU cycles, memory, and most importantly, your time.
A full crawl of a large site is a data firehose. You get everything: every tracking parameter, every pointless paginated series, every staging subdomain you forgot existed. The resulting dataset is noisy, bloated, and makes finding actionable insights a tedious exercise in filtering after the fact.
This is where you start working smarter. Instead of crawling everything and filtering later, you define the scope of your audit from the outset. This is the core principle of efficient technical SEO, and the primary reason we built robust filtering directly into ScreamingCAT.
By using ScreamingCAT regex filters, you dictate the exact boundaries of the crawl. You tell the crawler what to look for and what to ignore, transforming a brute-force process into a surgical strike. If you’re new to this, our Quick Start guide will get you up to speed on the basics before you dive in here.
The Basics: Include vs. Exclude with ScreamingCAT Regex Filters
ScreamingCAT provides two primary text boxes for your regular expressions: ‘Include’ and ‘Exclude’. Understanding the fundamental difference is non-negotiable.
Include: This is a whitelist. When you add a regex pattern here, ScreamingCAT will only crawl URLs that match the pattern. It’s an explicit instruction to focus solely on a specific subset of the site.
Exclude: This is a blacklist. The crawler will attempt to crawl everything it finds, except for URLs that match a pattern in this list. This is perfect for trimming the fat, like removing admin sections or known parameter-heavy areas.
You can, and often should, use them together. For example, you might ‘Include’ only the `/blog/` directory and then ‘Exclude’ any URLs within the blog that contain a `?replytocom` parameter. This combination gives you immense control over the crawl scope.
The most powerful audits leverage both lists to create a highly specific crawl configuration. This is the key to using ScreamingCAT regex filters effectively.
Warning
A common mistake is adding a restrictive ‘Include’ pattern without the seed URL matching it. If your start URL is `https://example.com` but your only ‘Include’ pattern is `https://example.com/blog/.*`, the crawl will stop after one URL. Ensure your starting point is within your ‘Include’ scope.
Foundational Regex Patterns for Everyday Audits
You don’t need to be a regex wizard to get 90% of the value. These five foundational patterns will handle the majority of your daily filtering needs. Master them first.
These patterns are your bread and butter for quick, targeted audits. They are simple to write, easy to understand, and dramatically reduce crawl times on large websites.
- 1. Isolate a Specific Subdirectory: The most common use case. You only want to audit the blog. Use an ‘Include’ rule.
https://www.example.com/blog/.* - 2. Exclude a Specific Subdirectory: You want to crawl the whole site but ignore the noisy, logged-in area. Use an ‘Exclude’ rule.
https://www.example.com/account/.* - 3. Target URLs by File Extension: Useful for finding all PDFs or other non-HTML files for a content audit. Use an ‘Include’ rule.
.*.pdf$ - 4. Exclude All Parameterized URLs: A blunt but effective way to eliminate faceted navigation and tracking URLs. The `?` escapes the question mark, which is a special character in regex. Use an ‘Exclude’ rule.
.*?.* - 5. Target URLs Containing a Specific Word: Great for auditing product pages that all have a common word in the URL, like ‘product’. Use an ‘Include’ rule.
.*product.*
Intermediate Regex for Surgical Crawling
Once you’ve mastered the basics, you’ll inevitably face more complex scenarios. Your site architecture isn’t always clean, and your audit requirements can be frustratingly specific.
These intermediate patterns introduce concepts like ‘OR’ conditions, subdomain targeting, and character groups. They allow you to build more nuanced and powerful ScreamingCAT regex filters. For a deeper dive into the syntax, check out our comprehensive Regex for SEO guide.
- 6. Isolate Multiple Subdirectories: You need to crawl the blog and the resources section, but nothing else. The pipe `|` character acts as an ‘OR’. Use an ‘Include’ rule.
.*/(blog|resources)/.* - 7. Crawl a Specific Subdomain: Your crawl is leaking into `dev.` or `staging.` subdomains. Lock it down to the main subdomain. Use an ‘Include’ rule.
^https://www.example.com.* - 8. Exclude Multiple File Types: Save resources by telling ScreamingCAT not to download images, CSS, or JavaScript files. This is excellent for a pure HTML audit. Use an ‘Exclude’ rule.
.*.(jpg|jpeg|png|gif|css|js)$ - 9. Find URLs with a Specific Query Parameter: You want to audit all pages tagged with a specific campaign, ignoring others. This looks for ‘utm_campaign=’ specifically. Use an ‘Include’ rule.
.*?.*utm_campaign=.* - 10. Target URLs with a Specific Number of Path Segments: Audit only top-level category pages (e.g., `/category/`) and not product pages (`/category/product/`). This pattern matches URLs with exactly one subdirectory. Use an ‘Include’ rule.
^https://[^/]+/[^/]+/$
Advanced ScreamingCAT Regex Filters for the Masochistic SEO
Sometimes, the problem isn’t simple. You’re dealing with a legacy system, an international site with a byzantine URL structure, or you just have a very, very specific question to answer.
This is where you push regex to its limits. These patterns are less common but are indispensable when you need them. They solve complex problems related to URL formatting, internationalization, and canonicalization. They can also be combined with Custom Extraction for unbelievably powerful data collection.
Good to know
Fair warning: Complex regex is hard to read and easy to get wrong. Always test your advanced patterns on a small sample of URLs before running a multi-million URL crawl.
- 11. Find URLs with Trailing Slashes: Enforce URL consistency by finding all directory-level URLs that are missing the trailing slash. Use an ‘Include’ rule.
.*/[^./]+$ - 12. Isolate URLs with Uppercase Characters: A classic technical SEO issue. Find potential duplicate content issues by isolating URLs that contain capital letters. Use an ‘Include’ rule.
.*[A-Z].* - 13. Target Specific Language/Country Subdirectories: Audit only the English-US and French-Canada sections of an international site. Use an ‘Include’ rule.
.*/(en-us|fr-ca)/.* - 14. Find URLs with Multiple Query Parameters: Identify URLs that are likely generated by faceted navigation systems by looking for at least two ampersands. Use an ‘Include’ rule.
.*?.*&.*&.* - 15. Exclude Everything *Except* HTML in a Subdirectory: A powerful combination. This ‘Exclude’ rule ignores everything that is NOT an HTML page OR is NOT in the `/products/` directory. It’s a double negative that effectively isolates only the HTML pages within `/products/`.
(?!.*/products/.*.html$).*
Putting It All Together: A Sample Configuration
Theory is great, but let’s apply this to a real-world scenario. Imagine you’re auditing a large e-commerce site. Your goal is to analyze all canonical product and category pages, while ignoring faceted navigation, user accounts, and all non-HTML assets.
This is a perfect job for a combined Include/Exclude configuration using ScreamingCAT regex filters. You’ll set a broad ‘Include’ rule and then use ‘Exclude’ rules to carve away everything you don’t need.
Here is what your configuration in ScreamingCAT would look like.
## ScreamingCAT Crawl Configuration
# Start URL: https://www.example-commerce.com
# Include List (one pattern):
https://www.example-commerce.com/(products|categories)/.*
# Exclude List (three patterns):
.*?.*
.*/account/.*
.*.(jpg|png|pdf|css|js)$
- The Include Rule: `https://www.example-commerce.com/(products|categories)/.*` tells ScreamingCAT to only consider URLs that fall within the `/products/` or `/categories/` subdirectories.
- The First Exclude Rule: `.*?.*` is our trusty catch-all for parameters. This immediately eliminates all faceted navigation, sorting, filtering, and tracking URLs.
- The Second Exclude Rule: `.*/account/.*` is a safety measure. Even though it wouldn’t be included anyway, explicitly excluding sensitive or irrelevant directories is good practice.
- The Third Exclude Rule: `.*.(jpg|png|pdf|css|js)$` prevents the crawler from downloading asset files, focusing its resources purely on the HTML documents you need to audit.
This configuration transforms a potential 10-hour, 5-million-URL crawl into a 30-minute, 50,000-URL surgical audit. This is the power of precise, upfront filtering.
A Wiser, More Caffeinated SEO
Key Takeaways
- Stop running full, unfiltered site crawls. They waste time, resources, and produce noisy data.
- Use ‘Include’ as a whitelist to define what TO crawl and ‘Exclude’ as a blacklist to define what NOT to crawl.
- Master a handful of foundational regex patterns to handle 90% of your daily audit needs.
- Combine multiple Include and Exclude patterns to perform highly specific, surgical audits on complex websites.
- Always test your regex filters on a small sample before launching a large-scale crawl to avoid misconfigurations.
Ready to audit your site?
Download ScreamingCAT for free. No limits, no registration, no cloud dependency.