SEO AutomationDamian SmilginMarch 28, 20267 min read

Stop filtering spreadsheets manually. Regular expressions are an SEO superpower for data analysis. Our guide provides 20+ copy-paste patterns for technical SEO.

In this article

What is Regex and Why Should SEOs Care?
The Building Blocks: Core Regex Syntax You'll Actually Use
Practical Regex for SEO: Filtering URLs and Data
Advanced Regex for SEO: Custom Extraction
Putting It All Together: Regex in Your Favorite Tools
Common Regex Pitfalls (and How to Not Look Dumb)

What is Regex and Why Should SEOs Care?

If you’ve ever spent an hour filtering a 100,000-row crawl export to find URLs with a specific parameter, you’ve felt the pain that regex for SEO was born to solve. Regex, short for regular expression, is a sequence of characters that specifies a search pattern. Think of it as ‘Find and Replace’ on a cocktail of caffeine and computer science.

For technical SEOs, it’s not a ‘nice-to-have’—it’s a fundamental tool. It allows you to sift through massive datasets with surgical precision, whether you’re analyzing log files, filtering a crawl, or creating complex segments in analytics.

You’ll find regex support in nearly every tool in your stack: ScreamingCAT, Screaming Frog, Google Search Console, Google Analytics, and every half-decent code editor. Mastering it means you work faster, find deeper insights, and spend less time doing mind-numbing data entry.

The Building Blocks: Core Regex Syntax You’ll Actually Use

Regex can look like a cat walked across your keyboard, but the core syntax is surprisingly simple. You don’t need to be a developer to understand it. Here are the essential building blocks you’ll use 99% of the time.

. (Dot): The wildcard. Matches any single character (except a newline). Lazy but effective.
* (Asterisk): The needy one. Matches the preceding character zero or more times. `ca*t` matches ‘ct’, ‘cat’, and ‘caaat’.
+ (Plus): The slightly less needy one. Matches the preceding character one or more times. `ca+t` matches ‘cat’ and ‘caaat’ but not ‘ct’.
? (Question Mark): The optional one. Matches the preceding character zero or one time. Often used to make a search ‘non-greedy’ (more on that later).
(Backslash): The escape artist. Use it before a special character to treat it as a literal character. To find a literal dot, you use `.`. Essential.
[] (Square Brackets): The character set. Matches any single character inside the brackets. `[abc]` matches ‘a’, ‘b’, or ‘c’. Use a hyphen for a range: `[0-9]` matches any digit.
() (Parentheses): The group. Groups multiple tokens together and creates a ‘capturing group,’ which is critical for custom extraction.
| (Pipe): The OR operator. `cat|dog` matches ‘cat’ or ‘dog’.
^ (Caret): The anchor for the start. Matches the beginning of a string. `^https` will only match URLs that start with ‘https’.
$ (Dollar Sign): The anchor for the end. Matches the end of a string. `/$` will find URLs that end with a trailing slash.

Practical Regex for SEO: Filtering URLs and Data

This is where the rubber meets the road. The most immediate use of regex for SEO is filtering large URL lists during a site audit. In a crawler like ScreamingCAT, you can use these patterns in the include/exclude configuration to refine your crawl scope or in the URL search to analyze the results.

Let’s look at some common filtering patterns you can copy and paste directly.

Good to know

Most tools have a ‘Matches Regex’ and a ‘Does Not Match Regex’ option. Combine them for powerful, multi-layered filtering.

Task	Regex Pattern	Explanation
Find URLs with query parameters	?	The question mark is a special character, so we escape it with a backslash.
Find URLs with underscores	_	Underscores are bad for usability and can sometimes cause issues. This simple pattern finds them.
Find non-lowercase URLs	[A-Z]	Finds any URL containing at least one uppercase letter.
Isolate a specific subfolder (e.g., /blog/)	/blog/	Matches the literal string '/blog/' anywhere in the URL.
Isolate URLs within a subfolder	^https?://[^/]+/blog/.*	Matches URLs that start with http or https, followed by the domain, then the /blog/ subfolder.
Find URLs ending in .pdf	.pdf$	Escapes the dot and uses the dollar sign to anchor the match to the end of the string.
Find URLs with numbers	[0-9]	Finds any URL containing at least one digit.
Find multiple file types (pdf, doc, xlsx)	.(pdf\|doc\|xlsx)$	Uses grouping and the OR operator to match multiple extensions at the end of a string.

Advanced Regex for SEO: Custom Extraction

Filtering is useful, but custom extraction is where you unlock god mode. Custom extraction allows you to pull any piece of information from a page’s source code during a crawl. While you can often use XPath or CSS selectors, sometimes the data you need isn’t wrapped in a clean HTML element. That’s when you use regex.

In ScreamingCAT, you can configure custom extractors to scrape data at scale. This is perfect for auditing structured data, tracking codes, or any other text string present in the HTML. For a full walkthrough, see our guide on Custom Extraction in ScreamingCAT.

The key is the capturing group `()`. Whatever the pattern matches *inside* the parentheses is what gets extracted.

Warning

Regex is ‘greedy’ by default. The pattern `

(.*)

` will match everything from the first `

` opening tag to the last `

` closing tag on the page. Use the non-greedy `?` qualifier—`

(.*?)

`—to stop at the first closing tag.

<!-- Example: Extracting the content attribute from a meta description -->

<!-- HTML Source -->
<meta name="description" content="This is our meta description.">

<!-- Regex Pattern -->
<meta name="description" content="(.*?)">

Extract Schema @type: `”@type”:s*”(.*?)”`
Extract Google Tag Manager ID: `(GTM-[A-Z0-9]+)`
Extract hreflang attributes: `hreflang=”([a-z]{2}-[A-Z]{2})”`
Extract canonical URL: “
Extract all inline email addresses: `([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,})`

Putting It All Together: Regex in Your Favorite Tools

Regex isn’t confined to your crawler. Its utility extends across the entire SEO toolkit.

In Google Search Console, you can use regex to filter performance reports. Switch the filter from ‘Query’ to ‘Custom (regex)’ to create powerful views. For example, you can compare brand vs. non-brand traffic with a pattern like `mybrand|my brand|myproduct`. This is far more efficient than adding dozens of individual ‘Contains’ filters. Dive deeper with our complete guide to Search Console.

In Google Analytics 4, regex is available when building comparisons, audiences, and filters. You can create a segment for traffic to multiple subdirectories using a pattern like `/blog/|/resources/` on the ‘Page path and screen class’ dimension.

Even your code editor (like VS Code) becomes an SEO tool with regex. Need to rewrite 1,000 URLs in a `.htaccess` file? A single find-and-replace operation using capturing groups can do it in seconds. This is a core component of many automation workflows, which you can explore in our Python for SEO guide.

Not using regex in GSC is like owning a sports car and never taking it out of first gear. You’re missing the entire point.
An SEO who has seen too many 'Query contains' filters

Common Regex Pitfalls (and How to Not Look Dumb)

You’re going to make mistakes. It’s fine. Here are the most common ones so you can make them less often.

The number one error is forgetting to escape special characters. If your pattern isn’t matching, your first check should be for unescaped dots, question marks, or parentheses. `example.com` will match `exampleXcom`, which is probably not what you wanted. `example.com` is correct.

Another classic is the greedy vs. lazy matching issue we mentioned earlier. If your custom extraction is returning a huge chunk of HTML you didn’t want, you almost certainly used `(.*)` instead of `(.*?)`.

Finally, pay attention to case sensitivity. Most regex engines are case-sensitive by default. `[a-z]` will not match ‘A’. Many tools provide a flag or checkbox to ignore case, which is often what you want when dealing with messy, user-generated URLs.

Key Takeaways

Regex (Regular Expressions) are patterns used to match and manipulate text, acting as a superpower for technical SEO data analysis.
Mastering a few core syntax elements (`.`, `*`, `[]`, `()`, `^`, `$`) unlocks the majority of regex capabilities for SEO tasks.
Use regex for filtering URLs in crawlers like ScreamingCAT to isolate specific URL patterns, find problematic URLs (e.g., with uppercase characters or parameters), and refine crawl scope.
Leverage custom extraction with capturing groups `()` to scrape any text-based data from a website’s HTML, such as schema, tracking codes, or hreflang attributes.
Apply regex beyond crawlers in tools like Google Search Console and Google Analytics to create sophisticated reports and segments that are impossible with standard filters.

ScreamingCAT Team

Building the fastest free open-source SEO crawler. Written in Rust, designed for technical SEOs who value speed, privacy, and no crawl limits.