Jeff Moser's How .NET Regular Expressions Really Work

Jeff Moser's has done an in-depth study of how regular expressions work in .NET. His article covers the core operating principals of Microsoft’s implementation such as the machine code used by compiled regular expressions.

The first thing he reveals is that the last 15 regular expressions are cached. For those little utility applications that only use one or two expressions, this means explicitly creating a RegEx object is probably not necessary.

When compiling a regular expression, the first step consists of a scanner than emits a RegexTree. Looking at just the leaf node, this resembles the source code to a fair extend. Next this is translated into the machine code of the regular expression engine.

The bulk of the work is done by the 250 line switch statement that makes up the EmitFragment function. This function breaks up RegexTree "fragments" and converts them to a simpler RegexCode.

[…]

The reward for all this work is an integer array that describes the RegexCode "op codes" and their arguments. You can see that some instructions like "Setrep" take a string argument. These arguments point to offsets in a string table. This is why it was critical to pack everything about a set into the obscure string we saw earlier. It was the only way to pass that information to the instruction.

Decoding the code array, we see:

Index	Instruction	Op Code/Argument	String Table Reference	Description
0	Lazybranch	23		Lazily branch to the Stop instruction at offset 21.
1		21
2	Setmark	31		Push our current state onto a stack in case we need to backtrack later.
3	Multi	12		Perform a multi-character match of string table item 0 which is "http://".
4		0	"http://"
5	Setmark	31		Push our current state onto a stack in case we need to backtrack later.
6	Setrep	2		Perform a set repetition match of length 1 on the set stored at string table position 1, which represents [^\s/].
7		1	"\x1\x2\x1\x2F\x30\x64"
8		1
9	Setloop	5		Match the set [^\s/] in a loop at most Int32.MaxValue times.
10		1	"\x1\x2\x1\x2F\x30\x64"
11		2147483647
12	Capturemark	32		Capture into group #1, the string between the mark set by the last Setmark and the current position.
13		1
14		-1
15	Oneloop	3		Match Unicode character 47 (a '/') in a loop for a maximum of 1 time.
16		47
17		1
18	Capturemark	32		Capture into group #0, the contents between the first Setmark instruction and the current position.
19		0
20		-1
21	Stop	40		Stop the regex.

We can now see that our regex has turned into a simple "program" that will be executed later.

You can read more about this process on Jeff Moser's blog. His article also covers

Prefix Optimizations
The Interpreter
Backtracking
Know Bugs

Topics

Beyond the Breach: Proactive Defense in the Age of Advanced Threats

Transforming Legacy Healthcare Systems: a Journey to Cloud-Native Architecture

Navigating LLM Deployment: Tips, Tricks, and Techniques

Participatory Leadership and Developing a Culture of Psychological Safety

From Local to Production: a Modern Developer’s Journey towards Kubernetes

Helpful links

Choose your language

Write for InfoQ

Rate this Article

This content is in the .NET topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

AWS Amplify and Amazon S3 Integration Simplifies Static Website Hosting

Aurora Limitless: AWS Introduces New PostgreSQL Database with Automated Horizontal Scaling

Meta Releases NotebookLlama: Open-Source PDF to Podcast Toolkit

Spring Framework 6.2 and Spring Boot 3.4 Improve Containers, Actuators ahead of New 2025 Generations

How Recall.ai Saved $1M on AWS by Eliminating WebSockets

Trends in Engineering Leadership: Observability, Agile Backlash, and Building Autonomous Teams

Carle Lerche Talking at QCon SF about Rust: a Productive Language for Writing Database Applications

Google Introduces Gemini AI Features to Android Studio

GitHub Universe 2024 Unveils AI Innovations and Developer-Centric Tools

Netflix Rolls out Service-Level Prioritized Load Shedding to Improve Resiliency

Transforming Legacy Healthcare Systems: a Journey to Cloud-Native Architecture

New "Laws" Announced at iSAQB Software Architecture Gathering

Participatory Leadership and Developing a Culture of Psychological Safety

How to Delight Your Developers with User-Centric Platforms and Practices

Trends in Engineering Leadership: Observability, Agile Backlash, and Building Autonomous Teams

LLaVA-CoT Shows How to Achieve Structured, Autonomous Reasoning in Vision Language Models

Microsoft Announces General Availability of Fabric API for GraphQL

Vercel Expands AI Toolkit with AI SDK 4.0 Update

Rise of Python, Generative AI, and Global Developer Communities: Insights from GitHub Octoverse 2024

From Local to Production: a Modern Developer’s Journey towards Kubernetes

Timescale Bolsters AI-Ready PostgreSQL with pgai Vectorizer

QCon San Francisco

QCon London

InfoQ Dev Summit Boston

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?