Tracing the Slowest PCRE Regular Expressions in OpenResty or Nginx Processes
Regular Expressions are ubiquitous in web applications, including Lua applications
running atop OpenResty or Nginx (with OpenResty’s core components like the Lua Nginx module).
OpenResty provides the ngx.re.*
Lua API functions such as
ngx.re.match and
ngx.re.find to
expose the PCRE regular expression library’s C API to the Lua land. PCRE provides
a very efficient implementation of the Perl-compatible regular expressions with
a Just-in-Time (JIT) compiler. PCRE is widely used by various famous systems
software, including the Nginx core and the PHP runtime.
However, many web developers pay little attention to the efficiency of their regular expressions. Pathological behavior may occur for poorly written regular expressions, especially in backtracking implementations like PCRE. Worse, such pathological behavior may only get triggered by relatively large data input with particular characteristics. Some DoS attackers may even specifically target such vulnerable regular expression patterns.
Fortunately, our OpenResty XRay product provides dynamic-tracing tools that can analyze online OpenResty and Nginx processes in real-time and quickly locate those vulnerable and slow regular expressions. No special plugins or modules are needed in the target OpenResty or Nginx applications, and absolutely no overhead would incur when OpenResty XRay is not sampling.
In this tutorial, we will demonstrate such advanced analyzers with real-world examples. In the end, we’ll also look at alternatives with fewer pathological behaviors, including RE2 and the OpenResty Regex engine.
System Environment
Here we use a RedHat Enterprise Linux 7 system as an example. Any Linux distributions supported by OpenResty XRay should work equally fine, like Ubuntu, Debian, Fedora, Rocky, Alpine, etc.
We use an open-source OpenResty binary build as the target application. You can use any OpenResty or Nginx binaries, including those compiled by yourself. No special build options, plugins, or libraries are needed in your existing server installation or processes. It is the beauty of dynamic tracing technologies. It’s genuinely non-invasive.
We also have the OpenResty XRay’s Agent daemon
running on the same system and have the command-line utilities from the openresty-xray-cli
package installed and configured.
A Sample OpenResty/Nginx Lua Application
We have an inefficient OpenResty/Nginx Lua application that tops at 100% CPU
usage in a server, as evidenced by the top
command-line utility:
$ top -p 3584441
...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3584441 nobody 20 0 41284 7184 4936 R 100.0 0.0 1:43.97 nginx
Note that the Nginx process’s CPU usage is 100.0%, and its PID is 3584441. We’ll reference this PID in our later dynamic-tracing analysis.
Narrowing Down the Culprit without Guessing
C-Land CPU Flame Graphs
To analyze any C/C++ processes (including Nginx processes) having high CPU usage, we should first obtain a C-land CPU flame graph1 from the process. Although OpenResty XRay can automatically detect OpenResty/Nginx applications and sample them and generate various kinds of flame graphs, here, we will use the manual way of running command-line utilities solely for demonstration purposes.
$ orxray analyzer run lj-c-on-cpu -p 3584441
Start tracing...
Go to https://x5vrki.xray.openresty.com/targets/68/history/664712 for charts.
We use the -p
option to specify the target Nginx worker process’s PID. And
we run the standard OpenResty XRay tool named
lj-c-on-cpu
. We don’t use the more generic c-on-cpu
tool here because the
former can automatically unwind dynamic machine code frames from
LuaJIT and PCRE’s Just-in-Time (JIT) compilers.
From this C-land CPU flame graph, we can see that
- almost all the CPU time is for
executing Lua code (note the highlighted
ngx_http_lua_run_thread
C function frame in the flame graph). - And the Lua code is also executing PCRE matcher (note the
pcre_exec
function frame in the graph). - Furhtermore, the PCRE JIT’d code is being executed, according to the
_pcre_jit_exec
function frame in the graph).
Note the power of this dynamic-tracing tool of transparently unwinding JIT’d code frames without any help from the target applications2.
The next step is sampling and generating a Lua-land CPU flame graph so that we can see what is going on in the Lua code.
Lua-Land CPU Flame Graphs
To generate a Lua-land CPU flame graph,
we can use the standard analyzer, lj-lua-on-cpu
, as in
$ orxray analyzer run lj-lua-on-cpu -p 3584441
Start tracing...
Go to https://x5vrki.xray.openresty.com/targets/68/history/664870 for charts.
The Lua-land flame graph also shows JIT’d Lua code, as evidenced by the trace#3:regex.lua:721
frame in the graph. That means a LuaJIT
trace object of the trace ID 3
, starting from the Lua code line at line 721
of the source file regex.lua
. We can see that almost all the CPU time is
on a single Lua code path. The most interesting Lua function frame in this dominating
Lua code path is regex.lua:re_match
. In that Lua file, we have
local re_match = ngx.re.match
So the re_match
symbol name points to the ngx.re.match
API function.
Then the next question becomes: what regular expression is so slow? We could
directly look up the Lua source line on content_by_lua(nginx.conf:69):10
, as
shown in the flame graph. But if the Lua code is generic and simply matching
through a Lua table holding multiple regular expressions, it is not sufficient
to pinpoint the regular expression in question.
The Slowest Regular Expressions
Knowing that it is the regular expression matching taking most of the CPU time,
we can now use another standard analyzer named lj-slowest-ngx-re
to find the
slowest regular expressions matched in the current live process, as in
$ orxray analyzer run lj-slowest-str-match-find -p 3584441 --y-var threshold_ns=1 --y-var max_hits=1000
Start tracing...
sub_regex_cache_count: 0
http: match
cache: ptr: 0x154bbc0, type: match, pat: "\w"
cache: ptr: 0x1517890, type: match, pat: "(?:.*)*css"
cache: ptr: 0x154a3a0, type: match, pat: ".*?js"
http: substitution
http: function substitution
...WARNING: reach samples count: 1000
=== start ===
max latency: 15844343 ns, ptr: 0x1517890
max latency: 13555 ns, ptr: 0x154a3a0
max latency: 8075 ns, ptr: 0x154bbc0
=== finish ===
Go to https://x5vrki.xray.openresty.com/targets/68/history/799249 for charts.
We open the URL in a web browser to browse the resulting bar chart shown below.
This bar chart shows the maximum execution latency for the slowest regular expressions
run inside the sampling time window. The latency is in nanoseconds. Obviously,
(?:.*)*css
is the absolute slowest regular expression, taking more than 20.6
milliseconds (or 20,671,282 nanoseconds). It is also way slower than the other
ones. Knowledgeable users can quickly identify the use of a greedy quantifier,
.*
, with a constant suffix can result in aggressive backtracking in a regular
expression engine. We can optimize the regular expression using a non-greedy
version of the quantifier instead: (?:.*?)css
.
After Optimizing the Slowest Regex
After optimizing the slowest regular expression (?:.*)*css
to (?:.*?)css
, we reload the application with the optimized regular expression and re-generate a new C-land CPU flame graph.
We can see the shapes and characteristics of the new flame graph are very different
from the previous one. The bottleneck is no longer the code paths executing
Lua code. Instead, it is now the writev
system call responsible for writing
out the HTTP response data (as highlighted by red in the graph above). Note
the ngx_http_output_filter
C function frame in the graph. This function is
responsible for invoking the Nginx output filter chain to send out the HTTP
response header and body data.
Let’s rerun the lj-slowest-ngx-re
tool against the optimized application.
The resulting chart is as follows.
We can see that the maximum latency of the regular expression is much shorter, merely 58,660 nanoseconds, or about 59 microseconds.
Limiting the execution overhead of PCRE
OpenResty’s Lua Nginx module provides the lua_regex_match_limit directive to set a hard limit on the basic operations performed for a single regular expression matching.
For example, we can put the following line into the nginx.conf
file.
lua_regex_match_limit 100000;
And make sure we log any error returned by the ngx.re.match
API function call:
local m, err = re_match(content, regexes[i], "jo")
if err ~= nil then
ngx.log(ngx.ERR, "re_math failed, re: ", regexes[i], ", error: ", err)
return
end
When the match limit is exceeded, we will get the following error log message:
2022/07/13 21:05:13 [error] 3617128#3617128: *1 [lua] content_by_lua(nginx.conf:69):11: re_math failed, re: (.*)*css, error: pcre_exec() failed: -8, client: 127.0.0.1, server: localhost, request: "GET / HTTP/1.1", host: "127.0.0.1"
Non-Backtracking Regular Expression Engines
PCRE has almost the full support for Perl regular expressions. The flexibility comes with a cost involved with backtracking. The user may also want to consider other non-backtracking implementations of regular expressions like Google’s RE2 or OpenResty Inc’s commercial OpenResty Regex engine. These engines usually employ algorithms based on the automaton theory. Unfortunately, the average matching time of RE2 is about 50% slower than PCRE for typical regular expressions though it has few pathological cases for slow matches. The OpenResty Regex engine, on the other hand, achieves similar average performance as PCRE (sometimes even faster) and is also free of backtracking issues.
Lua’s Built-in String Patterns
The standard Lua 5.1 language defines its own regular expression language syntax, as supported by its standard API functions string.match and string.find. OpenResty XRay also provides dynamic-tracing tools for analyzing such pattern-matching operations. We will cover this topic in another tutorial.
Tracing Applications inside Containers
OpenResty XRay tools support tracing containerized applications transparently. Both Docker and Kubernetes (K8s) containers work transparently. Just as with normal application processes, the target containers do not need any applications or extra privileges. The OpenResty XRay Agent daemon should run outside the target containers (like in the host operating system directly or in its own privileged container).
Let’s see an example. We first check the container name or container ID with
the docker ps
command.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4465297209d9 openresty/openresty:1.19.3.1-2-alpine-fat "/usr/local/openrest…" 18 months ago Up 45 minutes angry_mclaren
Here the container name is angry_mclaren
. We can then find out the target
process’s PID in this container.
$ docker top angry_mclaren
UID PID PPID C STIME TTY TIME CMD
root 3605106 3605085 0 19:40 ? 00:00:00 nginx: master process /usr/local/openresty/bin/openresty -g daemon off;
nobody 3605692 3605106 0 19:44 ? 00:00:20 nginx: worker process
The PID for the openresty
worker process is 3605692
. We then run the OpenResty
XRay analyzer against this PID as usual.
$ orxray analyzer run lj-slowest-ngx-re -p 3605692
Start tracing...
...
Go to https://x5vrki.xray.openresty.com/targets/68/history/684552 for charts.
OpenResty XRay is also able to automatically detect long-running processes as “applications” of a particular type (like “OpenResty”, “Python”, etc.).
How The Tools are Implemented
All the tools are implemented in the Y language.
OpenResty XRay executes them with either the Stap+2 or eBPF3
backends
of OpenResty XRay, both of which use the 100%
non-invasive dynamic tracing
technologies based on the Linux kernel’s uprobes
and kprobes
facilities.
We don’t require any collaborations from the target applications and processes. No log data or metrics data is used or needed. We directly analyze the running processes' process space in a strictly read-only way. And we also never inject any byte-code or other executable code into the target processes. It is 100% clean and safe.
The Overhead of the Tools
The dynamic-tracing tool lj-slowest-ngx-re
demonstrated in this tutorial is very efficient and
suitable for online execution.
When the tools are not running and actively sampling, the overhead on the system and the target processes are strictly zero. We never inject any extra code or plugins into the target applications and processes; thus, there’s no inherent overhead.
During sampling, the request latency only increases by about 7 microseconds (us) on average on typical server hardware. And the reduction in the maximum request throughput for the fastest OpenResty/Nginx server serving tens of thousands of requests per second on each CPU core is not measurable.
About The Author
Yichun Zhang (Github handle: agentzh), is the original creator of the OpenResty® open-source project and the CEO of OpenResty Inc..
Yichun is one of the earliest advocates and leaders of “open-source technology”. He worked at many internationally renowned tech companies, such as Cloudflare, Yahoo!. He is a pioneer of “edge computing”, “dynamic tracing” and “machine coding”, with over 22 years of programming and 16 years of open source experience. Yichun is well-known in the open-source space as the project leader of OpenResty®, adopted by more than 40 million global website domains.
OpenResty Inc., the enterprise software start-up founded by Yichun in 2017, has customers from some of the biggest companies in the world. Its flagship product, OpenResty XRay, is a non-invasive profiling and troubleshooting tool that significantly enhances and utilizes dynamic tracing technology. And its OpenResty Edge product is a powerful distributed traffic management and private CDN software product.
As an avid open-source contributor, Yichun has contributed more than a million lines of code to numerous open-source projects, including Linux kernel, Nginx, LuaJIT, GDB, SystemTap, LLVM, Perl, etc. He has also authored more than 60 open-source software libraries.
-
OpenResty XRay’s C-land CPU flame graphs are much more powerful than most of the open-source solutions. It supports unwinding JIT’d machine code without any collaborations from the target processes and it does support inlined C functions and detailed source file name and line numbers. ↩︎
-
The open-source Linux
perf
tool chain has limited JIT’d code unwinding support that requires the target application to create a specialperf.map
file. That is special help and collaborations from the target application and it may lead to file data leaks and extra runtime overhead even if the user does not need to sample and profile. ↩︎ -
Stap+ is OpenResty Inc’s greatly enhanced version of SystemTap. ↩︎