25.3. Regular expressions

Figure 25-4. Pattern matching with a regular expression.

Pattern matching with a regular expression.

Regular expressions are used as “templates” that match patterns of text. For example, the regular expression for the pattern matching in Figure 25-4 is[1]

\([Ii]f \|and \)*\(<i>[AC]\+<\/i>.\)\(and\)\?

To understand any URL manipulation solution to the problem of non-search-engine-friendly URLs, you have to get acquainted with Regular Expressions. To get you started, read Using Regular Expressions and Matching Patterns in Text. We can only touch the basics here, for which we use material taken from A Brief Introduction to Regular Expressions:

An expression is a string of characters. Those characters that have an interpretation above and beyond their literal meaning are called metacharacters. A quote symbol, for example, may denote speech by a person, ditto, or a meta-meaning for the symbols that follow. Regular Expressions are sets of characters and/or metacharacters that UNIX endows with special features.

The main uses for Regular Expressions (REs) are text searches and string manipulation. An RE matches a single character or a set of characters (a substring or an entire string).

What does the above tell us when we encounter a cryptic mod_rewrite directive that looks like the following?

RewriteEngine on
RewriteRule ^page1\.html$ page2.html [R=301,L]

Of course, the first line is easy: mod_rewrite is not enabled by default, so this line starts the “Rewrite Engine”. The second directive is a “Rewrite Rule” that instructs mod_rewrite to translate whatever URL is matched by the regular expression “^page1\.html$” to “page2.html”.

What URLs does the regular expression “^page1\.html$” match?

In this example, adapted from An Introduction to Redirecting URLs on an Apache Server, we have a caret at the beginning of the pattern, and a dollar sign at the end. These are regex special characters called anchors. The caret tells regex to begin looking for a match with the character that immediately follows it, in this case a "p". The dollar sign anchor tells regex that this is the end of the string we want to match. In our simple example, "page1\.html" and "^page1\.html$" are interchangable expressions and match the same string. However, "page1\.html" matches any string containing "page1.html" (apage1.html for example) anywhere in the URL, but "^page1\.html$" matches only a string which is exactly equal to "page1.html". In a more complex redirect, anchors (and other special regex characters) are often essential.

Putting all the above together, we can see that “^page1\.html$” matches URLs that start (the caret -- ^ --) with “page1”, immediately followed by a literal dot (escaped dot --\.--, as opposed to a simple tot, which is a metacharacter that matches any single character except newline), immediately followed by “html” and the end of the URL (dollar sign --$--).

In our example, we also have an "[R=301,L]". These are called flags in mod_rewrite and they're optional parameters. "R=301" instructs Apache to return a 301 status code with the delivered page and, when not included as in [R,L], defaults to 302. The "L" flag tells Apache that this is the last rule that it needs to process, IF the RewriteRule pattern is matched. Experts suggest that you get in the habit of including the "L" flag with every RewriteRule to avoid unpleasant surprises.

One powerful option in creating search patterns is specifying that a subexpression that was matched earlier in a regular expression is matched again later in the expression. We do this using backreferences. Backreferences are named by the numbers 1 through 9, preceded by the backslash/escape character when used in this manner (in mod_rewrite, you have to use the dollar sign instead of the backslash, but in PHP you will use the backslash, so don't get confused, it just depends on the context the regular expression is in). These backreferences refer to each successive group in the match pattern, as in /(one)(two)(three)/\1\2\3/ (or $1, $2 and $3 for mod_rewrite). Each numbered backreference refers to the group that has the word corresponding to the number.

Thus the following URL translation:

#Your Account
RewriteRule ^userinfo-([a-zA-Z0-9_-]*)\.html
modules.php?name=Your_Account&op=userinfo&username=$1

in the .htaccess file (Section 25.4) will match any URL that starts (carret --^--) with “userinfo-”, immediately followed by any number (star --*--) of characters belonging to the alphanumeric class (a-z, A-Z, 0-9), including underscores (_) and dashes (-), followed by a literal dot (an escaped dot --\.--) and “html”. The Rewrite Rule instructs mod_rewrite to translate ther URL to

modules.php?name=Your_Account&op=userinfo&username=$1

where $1 is a backreference, referring to the first matched subexpression, the one inside the parenthesses (). Since inside the parenthesses is a regular expression that matches “any number of characters belonging to the alphanumeric class, including underscores and dashes”, $1 will contain whatever alphanumeric characters were between “userinfo-” and “.html” (including underscores and dashes). In PHP-Nuke, this is the username, so that the URL returned by mod_rewrite will be

modules.php?name=Your_Account&op=userinfo&username=(some matched username)

thus completing the transformation of a static URL (that PHP-Nuke does not understand), to a dynamic one that makes perfectly sense to PHP-Nuke (see Section 25.5.1.3 for the complete picture).

Notes

[1]

The regular expression matches the HTML code for the text shown in Figure 25-4, where the capital letters A and C were enclosed in <i> tags. This makes it look more formidable than it actually is.