morzel.net

.net, js, html, arduino, java... no rants or clickbaits.

Easy way to fix outdated links (URL rewrite rule in Web.config)

INTRO

I’ve recently moved my site from BlogEngine.NET 2.0 to 3.3 – thanks to the great work done by BlogEngine.NET team the migration was easy... The only serious problem I’ve noticed was with post links ending with .aspx. For example when Google or CodeProject had a link to such URL:

http://en.morzel.net/post/2014/09/24/OoB-Sonar-with-Arduino-C-JavaScript-and-HTML5-(Part-2).aspx

the post was not found. If .aspx suffix was removed from the address:

http://en.morzel.net/post/2014/09/24/OoB-Sonar-with-Arduino-C-JavaScript-and-HTML5-(Part-2) 

everything was working fine! Fortunately fixing it didn’t require any BlogEngine code changes – all thanks to URL Rewrite Module 2.0 (available since IIS 7) and system system.webServer/rewrite Web.config section.

URL Rewrite is a big topic. Checkout http://www.iis.net/learn/extensions/url-rewrite-module docs if you want to know all the details – you can even do things like setting HTTP headers or server variables! In this post I will focus on how to fix the .aspx link problem and I will also note some issues you might face while trying to setup you own URL rewrite rules…

 

SETTING THE RULES

I’ve added such rewrite section inside system.webServer node in Web.config file:

<rewrite>
    <rules>
        <rule name="FixOldAspxLinks" stopProcessing="true">
            <match url="^(.*post/.+)\.aspx$" />
            <action type="Redirect" url="{R:1}" redirectType="Permanent" />
        </rule>
    </rules>
</rewrite>

It has a single rule that matches all addresses that contain post/ and end with .aspx and triggers redirect action to the same address but with .aspx part dropped.

The rule

Rule has a name (something describing the purpose of the rule is welcome) and stopProcessing=”true” setting which instructs IIS to skip any further rules for matched URL (yes, there's only one rule but having stopProcessing=”true” makes our intention clear).

The match

If you are familiar with regular expressions the url="^(.*post/.+)\.aspx$" attribute should be obvious, if not - don’t worry, it’s simpler than it looks:

  • ^ – means beginning of URL
  • $ – means the end of URL
  • .* – means any character zero or more times
  • .+ – means any character at least once
  • \. – means a literal dot (in regexes . stands for any character so if we literally want to look for a dot we need to escape the special meaning by preceding it with backslash)
  • () – parentheses denote the text (capturing group) what we are going to reference in action element by using {R:1} 

The matching expression could be written in many ways but the one I’ve used solves the problem without going overboard with URL pattern recognition…

The action

We want the browser to look for a new address hence type="Redirect" is set.
New address is specified with url="{R:1}". The {R:1} is a reference to the group captured by matching expression – its value is the text found between parentheses. In our case it’s everything that preceded the .aspx suffix. redirectType="Permanent" instructs the server to issue a 301 Moved Permanently response to the browser. When HTTP client receives permanent redirect it will use the new URL each time it sees a link to the old URL…

Ok, so the above rewrite should be all that’s needed to make .apsx problem disappear! Doesn’t work on your machine? Read on!

 

POSSIBLE ISSUES

No URL Rewrite module installed

Before pushing any changes to remote server I wanted to check rewrite settings on my local IIS 7.5 on Windows 7 x64. I did it and instead of redirect I got HTTP Error 500.19 – Internal Server Error. The error page was useless as it didn’t show any hint on what was wrong with the config... If you face the same issue you probably don’t have IIS Rewrite module installed (it is not added by default). Quick way to find out if you have the module is to check if this file exists: 

%windir%\System32\inetsrv\config\schema\rewrite_schema.xml

I got the installer from here: https://www.microsoft.com/en-us/download/details.aspx?id=7435. After module was added to IIS the rewrite rule started to work :)

Redirect caching

Rewrite rule is setup to redirectType="Permanent" because we want to teach HTTP clients that the resource is moved for good, right? It's all ok unless you are during development and do some changes to the rule – if browser already received 301 response for particular URL your modified rule will not get a chance to work! To solve this problem you can clear the cache but I prefer to have Chrome's dev tools open (with caching disabled) or try to open the page in fresh incognito window…

Pattern testing

Regular expressions are powerful tool but it's very easy to make a mistake while working with them. Fortunately IIS Rewrite Module has it's own panel (snap-in) in IIS Manager:

URL Rewrite module in IIS Manager... Click to enlarge...

that lists rewrite rules used for the site:

Rewrite rule in IIS Manager... Click to enlarge...

If you double click a rule, you will see a window that lets you change rule properties without manual modifications to Web.config. Pressing "Test pattern..." button opens the window in which you can quickly test your regular expression:

Pattern test in IIS Manager... Click to enlarge...

Short but very usueful regex – lookbehind, lazy, group and backreference

Recently, I wanted to extract calls to external system from log files and do some LINQ to XML processing on obtained data. Here’s a sample log line (simplified, real log was way more complicated but it doesn’t matter for this post):

Call:<getName seqNo="56789"><id>123</id></getName> Result:<getName seqNo="56789">John Smith</getName>

I was interested in XML data of the call:

<getName seqNo="56789">
  <id>123</id>
</getName>

Quick tip: super-easy way to get such nicely formatted XML in .NET 3.5 or later is to invoke ToString method on XElement object:

var xml = System.Xml.Linq.XElement.Parse(someUglyXmlString);     
Console.WriteLine(xml.ToString());

When it comes to log, some things were certain: 

  • call’s XML will be logged after “Call:” text on the beginning of line
  • call’s root element name will contain only alphanumerical chars or underscore
  • there will be no line brakes in call’s data
  • call’s root element name may also appear in the “Result” section

Getting to the proper information was quite easy thanks to Regex class:

Regex regex = new Regex(@"(?<=^Call:)<(\w+).*?</\1>");
string call = regex.Match(logLine).Value;

This short regular expressions has a couple of interesting parts. It may not be perfect but proved really helpful in log analysis. If this regex is not entirely clear to you - read on, you will need to use something similar sooner or later. 

Here’s the same regex with comments (RegexOptions.IgnorePatternWhitespace is required to process expression commented this way):

string pattern = @"(?<=^Call:) # Positive lookbehind for call marker
                   <(\w+)      # Capturing group for opening tag name
                   .*?         # Lazy wildcard (everything in between)
                   </\1>       # Backreference to opening tag name";   
Regex regex = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);
string call = regex.Match(logLine).Value;

Positive lookbehind

(?<=Call:) is a lookaround or more precisely positive lookbehind. It’s a zero-width assertion that lets us check whether some text is preceded by another text. Here “Call:” is the preceding text we are looking for. (?<=something) denotes positive lookbehind. There is also negative lookbehind expressed by (?<!something).  With negative lookbehind we can match text that doesn’t have particular string before it. Lookaround checks fragment of the text but doesn't became part of the match value. So the result of this:

Regex.Match("X123", @"(?<=X)\d*").Value

Will be "123" rather than "X123".

.NET regex engine has lookaheads too. Check this awesome page if you want to learn more about lookarounds. 

Note: In some cases (like in our log examination example) instead of using positive lookaround we may use non-capturing group...

Capturing group

<(\w+) will match less-than sign followed by one or more characters from \w class (letters, digits or underscores). \w+ part is surrounded with parenthesis to create a group containing XML root name (getName for sample log line). We later use this group to find closing tag with the use of backreference. (\w+) is capturing group, which means that results of this group existence are added to Groups collection of Match object. If you want to put part of the expression into a group but you don’t want to push results into Groups collection you may use non-capturing group by adding a question mark and colon after opening parenthesis, like this: (?:something)

Lazy wildcard

.*? matches all characters except newline (because we are not using RegexOptions.Singleline) in lazy (or non-greedy) mode thanks to question mark after asterisk. By default * quantifier is greedy, which means that regex engine will try to match as much text as possible. In our case, default mode will result in too long text being matched:

<getName seqNo="56789"><id>123</id></getName> Result:<getName seqNo="56789">John Smith</getName>

Backreference

</\1> matches XML close tag where element's name is provided with \1 backreference. Remember the (\w+) group? This group has number 1 and by using \1 syntax we are referencing the text matched by this group. So for our sample log, </\1> gives us </getName>. If regex is complex it may be a good idea to ditch numbered references and use named references instead. You can name a group by <name> or ‘name’ syntax and reference it by using k<name> or k’name’. So your expression could look like this:

@"(?<=^Call:)<(?<tag>\w+).*?</\k<tag>>"

or like this:

@"(?<=^Call:)<(?'tag'\w+).*?</\k'tag'>"

The latter version is better for our purpose. Using < > signs while matching XML is confusing. In this case regex engine will do just fine with < > version but keep in mind that source code is written for humans…

Regular expressions look intimidating, but do yourself a favor and spend few hours practicing them, they are extremely useful (not only for quick log analysis)!