Html Agility Pack - massive information extraction from WWW pages

by Miłosz Orzeł 1. December 2013 22:56

Recently I needed to acquire some database. Unfortunately it was published only as a website that presented 50 records per single page. Whole database had more than 150 thousand records. What to do in such situation? Click through 3000 pages, manually collecting data in a text file? One week and it's done! ;) Better to write a program (so called scraper) which will do the work for you. The program has to do three things:

  • generated a list of addresses from which data should be collected;
  • visit pages sequentially and extract information from HTML code;
  • dump data to local database and log work progress.

Address generation should be quite easy. For most sites pagination is built with plain links in which page number is clearly visible in the main part of URL ( or in the query string ( If pagination is done via ajax calls situation is a bit more complex, but let's not bother with that in this post... When you know the pattern for page number parameter, all it's needed is a simple loop with something like:

string url = string.Format("{0}", pageNumber)

Now it's time for something more interesting. How to extract data from a webpage? You can use WebRequest/WebResponse or WebClient classes from System.Net namespace to get page content. After that you can obtain information via regular expressions. You can also try to treat downloaded content as XML and scrutinize it with XPath or LINQ to XML. These are not good approaches, however. For complicated page structure writing correct expression might be difficult, one should also remember that in most cases webpages are not valid XML documents. Fortunately Html Agility Pack library was created. It allows convenient parsing of HTML pages, even these with malformed code (i.e. lacking proper closing tags). HAP goes through page content and builds document object model that can be later processed with LINQ to Objects or XPath.

To start working with HAP you should install NuGet package named HtmlAgilityPack (I was using version 1.4.6) and import namespace with the same name. If you don't want to use NuGet (why?) download zip file from project's website and add reference to HtmlAgilityPack.dll file suitable for your platform (zip contains separate versions for .NET 4.5 and Silverlight 5 for example). Documentation in .chm file might be useful too. Attention! When I opened downloaded file (in Windows 7), the documentation looked empty. "Unlock" option from file's properties screen helped to solve the problem.

Retrieving webpage content with HAP is very easy. You have to create HtmlWeb object and use its Load method with page address:

HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument htmlDocument = htmlWeb.Load("");

In return, you will receive object of HtmlDocument class which is the core of HAP library.

HtmlWeb contains a bunch of properties that control how document is retrieved. For example, it is possible to indicate whether cookies should be used (UseCookies) and what should be the value of User Agent header included in HTTP request (UserAgent). For me AutoDetectEncoding and OverrideEncoding properties were especially useful as they let me correctly read document with Polish characters.

HtmlWeb htmlWeb = new HtmlWeb() { AutoDetectEncoding = false, OverrideEncoding = Encoding.GetEncoding("iso-8859-2") };

StatusCode (typeSystem.Net.HttpStatusCode) is another very useful property of HttpWeb. With it you can check the result of latest request processing.

Having HtmlDocument object ready, you can start to extract data. Here's an example of how to obtain links addresses and texts from previously downloaded webpage (add using System.Linq):

IEnumerable<HtmlNode> links = htmlDocument.DocumentNode.Descendants("a").Where(x => x.Attributes.Contains("href"));
foreach (var link in links)
    Console.WriteLine(string.Format("Link href={0}, link text={1}", link.Attributes["href"].Value, link.InnerText));       

Property DocumentNode of type HtmlNode points to page's root. Method Descendants is used to retrieve all links (a tag) that contain href attribute. After that texts and address are printed on the console. Quite easy, huh? Few other examples:

Getting HTML code of the whole page:

string html = htmlDocument.DocumentNode.OuterHtml;

Getting element with "footer" id:

HtmlNode footer = htmlDocument.DocumentNode.Descendants().SingleOrDefault(x => x.Id == "footer");

Getting children of div with "toc" id and displaying names of child nodes which have type different than Text:

IEnumerable<HtmlNode> tocChildren = htmlDocument.DocumentNode.Descendants().Single(x => x.Id == "toc").ChildNodes;
foreach (HtmlNode child in tocChildren)
    if (child.NodeType != HtmlNodeType.Text)

Getting list elements (li tag) that have toclevel-1 class:

IEnumerable<HtmlNode> tocLiLevel1 = htmlDocument.DocumentNode.Descendants()
    .Where(x => x.Name == "li" && x.Attributes.Contains("class")
    && x.Attributes["class"].Value.Split().Contains("toclevel-1"));

Notice that Where filter is quite complex. Simple condition:

Where(x => x.Name == "li" && x.Attributes["class"].Value == "toclevel-1")

is not correct! Firstly there is no guarantee that each li tag will have class attribute set so we need to check if attribute exist to avoid NullReferenceException exception. Secondly the check for toclevel-1 is flawed. HTML element might have many classes, so instead of using == it's worthwhile to use Contains(). Plain Value.Contains is not enough though. What if we are looking for "sec" class and element has "secret" class? Such element will be matched too! Rather than Value.Contains you should use Value.Split().Contains. This way an array of strings will be checked via equals operator (instead of searching a single string for substring).

Getting texts of all li elements which are nested in minimum one li element:

var h1Texts = from node in htmlDocument.DocumentNode.Descendants()
              where node.Name == "li" && node.Ancestors("li").Count() > 0
              select node.InnerText;

Beyond LINQ to Objects, XPath might also be used to extract information. For example:

Getting a tags that have href attribute value starting with # and longer than 15 characters:

IEnumerable<HtmlNode> links = htmlDocument.DocumentNode.SelectNodes("//a[starts-with(@href, '#') and string-length(@href) > 15]");

Finding li elements inside div with id "toc" which are third child in their parent element:

IEnumerable<HtmlNode> listItems = htmlDocument.DocumentNode.SelectNodes("//div[@id='toc']//li[3]");

XPath is a complex tool and it's impossible to show all its great capabilities in this post...

HAP lets you explore page structure and content but it also allows page modification and save. It has helper methods good for detecting document encoding (DetectEncoding), removing HTML entities (DeEntitize) and more... It is also possible to gather validation information (i.e. check if original document had proper closing tags). These topics are beyond the scope of this post.

While processing consecutive pages, dump useful information to local database most suitable for your needs, Maybe .csv file will be enough for you, maybe SQL database will be required? For me plain text file was sufficient.

Last thing worth doing is ensuring that scraper properly logs information about its work progress (for sure you want to know how far your program went and if it encountered any errors). For logging it is best to use specialized library such as log4net. There's a lot of tutorials on how to use log4net so I will not write about it. But I will show you a sample configuration which you can use in console application:

<?xml version="1.0" encoding="utf-8" ?>
        <section name="log4net" type="log4net.Config.Log4NetConfigurationSectionHandler, log4net"/>          
            <level value="DEBUG"/>            
            <appender-ref ref="ConsoleAppender" />
            <appender-ref ref="RollingFileAppender"/>
        <appender name="ConsoleAppender" type="log4net.Appender.ColoredConsoleAppender">
            <layout type="log4net.Layout.PatternLayout">
                <conversionPattern value="%date{ISO8601} %level [%thread] %logger - %message%newline" />
                <level value="ERROR" />
                <foreColor value="White" />
                <backColor value="Red" />
            <filter type="log4net.Filter.LevelRangeFilter">
                <levelMin value="INFO" />                
        <appender name="RollingFileAppender" type="log4net.Appender.RollingFileAppender">
            <file value="Log.txt" />
            <appendToFile value="true" />
            <rollingStyle value="Size" />
            <maxSizeRollBackups value="10" />
            <maximumFileSize value="50MB" />
            <staticLogFileName value="true" />
            <layout type="log4net.Layout.PatternLayout">
                <conversionPattern value="%date{ISO8601} %level [%thread] %logger - %message%newline%exception" />

Above config contains two appenders: ConsoleAppender and RollingFileAppender. The first logs text to console window, ensuring that errors are clearly distinguished by color. To reduce amount of information LevelRangeFilter is set so only entries with INFO or higher level are presented. The second appender logs to text file (even entries with DEBUG level go there). Maximum size of singe file is set to 50MB and total files number limit is set to 10. Current log is always in Log.txt file...

And that's all, scraper is ready! Run it and let it labor for you. No dull, long hour work - leave it for people who don't know how to program :)

Additionally you can try a little exercise: instead of creating a list of all pages to visit, determine only the first page and find a link to next page in currently processed one...

P.S. Keep in mind that HAP works on HTML code that was sent by the server (this code is used by HAP to build document model). DOM which you can observe in browser's developer tools is often a result of scripts execution and might differ greatly form the one build directly from HTTP response.

Update 08.12.2013: As requested, I created simple demo (Visual Studio 2010 solution) of how to use Html Agility Pack and log4net. The app extracts some links from wiki page and dumps them to text file. Wiki page is saved to htm file to avoid dependency on web resource that might change. Download

Short but very usueful regex – lookbehind, lazy, group and backreference

by Miłosz Orzeł 16. August 2013 10:04

Recently, I wanted to extract calls to external system from log files and do some LINQ to XML processing on obtained data. Here’s a sample log line (simplified, real log was way more complicated but it doesn’t matter for this post):

Call:<getName seqNo="56789"><id>123</id></getName> Result:<getName seqNo="56789">John Smith</getName>

I was interested in XML data of the call:

<getName seqNo="56789">

Quick tip: super-easy way to get such nicely formatted XML in .NET 3.5 or later is to invoke ToString method on XElement object:

var xml = System.Xml.Linq.XElement.Parse(someUglyXmlString);     

When it comes to log, some things were certain: 

  • call’s XML will be logged after “Call:” text on the beginning of line
  • call’s root element name will contain only alphanumerical chars or underscore
  • there will be no line brakes in call’s data
  • call’s root element name may also appear in the “Result” section

Getting to the proper information was quite easy thanks to Regex class:

Regex regex = new Regex(@"(?<=^Call:)<(\w+).*?</\1>");
string call = regex.Match(logLine).Value;

This short regular expressions has a couple of interesting parts. It may not be perfect but proved really helpful in log analysis. If this regex is not entirely clear to you - read on, you will need to use something similar sooner or later. 

Here’s the same regex with comments (RegexOptions.IgnorePatternWhitespace is required to process expression commented this way):

string pattern = @"(?<=^Call:) # Positive lookbehind for call marker
                   <(\w+)      # Capturing group for opening tag name
                   .*?         # Lazy wildcard (everything in between)
                   </\1>       # Backreference to opening tag name";   
Regex regex = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);
string call = regex.Match(logLine).Value;

Positive lookbehind

(?<=Call:) is a lookaround or more precisely positive lookbehind. It’s a zero-width assertion that lets us check whether some text is preceded by another text. Here “Call:” is the preceding text we are looking for. (?<=something) denotes positive lookbehind. There is also negative lookbehind expressed by (?<!something).  With negative lookbehind we can match text that doesn’t have particular string before it. Lookaround checks fragment of the text but doesn't became part of the match value. So the result of this:

Regex.Match("X123", @"(?<=X)\d*").Value

Will be "123" rather than "X123".

.NET regex engine has lookaheads too. Check this awesome page if you want to learn more about lookarounds. 

Note: In some cases (like in our log examination example) instead of using positive lookaround we may use non-capturing group...

Capturing group

<(\w+) will match less-than sign followed by one or more characters from \w class (letters, digits or underscores). \w+ part is surrounded with parenthesis to create a group containing XML root name (getName for sample log line). We later use this group to find closing tag with the use of backreference. (\w+) is capturing group, which means that results of this group existence are added to Groups collection of Match object. If you want to put part of the expression into a group but you don’t want to push results into Groups collection you may use non-capturing group by adding a question mark and colon after opening parenthesis, like this: (?:something)

Lazy wildcard

.*? matches all characters except newline (because we are not using RegexOptions.Singleline) in lazy (or non-greedy) mode thanks to question mark after asterisk. By default * quantifier is greedy, which means that regex engine will try to match as much text as possible. In our case, default mode will result in too long text being matched:

<getName seqNo="56789"><id>123</id></getName> Result:<getName seqNo="56789">John Smith</getName>


</\1> matches XML close tag where element's name is provided with \1 backreference. Remember the (\w+) group? This group has number 1 and by using \1 syntax we are referencing the text matched by this group. So for our sample log, </\1> gives us </getName>. If regex is complex it may be a good idea to ditch numbered references and use named references instead. You can name a group by <name> or ‘name’ syntax and reference it by using k<name> or k’name’. So your expression could look like this:


or like this:


The latter version is better for our purpose. Using < > signs while matching XML is confusing. In this case regex engine will do just fine with < > version but keep in mind that source code is written for humans…

Regular expressions look intimidating, but do yourself a favor and spend few hours practicing them, they are extremely useful (not only for quick log analysis)!

Fast pixel operations in .NET (with and without unsafe)

by Miłosz Orzeł 6. July 2013 21:18

Bitmap class has GetPixel and SetPixel methods that let you acquire and change color of chosen pixels. Those methods are very easy to use but are also extremely slow. My previous post gives detailed explanation on the topic, click here if you are interested.

Fortunately you don’t have to use external libraries (or resign from .NET altogether) to do fast image manipulation. The Framework contains class called ColorMatrix that lets you apply many changes to images in an efficient manner. Properties such as contrast or saturation can be modified this way. But what about manipulation of individual pixels? It can be done too, with the help from Bitmap.LockBits method and BitmapData class…

Good way to test individual pixel manipulation speed is color difference detection. The task is to find portions of an image that have color similar to some chosen color. How to check if colors are similar? Think about color as a point in three dimensional space, where axes are: red, green and blue. Two colors are two points. The difference between colors is described by the distance between two points in RGB space.

Colors as points in 3D space diff = sqrt((C1R-C2R)2+(C1G-C2G)2+(C1B-C2B)2)

This technique is very easy to implement and gives decent results. Color comparison is actually a pretty complex matter though. Different color spaces are better suited for the task than RGB and human color perception should be taken into account (e. g. our eyes are more keen to detect difference in shades of green that in shades of blue). But let’s keep things simple here…

Our test image will be this Ultra HD 8K (7680x4320, 33.1Mpx) picture* (on this blog it’s of course scaled down to save bandwidth):

Color difference detection input image (scaled down for blog)

This is a method that may be used to look for R=253 G=129 B=84 pixels (aka “pink bra”). It sets matching pixels as white (the rest will be black):

static void DetectColorWithGetSetPixel(Bitmap image, byte searchedR, byte searchedG, int searchedB, int tolerance)
    int toleranceSquared = tolerance * tolerance;
    for (int x = 0; x < image.Width; x++)
        for (int y = 0; y < image.Height; y++)
            Color pixel = image.GetPixel(x, y);

            int diffR = pixel.R - searchedR;
            int diffG = pixel.G - searchedG;
            int diffB = pixel.B - searchedB;

            int distance = diffR * diffR + diffG * diffG + diffB * diffB;

            image.SetPixel(x, y, distance > toleranceSquared ? Color.Black : Color.White);

Above code is our terribly slow Get/SetPixel baseline. If we call it this way (named parameters for clarity):

DetectColorWithGetSetPixel(image, searchedR: 253, searchedG: 129, searchedB: 255, tolerance: 84);

we will receive following outcome:

Color difference detection output image (scaled down)

Result may be ok but having to wait over 84300ms* is a complete disaster! 

Now check out this method:

static unsafe void DetectColorWithUnsafe(Bitmap image, byte searchedR, byte searchedG, int searchedB, int tolerance)
    BitmapData imageData = image.LockBits(new Rectangle(0, 0, image.Width, image.Height), ImageLockMode.ReadWrite, PixelFormat.Format24bppRgb);
    int bytesPerPixel = 3;

    byte* scan0 = (byte*)imageData.Scan0.ToPointer();
    int stride = imageData.Stride;

    byte unmatchingValue = 0;
    byte matchingValue = 255;
    int toleranceSquared = tolerance * tolerance;

    for (int y = 0; y < imageData.Height; y++)
        byte* row = scan0 + (y * stride);

        for (int x = 0; x < imageData.Width; x++)
            // Watch out for actual order (BGR)!
            int bIndex = x * bytesPerPixel;
            int gIndex = bIndex + 1;
            int rIndex = bIndex + 2;

            byte pixelR = row[rIndex];
            byte pixelG = row[gIndex];
            byte pixelB = row[bIndex];

            int diffR = pixelR - searchedR;
            int diffG = pixelG - searchedG;
            int diffB = pixelB - searchedB;

            int distance = diffR * diffR + diffG * diffG + diffB * diffB;

            row[rIndex] = row[bIndex] = row[gIndex] = distance > toleranceSquared ? unmatchingValue : matchingValue;


It does exactly the same thing but runs for only 230ms over 360 times faster!

Above code makes use of Bitmap.LockBits method that is a wrapper for native GdipBitmapLockBits (GDI+, gdiplus.dll) function. LockBits creates a temporary buffer that contains pixel information in desired format (in our case RGB, 8 bits per color component). Any changes to this buffer are copied back to the bitmap upon UnlockBits call (therefore you should always use LockBits and UnlockBits as a pair). Bitmap.LockBits returns BitmapData object (System.Drawing.Imaging namespace) that has two interesting properties: Scan0 and Stride. Scan0 returns an address of the first pixel data. Stride is the width of single row of pixels (scan line) in bytes (with optional padding to make it dividable by 4). 

BitmapData layout

Please notice that I don’t use calls to Math.Pow and Math.Sqrt to calculate distance between colors. Writing code like this: 

double distance = Math.Sqrt(Math.Pow(pixelR - searchedR, 2) + Math.Pow(pixelG - searchedG, 2) + Math.Pow(pixelB - searchedB, 2));

to process millions of pixels is a terrible idea. Such line can make our optimized method about 25 times slower! Using Math.Pow with integer parameters is extremely wasteful and we don’t have to calculate square root to determine if distance is longer than specified tolerance.

Previously presented method uses code marked with unsafe keyword. It allows C# program to take advantage of pointer arithmetic. Unfortunately, unsafe mode has some important restrictions. Code must be compiled with \unsafe option and executed for fully trusted assembly. 

Luckily there is a Marshal.Copy method (from System.Runtime.InteropServices namespace) that can move data between managed and unmanaged memory. We can use it to copy image data into a byte array and manipulate pixels very efficiently. Look at this method:

static void DetectColorWithMarshal(Bitmap image, byte searchedR, byte searchedG, int searchedB, int tolerance)
    BitmapData imageData = image.LockBits(new Rectangle(0, 0, image.Width, image.Height), ImageLockMode.ReadWrite, PixelFormat.Format24bppRgb);

    byte[] imageBytes = new byte[Math.Abs(imageData.Stride) * image.Height];
    IntPtr scan0 = imageData.Scan0;

    Marshal.Copy(scan0, imageBytes, 0, imageBytes.Length);
    byte unmatchingValue = 0;
    byte matchingValue = 255;
    int toleranceSquared = tolerance * tolerance;

    for (int i = 0; i < imageBytes.Length; i += 3)
        byte pixelB = imageBytes[i];
        byte pixelR = imageBytes[i + 2];
        byte pixelG = imageBytes[i + 1];

        int diffR = pixelR - searchedR;
        int diffG = pixelG - searchedG;
        int diffB = pixelB - searchedB;

        int distance = diffR * diffR + diffG * diffG + diffB * diffB;

        imageBytes[i] = imageBytes[i + 1] = imageBytes[i + 2] = distance > toleranceSquared ? unmatchingValue : matchingValue;

    Marshal.Copy(imageBytes, 0, scan0, imageBytes.Length);


It runs for 280ms, so it is only slightly slower than unsafe version. It is CPU efficient but uses more memory then previous method – almost 100 megabytes for our test Ultra HD 8K image in RGB 24 format.

If you want to make pixel manipulation even faster you may process different parts of the image in parallel. You need to make some benchmarking first because for small images the cost of threading may be bigger than gains from concurrent execution. Here’s a quick sample of code that uses 4 threads to process 4 parts of the image simultaneously. It yields 30% time improvement on my machine. Treat is as a quick and dirty hint, this post is already to long…

static unsafe void DetectColorWithUnsafeParallel(Bitmap image, byte searchedR, byte searchedG, int searchedB, int tolerance)
    BitmapData imageData = image.LockBits(new Rectangle(0, 0, image.Width, image.Height), ImageLockMode.ReadWrite, PixelFormat.Format24bppRgb);
    int bytesPerPixel = 3;

    byte* scan0 = (byte*)imageData.Scan0.ToPointer();
    int stride = imageData.Stride;

    byte unmatchingValue = 0;
    byte matchingValue = 255;
    int toleranceSquared = tolerance * tolerance;

    Task[] tasks = new Task[4];
    for (int i = 0; i < tasks.Length; i++)
        int ii = i;
        tasks[i] = Task.Factory.StartNew(() =>
                int minY = ii < 2 ? 0 : imageData.Height / 2;
                int maxY = ii < 2 ? imageData.Height / 2 : imageData.Height;

                int minX = ii % 2 == 0 ? 0 : imageData.Width / 2;
                int maxX = ii % 2 == 0 ? imageData.Width / 2 : imageData.Width;                        
                for (int y = minY; y < maxY; y++)
                    byte* row = scan0 + (y * stride);

                    for (int x = minX; x < maxX; x++)
                        int bIndex = x * bytesPerPixel;
                        int gIndex = bIndex + 1;
                        int rIndex = bIndex + 2;

                        byte pixelR = row[rIndex];
                        byte pixelG = row[gIndex];
                        byte pixelB = row[bIndex];

                        int diffR = pixelR - searchedR;
                        int diffG = pixelG - searchedG;
                        int diffB = pixelB - searchedB;

                        int distance = diffR * diffR + diffG * diffG + diffB * diffB;

                        row[rIndex] = row[bIndex] = row[gIndex] = distance > toleranceSquared ? unmatchingValue : matchingValue;



* Originally I had some triangles and squares as an illustration, but Victoria's Secret models (source) are better, huh? :) 

* .NET 4 console app, executed  on MSI GE620 DX laptop: Intel Core i5-2430M 2.40GHz (2 cores, 4 threads), 4GB DDR3 RAM, NVIDIA GT 555M 2GB DDR3, HDD 500GB 7200RPM, Windows 7 Home Premium x64.

Radio buttons for list items in MVC 4 – problem with id uniqueness

by Miłosz Orzeł 24. June 2013 20:37

Let's suppose that we have some model that has a list property and we want to render some radio buttons for items of that list. Take the following basic setup as an example.

Main model class with list:

using System.Collections.Generic;

public class Team
    public string Name { get; set; }
    public List<Player> Players { get; set; }

List item class:

public class Player
    public string Name { get; set; }
    public string Level { get; set; }

There are three accepted values for player’s skill Level property: BEG (Beginner), INT (Intermediate) and ADV (Advanced) so we want three radio buttons (with labels) for each player in a team. Yup, normally we would rather use enum instead of a string for Level property, but let’s skip it here for the sake of simplicity…

Controller action method that returns sample data:

public ActionResult Index()
    var team = new Team() {
        Name = "Some Team",
        Players = new List<Player> {
               new Player() {Name = "Player A", Level="BEG"},
               new Player() {Name = "Player B", Level="INT"},
               new Player() {Name = "Player C", Level="ADV"}

    return View(team);

Here is our Index.cshtml view:

@model Team


    @Html.EditorFor(model => model.Players)            

Notice that markup for Players is not created inside a loop. Instead, EditorTemplate is used. It’s a good practice since it makes code more clear and maintainable. Framework is smart enough to use code from template for each player on a list, because Team.Players property implements IEnumerable interface...

And here is Players.cshtml EditorTemplate:

@model Player

    @Html.RadioButtonFor(model => model.Level, "BEG")
    @Html.LabelFor(model => model.Level, "Beginner")

    @Html.RadioButtonFor(model => model.Level, "INT")
    @Html.LabelFor(model => model.Level, "Intermediate")

    @Html.RadioButtonFor(model => model.Level, "ADV")
    @Html.LabelFor(model => model.Level, "Advanced")        

The code looks fine, nice strongly typed helpers that relay on lambda expressions (for compile-time checking and easier refactoring)... but there’s a catch: HTML markup that is generated by such code is actually seriously flawed. Check this snipped of web page source generated for the first player:

    <strong>Player A:</strong>  
    <input name="Players[0].Level" id="Players_0__Level" type="radio" checked="checked" value="BEG">
    <label for="Players_0__Level">Beginner</label>
    <input name="Players[0].Level" id="Players_0__Level" type="radio" value="INT">
    <label for="Players_0__Level">Intermediate</label>
    <input name="Players[0].Level" id="Players_0__Level" type="radio" value="ADV">
    <label for="Players_0__Level">Advanced</label>        

Players_0__Level id is used for three different radio buttons! Lack of uniqueness not only violates HTML specification and makes scripting hard but also causes label tags to not work properly (clicking on them doesn’t check their corresponding input element). 

Fortunately MVC framework contains TemplateInfo class that has GetFullHtmlFieldId method. This method returns id for DOM element. That id is constructed by appending name provided as method argument to an automatically determined prefix. This prefix takes into account nesting level and list item's index. Internally, GetFullHtmlFieldId uses TemplateInfo.HtmlFieldPrefix property and TagBuilder.CreateSanitizedId method so even if you pass some illegal characters to id suffix they will be replaced.

Here is modified EditorTemplate
@model Player

    @{string rbBeginnerId = ViewContext.ViewData.TemplateInfo.GetFullHtmlFieldId("rbBeginner"); }
    @Html.RadioButtonFor(model => model.Level, "BEG", new { id = rbBeginnerId })
    @Html.LabelFor(model => model.Level, "Beginner",  new { @for = rbBeginnerId} )

    @{string rbIntermediateId = ViewContext.ViewData.TemplateInfo.GetFullHtmlFieldId("rbIntermediate"); }
    @Html.RadioButtonFor(model => model.Level, "INT", new { id = rbIntermediateId })
    @Html.LabelFor(model => model.Level, "Intermediate",  new { @for = rbIntermediateId })

    @{string rbAdvancedId = ViewContext.ViewData.TemplateInfo.GetFullHtmlFieldId("rbAdvanced"); }
    @Html.RadioButtonFor(model => model.Level, "ADV", new { id = rbAdvancedId })
    @Html.LabelFor(model => model.Level, "Advanced",  new { @for = rbAdvancedId })

Calls to ViewContext.ViewData.TemplateInfo.GetFullHtmlFieldId method let us obtain ids for radio buttons which are also used to set for attributes of labels. In MVC 3 there was no overload of LabelFor extension method that accepted htmlAttributes object. Luckily version 4 has it build-in.

Above code produces such markup:

    <strong>Player A:</strong>
    <input name="Players[0].Level" id="Players_0__rbBeginner" type="radio" checked="checked" value="BEG">
    <label for="Players_0__rbBeginner">Beginner</label>
    <input name="Players[0].Level" id="Players_0__rbIntermediate" type="radio" value="INT">
    <label for="Players_0__rbIntermediate">Intermediate</label>
    <input name="Players[0].Level" id="Players_0__rbAdvanced" type="radio" value="ADV">
    <label for="Players_0__rbAdvanced">Advanced</label>

Now inputs ids are unique and labels properly reference radio buttons via for attribute. Alright :) 

BTW: the weird name “radio buttons” for mutually exclusive option elements comes from buttons on radio receivers that were used to switch between stations (pushing one in automatically pushed the others out).

Coordinate system in HTML5 Canvas, drawing with y-axis value increasing upwards

by Miłosz Orzeł 19. May 2013 10:06

Coordinate system in HTML5 Canvas is set up in such a way that its origin (0,0) is in the upper-left corner. This solution is nothing new in the world of screen graphics (e.g. the same goes for Windows Forms and SVG). CRT monitors, which were standard in the past, displayed picture lines from top to bottom and image within a line was created from left to right. So locating origin (0,0) in the upper-left corner was intuitive and it made creating hardware and software for handling graphics easier.

Unfortunately sometimes default coordinate system in canvas is a bit impractical. Let’s assume that you want to create projectile motion animation. It seems natural that for ascending projectile, the value of y coordinate should increase. But it will result in a weird effect of inverted trajectory:

Default coordinate system (y value increases downwards)

You can get rid of this problem by modifying y value that is passed to drawing function:

context.fillRect(x, offsetY - y, size, size);

For y = 0, projectile will be placed in a location determined by offsetY (to make y = 0 be the very bottom of the canvas, set offsetY equal to height of the canvas). The bigger the value of y the higher a projectile will be drawn. The problem is that you can have hundreds of places in your code that use y coordinate. If you forget to use offsetY just once the whole image may get destroyed. 

Luckily canvas lets you make changes to coordinate system by means of transformations. Two transformation methods will be useful for us: translate(x ,y) and scale(x, y). The former allows us to move origin to an arbitrary place, the latter is for changing size of drawn objects, but it may also be used to invert coordinates.

Single execution of the following code will move origin of coordinate system to point (0, offsetY) and establish y-axis values as increasing towards the top of the screen:

context.translate(0, offsetY);
context.scale(1, -1);

Translation and scaling of coordinate system. Click to enlarge...

But there’s a catch: the result of providing -1 as scale’s method second argument is that the whole image is created for inverted y coordinate. This applies to text too (calling fillText will render letters upside-down). Therefore before writing any text, you have to restore default y-axis configuration. Because manual restoring of canvas state is awkward, methods save() and restore() exist. These methods are for pushing canvas state on the stack and popping canvas state from the stack, respectively. It is recommended to use save method before doing transformations. Canvas state includes not only transformations but also values such as fill style or line width...;
context.fillStyle = 'red';
context.scale(2, 2);
context.fillRect(0, 0, 10, 10);
context.fillRect(0, 0, 10, 10);

Above code draws 2 squares: 

First square is red and is drawn with 2x scale. Second square is drawn with default canvas settings (color black and 1x scale). This occurs because right before any changes to scale and color, canvas state was save on the stack, later on it was restored before second square drawing.

TortoiseSVN pre-commit hook in C# - save yourself some troubles!

by Miłosz Orzeł 13. January 2013 19:24

Probably everyone who creates or debugs a program happens to make temporary changes to the code that make current task easier but should never get into the repository. And probably everyone has accidentally put such code into next revision. If you are lucky enough, mistake will be revealed quickly and the only result will be a bit of shame, if not...

If only there was a way to mark “uncommitable” code...

You can do it and it’s pretty simple!

TortoiseSVN lets you set so-called pre-commit hook. It’s a program (or script) that is run when user clicks “OK” button in “SVN Commit” window. Hook can for example check content of modified files and block commit when deemed appropriate. Tortoise hooks differ from Subversion hooks in that they are executed locally and not on the server that hosts the repository. You therefore don’t have to worry whether your hook will be accepted by the admin or if it works on the server (e.g. server may not have .NET installed), you also don’t affect the experience of other users of the repository. Client-side hooks are quicker too.

Detailed description of hooks can be found in „4.30.8. Client Side Hook Scripts” chapter of Tortoises help file.

Tortoise supports 7 kinds of hooks: start-commit, pre-commit, post-commit, start-update, pre-update, post-update and pre-connect. We are concerned with pre-commit action. The essence of the hook is to check whether one of added or modified files contains temporary code marker. Our marker may be a “NOT_FOR_REPO” text put into a comment placed above temporary code.

This is whole hook’s code – simple console application, that may save your ass :)

using System;
using System.IO;
using System.Text.RegularExpressions;

namespace NotForRepoPreCommitHook
    class Program
        const string NotForRepoMarker = "NOT_FOR_REPO";

        static void Main(string[] args)
            string[] affectedPaths = File.ReadAllLines(args[0]);

            Regex fileExtensionPattern = new Regex(@"^.*\.(cs|js|xml|config)$", RegexOptions.IgnoreCase);

            foreach (string path in affectedPaths)
                if (fileExtensionPattern.IsMatch(path) && File.Exists(path))
                    if (ContainsNotForRepoMarker(path))
                        string errorMessage = string.Format("{0} marker found in {1}", NotForRepoMarker, path);

        static bool ContainsNotForRepoMarker(string path)
            StreamReader reader = File.OpenText(path);

                string line = reader.ReadLine();

                while (line != null)
                    if (line.Contains(NotForRepoMarker))
                        return true;

                    line = reader.ReadLine();

            return false;

TSVN calls pre-commit hook with four parameters. We are interested only in the first one. It contains a path to *.tmp file. In this file there is a list of files affected by current commit. Each line is one path. After loading the list, files are filtered by extension (useful if you don’t want to process files of all types). Checking if file exists is also important – the list from *.tmp file contains paths for deleted files too! Detection of the marker represented by NotForRepoMarker constant is realized by ContainsNotForRepoMarker method. Despite its simplicity it provides good performance. On mine (middle range) laptop, 100 MB file takes less than a second to process. If marker is found, program exits with error code (value different than 0). Before quitting, information about which file contains the marker is sent to standard error output (via Console.Error). This message will get displayed in Tortoise window.

The code is simple, isn’t it? In addition, hook installation is also trivial!

To attach hook, choose “Settings” item from Tortoise’s context menu. Then select “Hook scripts” element and click “Add…” button. Such window will appear:

TSVN hooks configuration window

Set „Hook Type” to „Pre-Commit Hook”. Fill “Working Copy Path” field with a path to the directory that contains local copy of the repo (different folders can have different hooks). In “Command Line To Execute” field, set path to the application that implements the hook. Check “Wait for the script to finish” and “Hide the script while running” options (the latter will prevent console window from showing). Press “OK” button and voila, hook is installed!

Now mark some code with “NOT_FOR_REPO” comment and try to execute commit. You should see something like that:

Operation blocked by pre-commit hook

Notice the „Retry without hooks” button – it allows commit to be completed by ignoring hooks.

We now have a hook that prevents from temporary code submission. One may also want to create a hook that enforces log message to be filled, blocks *.log files commits etc. Your private hooks – you decide! And if some of the hooks will be usefull for the whole team, you can always remake them as Subversion hooks.

Tested on TortoiseSVN 1.7.8/Subversion 1.7.6.

Update 17.09.2013 (additional info): You may set hook on a parent folder which contains multiple repositories checkouts. If you are willing to sacrifice a bit of performance for added protection you may resign from filtering files before checking for NotForRepoMarker marker.  

How to close pop-ups upon main window closure or logout?

by Miłosz Orzeł 4. November 2012 14:58

Imagine you have to provide support for some really old web application. The app has one main window and pop-up windows that show some sensitive information (for example payroll list). Client wants to ensure that all pop-ups are closed when user leaves main window or clicks “logout” button in this window...

So... how to close all the windows opened with

On the web this question comes up very often. Unfortunately, most common answer is really naive. Proposed solution is based on keeping references to opened pop-ups and subsequent invocation of close method: 

var popups = []; 
function openPopup() {
    var wnd ='Home/Popup', 'popup' + popups.length, 'height=300,width=300');
function closePopups() {
    for (var i = 0; i < popups.length; i++) {
    popups = [];

In practice this doesn’t work because the array of references is cleared at full page reload (for example after clicking on a link or upon postback)...

Other suggested solution is to give the pop-up a unique name (using the second parameter of the open method) and later acquisition of a reference to the window:

var wnd ='', 'popup0');

This is based on the fact, that method works in two modes:

  1. If a window with a given name doesn’t exist, it is created.
  2. If a window with a given name does exist, it will not be recreated, instead a reference to that window will be returned (if non empty URL is passed to the open method pop-up will be reloaded).

The problem lies at point no. 1. If pop-up window with given name wasn’t previously opened, the call to open and close methods will cause the pop-up to be briefly visible. It sucks…

But maybe a reference to pop-up can be retained between page reloads?

If there is no need to support older browsers (unlikely for the old application) we can try to put reference to the pop-up window into localStorage. However, this will not work:

var popup ='', 'test');
localStorage.setItem('key', JSON.stringify(popup)); 
TypeError: Converting circular structure to JSON

Old tricks for keeping page state between reloads that are based on cookies or will not work too.


So… what to do?

Even if you can’t afford to have a major change such as introducing frames, don’t give up :)

Pop-up windows have opener property that points to parent window (that is the window in which the call to was placed). Pop-ups can therefore periodically check whether the main window still remains open. Additionally, pop-ups can also access variables from parent window. This can be used to enforce pop-ups closure when main window is closed or when user clicks on “logout” button in parent window. When user is logged-in (and only then!), a marker variable (i.e. loggedIn) should be set in main window.

Here is the JS code that should be placed on a page displayed in a pup-up:

window.setInterval(function () {
    try {
        if (!window.opener || window.opener.closed === true || window.opener.loggedIn !== true) {
    } catch (ex) {
        window.close(); // FF may throw security exception when you try to access loggedIn (for external site)
}, 1000);

Checking variable from the opener window has another advantage. If user moves away from our application in main window (for example by clicking back button or a link to an external website), then the pop-up window will detect the lack of monitored variable in window.opener and close automatically.

Well, it's not the kind of code you enjoy to write but it achieves the desired result despite the painful gaps in the browsers API. If only they provide us with window.exists('name') method...

Ref modifier for reference types and a bit of SOS

by Miłosz Orzeł 9. April 2012 23:30

Take a look at the following code and think what value will be displayed on the console (note that string is a reference type)?

using System;
class Program
    static void Test(string y)
        y = "bbb";

    static void Main()
        string x = "aaa";

The correct answer (aaa) is not all that obvious. You will see the words aaa, because without a ref modifier, a program written in C# provides a copy of the parameter value (for value types) or a copy of a reference (for reference types).

When parameter y in method Test receives a new text value, CLR does not modify the array of chars. Instead, a new string is created and a reference to it is assigned to variable y (more info here). Variable y contained in method Test is, however, just a copy of a reference hold under x variable from method named Main.

To actually change the text hidden under x variable, use the ref modifier (you have to set it both in the method declaration and its invocation - C# enforces such behavior for clarity):

using System;
class Program
    static void Test(ref string y)
        y = "bbb";

    static void Main()
        string x = "aaa";
        Test(ref x);

After this change, console will show bbb text.



Way in which parameters are passed to a method can be examined by using tool called SOS (Son of Strike). We will use CLRStack -a command, which displays information about parameters and local variables on managed code stack (if you don't know how to use SOS look here and here, if you wonder where the name "Son of Strike" came from, click here)...

Below are the results of CLRStack -a command executed at the time of entry to the Test method.

For code without ref modifier:

!CLRStack -a
OS Thread Id: 0x176c (5996)
Child SP IP       Call Site
0031f114 00390104 Program.Test(System.String)
        y (0x0031f114) = 0x025cb948

0031f158 003900af Program.Main()
        0x0031f158 = 0x025cb948

0031f3c0 656721bb [GCFrame: 0031f3c0]

For code with ref modifier:

!CLRStack -a
OS Thread Id: 0x934 (2356)
Child SP IP       Call Site
001dee34 002f00f4 Program.Test(System.String ByRef)
        y (0x001dee34) = 0x001dee78

001dee78 002f00aa Program.Main()
        0x001dee78 = 0x027fb948

001df0ec 656721bb [GCFrame: 001df0ec]

An important difference that is exhibited by these results is the value of y parameter. In the case of code without ref modifier, it is the address of aaa string (0x025cb948). For the code with ref modifier, the value of y parameter is the address of x variable (0x001dee78) from Main method (that variable points to aaa string).

View State for TextBox and other controls that implement IPostBackDataHandler

by Miłosz Orzeł 8. January 2012 21:00

While reading the official training kit for 70-515 exam I came across this text: "With view state, data is stored within controls on a page. For example, if a user types an address into a TextBox and view state is enabled, the address will remain in the TextBox between requests.". If such statements can be found in recommended study guide, it should not come as a surprise, that confusion about the way ASP.NET Web Forms tries to cope with inherent statelessness of HTTP is so common… ;)

TextBox control from ASPX page:

<asp:TextBox ID="TextBox1" runat="server"></asp:TextBox>

is rendered on HTML page as an input tag:

<input name="TextBox1" type="text" id="TextBox1" />

If so, then the preservation of TextBox value between requests does not require any use of __VIEWSTATE hidden field. To illustrate this, create a simple page that contains TextBox and Button controls:

    <form id="form1" runat="server">
        <asp:TextBox ID="TextBox1" runat="server"></asp:TextBox>
        <asp:Button ID="Button1" runat="server" Text="Button" onclick="Button1_Click" /

and add a handler for button’s Click event, which only task is to extend the text in TextBox1 control:

protected void Button1_Click(object sender, EventArgs e)
    TextBox1.Text += "X";

Then, run the page and activate a tool for monitoring communication between browser and server. We are interested in testing form data that is sent to the server at postback... If you are using IE, I can recommend you a debugging proxy called Fiddler. Under Firefox, use Firebug. You can also use built-in ASP.NET Trace feature – to do so, add Trace = "true" to @Page directive. I performed my tests using development tools provided with Chrome browser ("Network" tab).

The following screenshot shows what form data (HTTP POST request) was sent after first button press:

Dane formularza przy pierwszym postbacku

And here is data from second postback:

Dane formularza przy drugim postbacku

If you compare data from first and second requests, you will see that a change in the value of TextBox1.Text does not affect the value of __VIEWSTATE field. Expanding the field would be a waste of network resources if text is being sent to server in a separate field called TextBox1.

System.Web.UI.WebControls.TextBox class is one of several classes that implement IPostBackDataHandler interface. This interface requires LoadPostData method. After page initialization is completed (but before the Load event) loading of View State data is invoked (LoadViewState) and then (if the control implements IPostBackDataHandler), loading of form data is invoked (LoadPostData). Text property of a TextBox control can therefore be set even if View State mechanism is disabled (via EnableViewState = "false" setting).

So... Can we completely disable View State mechanism for TextBox controls and the like?

No. For example, View State is useful when TextChanged event is handled (for comparison between current and previous value). It can also be used when the value that is being set is other than the one related to control’s value (e.g. ForeColor).

Detection of loading an iframe created in Ext JS

by Miłosz Orzeł 5. November 2011 19:31

Suppose that you need to execute a block of code when iframe's content is loaded. In case when iframe is created statically in HTML markup, the solution is really simple. All you have to do is to connect some JavaScript function with load event:

<iframe src="" width="600" height="400" onload="someFunction();" ></iframe>

Note: The load event (onload) is invoked when the entire contents of the document is loaded (including its external elements such as images). If you need to act earlier, that is at a time when the DOM is ready, use the other methods...

But what if the iframe is created with Ext JS code?

A simple way to set it up it is to use Ext.BoxComponent with correct autoEl property value. This gives you the ability to easily use iframe in Ext JS layout (e. g. as a child item of Ext.Window), without extending document tree with redundant elements. 

var iframeContainer = new Ext.BoxComponent({
    autoEl: {
        tag: 'iframe',
        frameborder: '0',
        src: ''
    listeners: {
        afterrender: function () {

            this.getEl().on('load', function () {

In the above code (Ext JS 3.2.1), really important thing is the time when iframe's load event is hooked. You can do it only after the control (BoxComponent) is rendered. If you try this earlier, then getEl() will return undefined and the code will fail. Prior to rendering, an Ext JS control is just a JavaScript objects, for which no document tree elements exist. Below are two screenshots showing the HTML snippets created by Ext.Window in which the only item was BoxComponent creating the iframe tag...


DOM beforerender


DOM afterrender

You can clearly see that premature connecting to load event is futile, becasue you simply cannot listen to events on something that does not exist.

Those screenshots come from Elements window of Chrome Developer Tools. A quick way to show that tool (of course in Google's browser) is to press F12 or Ctrl+Shift+I. Nice feature of CDT is the ability to show events being listened on a DOM element. To see the list you have to select DOM element and, on the right side menu, choose "Event Listeners" tab. On the screenshot below, you can see that iframe's load event is indeed used:

CDT Event Listeners

What for?

I can’t imagine working as a programmer without hundreds of web pages on which people "wasting" their free time share what they managed to find out. Therefore I will try to add a bit of useful information to the web’s resources myself...  - about me


This blog is my first attempt to write in English so if you see any language mistakes please let me know. I didn’t have enough time to translate most of my old posts but I will try to make new ones both in Polish and in English.
Znasz polski? Kliknij tutaj.