2010-07-26

Playing with PowerShell

PowerShell becomes more appealing as you learn it. For example, I had to look up what different values of “encoding” people use in XML posts to a site. To do this, you have to loop through every line in the logfile, extract the value and group by a count. You can do this with Log Parser, but Log Parser lacks support for easily matching parts of a string. You have to work with INDEX_OF and SUBSTRING and it's just tedious. You can do it with LINQPad if you write the .NET code yourself, but the framework doesn't have convenience methods for everything (like processing multiple files). Enter PowerShell, where it's trivial to combine command-line tools with .NET code. Here was my first attempt.

findstr /i /c:encoding 07-23-2010.log | foreach { [text.regularexpressions.regex]::match($_, "encoding *= *['`"](?<encoding>[^'`"]+)") } |? { $_.success } | foreach { $_.groups["encoding"].value } | Group-Object -NoElement | Sort-Object Count -Descending | Format-Table -AutoSize

Result:

Count Name
----- ----
10071 UTF-8
  908 ISO-8859-15
   49 ISO-8859-1
   16 UTF8


(OK, this wasn’t quite my first attempt - the stuff after Group-Object was added later when it became clear it was necessary. If you leave it out, the output should be the same by coincidence (Group-Object doesn’t sort and Format-Table is the default, but without -AutoSize).)

The nice thing here is that PowerShell actually has built-in support for regular expressions, but if you’re still learning about it and you don’t know the cmdlets, then your existing knowledge of findstr and .NET transfers just fine. Here is the fully PowerShell way of doing it:

select-string -pattern "encoding *= *['`"](?<encoding>[^'`"]+)" 07-23-2010.log |% { $_.matches[0].groups["encoding"] } | group -NoElement | sort Count -Descending | ft -AutoSize

PowerShell has its own findstr called select-string, which yields MatchInfo objects that can be further manipulated. The % is short for Foreach-Object, incidentally. Even with IntelliSense, it’s quicker. I’m using other aliases as well.

The really neat thing here is Group-Object. If you were using a Unix command line, you’d probably have to puzzle a bit on this - the trick is uniq -c, which can produce counts of sorted lines. So all you need to do is produce a list of encodings, one for each line, sort them, count them, then sort by count. You can reduce a line to its encoding with sed or awk, for example. I’m going to use sed because I’m hardcore. (Note that this command line is Windows-specific, and we're using the built-in sort rather than the Unix sort.)

      sed -n "s/^.*encoding *= *['\""]\([^'\""]*\).*$/\U\1/p" 07-23-2010.log|sort|uniq -c|sort /r

Result:

  10071 UTF-8
    908 ISO-8859-15
     49 ISO-8859-1
     16 UTF8

PowerShell’s object-oriented approach is much nicer than having to combine sort and uniq in creative ways. You can, of course, have your cake and eat it too: why not call sed from within PowerShell?

sed -n "s/^.*encoding *= *['\`"]\([^'\`"]*\).*$/\U\1/p" 07-23-2010.log|group -NoElement|sort Count -Descending|ft -AutoSize

Now you’re playing with power… albeit power that’s pretty hard to use if you don’t know sed’s rather spartan regex syntax. Of course, the kids these days use Perl or Python for tasks like this rather than a dinosaur like sed, but I’m not digging into that here. You might as well do it in C# if you're going that way. In .NET 4, improvements were made to the File class so you can easily read and write the lines incrementally through an IEnumerable instead of having to slurp everything into memory.
    var encodings = 
        from line in File.ReadLines("07-23-2010.log")
        let match = Regex.Match(line, @"encoding *= *['""](?<encoding>[^'""]+)")
        where match.Success
        select match.Groups["encoding"].Value
    ;
    foreach (var encoding in encodings.GroupBy(encoding => encoding, StringComparer.OrdinalIgnoreCase)) {
       Console.WriteLine("{0}\t{1}", encoding.Key, encoding.Count());
    }

Result:
UTF-8   10071
ISO-8859-15     908
ISO-8859-1      40
UTF8    16
This is a lot more code. On the plus side, if you already know .NET and LINQ well you probably have to spend less time thinking about it. Once you know PowerShell, though, the ability to do these things in a single line and incrementally starts saving time. Especially because none of the alternatives provided so far can handle multiple files gracefully, like the PowerShell version can:
select-string -pattern "encoding *= *[^'`"](?<encoding>[^'`"]+)" 07-*-2010.log |% { $_.matches[0].groups["encoding"] } | group -NoElement | sort Count -Descending | ft -AutoSize
To search all files for this month, all I have to do is insert a wildcard. The alternatives all operate on single files and can’t do this as easily. Try extending the .NET version to see what I mean. I won’t even go into how you do this in Unix - sufficive to say it requires the horrible find command, xargs and cat and is by no stretch of the imagination easy, unless you do it every day.

2 comments:

Anonymous said...

In Unix you would probably use Perl:

echo "Count Name"

cat 07-23-2010.log |
perl -ne '/encoding *= *(['"'"'`"]?)([-a-zA-Z0-9_]*)/ and $a{$2}++; END {print map { "$a{$_} $_\n" } keys %a}' | sort -n


BTW: This part ('|`") of your regexp looks wrong. I think you really mean: (['`"]*)

Jeroen Mostert said...

Whoops, you're absolutely right about the | part -- this was an earlier version. The "right" expression even appears elsewhere. Of course the pipe works, but it's unnecessarily complicated. Corrected. (Note that the starting quote cannot be optional or appear any number of times, though -- that's invalid XML.)

The perl solution actually looks more complicated than the awk solution I worked out. I've always managed to somehow resist learning perl, though, so maybe there's a solution that's shorter but less readable, as I'm told there always is in perl. :-)