Parsing Windows Media Services Log Files

 

Christopher Diggins

Microsoft Corporation

March 2007

 

Applies to:

   Microsoft® Windows Media® Services 9

 

Summary: Discusses the Windows Media Services 9 logging model, explains how to use the free Microsoft utility Log Parser 2.2 to parse Windows Media Services log files, and provides some sample scripts to accomplish common tasks. This document assumes some knowledge of Windows Media Services and server-side logging.

Contents

Introduction

Windows Media Services Log Files

   Windows Media Services Log Entries

Fast Caching

Analyzing Log Files

About Log Parser

   Writing Log Parser Queries

   Basic Log Parser Usage

   Aggregating Data

   Analyzing Aggregated Data

   Using Log Parser COM Objects

For More Information

Introduction

This article introduces the topic of parsing Windows Media Services 9 log files. The Windows Media Services 9 logging model is briefly introduced. Special issues pertaining to the parsing of Windows Media Services log files are explained, and techniques for parsing log files using the Log Parser 2.2 utility from Microsoft are demonstrated with sample scripts.

Most online content servers generate log files that contain specific details about the successful delivery of content or errors that might have occurred. The information contained in a log file is usually too unwieldy to be analyzed without special tools. Fortunately, Microsoft provides a free utility for parsing log files called Log Parser, which makes it much easier to manage the data.

Not only is Log Parser a useful utility, it comes with a dynamically linked library (DLL) that allows you to access it from any COM-enabled language.

Windows Media Services Log Files

Log files generated by Windows Media Services 9.0 follow the W3C extended log file format. There are 52 different fields for each log entry. These are each described in detail in the technical article Windows Media Services 9 Logging Model (https://go.microsoft.com/fwlink/?LinkId=85570).

Windows Media Services Log Entries

In a typical HTTP log file, a single log entry represents a single attempt to download a specific file. If successful, a single entry is commonly referred to as a hit. In a Windows Media Services log file, a single log entry can contain information pertaining to streaming, or rendering, or both. A Windows Media Services log entry is one of the following types:

  • Client render log entry
  • Client stream log entry
  • Combination log entry
  • Multicast log entry
  • Distribution server log entry
  • Cache/proxy server log entry
  • Cache/proxy propagated client log entry
  • Web proxied log entry

Client Render Log Entry

The client render log entry contains information about the playback experience of a user (for example, how long the file was played, etc.).

Client render log entries are available only when a client is using Windows Media Player 9 or later (or a player built using the Windows Media Player 9 or later SDK), and the fast caching feature is enabled on both the server and the client.

Traditionally, the client rendering and streaming data was contained in a single log-entry. However, with the introduction of fast caching, a client would finish downloading the data before it finished rendering, so the rendering and streaming data was separated.

A client render log can be identified by the fact that the protocol field is equal to "cache".

For more information, see Fast Caching.

Client Stream Log Entry

A client stream log entry contains information about the delivery of content from the server to a client. Client stream log entries have the following characteristics:

  • The audiocodec field is empty.
  • The videocodec field is empty.
  • The c-hostexe field is not "WMServer.exe".
  • The cs(User-Agent) field starts with one of the following values: "WMFSDK/9", "WMPlayer/9", "NSPlayer/9".

Combination Log Entry

Combination log entries (also known as legacy log entries) contain both streaming and rendering data. Combination log entries can be generated in one of two situations:

  • The client is using a version of Windows Media Player prior to version 9 (or a player based on an earlier version of Windows Media Player SDK than version 9).
  • The fast caching feature is disabled on the Windows Media Player or the server.

The combination log entries have the following characteristics:

  • The protocol field is not equal to "cache".
  • The audiocodec field is not empty.
  • The videocodec field is not empty.

Multicast Log Entry

Multicasting is a form of connectionless broadcasting, which means that there is no direct connection between clients and the Windows Media server. The logging is performed by an Internet Server API (ISAPI) logging application by using wmsiilog.dll.

If a multicast publishing point is using a server-side playlist as a source, Windows Media Player versions 7 and later send one log entry for each playlist item. In addition, if the Windows Media Player buffers the entry while playing a playlist item, then the Player will send a log entry. This happens each time the player enters the buffering state. However, Windows Media Player 6.4 sends only one log entry when the user stops the stream. The logging information contains the accumulated data for all playlist items; Windows Media Player 6.4 and earlier versions do not send a log entry for each playlist item.

The ISAPI logging application does not have any information about the streams; thus, it cannot fill in the server-side information such as the number of bytes sent or the number of packets sent. In addition, you cannot see server-generated error log entries.

Multicast log entries have the following characteristics:

  • The protocol field is "asfm".
  • The channelURL field is not empty.

Distribution Server Log Entry

A distribution server log is generated when the downstream client is a distribution server. The distribution configuration is a Windows Media server (the distribution server) with a publishing point that is configured to source from an upstream server (the origin server). When the distribution server stops receiving data (by either the distribution or origin server's behavior), it sends a distribution server log entry.

Log entries sent by distribution servers do not have rendering data because the distribution server only streams content to its clients. This kind of log entry can be distinguished by the following characteristics:

  • The c-hostexe field is "WMServer.exe".
  • The audiocodec field is empty.
  • The videocodec  field is empty.
  • The cs(User-Agent) field starts with WMServer/9 or NSServer/4.1.
  • The c-playerid field is "{00000000-0000-0000-0000-000000000000}".

The totalClients field will contain the number of clients connected to the distribution server when the log was sent. This might differ from the total number of clients that played from the distribution server.

Cache/Proxy Server Log Entry

This kind of log entry has the following characteristics:

  • The c-hostexe field is "WMServer.exe".
  • The audiocodec field empty.
  • The videocodec field is empty.
  • The s-proxied field is "1".
  • The cs(User-Agent) field starts with "WMCacheProxy/9".
  • The c-playerid field is "{00000000-0000-0000-0000-000000000000}".

Cache/Proxy Propagated Client Log Entry

This kind of field has the following characteristics:

  • The s-cpu-util field is empty.
  • The s-totalclients field is empty.
  • The s-proxied field is "1".

Web Proxied Log Entry

A log entry that occurs due to proxying by a Web server can be identified by the following characteristics:

  • The c-hostexe field is not "WMServer.exe".
  • The s-proxied field is "1".

Fast Caching

To understand and parse Windows Media Services log files effectively, you must be familiar with the scenario known as fast caching. This occurs when Windows Media Player 9.0 or later versions download streaming content at a fast speed and cache it locally to play back at normal speed. Fast caching is not supported for live scenarios and must be enabled on both the server (the default) and the player to occur.

There are two possible log entries when fast caching is enabled: cache hits and cache misses.

Cache Hit Scenario

In this scenario, only a single client-rendering log entry is generated by Windows Media Services. This log entry can be identified because the protocol field is equal to "cache".

Cache Miss Scenario

In this scenario, a separate streaming log entry and a client rendering log entry will be generated.

The implication of the cache scenario means that it is common for later versions of Windows Media Player to generate two log entries for what appears to the user as a single action of downloading content.

Analyzing Log Files

When parsing a log file, you are usually trying to answer some simple questions. For example:

  • On average, how long was specific content rendered?
  • How often was specific content downloaded by different people?
  • Which streaming content was most popular?

Unfortunately, the answers are not as easily resolved as one might hope, due to the complex layout of Windows Media Services log files. Consider the following example of computing rendering times:

Computing Rendering Times

The rendering time is stored in the log file under the x-duration field. An unfortunate issue with the x-duration field is the fact that it is also used to represent streaming times, depending on the type of log entry. For example, when the client streams content from the server and caches it locally for viewing. For more information, see Fast Caching.

Another problem can arise because a single user can be connected and disconnected multiple times while downloading a stream. This might or might not affect the user experience.

When you examine the rendering duration time for specific content, it happens occasionally that the total rendering time might be significantly greater than the length of the content. This might indicate the content was looped and watched repeatedly. However, it is more likely that the player was left on while the user performed another task. Using this number as-is can skew your statistics, so the best solution is to restrict any rendering time so that it is not longer than the actual length of the file.

The general solution to the problem of computing rendering times, and other similar issues, is to pre-process the data and to store it in an intermediate form. This enables you to easily perform statistical computations on the aggregated data.

One obvious approach to storing the aggregated data in an intermediate form is to use a database, such as SQL Server. This solution has its merits, but is expensive, complex, and relatively slow for simple calculations.

There is a free solution from Microsoft called Log Parser, which is ideal for parsing log files, in single or multiple steps. Not only is Log Parser free, it is very efficient.

About Log Parser

Log Parser 2.2 is a free command-line utility (and COM object implementation) from Microsoft for executing Structured Query Language (SQL) queries on tabular data. This makes it very powerful, but the learning curve can be a bit steep. Some knowledge of SQL Server is helpful in learning how to use Log Parser, but this article does not assume such knowledge.

Log Parser serves two primary roles in the analysis of log files:

  1. It enables you to convert all or part of a log file into a more manageable format, such as a comma separated values (CSV) file, which can then be read by spreadsheet programs (such as Microsoft Excel).
  2. Log Parser enables you to query log files for information of particular interest.

Log Parser cannot be used from the command prompt. You can use its functionality from any COM-enabled language. This is not covered in this article because, for most purposes, it is sufficiently flexible and more efficient to use Log Parser from the command prompt. However, anything you can do at the command prompt can easily be accomplished by using Log Parser COM objects. For more information, see the documentation included with Log Parser.

This section of the documentation is intended only as a brief overview of the Log Parser functionality. For more information, see the documentation included with Log Parser or see For More Information.

You can download Log Parser for free at this Microsoft Web site Microsoft TechNet (https://go.microsoft.com/fwlink/?LinkId=85571)

Writing Log Parser Queries

Log Parser uses a near-subset of SQL to construct queries. A query is an expression that is designed to extract information from a structured arrangement of data. This section briefly explains the components of a Log Parser query, but you can skip ahead without reading or fully understanding this section.

A query comprises up to eight separate clauses. The order of the clauses is important:

  • SELECT clause. Specifies the fields of the output records to be returned by the query. Select clauses may also include aggregate functions, which perform a calculation on a set of values.
  • USING clause (optional). Declares aliased field-expressions that do not appear in the output records but can be referenced anywhere in the query. The using clause is intended to improve the readability of a query.
  • INTO clause (optional). Specifies the output target(s) to which the query output records are to be written.
  • FROM clause. Specifies the input format source(s) from which the query input records are to be read.
  • WHERE clause (optional). Specifies a Boolean condition that must be satisfied by an input record for that record to be output. Input records that do not satisfy the condition are discarded.
  • GROUP BY clause (optional). Specifies the groups into which output rows are to be placed and, if aggregate functions are included in the SELECT or HAVING clauses, calculates the aggregate functions values for each group.
  • HAVING clause (optional). Specifies a Boolean condition that must be satisfied by a group for the group record to be output.
  • ORDER BY clause (optional). Specifies the field-expressions to use for sorting the output records.

For more information about each of the clauses and construction of SQL queries specific to Log Parser, see the documentation that accompanies Log Parser 2.2.

Basic Log Parser Usage

The following examples are Log Parser queries that can be passed as either a command-line argument to Log Parser, or preferably placed in a query file. You must have installed LogParser.exe to use these examples. If you save the query to a file named query.sql, you can invoke the examples as follows:

LogParser.exe file:query.sql?infile=*.log+outfile=output.csv -q:on -i:W3C -o:CSV

The -q:on option tells Log Parser to execute in a non-verbose mode. The -i:W3C option tells Log Parser that the input files will conform to the W3C extended log file format (this is the default for Windows Media Services). The -o:CSV option tells Log Parser to output the result of the query in a comma separated value (CSV) format. The additional characters after the SQL file name, "?infile=input.log+outfile=output.csv", tell Log Parser to replace instances of "%infile%" and "%outfile%" in the query file (.SQL) with the values "input.log" and "output.csv," respectively.

Converting from a Log file to a CSV File

The most basic yet useful technique for using Log Parser is to convert Windows Media Services Log Files into CSV files which can be loaded and manipulated directly from Excel or other spreadsheet software. To convert an entire Windows Media Services  log file directly into a CSV file, the script is as follows:

SELECT * INTO '%outfile%' FROM '%infile%'

Note that selecting * will include additional fields not in the original log file, specifically, the log file name (LogFilename) and row number (RowNumber).

Selecting Specific Fields

Selecting all of the data in a log file often yields more column data than you actually need. To output only specific fields, you can list them in the SELECT clause as follows:

SELECT 
    c-ip, date, time, cs-media-name
INTO 
    '%outfile%'
FROM 
    '%infile%'

Restricting Output Records

It is not uncommon that you search only for data that satisfies some specific criteria. You can achieve this by using the WHERE clause, to express the condition that must be true for a field to be output. The following example outputs only records for those log entries that have a status code of 200.

SELECT 
    c-ip, date, time, cs-media-name, c-status
INTO 
    '%outfile%'
FROM 
    '%infile%'
WHERE
    c-status = 200

Ordering Output Records

It is very useful to output records in order. Do this by using the ORDER BY clause as in the following example:

SELECT 
    c-ip, date, time, cs-media-name, c-status
INTO 
    '%outfile%'
FROM 
    '%infile%'
WHERE
    c-status = 200
ORDER BY
    cs-media-name ASC

The ASC stands for ascending, and can be replaced with DESC.

Counting Groups of Records

Sometimes, you might count groups of records, such as those for a specific IP address as in the following example:

SELECT TOP 25
    c-ip, COUNT(*) as cnt
INTO 
    '%outfile%'
FROM 
    '%infile%'
GROUP BY
    c-ip
ORDER BY
    cnt DESC

Notice the addition of "TOP 25" to the SELECT clause. This restricts the number of output records to a specified number. This makes sense only when the ORDER BY clause is included.

The data output by this query counts log entries regardless of status code. So the numbers represent successful connections, cache hits, automatic reconnections, and numerous other error conditions.

Quickly Identifying Troublesome Content

A simple way to identify what content might be causing the most problems for your customers is to identify the media files that generate the most error codes by using the following query:

SELECT TOP 25
    cs-media-name, COUNT(*) as cnt
INTO 
    '%outfile%'
FROM 
    '%infile%'
WHERE
    c-status != 200
GROUP BY
    cs-media-name
ORDER BY
    cnt DESC

This query restricts the counting of log entries to status codes that are different than 200, which indicates success.

Identifying Missing Content

Sometimes, content is simply not accessible. The following query identifies content that has produced an status code of 404, indicating that it could not be found.

SELECT TOP 25
    cs-media-name, COUNT(*) as cnt
INTO 
    '%outfile%'
FROM 
    '%infile%'
WHERE
    c-status = 404
GROUP BY
    cs-media-name
ORDER BY
    cnt DESC

Often 404 status codes are generated because of a badly formed universal resource indicator (URI) or content has moved and links have not been updated.

The following query generates a very rough estimate of which content is the most popular:

SELECT TOP 25
    cs-media-name, COUNT(*) as cnt
INTO 
    '%outfile%'
FROM 
    '%infile%'
WHERE
    c-status = 200
ORDER BY
    cnt DESC

This is only a crude measure because sometimes Web crawlers (also known as bots) or other automated programs can skew your statistics, by generating hundreds or thousands of log entries for otherwise unvisited content. Furthermore, this query does not take into account the fast caching scenario.

Identifying Skewed Statistics

The following query can tell you which IP addresses are downloading specific content most often and, as a result, skewing the popularity data:

SELECT TOP 25
    cs-media-name, c-ip, COUNT(*) as cnt
INTO 
    '%outfile%'
FROM 
    '%infile%'
WHERE
    c-status = 200
GROUP BY
    cs-media-name, c-ip
ORDER BY
    cnt DESC

More Accurate Popularity

Counting unique IP addresses for each piece of content yields more accurate data about the popularity of specific content. Unfortunately, this is not easy to do because there are often multiple records for a single hit. To avoid this problem, first aggregate the data and then analyze the aggregated data.

Aggregating Data

When analyzing a Windows Media Services log file, you can compute sophisticated statistics more easily by first aggregating the log entry data and then computing additional fields. It is much easier to compute more specific statistics from this data.

One way to do this is to group the data according to client IP address and the media file name. This way, a single client accessing a single file multiple times can be treated as a single data record (a row in a spreadsheet).

Simple Data Aggregation

The following query yields an approximation of the number of successful hits and errors, by using the SUM and COUNT aggregate functions:

SELECT
    c-ip,
    cs-media-name,
    COUNT(*) as cnt,
    SUM(CASE c-status WHEN 200 THEN 1 ELSE 0 END) as hits,
    SUM(CASE c-status WHEN 200 THEN 0 ELSE 1 END) as errors
INTO 
    '%outfile%'
FROM 
    '%infile%'
GROUP BY
    c-ip,
    cs-media-name

This example was only an approximation because it counts both server log entries and rendering log entries.

Separating Server and Rendering Log Entries

In the fast caching scenario, two separate log entries will be generated for a successful download of a single piece of media. One contains delivery information (the server log entry) while the other contains rendering information (the client log entry). The following query separates the two kinds of hits by using nested aggregate functions:

SELECT
    c-ip,
    cs-media-name,
    COUNT(*) as logentries,
    SUM(CASE protocol WHEN 'Cache' 
        THEN CASE c-status WHEN 200 THEN 1 ELSE 0 END 
        ELSE 0 
        END) as cache_hits,
    SUM(CASE protocol WHEN 'Cache' 
        THEN 0 
        ELSE CASE c-status WHEN 200 THEN 1 ELSE 0 END 
        END) as server_hits
INTO 
    '%outfile%'
FROM 
    '%infile%'
GROUP BY
    c-ip,
    cs-media-name

By nesting the two CASE functions, you can assure that you count successful hits for rendering data or server data in separate columns.

In the preceding example, the sub-expression was used twice for symmetry. However, you can exploit the fact that a cache hit will always have a status code of 200. The simplified form of the query is:

SELECT
    c-ip,
    cs-media-name,
    COUNT(*) as logentries,
    SUM(CASE protocol WHEN 'Cache' 
        THEN 1 
        ELSE 0 
        END) as cache_hits,
    SUM(CASE protocol WHEN 'Cache' 
        THEN 0 
        ELSE CASE c-status WHEN 200 THEN 1 ELSE 0 END 
        END) as server_hits
INTO 
    '%outfile%'
FROM 
    '%infile%'
GROUP BY
    c-ip,
    cs-media-name

Computing Actual Hits

Choosing a single number for the correct representation of a hit is difficult. Choosing to count only client rendering log entries will not count hits for versions of Windows Media Player earlier than version 9, so choosing to count server log entries is preferable.

Counting only server log entries might result in the following anomalous situation. Sometimes, a client viewed cached data that was not downloaded from the server recently, so it might not be in the log files examined or it might be out of the range of the query. This means that you can end up with zero for the number of hits, which can cause errors in statistical calculations. This is rare, so the simplest solution—to ignore cache hits without accompanying server hits—is the best solution for most circumstances.

Analyzing Aggregated Data

This section covers examples of how to extract useful information from the output generated by the following query:

SELECT
    c-ip as agg_ip,
    cs-uri-stem as agg_uri,
    cs-media-name as agg_media_name,
    date as agg_date,
    c-playerversion as agg_player_version,
    c-playerlanguage as agg_player_language,
    c-os as agg_os,
    c-osversion as agg_os_version,
    c-cpu as agg_cpu,
    filelength as agg_file_length,
    filesize as agg_file_size,
    COUNT(*) as agg_log_entries,
    SUM(CASE protocol WHEN 'Cache' THEN 1 ELSE 0 END) as agg_cache_hits,
    SUM(CASE protocol WHEN 'Cache' THEN 0 ELSE 
        CASE c-status WHEN 200 THEN 1 ELSE 0 END 
        END) as agg_server_hits,
    SUM(CASE c-status WHEN 200 THEN 1 ELSE 0 END) as agg_200,
    SUM(CASE c-status WHEN 210 THEN 1 ELSE 0 END) as agg_210,
    SUM(CASE c-status WHEN 401 THEN 1 ELSE 0 END) as agg_401,
    SUM(CASE c-status WHEN 404 THEN 1 ELSE 0 END) as agg_404,
    SUM(CASE c-status WHEN 408 THEN 1 ELSE 0 END) as agg_408,
    SUM(CASE c-status WHEN 500 THEN 1 ELSE 0 END) as agg_500,
    SUM(CASE c-status WHEN 200 THEN 0 ELSE 1 END) as agg_errors,
    SUM(CASE protocol WHEN 'Cache' THEN x-duration ELSE 0 END) as agg_rendering_duration,
    SUM(CASE protocol WHEN 'Cache' THEN 0 ELSE 
        CASE c-status WHEN 200 THEN x-duration ELSE 0 END 
        END) as agg_streaming_duration
INTO 
    '%outfile%'
FROM 
    '%infile%'
GROUP BY
    c-ip,
    cs-uri-stem,
    cs-media-name,
    date,
    c-playerversion,
    c-playerlanguage,
    c-os,
    c-osversion,
    c-cpu,
    filelength,
    filesize

In the examples that follow, it is assumed that you will have used this query (or one very similar) to output the data to a file named output.csv.

After you have done so, you will then want to execute the following examples on the aggregated data (output.csv) into a new output file (say final.csv) by using a new batch (.bat) file containing the following text:

LogParser.exe file:myquery.sql%infile=output.csv+outfile=final.csv -q:off -i:CSV -o:CSV

Identifying the Most Visited Content

After you have aggregated the data, you can then get accurate numbers about the most popular content, by using the following query:

SELECT TOP 25
    agg_media_name, COUNT(*) as cnt
INTO 
    '%outfile%'
FROM 
    '%infile%'
GROUP BY
    agg_media_name
ORDER BY
    cnt DESC

Each row in the input CSV file corresponds to all of the visits to a single piece of content from a single IP address. This is referred to as a visit. Computing the most popular content by unique IP addresses usually produces more useful information because it avoids counting bots or spiders, which can download a single piece of content thousands of times. This method also has the advantage that multiple automated reconnects are also not counted uniquely.

Identifying Error Prone Content

One way to identify content that produces the most errors is to count all of the visits that have at least one error, by using the following query:

SELECT TOP 25
    agg_media_name,
    SUM(CASE agg_errors WHERE 0 THEN 0 ELSE 1 END) as err_cnt
INTO 
    '%outfile%'
FROM 
    '%infile%'
GROUP BY
    agg_media_name
ORDER BY
    err_cnt DESC

Using Log Parser COM Objects

The Log Parser COM objects can be used from any programming environment supporting automation, including C++, C#, Visual Basic, JScript and VBScript.

Before using the Log Parser scriptable COM objects on your computer, the LogParser.dl binary must be registered with the computer's COM infrastructure. This should be done automatically by the setup package, but, if it is not, you can also do this manually by executing the following command in the directory containing the LogParser.dll binary:

C:\LogParser>regsvr32 LogParser.dll

The Log Parser COM objects are fully documented in the Log Parser Help file.

For More Information

Web addresses can change, so you might be unable to connect to the Web site or sites mentioned here.