SQL Server 2005 Integration Services: Developing Custom Components for Project REAL

 

Jyoti Jacob, Scalability Experts, Inc.
Donald Farmer, Microsoft Corp.
Len Wyatt, Microsoft Corp.

October 2005

Applies to:
   Microsoft SQL Server 2005 Beta 2
   IDW 9 Community Preview

Summary: The first phase of Project REAL included the migration of existing SQL Server 2000 DTS packages to SQL Server 2005 Integration Services (SSIS) for a large electronic retailer. One source of information for the data warehouse was from TLog files, which contained data from point-of-sale cash registers in stores. TLog files store data in packed decimal format. To extract and transform this data into a format suitable for loading into the relational data warehouse tables, it was necessary to unpack and parse the files. To achieve this, custom pipeline components were implemented to read and transform the data by extending the SSIS object model. This paper discusses what we learned while implementing the pipeline components for extracting and parsing data from the TLog parser. (44 printed pages)

You can download the Microsoft Word version of this document.

Contents

Introduction: Project REAL
   Phase One Implementation
   Core ETL Processing
Why Custom Components?
Proposed Design for Loading the TLog Data into the Data Warehouse
   SSIS Object Model and Pipeline Components
Implementation of the Custom Components
   The Custom Source Adapter
   The Custom TLog Transformation
Deployment
   Generating a Strong Name for the Assembly
   Compiling the Task
   Moving the Task to the Appropriate Directories
   Adding the Task to the SSIS Toolbox
   Creating the Package Using the Custom Components
Debugging
Lessons Learned
Conclusion
   For more information

Introduction: Project REAL

Project REAL is an effort to discover best practices for creating business intelligence applications based on Microsoft SQL Server 2005 by creating reference implementations that are based on actual customer scenarios. This means that customer data is brought in-house and is used to work through the same issues that the customers face during deployment. These issues include:

  • The design of schemas
  • The implementation of a data extraction, transformation, and loading (ETL) process
  • The sizing of systems for production
  • The management and maintenance of the systems on an ongoing basis

By working with real deployment scenarios, we gain a complete understanding of how to work with the tools. Our goal is to attempt to address the full range of concerns that a large company would face during its own real-world deployment. This paper focuses on the experience of developing custom components while working on the ETL portion of Phase 1 of Project REAL.

Project REAL uses data from two Microsoft business intelligence customers. Phase 1 of the project was modeled on a large electronics retailer that keeps sales and inventory data in an SQL Server 2000 data warehouse. SQL Server 2000 Data Transformation Services (DTS) is used to manage the flow of data into the relational database and then into SQL Server 2000 Analysis Services cubes for reporting and interactive querying. This customer maintains approximately 200 GB of data in their relational store. All of this data is subsequently processed into Analysis Services cubes. The Phase 1 implementation focuses primarily on the concerns that an existing SQL Server 2000 customer might have when carrying out a migration to SQL Server 2005. Our results largely represent the migration of existing functionality, with a few new capabilities used where appropriate. In the area of ETL, there was a substantial amount of work to do to create packages with SQL Server 2005 Integration Services (SSIS) based on the existing SQL Server 2000 DTS packages.

Note   SQL Server Integration Services (SSIS) is the new name assigned to the component formerly branded DTS. The product was renamed after Beta 2; thus many screen shots of the new SSIS in this paper still use the old name.

Phase 2 of Project REAL is based on a larger data set from a different customer, and exercises more of the new capabilities of SQL Server 2005 than does Phase 1. This is because Phase 2 is primarily a new implementation of a SQL Server 2005 solution. Look for more papers about Project REAL on the Project REAL Web site.

Project REAL is a joint venture between Microsoft Corporation, Unisys, EMC, Scalability Experts, Panorama, Proclarity, and Intellinet. Scalability Experts and Microsoft Corporation performed the work described in this paper.

Phase One Implementation

The Phase 1 customer has two primary sources of information that flow into the data warehouse. These are:

  • TLog files, which are the output of point-of-sale (POS) cash registers in stores (not to be confused with SQL Server transaction log files).
  • Flat file extracts from a JDA system (a packaged retail software application), which are used to manage the business.

The overall flow of data, which is managed by SQL Server 2000 DTS, is shown in Figure 1.

Figure 1. Phase 1 customer data flow

Core ETL Processing

  1. The core ETL processing that is performed to load this customer's data warehouse is as follows.
  2. TLog files come from point-of-sale (POS) cash registers in a special highly compressed format that must be decoded before the files can be loaded into a database. The customer's application compresses these POS transactions into a packed decimal format based on predefined specifications.
  3. The application assigns each file a name by using a numeric sequence. It stores the name in the appropriate directory according to the store number. This naming convention is necessary to support files that span days as well as to support situations where several files are written per day.
  4. The customer uses Perl scripts to parse the TLog binary files into multiple text files before the data is loaded into the database. The scripts unpack the data using a predefined template. The scripts then uncompress the data according to a set of rules defined in a format specification (.ini) file.
  5. The output from the scripts is stored in flat files. The flat files are then read by a DTS package and the data is loaded into the database. Thus, two extra steps are required before the data can be processed—one to parse the data and one to load the flat file output.

To take advantage of the new features and performance improvements in SSIS, we decided to redesign the existing DTS package in SQL Server 2000. This would allow us to measure performance improvement over a simple migration of the existing methodology. As part of this migration, Scalability Experts created a TLog parser that runs in the SSIS pipeline. This effectively eliminates a costly extra step and helps reduce the storage requirements for processing TLog files. Each TLog file, containing compressed data, is read directly into the SSIS pipeline by using a custom source component and parsed by using a custom transformation component. The next section details the advantages of changing the architecture.

Note   This paper describes what we learned from writing the custom components that were required to implement the TLog parser. For general high-level lessons that were learned about SSIS in Phase 1 of Project REAL, see SQL Server 2005 Integration Services: Lessons from Project REAL.

So that we could fully exercise the system, the customer provided not only the initial state of the data warehouse, but also three months of daily incremental updates from both the sales and inventory systems. This allowed us to simulate a full processing cycle, as if run over time, including automated partition management in both the relational and OLAP databases.

In addition to the work on ETL that is described in this paper, the team executed various other activities in Phase 1 of Project REAL including:

  • Migrating the relational data from SQL Server 2000 to SQL Server 2005, while preserving the schema precisely
  • Masking the customer's data to protect confidential information
  • Migrating the Analysis Services 2000 database to SQL Server 2005 Analysis Services (SSAS)
  • Verifying client connectivity using the customer's preferred front-end tools (Microsoft Excel and Proclarity)
  • Creating sample reports against the new cubes using SQL Server 2005 Reporting Services (SSRS)
  • Fully implementing a new Inventory cube in SSAS. The customer previously had encountered difficulties working with semiadditive measures because of the high volume of data. Because of this, the customer had stopped work in this area. New capabilities in Analysis Services 2005 now make working with additive measures feasible even when working with large data volumes.

The remainder of this paper documents what we learned during the process of moving the customer's ETL processing to SQL Server 2005 Integration Services. The paper also highlights numerous best practices.

Why Custom Components?

It is common for sales and distribution applications to compress raw transactions into compressed files to save space. In Project REAL, as discussed in Core ETL Processing, processing one batch of data consisted of reading compressed TLog files from point-of-sale (POS) applications. To see why it was necessary to implement custom components, it is helpful to understand the requirements for loading the POS data. The POS data is stored in packed decimal files as shown in Figure 2.

Figure 2. Sample TLog packed file

This file can be unpacked using the template by which it was packed. The template is a sequence of characters that define the order and type of the values that were used to pack the values into a binary file. Following is a sample portion of the template file that was used for unpacking the packed decimal files in this implementation.

"00" => ["H2","H4","H5","H6","H4","H8"]

"01" => ["H2","A12","H4"]

"03" => ["H2","A5","A12"]

Where

H: A hex string (high nybble first)

A: An ASCII string

Digit: number of bytes that contain the particular encoding

The first two hexadecimal values of each row of the packed data represent the row type of the template. That is, these first two numbers define the template for the rest of the row. In the sample template there are three row types defined. These are "00", "01", and "03". For rows starting with "03", the first 2 bytes will be a hexadecimal value, followed by 5 bytes of ASCII characters, followed by 12 bytes of ASCII characters.

Using this template, the file is unpacked.

After the file is unpacked, the data needs to be parsed further to extract detailed information about the transactions. Parsing the file is a challenge because there are multiple row delimiters, and each row is a separate transaction with an unequal number of columns. So, the field types that are required to parse the file are defined in a specification file. Following is a sample specification file for parsing the TLog file.

[00]Filename="Header Item"DelimiterCharacter=","Outputs="%store,0-2-2,0-1-0,0-2,-------------------

[03]Filename="Sales Receipt"

DelimiterCharacter=","Outputs="%store,3-0,3-1,............................

The first two hexadecimal values of each unpacked transaction define the type of transaction information. This might be payment information such as discounts and rebates applied, or it might be user information. For example, in the sample template, transaction type "00" defines the transaction as a Header. Based on the type of transaction, the specification file contains the values and methods necessary to parse the data and extract meaningful data.

The customer used Perl scripts to process this information. The extensible SSIS object model and support for .NET languages presented us the opportunity to implement the ETL process of unpacking, parsing, and extracting as a set of custom components in the SSIS pipeline itself. This in turn helped us to improve performance and manageability. This would not have been possible had we used the existing Perl scripts to parse the TLog files outside of the SSIS pipeline.

The perceived advantages that would result from the implementation of this process were as follows.

  • Efficient and robust ETL because the extraction of the data from packed decimal TLog files would be included in the same SSIS pipeline instead of in a separate manual process. The TLog data that would be extracted in this process could be further processed using built-in SSIS transformation tasks and stored in any destination of choice (such as SQL Server or flat files) in the same pipeline.

  • Better manageability because the Perl parser was replaced by the custom components in the SSIS pipeline. Hence, it was not necessary to implement or execute any additional applications outside of the SSIS pipeline. Also, additional storage is not required to store the interim parsed files before loading them into the relational tables in the data warehouse.

  • Faster ETL process because additional I/O is not incurred as the data does not need to be written to or read from an intermediate format on disk between each transformation.

  • More flexibility and control over the data flow because the end user determines the specific outputs (subtables) that are to be extracted.

  • The TLog files contain approximately 24 different kinds of output information such as Rebates, User Description, Discounts, and Header Information. Previously, the Perl parser extracted all of these subtables into different files (24 files for each TLog file), irrespective of whether or not this was required by the data warehouse application. The proposed design allows the user to pick and choose the required outputs before the data is transferred to the downstream destination. Hence, unnecessary data flow is avoided, leading to improved performance and manageability.

  • Ease of modification because the custom components were implemented in managed languages such as Visual C# .NET, which offers better debugging and easier modification of code than low-level languages like Perl.

    In addition, the proposed design provides the template file for unpacking the compressed files and the specifications for parsing the data as separate input to the custom components. Hence, any changes to the format specifications could be easily applied without modifying the custom component themselves.

Because the cost and performance benefits of implementing custom components far out-weighed the cost, we decided to go ahead with this project. Because this process would manage the flow of data from packed decimal files to the data warehouse, it was decided to implement this task as a pipeline component in the data flow. For better manageability and to implement modular programming, the unpacking of the packed decimal file and the parsing of the file to extract meaningful data were implemented as separate components. The following two separate custom components were implemented.

  • A source adapter that would unpack the compressed file and convert it into its associated encoding.
  • A custom transformation that would parse the file based on a given specification.

Proposed Design for Loading the TLog Data into the Data Warehouse

This section discusses the design that was developed to extract and transform the point-of-sales data stored in the TLog files that originated from different stores.

The TLog data files for each store are grouped together in a directory. Directories are named according to the store number. In each directory, the TLog files are arranged in numerical sequence. An additional Start file is included in the directory that contains the name of the last processed file. To access a TLog file, the particular store directory must first be accessed by looping through the collection of store directories. Then, another looping process loops through each TLog store directory to extract a particular TLog file. This TLog file must be unpacked and parsed before it can be further transformed into a format that is suitable for loading in the SQL Server data warehouse.

Before the data in the TLog files can be processed, the files must be accessed for processing. To access the source files: 1) loop through the store directories, and 2) for each store directory, loop through the TLog files. In SSIS, these processes are often handled by different control flow tasks.

Once the TLog file is accessed, it is necessary to unpack the data. The unpacked data must be further parsed based on defined specifications to extract meaningful information. This must be done before additional transformations can be applied and the data loaded into the destination. Currently, there are no built-in tasks available to unpack and parse packed decimal files. Hence, we created custom components to unpack and parse the TLog files based on the defined specifications in the SSIS pipeline. For better manageability, two separate custom components were implemented—one to read and unpack the file and another to parse the file.

To achieve the above design, the following control flow and Data Flow tasks were used.

Control flow tasks

The following control flow tasks were used to access the source TLog files.

  1. A custom For Each Directory container that looped through the different store directories containing the TLog files.

    Note   A custom For Each Directory task is included in the sample SSIS applications included in September CTP. (Filepath: <installedsamplesdirectory>\Integration Services\Programming Samples\Control Flow\ForEachDirectory Sample directory)

  2. For each extracted directory, the For Each File built-in container was used to loop over the TLog files in each directory.

  3. The file path of each TLog file was extracted to be set dynamically as the source table in the data flow source adapter.

  4. A Data Flow task read, unpacked, and parsed the data and loaded it into the required relational destination tables in the data warehouse.

Data Flow tasks

The following Data Flow tasks were used to extract, transform, and load data.

  1. A custom source adapter was implemented to read the packed decimal TLog file and unpack it into its equivalent encoding based on the provided template.
  2. A custom transformation task was used to transform the data using the specification provided in the .ini input file.
  3. Additional transformations were done as required.
  4. SQL Server Destination Adapter, which is the built-in destination component, was used to load the data into SQL Server.

SSIS Object Model and Pipeline Components

An understanding of the new SSIS object model is prerequisite to a discussion of the implementation of custom components. This section discusses the SSIS object model and its various components.

In the SSIS object model, the main runtime components are tasks and containers. Data management and processing are handled by individual tasks. As shown in Figure 3, these tasks are grouped together in containers. Containers provide iterative and branching control flow in the package. The order in which the DTS Runtime Engine executes tasks and containers is defined by precedence constraints. The precedence constraints define the conditions for the execution of the tasks and containers.

Figure 3. SSIS object model

Data flow tasks are those tasks that extract data from different sources and then transform, cleanse, and modify the data and store it in the appropriate destination. You can think of a Data Flow task as an acyclic directed graph, and the source adapters, transformations, and destination adapters as the graph nodes. Each of these components is connected by directed paths, which determine the flow of data.

One of the biggest improvements that SSIS 2005 has achieved over DTS 2000 is in performance. For Data Flow tasks, in-memory data stores called buffers handle most of the data flow between components. The data flow engine creates the execution plan and defines the required buffers for each component. When the buffers are full, the data is pushed to the downstream component. Data flow is managed in memory, rather than through expensive I/O operations, and the result is faster data movement and transformation.

One of the challenges in this project was extending the object model to implement the custom components. However, the entire underlying SSIS object model has been reengineered with extensibility in mind. The new SSIS object model provides support for the core Microsoft .NET Framework. It leverages Microsoft Visual Studio® to provide a powerful integrated development environment (IDE) for both design and debugging purposes.

To extend the SSIS object model, the developer needs to consider two stages of development for a SSIS package. The SSIS architecture handles the design and runtime of the package separately.

Note   In this paper, the person who implements the source code of the custom components is referred to as the developer and the person who creates the SSIS package by using these custom components in Business Intelligence Development Studio is the user or end user.

Each component in the Data Flow task is connected in a graphical format with data paths that control the flow of data as shown in Figure 4. Each upstream component extracts and processes the data and pushes it to the next downstream component. Each component consists of two important subelements: the user interface (UI) and metadata.

Metadata defines the role of the component in the layout. The metadata consists of component information such as custom properties, the input and output collection, and the columns in each input and output collection. The data is transferred between the components using in-memory stores called pipeline buffers as shown in Figure 4.

Figure 4. Data Flow task

In the design state, the user selects the data flow components and the appropriate data paths. The user is allowed to change the metadata of each component in this stage. For instance, a user might define custom properties for the different components and determine the flow of data. The user can also decide the number of inputs and outputs for each component and their associated column collection. In addition, the user can specify the flow of data from each output to the next downstream component. During the design stage, the component must be able to react to these metadata changes. For instance, the components may allow or disallow adding or removing inputs and outputs. The components should be flexible enough to accommodate these changes.

In the runtime phase, the data flow components are executed. During this phase, the data flow engine defines the execution plan for the data flow, then creates and assigns the data buffers. It assigns the worker threads and calls the different components such as source, transform, and destination in the order in which the user arranged them in the graphical format. In addition, the data flow engine handles errors and completion of the tasks.

To create a custom data flow component, you need to derive from the PipelineComponent base class. The PipelineComponent implements the IDTSDesigntimeComponent90 interface for defining the design-time behavior of the Data Flow task. To define the runtime behavior, the IDTSRuntimeComponent90 interface must be defined. Both interfaces expose methods and events that you can override for your component. To implement and override virtual methods for the data flow components, add the following references to your project.

  • Assembly: microsoft.sqlserver.dtspipelinewrap
  • Default path: %program files%\microsoft sql server\90\sdk\assemblies\microsoft.sqlserver.dtspipelinewrap.dll
  • Assembly: microsoft.sqlserver.dtsruntimewrap
  • Default path: %program files%\microsoft sql server\90\sdk\assemblies\microsoft.sqlserver.dtsruntimewrap.dll
  • Assembly: microsoft.sqlserver.manageddts
  • Default path: %program files%\microsoft sql server\90\sdk\assemblies\microsoft.sqlserver.manageddts.dll
  • Assembly: microsoft.sqlserver.pipelinehost
  • Default path: %program files%\microsoft sql server\90\sdk\assemblies\microsoft.sqlserver.pipelinehost.dll

Before the package is executed, the pipeline component is in the design state. In this state, the metadata can continuously change. In this state, the user can determine the layout by adding or removing components, and by setting the data flow path and metadata for each component. The SSIS designer in Business Intelligence Development Studio interacts with the component through the design-time IDTSDesigntimeComponent90 interface. In the design state, each node or component that is selected by the user contains a Microsoft.SqlServer.Dts.Pipeline.Wrapper.IDTSComponentMetaData90 interface and a reference to the design-time interface of the corresponding user component. The data paths define the movement of data from the upstream component to the downstream components. This data path is defined by the collection of Microsoft.SqlServer.Dts.Pipeline.Wrapper.IDTSPath90 objects, which define the movement of data between components.

The runtime methods are executed by the worker threads, which are also managed by the Data Flow task. The worker threads call the runtime IDTSRuntimeComponent90 interface to execute the runtime methods.

During runtime, the Data Flow task generates an execution plan for the execution of the data flow components. The execution plan contains an execution tree for each sequence of components that start from either a source or an asynchronous component and continue until either the destination or another asynchronous component is encountered. Based on the execution plan, a pipeline buffer is created for the output column collection for each sequence of components. This pipeline buffer contains the column collection of all the components until an asynchronous component or destination component is encountered. If an asynchronous component is encountered, a pipeline buffer that contains the columns of the next sequence of components is created. Hence, the buffer is reused for a sequence of components in a particular execution tree.

This provides optimization by allowing multiple components to reuse the same buffer objects, which increases efficiency by reducing the number of expensive memory copies of the buffer.

To achieve this, the engine maintains a constant buffer definition in the runtime execution plan. When the upstream component creates the pipeline buffer, it allocates space for the columns that will be created by the downstream components. Since upstream components only write to the space that is reserved for their columns, there can be more columns at the component level at runtime than are selected by the end user at design time.

The runtime engine assigns the pipeline buffers. The output rows are stored in the buffer until it is full. Then the data is pushed to the next component. The buffer cells hold the value of each column for auditing purposes. The columns are provided with unique lineage identifiers. These lineage identifiers map to the column index in the buffers. During execution, the columns are located by using methods such as FindColumnbyLineageID.

The initial programming includes deriving from the PipelineComponent base class. You need to define the namespace and assign the class attributes of the DtsPipelineComponent. For the custom source adapter, the namespace and class was defined as follows by deriving from the PipelineComponent class.

using System;
using Microsoft.SqlServer.Dts.Pipeline;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using Microsoft.SqlServer.Dts.Runtime;
#endregion

namespace TLogSourceComponent
{    #region ClassAttribute
    [
        DtsPipelineComponent
        (
            DisplayName = "TLOG Source",
            ComponentType = ComponentType.SourceAdapter,
            IconResource = "TLogSourceComponent.TLogSource.ico",
            Description = "This source component unpacks the packed 
decimal fields in a POS TLOG and returns a single column as 
record/string/substring structure."
        )
    ]
    #endregion
    public class TLGSrc : PipelineComponent
   {
      \\ write the code here
        }
 }

Implementation of the Custom Components

The object model can be extended to create custom tasks by using a language in .NET Framework. This particular solution was implemented in managed code by using C#. The project was implemented in the Visual Studio 2005 development environment. The rest of the implementation and API usage are explained in C#.

The custom components were created as separate projects for better manageability and debugging. Each project was compiled to generate the required assemblies (.dll) files. These assemblies can be shared with hosting applications such as Business Intelligence Development Studio.

The remainder of this section discusses the implementation details of the custom source adapter and custom transformation.

The Custom Source Adapter

The source adapter allows the user to extract data from different data sources. The source adapter can contain several data and error outputs. As the source adapter is the first component in the Data Flow task, it does not contain any inputs. Because the source adapter does not contain inputs, it connects directly to the data source and extracts the required data.

The output can contain one or more columns of a defined data type to transfer the data downstream in the data flow pipeline. The same is true for the error output columns. However, in the implementation, the developer can explicitly deny the user the ability to add or remove columns or outputs by overriding different methods, such as DeleteOuputColumn() and InsertOutputColumn(), and then displaying error messages. If the implementation allows the removal of columns, at design time the end user can choose the different required outputs and their associated columns.

For the TLog source adapter, the following behavior was required.

  • Read from the packed decimal source file.
  • Unpack the file based on the specification specified in the associated input template file.
  • Combine the unpacked data into a text data type column and send it to the next component.

As discussed earlier, there are two important phases for creating custom components. These are the design-time methods and the runtime methods. Following is a discussion of the virtual methods that need to be overwritten for design and runtime.

Designing a source adapter

In the design phase of a source adapter, the first step is to define and initialize the metadata of the component. The metadata of a component consists of:

  • The name, description, and custom properties of the component.
  • The input and output collection of the component, including the name, description, and properties of each output and input.
  • The column collection that is included in each input and output of the component, including the data type, size, precision, and length of each column, as well as any column-level custom properties that have been defined.

During design time, the metadata for a component is defined by using the ProvideComponentProperties() method. The package calls the ProvideComponentProperties() method to initialize the output/input collection and the custom properties. This is the first method that is called when the end user adds a component to the Data Flow task in the design state. This method initializes component properties such as inputs, outputs, and custom properties. Though this method sounds like a constructor, it is not exactly a constructor. It is not called each time the design-time instance or runtime instance of the component is instantiated.

Following is the code for the ProvideComponentProperties method.

public override void ProvideComponentProperties()
  {
    ///Support resetting the component.
    ComponentMetaData.RuntimeConnectionCollection.RemoveAll();
    RemoveAllInputsOutputsAndCustomProperties();
      
    // Add the Custom properties
    IDTSCustomProperty90 tlogFile = ComponentMetaData.CustomPropertyCollection.New();
    tlogFile.Description = "Specify the source TLOG file name
        including the full path.";
    tlogFile.Name = "TLogFile";

    IDTSCustomProperty90 templateFile = ComponentMetaData.CustomPropertyCollection.New();
    templateFile.Description = "Specify the template file name
        including the full path.";
    templateFile.Name = "TemplateFile";

    IDTSOutput90 output = ComponentMetaData.OutputCollection.New();
    output.Name = "TLOGOutput";
    output.ExternalMetadataColumnCollection.IsUsed = true;
  }

The first two methods, RemoveAllInputsOutputsAndCustomProperties() ** and ComponentMetaData.RuntimeConnectionCollection.RemoveAll(), are used to reset an instance of the existing component by deleting the defined inputs, outputs, custom properties, and connections.

To connect to the external source, new connections need to be defined. To achieve this, custom properties were defined. The TLog Source adapter contains the following two custom properties.

  • The connection string of the TLog packed decimal file.
  • The path to the template files that contain the specification that is required to unpack the packed decimal file.

A source adapter connects directly to the data source to extract the data. This is usually the first component in the Data Flow graph layout. Hence, there are no inputs associated with an upstream component.

The unpacked file is sent from the custom source adaptor to the downstream component as a single row with a single column of DT_TEXT data type. The design to send the entire unpacked data as a single row was used as the unpacked data output contained different row delimiters with an unequal number of columns. Transferring the data as a single row made processing the data easy in the transformation phase. Hence, there is only one output defined. It is important that you define the DataType, Length, Precision, and CodePage properties of any output columns. All of these properties need to be set because these properties are read-only and each is dependent on the setting of the other. These properties can be set by calling the SetDataTypeProperties() method.

**Note   **If you are not sure of the correct data type for the output, the base class provides the following three helper methods.

  • ConvertBufferDataTypeToFitManaged: Gets the DTS data type that is best used from managed code.
  • BufferTypeToDataRecordType: Returns a managed data type based on a DTS data type.
  • DataRecordTypeToBufferType: Gets the DTS data type that corresponds to a managed type.

Two other methods that needed to be overridden for the custom source adapter were Validate() and ReinitializeMetadata().

The Validate() method of the PipelineComponent class provides the validation necessary to ensure that the component has been configured correctly. It checks for the input and output objects, including the data types of the columns. In addition, it checks properties to make sure they are correct. It checks the custom properties to ensure that input values and connection strings have been assigned as required. This method is called every time the user edits the component in the design state. This ensures the validity of the component. This method is also called during the pre-execute phase at runtime. To see if there are errors, check the status of the DTSValidationStatus enumeration return value. You can inform the end user if the input values or configuration are not correct.

For the TLog source adapter component, this method checks to see if correct connection paths are defined for the TLog packed decimal file and the template file. It also checks to make sure these files exist. Based on the type of error it finds, the Validate method can return one of following values. The end user can then be warned or errors can be rectified.

  • VS_NEEDSNEWMETADATA: An error exists in the component metadata and can be fixed by the component itself.
  • VS_ISBROKEN: The component contains an error, which the end user can rectify in the SSIS designer in Business Intelligence Development Studio, by editing the components.
  • VS_ISCORRUPT: Errors exists which are due to modifications to the source code or to the package XML. To rectify such errors, the developer must delete and re-create the component.
  • VS_ISVALID: The component is properly configured.

The developer can post warnings and errors back to the user by using the FireWarning() and FireError() methods.

The following code is used to validate the TLog file custom property.

#region Validate
        [CLSCompliant(false)]
        public override DTSValidationStatus Validate()
        {
            IDTSOutput90 output = ComponentMetaData.OutputCollection[0];
IDTSCustomProperty90 tlogFile = ComponentMetaData.CustomPropertyCollection["TLogFile"];
            if (tlogFile.Value == null || ((String)tlogFile.Value).Length == 0)
            {
                ComponentMetaData.FireError(0, ComponentMetaData.Name, 
"TLOG file is required.", "", 0, out Cancel);
                return DTSValidationStatus.VS_ISBROKEN;
            }
            else if (!File.Exists((String)tlogFile.Value))
            {// TLog file name specified, verify it exists and is accessible

ComponentMetaData.FireError(0, ComponentMetaData.Name, String.Format("TLOG 
file '{0}' does not exist or is not accessible.", (String)tlogFile.Value), "", 0, out Cancel);
                return DTSValidationStatus.VS_ISBROKEN;
            }
      // Validate the other properties
     }
#end region

Similar code validates the path for the template file. The code also checks to see if the metadata is correct by checking the mapping of the output column.

if (ComponentMetaData.OutputCollection[0].OutputColumnCollection.Count == 0)
            {
                return DTSValidationStatus.VS_NEEDSNEWMETADATA;
            }

            /// Validate the output columns against the external metadata
            if (ComponentMetaData.ValidateExternalMetadata)
            {
                if (!ComponentValidation.DoesExternalMetaDataMatchOutputMetaData(output))
                {
                    ComponentMetaData.FireWarning(0, ComponentMetaData.Name, "The ExternalMetaDataColumns do not match the output columns.", "", 0);
                    return DTSValidationStatus.VS_NEEDSNEWMETADATA;
                }

If the metadata is not correct and can be rectified by the component, then the VS_NEEDSNEWMETADATA enumeration value is returned. Remember that the Validate() method should be used only to raise errors. The state of the component should not be changed in this method. Since the designer calls this method frequently by the designer, changes to the component in this method can result in unpredictable behavior. Hence, errors should be repaired in the **ReintializeMetaData() **** method. For example, if the output column properties are not correct, the **ReintializeMetaData() **** method gets the IDTSOutputColumn90 object, sets the appropriate data type, and maps it correctly with the metadata of the external input columns. The code is as follows.

public override void ReinitializeMetaData()
        {
            IDTSOutput90 output = ComponentMetaData.OutputCollection[0];

            ///   Start clean.
            output.OutputColumnCollection.Removedll();
            output.ExternalMetadataColumnCollection.RemoveAll();

            IDTSOutputColumn90 outColumn = ComponentMetaData.OutputCollection[0].OutputColumnCollection.New();
            IDTSCustomProperty90 dataSourceColumnName = outColumn.CustomPropertyCollection.New();
            dataSourceColumnName.Name = "DataSourceColumnName";
            dataSourceColumnName.Value = "DecodedString";
            dataSourceColumnName.Description = "The name of the column at the data source.";

            /// Set the output column's properties.
            outColumn.Name = "DecodedString";
            outColumn.SetDataTypeProperties(DataType.DT_TEXT, 0, 0, 0, 1252);

            IDTSExternalMetadataColumn90 eColumn = output.ExternalMetadataColumnCollection.New();
            eColumn.Name = outColumn.Name;
            eColumn.DataType = outColumn.DataType;
            eColumn.Precision = outColumn.Precision;
            eColumn.Length = outColumn.Length;
            eColumn.Scale = outColumn.Scale;

            outColumn.ExternalMetadataColumnID = eColumn.ID; 
        }

Runtime

The data flow engine moves the data between different components by using in-memory buffers. The data flow engine creates and manages the pipeline buffers that are used for moving data. At execution time, the input and output collections are assigned and the associated buffers are created. For the custom source adapter, as there are no inputs, only the outputs are added. The output is assigned by the Data Flow task by allocating the Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer.PipelineBuffer objects that contain all of the required columns in the data flow.

Pre-Execute, ** PrimeOutput, ** and ** ProcessInput ** are runtime methods that should be used to manage the outputs and inputs. Pre-Execute is called before PrimeOutput and ProcessInput. This method is your first chance to access the output and input collections. Each input and output collection contains the column collection. Columns in the collection are identified by their unique lineage identifiers (lineage IDs). The pipeline buffers store the column data in the associated buffer columns. These buffer columns are identified by their column index. If there are multiple columns in the input or output collection, to locate the column of a particular input or output in the buffer, you must map the lineage ID of a column in the column collection to the column index of the associated buffer. This is because the order of the columns in the buffer is not maintained in the same order as the columns in the input/output collection. To achieve the mapping, you must locate the lineage ID of the columns in the output column collection. The lineage ID can be accessed by using the Microsoft.SqlServer.Dts.Pipeline.Wrapper.IDTSBufferManager90.FindColumnByLineageID(System.Int32,System.Int32) method of the Microsoft.SqlServer.Dts.Pipeline.PipelineComponent.BufferManager property. This lineage ID and buffer type can be used to find the column index.

As there is no input associated with this component, the ProcessInput() method is generally not used when creating source adapters.

The next important method for the custom source adapter is the PrimeOutput() method. This method is used for calling external methods to process and manage the outputs. It adds rows to the output buffers to send data to the downstream component. This method initializes an object of the TLogReader class to call different methods to unpack the file. The TLog file and the template file that are defined through the custom properties of the source adapter are sent as input parameters to the methods in the TLogReader class, which unpacks the packed decimal file. The return output is inserted into the output buffer. As the buffer becomes full, the data is pushed to the downstream component and processed further.

public override void PrimeOutput(int outputs, int[] outputIDs, PipelineBuffer[] buffers)
        {
            IDTSOutput90 output = ComponentMetaData.OutputCollection[0];
            PipelineBuffer buffer = buffers[0];

            try
            {
                buffer.AddRow();
                TLOGReader reader = new TLOGReader();
                IDTSCustomProperty90 tlogFile = ComponentMetaData.CustomPropertyCollection["TLogFile"];
                IDTSCustomProperty90 templateFile = ComponentMetaData.CustomPropertyCollection["TemplateFile"];
                reader.ProcessTLogFile(((String)tlogFile.Value), ((String)templateFile.Value));
                buffer[0] = reader.ReaderOutout.ToString();
            }
            catch
            {
            }
            finally
            {
                /// Notify the data flow that we are finished adding rows to the output.
                buffer.SetEndOfRowset();
            }
        }

The TLogReader class contains methods to unpack the packed decimal file. The data in the packed decimal file is extracted as a stream of binary data. The Perl language contains functions called Pack() and Unpack(). Pack() ** takes an array or list of values and packs it into a binary structure, returning the string containing the structure. Unpack() does the reverse of Pack(). It takes the binary structure and unpacks it by using the template that was used to pack the data. The template is a sequence of characters that give the order and type of the values that were used to pack the values into a binary structure. Unfortunately, the .NET Framework does not contain any equivalent functions to unpack binary files by using a template. Hence, a custom method was implemented called Unpack.** This method accepts the binary data and the template specification file as input. The template contains a sequence of characters that give the order and type of values. Based on the encoding that is provided in the template file, this custom Unpack ** method converts the binary data into appropriate values.

The template file that was used to unpack the packed decimal file that was provided as input to the TLog source adapter is as follows.

"00" => ["H2","H4","H5","H6","H4","H8"]

"01" => ["H2","A12","H4"]

"03" => ["H2","A5","A12"]

Where

H: A hex string (high nybble first).

A: An ASCII string.

Digit: Number of bytes that contain the particular encoding.

As discussed earlier, the first two hexadecimal numbers of each row represent the row type and define the rest of the specification to parse the file. Based on this specification, the file is unpacked. For example, if the template file specifies that the first two bytes are hexadecimal (high nybble first), the unpack function unpacks the file in the following manner.

switch (FormatSpecifier[0])
    {
      case 'H': //Hex
        String CurrentHex;
        int HexStrLen = 0;
        foreach (Byte CurrentByte in Bytes)
        {
          if (ExpectedCharCount != -1 && HexStrLen >= ExpectedCharCount)
            break;
          CurrentHex = String.Format("{0:X2}", CurrentByte).ToLower();
          Result.Append(CurrentHex);
          HexStrLen += 2;
        }
         break;

Similarly, ASCII characters are also unpacked by using the proper encoding. The result set is appended as a DT_TEXT data type. The complete file is unpacked in this manner and the result set is sent to the output pipeline buffer.

A sample output of the unpacked file looks like the following:

00:0006:0006:0408021033:00:32:101254:f42587:3297::f9:38::f433:","","99:ZIPCODE:11999","90:DESC:XYZ-24C BW","01:288206:1999:f130:000000:f128::00","90:DESC:XYZ ITEM,"01:112811:f790:f130:000011:f128::00", "20:0023:0011:0408020945:10:101112:I:"

"20:0019:0011:0405020945:10:1012:I:"

The Custom TLog Transformation

The data is read and extracted from the packed decimal file by the custom source adapter. It is then unpacked. The unpacked data is sent to the custom transformation component where it can be parsed. The custom transformation receives a single row with a single column as the input and parses the data based on the specification file into a format that is meaningful and can be understood. The specification file contains the template that is used to parse the hexadecimal and ASCII characters into a recognizable format, which can be processed or cleansed further in the data flow pipeline. This specification file is loaded as an .ini configuration file.

The unpacked input to the TLog transformation component contains rows with the following characteristics.

  • The rows can either be delimited by a comma or EndOfLine character.
  • The columns are delimited by colons.
  • The first two hexadecimal digits in each row of the unpacked data define the type of transaction and hence the format of the rest of the row.

Based on the specification, the data is read and the transactions are separated into several outputs. Example outputs might be header information, rebates information, sales information, or user information.

As discussed previously, the POS application can keep a file open for several days. Or, several files can be opened in a single process. The close of a file is defined by a close transaction.

Designing an asynchronous transformation component

The transformation component can be either synchronous or asynchronous. The transformation component is defined as asynchronous if a component cannot output rows until it has received all its input rows, or when the transformation does not produce exactly one output row for each row received as input. In this project, the custom source adapter sends the data as a single row with a single column. This data needs to be parsed into multiple different outputs based on the specification. The entire input is received from the custom TLog source adapter and processed to generate multiple different rows. These rows are redirected to separate outputs. Hence, the output in the TLog transformation component is asynchronous.

The input, output, and custom properties are defined in the ProvideComponentProperties method. By default, the base class creates an input and an output and sets the output to be synchronous with the input. In this project, since the output is asynchronous in this component, this property is defined by setting the SynchronousInputID ** property to 0 as shown in the following code*.*

  public override void ProvideComponentProperties()
        {
            /// Support resettability.
            /// The base class calls RemoveAllInputsOutputsAndCustomProperties to reset the
            /// component. Used here to highlight the functionality of the base class.
            base.RemoveAllInputsOutputsAndCustomProperties();

            //   Let the base component add the input and output.
            base.ProvideComponentProperties();

            //   Name the input and output, and make the output asynchronous.
            ComponentMetaData.InputCollection[0].Name = "TLOG Input";

            IDTSOutput90 headerOutput = ComponentMetaData.OutputCollection[0];
            headerOutput.Name = "HeaderOutput";
            headerOutput.Description = "Header rows are directed to this output.";
            headerOutput.SynchronousInputID = 0;


            ///   Add the other outputs.
            IDTSOutput90 closeOutput = ComponentMetaData.OutputCollection.New();
            closeOutput.Name = "CloseOutput";
            closeOutput.Description = "Close rows are directed to this output.";
          .....
         .....
         ......//Other outputs are defined ....
                  .....
           .....
    ///Add the custom property to the component
/// Config File is the ini template file. The TLog Parser will parse the files ///based on this template.
            IDTSCustomProperty90 INIproperty = ComponentMetaData.CustomPropertyCollection.New();
            INIproperty.Name = "Config File";
            INIproperty.Description = " Ini file to process the tlog information";

            /// Each TLog file is associated with a store.
            IDTSCustomProperty90 StoreName = ComponentMetaData.CustomPropertyCollection.New();
            StoreName.Name = "Store Name";
            StoreName.Description = "Name of the store to be processed";
           
        }

For each output, the processed rows contain different numbers of columns. Because the output contains an unequal number of columns for each row, for ease of implementation, each row is pushed with all the columns appended as a single string. Hence, there is a single column per output. The output data can be further transformed by using other built-in components such as the Derived Column component, before the data is loaded into the data warehouse relational tables.

Because there is only one column as output, the user is not allowed to add or remove columns from the outputs at design time. However, users are allowed to process any number of outputs that they want to. Hence, users can delete any outputs that they do not require. To prevent users from adding or removing columns from the output, methods such as InsertOutput() and DeleteOutputColumn() were overridden to throw appropriate error messages.

public override void DeleteOutputColumn(int outputID, int outputColumnID)
        {
            throw new Exception("Columns cannot be deleted. Remove the output instead");
        }

At design time, the output columns are defined based on the input column properties. The PipelineComponent.SetUsageType() method is used to create a mapping between the input and output columns, as well as to define the data type and other properties of the output column.

public override IDTSInputColumn90 SetUsageType(int inputID, IDTSVirtualInput90 virtualInput, int lineageID, DTSUsageType usageType)
        {
            IDTSVirtualInputColumn90 vCol = virtualInput.VirtualInputColumnCollection.GetVirtualInputColumnByLineageID(lineageID);
            IDTSInputColumn90 col = null;

            /// Only support for text columns.
            if ((vCol.DataType != DataType.DT_TEXT))
               throw new Exception("Only Text datatypes are supported.");

            ///   If the usageType is UT_IGNORED, then the column is being removed.
            /// So throw the appropriate error
            if (usageType == DTSUsageType.UT_IGNORED)
            {
                throw new Exception("columns cannot be removed");
            }
            else if (usageType == DTSUsageType.UT_READWRITE)
            {
                throw new Exception("Read write not supported.");
            }
            else
            {
                ///   Let the base class add the input column.
                col = base.SetUsageType(inputID, virtualInput, lineageID, usageType);

                /// Add an output column 
                AddOutputColumn(ComponentMetaData.OutputCollection[0].ID, col);
                AddOutputColumn(ComponentMetaData.OutputCollection["CloseOutput"].ID, col);
                .....
                .....
                .....................// to add additional output columns to the outputCollection
                .....
                .....
             }
return col;
        }

Runtime methods for an asynchronous component

The runtime methods are called when the package is executed. The first time that the output and input collection can be accessed during execution is in the PreExecute() method. The difference between the custom source adapter and the custom transformation component is in the input rows from the upstream component. There are no inputs for the source adapter. It directly connects to the external source. The custom transformation component receives inputs from the upstream component, processes the data, and sends it on to the downstream component(s). If the output is synchronous, it follows the simple 'row in and row out' rule. For asynchronous outputs, because the input collection is different from the output collection, different pipeline buffers are created for the inputs and outputs.

**Note   **If the transformation component is synchronous, the same input row is sent as an output row after the data is transformed. As there is an output for every input, the same input buffer can be re-used to store the output as the buffer contains the entire column collection. A single buffer is shared between the input and the output. The buffer contains the columns for both the input and output.

For the asynchronous component, the input rows (the output rows from the upstream component) are received by the input buffer when the output buffer in the upstream component is full during the ProcessInput() ** method. This method is repeatedly called during the input process whenever the buffer is passed to it. The difference between the ProcessInput() and PrimeOutput() ** methods is that, apart from processing input and output rows, ProcessInput() is repeatedly called whenever a buffer is passed to it, while PrimeOutput() is called only once. The data is pushed to the downstream component whenever the pipeline buffer is full. The EndofRowset property is set true after the last row. This indicates to the downstream component that the input rows are finished. Similarly, the input rows are processed until the EndofRowset ** is reached. For the TLog transformation, there is only one input from the upstream source adapter.

Because the EndofRowset property defines the end of the input rows, the developer needs to set this property for the output rows when all the rows have been processed. This is achieved by calling the SetEndofRowset() method. The following code shows the implementation of the ProcessInput() method .

public override void ProcessInput(int inputID, PipelineBuffer buffer)
        {
            try
            {
                //Read all the rows of the input before processing
                if(!buffer.EndOfRowset)
                {
     IDTSInput90 input =      ComponentMetaData.InputCollection.GetObjectByID(inputID);

                    while (buffer.NextRow())
                        TinputBuffer = buffer.GetString(0);
                }
                
               if (!buffer.EndOfRowset)
                {
                    //Extract the custom property for the config file
IDTSCustomProperty90 INIFile = ComponentMetaData.CustomPropertyCollection["Config File"];

                    //Extract the custom property of store name
                    //Call the TLog parser

PipelineBuffer[] buffers = new PipelineBuffer[_Buffers.Count];
IDictionaryEnumerator myEnumerator = _Buffers.GetEnumerator();

       //Set the endofrowset for all the output buffers
       while (myEnumerator.MoveNext())
           {
              PipelineBuffer NextBuffer = (PipelineBuffer)myEnumerator.Value;
              if (NextBuffer != null)
                    NextBuffer.SetEndOfRowset();
                        }
                    }   

            }            
catch
            {
                throw new Exception("Error in processing the file");

            }
 

The TLog parser accepts the unpacked string from the source adapter as input and accepts the .ini configuration file as additional input by setting the custom property. It processes the rows based on the specifications provided in the .ini configuration file.

Following is a sample specification in the .ini file.

[00]Filename="Header Item"DelimiterCharacter=","Outputs="%store,0-2-2,0-1-0,0-2,----------------------------

[03]Filename="Sales Receipt"

DelimiterCharacter=","Outputs="%store,3-0,3-1,.............................

The second parameter in the .ini file (as shown in the sample) indicates that the row delimiter is a comma. Based on this delimiter, several rows are extracted from the single row that is received from the upstream source adapter. Each row is further divided based on the EndOfLine character. Each column is delimited by a colon. The hexadecimal digits in the first column of each row classify the type of transaction, such as header or sales. Based on the transaction type, additional specifications such as subtypes or field value can be extracted and used for parsing the row. The specifications such as the type and field information of the data are indicated in the Outputs option of the specification for each type of transaction.

Once the rows are parsed by the TLog custom transformation, the PrimeOutput() method is called to assign the output buffers to each of the selected outputs. Each output is assigned an equivalent buffer. For the TLog Parser transformation, the output is pushed to the downstream component as a single column. As discussed in Designing an asynchronous transformation component, this design logic was adopted as each row contained an unequal number of output columns.

#region PrimeOutput
        #region PrimeOutput
        /// <summary>
        /// The data flow task provides a buffer for each output 
  /// that is connected to a downstream component.
public override void PrimeOutput(int outputs, int[] outputIDs, PipelineBuffer[] buffers)
        {
            _Buffers = new Hashtable();
            for (int x = 0; x < outputIDs.Length; x++)
            {
                IDTSOutput90 output = ComponentMetaData.OutputCollection.GetObjectByID(outputIDs[x]);
                switch (output.Name)
                {
                    case "HeaderOutput":
                        HeaderBuffer = buffers[x];
                        _Buffers.Add("HeaderOutput", (PipelineBuffer)this.HeaderBuffer);
                        break;    
                 .......
                 .......
.................//TODO: Add the other output selected by the user during design      
// stage and assign the buffer
                .....
                .....
  }//end of switch

            } //end of loop

        } //end of PrimeOutput 

The unpacked and transformed data looks like:

1,00,0004,0402030853,03,6,0,0,0,0,0,0,2628,0,0,1,0,0,0

1,1,01,0004,0005,0402030910,222222,1500,000000,128,0,0,0,0

Deployment

Once the coding is complete, it is time to deploy the components in SQL Server Integration Services. Once the code is verified for correctness, each component is compiled into separate DLLs. It is important to ensure that the developers have proper security permissions to access the reference libraries, copy the compiled assembly into the global assembly cache and the Pipeline Component directory, and create packages in SSIS. As this assembly needs to be saved in the global assembly cache, each assembly should be assigned a strong name for deployment.

Following are the requirements for deploying a pipeline component.

  1. Generate a strong name for the assembly.
  2. Compile the task.
  3. Move the task to the appropriate directories.
  4. Add the task to the SSIS Toolbox.
  5. Create the package using the custom components.

This section describes each of these requirements.

Generating a Strong Name for the Assembly

As this assembly needs to be shared among applications, apart from the component directory, it must also be copied into the global assembly cache. Assemblies can be copied into the global assembly cache only if they have a strong name. This must be a fully qualified name that includes the assembly's name, culture, public key, and version number. This guarantees that the assembly is globally unique. You can use the Strong name tool (sn.exe) to generate a strong name.

The Strong name tool generates a public/private key. This key pair is used during compilation to create a strong-named assembly. The tool can be executed from the command prompt by specifying the name of the output file containing the key pair.

Specifying the location of the output file that contains the key file pair can be done in two ways. You can edit the AssemblyInfo.cs file or you can use the Assembly Linker Tool (Al.exe). The following code is in the AssemblyInfo.cs file. It includes the path of the file containing the key pair.

[assembly:AssemblyKeyFileAttribute(@"..\..\TLogprocess.snk")]

In addition, the version number must not change when the project is compiled. To achieve that, specify a constant version number in the AssemblyInfo.cs file. The following code indicates that the version number remains constant.

[assembly:AssemblyVersion("1.0.0.0")]

Compiling the Task

The projects can be compiled in either Visual Studio 2005 or from the command prompt.

To compile the TLog transformation task in Visual Studio 2005

  1. On the File menu, select Open, and then select Project.

  2. Open the TLogProcess.csproj project.

    Note: The TLog transformation task project is named TLogProcess.

  3. Press F5, or click Start on the Debug menu to compile and run the project.

To compile the file using the command prompt

  1. Use the Change Directory command to change to the directory containing the custom component projects.

  2. Type the following command.

    for /r %f in (*.sln) do msbuild.exe "%f" 
    

Moving the Task to the Appropriate Directories

During installation of SSIS, separate directories are created to hold the assemblies of tasks and pipeline components. For the pipeline components, the default path of the directory is %SystemDrive%:\Program Files\Microsoft SQL Server\90\DTS\PipelineComponents. After the tasks are compiled, copy the assembly files (.dll) into this directory.

As these assemblies must be shared, it is necessary to save a copy of the assembly files in the global assembly cache. These assemblies have already been signed with a signed name. To copy the file into the global assembly cache, you can use the command-line tool gacutil.exe or drag-and-drop the assembly to the %system%\assembly directory.

Use the following command to use the command-line tool, gacutil.exe, to copy the file.

gacutil.exe -iF "<assembly path>" 

Adding the task to the SSIS Toolbox

Once the assemblies have been compiled and copied to the appropriate directories, the last task is to add them to the SSIS Toolbox so that users can access the custom components. Placing the components in the directories does not make them explicitly visible in the Toolbox. SQL Server Integration Services (formerly the DTS Server) caches information about tasks and components because it is expensive to enumerate them. Hence, developers will need to restart the service to get the new list of components. Once the service is restarted, access the task.

To access the custom components

  1. Right-click Tools on the main menu. Select Toolbox items.

  2. In the Choose Toolbox Items dialog box, select the SSIS Data Flow Items tab.

  3. Select check boxes for the appropriate data flow components that you need to view in the SSIS Designer in Business Intelligence Development Studio.

    As shown in Figure 5, selecting the TLogSource (source adapter) and TLogProcess (transformation) components allows the end user to view these components in the SSIS Designer.

    Figure 5. Selecting the custom components in Toolbox for display

Now users can view the components in the Toolbox when they open the SSIS Designer. Of course, they must have appropriate permissions to work with the SSIS Designer and to access tasks on the machine. The user can drag and use these components the same as any other built-in component. Once the custom properties are assigned, the components are ready to execute.

Creating the Package Using the Custom Components

This section covers how to create a Data Flow task by using the custom components. In this example, we create a sample package by using the custom source adapter to unpack a packed decimal file, parse it with the custom TLog transformation component, and write it to flat files.

To add the custom components in a package

  1. To start Business Intelligence Development Studio, open the Programs menu and select Microsoft SQL Server 2005 CTP. Select Business Intelligence Development Studio.

  2. On the File menu, select New, and then select Project. For the project type, select Business Intelligence Projects.

  3. In the New Project Templates dialog box, select Integration Services Project.

  4. Provide the Name, Location, and Solution Name of the project. The Package Design Editor will open.

  5. Add and drop the Data Flow task from the Toolbox onto the SSIS designer editor in Business Intelligence Development studio.

  6. Open the Data Flow Task Editor. From the Toolbox in the Data Flow Task Editor, select and add the TLog Source Adapter to the window.

  7. Right-click TLog Source Adapter component and select Show Advanced Editor as shown in Figure 6.

    Figure 6. Adding the TLog source component

  8. In the Advanced Editor for TLog Source window, assign the custom properties. Set the connection path of the packed decimal file and the path of the template file as shown in Figure 7.

    Figure 7. Setting custom properties for the TLog source adapter

  9. Ensure that the settings on the Column Mappings tab and the Input and Output Properties tab of the TLog Source Adapter are set correctly. Click on Column Mappings on the Source Adapter and select the input columns to include. (See Figure 8.)

    Figure 8. Mapping the input columns to the output collection

  10. Next, select the ProcessTLog task (ProcessTLog is the name given to the TLog transformation component). Connect the output of the TLog Source Adapter to the ProcessTLog component.

  11. Open the advanced property editor for the ProcessTLog component. Under Custom Properties, set the Config File property to the path of the.ini configuration file and set the Store Name property, as shown in Figure 9.

    Figure 9. Properties for the TLog transformation task

  12. Select the Input and Output Properties tab. This tab contains all the outputs. The custom transformation component has been implemented in such as way as to prevent users from removing output columns. If the user tries to remove the output columns, the error message in Figure 10 is displayed. Note however, that although you are not allowed to remove the columns from the defined outputs, you are allowed to remove the complete output.

    Figure 10. Error message thrown when the user attempts to remove a column

    As shown in Figure 11, out of 24 outputs, only the Header and Close output are selected to send the data downstream. Also shown in Figure 11, the property window of the output column displays the data type of the columns and other properties.

    Figure 11. Properties of the output column for the transformation component

  13. To extract the data from the packed decimal file and save to destination flat files, the Flat File Destination Adapter is selected. When the output of the custom TLog transformation is connected to the Flat File destination, the user is allowed to select which output to connect to as shown in Figure 12.

    Figure 12. Selecting the output to map to the destination

  14. To store the Header information in a file, assign the file connections and the mapping for the Flat File Destination.

  15. Run the package to execute it. The output is stored in the Header flat file as shown in Figure 13.

    Figure 13. Successful completion of the data flow for TLog packed decimal files

Debugging

It is imperative that your code is tested thoroughly before it is put in production. The developer should initially test the code for syntax correctness and general logic. Compiling the product ensures that the code is syntactically correct. For semantic validation, the developer needs to validate the correctness of the code by attaching it to the process that is hosting the task and by setting appropriate breakpoints. The implemented components are hosted by applications such as Business Intelligence Development Studio and can be debugged by associating the source code with the process of the environment that is hosting the task. Breakpoints can be set on different locations in the Visual Studio development environment and output data checked to ensure that the logic implemented is working as expected.

One way to debug the runtime methods of implemented custom components is to use the DTEXEC command utility to run the package.

To use DTEXEC to run the package

  1. Create a SSIS package using the implemented custom components.
  2. Open Visual Studio and open the project for the Custom Component.
  3. Select Project, and then select <projectname> Properties.
  4. Set the project in Debug mode.
  5. Select DEBUG property.
  6. Set the following properties.

Mode: Debug

Start external program: <path to the exe>\DTEXEC.exe

Command line arguments: /f "<pathtoyourSSISpackage>"

Figure 14. Using DTEXEC to run package

DTEXEC is the command-line utility for running the DTS package. Set the breakpoint and watch points on appropriate methods in the implemented code. When the code is executed using F5 or Start in Debug mode, DTEXEC launches the associated SSIS package. You can debug the output values to ensure the validity and correctness of the code.

You can also debug the runtime methods of a running package by attaching the source code to the DtsDebugHost.exe process. For this approach, a pause needs to be introduced during execution of the package. A pause can be introduced by setting an OnPreExecute breakpoint on the task. Once the package execution is paused, the DTSDebugHost.exe process can be attached and the source code can be debugged by inserting breakpoints at the appropriate places.

You can also set debugging properties in Business Intelligence Development Studio. For example, you can set data viewers between two Data Flow tasks to monitor the data flow to the downstream component and to monitor the number of rows/ columns that are transferred. In this case, the TLog Process transformation task parses and sends around 338 rows to the Header file. As shown in Figure 15, you can also see the content of the rows to validate the data. In addition, the data viewer indicates the number of rows transferred to the next component. The data viewer can be used as a good debugging tool.

Figure 15. Output displayed in the data viewer

When the custom components have been coded, tested, and deployed properly, they can be used as any other built-in data flow component.

Lessons Learned

For Project REAL, to improve the efficiency and performance of the ETL package, we decided to include the parsing phase of the packed decimal files in the SSIS pipeline itself. To achieve this, it was necessary to implement custom components.

For designing the custom components, the initial challenge was to understand the new SSIS object model and programmability features. Understanding which interfaces and methods to override is the first key to developing a custom component. One striking thing that we learned while implementing these custom components was how easily you can extend the current model to create your own custom tasks and components. In addition to understanding and extending the SSIS object model, for the successful implementation of these custom components, it was necessary to decompress and parse the packed decimal TLog files by using complex specifications. We had to implement custom methods to unpack and parse the data into the required format.

Following is a summary of some of the lessons we learned during the process.

  • To ensure proper execution, the component should be configured properly. The developer should ensure that the proper input and output collections are configured. Any additional properties that are required for proper configuration should be configured using custom properties. These properties should be defined clearly in the ProvideComponentProperties() method and validated using the Validate() method.
  • The external data source for the source adapter needs to be defined by using either the ConnectionManagers class or a custom property.
  • Implementing a custom component involves designing for both the design-time and the runtime methods. During design time, the metadata may change frequently. Hence, flexibility is an important criterion to include in the design when you implement design-time methods. Adding or removing input or output columns or changing the metadata should not break the component. Proper validation methods and error messages must be incorporated in the design to ensure that the input and external sources are validated and that users do not remove the required outputs when they create the package.
  • One of the critical functions is the Validate() method, as it ensures that the component is configured correctly for execution. One important thing to remember is that the Validate( ) method should only be used to validate the correct configuration and to raise error events. The errors should not be fixed by using this method. One rule to remember is that the state of the component should not be changed in the Validate( ) method. This rule is necessary because the designer assumes that this method is used only for raising events on error and calls it at will. If the component changes during this process, the behavior of the application can be unpredictable.
  • Pipeline buffers are a very important factor to consider when designing runtime methods. The first step is to determine whether your component is synchronous or asynchronous. If the component is synchronous, then the same input buffer is used as output. If the component is asynchronous, then the PrimeOutput() method should specify the pipeline buffers that are associated with the outputs. Hence, an asynchronous component is the only data flow component to receive the input buffer from the ProcessInput() method and the output buffers from the PrimeOutput() method.
  • The main challenge that we faced in this project was the fact that the TLog transformation component has 24 outputs (if the user chooses to use all 24 outputs). To ensure proper data flow, 24 pipeline buffers had to be initialized and assigned to the appropriate outputs. In addition, in order to extract the correct data, it is imperative that the columns of the required outputs are matched correctly with the associated column index in the buffers.
  • SQL Server Integration Services (formerly DTS Server) caches information about all the tasks and components in the Toolbox. This prevents costly enumeration. For deployment, if the SQL Server Integration Service is enabled, then the service must be restarted to include the newly deployed custom components.
  • For deployment, make sure that the developer of the custom component has the necessary privileges to copy the assembly in the global assembly cache and the Pipeline component directory. The custom component needs to be copied to both the directories.

Conclusion

The new SSIS object model has been re-engineered with extensibility in mind. Because the object model can be extended using managed languages, you can easily create your own custom tasks, components, or customize any object in the SSIS model.

This paper illustrates one application of the extensibility mechanisms—that of loading data from an unusual format directly into a production data warehouse. The new mechanism eliminates the intermediate storage and processes when handling the data. This results in higher performance and a more maintainable system. These mechanisms are easy to learn. You can optimize data flow performance by optimal usage of the pipeline buffers. This ensures more manageability, performance, flexibility, and a robust ETL process for loading data into the data warehouse.

For More Information

To learn more about SQL Server 2005, including Integration Services, visit the Microsoft SQL Server 2005 Developer Center