Because other ETL tools view the migration of data from their sources to various levels of storage, they view the ETL process as a series of data paths. This colors the way the tools implement their design process. Most of the code and toolset they provide are slanted toward data manipulation only. They are also often geared toward complex data warehousing environments, so they have a steep learning curve.
Because Microsoft views the data as a single path, the toolset focuses not only on the data manipulation, but the extract process, the operating system, Web services, and more. Although it has many powerful BI features, it contains many wizards that simplify tasks such as copying a database or importing a text file to a table.
SSIS has two primary areas that you will work with: the control flow and the data flow. I explain these concepts as we explore the components of the system and then I demonstrate them fully in the section describing the Business Intelligence Development Studio, one of the tools you can use to work with Integration Services.
Integration Services Components
The components within SSIS include the outer objects such as packages, which hold other objects, and two primary objects to process and work with the data. The first is the control flow, which contains tasks you want to perform before, in between, or after the data is moved. The second is the data flow, which has connections, sources, transforms, and destinations to work with the data.
Within the design environment, you will see two sequential tabs with those objects as the title. You will actually create the control flow first, and it will contain at least one data flow object, even though you work with them on what appear to be equal tabs. We come back to that in a little while, but it is important to remember the distinction that a control flow contains data flow objects.
The first object we will explore is the package. This is the collection of the control flow and data flow objects in a single unit. The problem with this term is that it was also used in the previous version of SQL Server to refer to the data movement tool there, called Data Transformation Services (DTS). To confuse matters even further, when you design a SSIS system, the largest container is called a solution, which is made up of one or more projects. Projects contain packages.
You can store a package in a SQL Server database, the SSIS Package Store, or even as a file.
To help separate these terms, keep in mind that solutions and projects are part of the design effort; an SSIS package runs on a server and is separate from a DTS package. I will make sure I am clear when I use these terms as I explain the SSIS objects. For now, think of the package as the outermost container of the rest of the objects.
The control flow view of the process involves the tasks performed before the data is imported or exported and the tasks performed after the transfer occurs. Although you can work with data in this process, it is normally done in another area.
Available tasks in this area include downloading a file from a foreign system using File Transfer Protocol (FTP), sending a notification that a transfer process is starting, or making a database call to trigger a process on another server. You can also create tasks that link into an Analysis Services system to prepare a transfer there. As we work through the various scenarios, I explain more about these components.
There are three types of objects within the control flow. The first are the containers, which hold units of work. Containers deal with loops and sequences. If you want to iterate through a series of tasks, you can create either a Foreach loop, which works through a defined number of steps, or a For loop, which works through a series of steps until some condition is true or false. Both of these are types of containers.
Another container is the sequence. This container provides order to the processes it contains and can even act like an entire package within a package.
The second type of object in the control flow is the task. Tasks are the individual units of work for the control flow. Remember, the control flow contains the work your package does around the data movements.
Tasks include items such as scripting, Analysis Services, XML manipulations, downloads, and more. Tasks are the building blocks of your package. By putting tasks together, you can create complex operations that are easy to follow because they are represented graphically.
One of the most important tasks is also the second major component within SSIS: the data flow task. This task moves data from one place to another. I cover the data flow task separately because it has its own types of tasks within it.
You can create one or more tasks within a single control flow. That means you could load data from one location, receive a downloaded XML file, read it in, and transform all that data into a final form.
Here are the control flow tasks you have available and a quick synopsis of what they do.
The third type of objects are the precedence constraints. They connect containers and tasks in packages into an ordered control flow. Using precedence constraints, you can control the sequence execution for tasks and containers and specify conditions that determine whether tasks and containers run.
Inside the control flow (although viewed separately in the design environment) is the data flow. The data flow contains connections to data systems and three basic objects and tasks: sources, transformations, and destinations.
Each source object represents various places that you can pull data. You can read from databases, files, Web services, Excel files, XML documents, and more.
You can also apply transformation tasks, which include things such as aggregations, merges, splits, various mathematical and string manipulations, and more. You can place these tasks directly after the source system objects to create the pipeline architecture, or after data arrives at the destination. You can have as many transformation tasks as you want.
When you are finished importing and transforming the data, you place it into a destination. The destination does not have to be a SQL Server database. You can read data from an FTP site, transform it into all lowercase, and then record the data out into an XML file. The data never has to pass through a SQL Server database at all.
The Connection Manager
Before you can read data from a source or write it to a destination, you must have a connection to work with. Connections involve the libraries that talk across protocols to systems and the security authentication (if any) that those connections require.
If you have never programmed before, this might be a new concept, because it seems unnecessary to tell the system you want to extract data and also that you want to connect to that data. In practice, connections are useful because you can reuse them. Not only that, but connections need to be abstracted from the data transfer because you might want to make two connections to a system with different security or configuration information but perform the same data operations on both.
There are a lot of data sources you can access, right out of the box. Here is a quick list:
In the following sections, I show you how to create connections for both source and destination data.
With the connection created, SSIS can read from several types of data sources. Here is a chart of the types you have available.
If you are connecting to a SQL Server database, you will most often use the OLE DB data source.
In a moment, I put all this together in a simple package so that you can see the interactions between the connections and the data sources.
When you bring data from another system into a database, you will often need to change it. Most often, it is a data-type issue.
For instance, when you are reading from a flat file, the data is necessarily all text. There is no notion of dates, numbers, or binary representation because the ASCII set does not support a "type" attribute. When you put the data into a database, however, it will have a type. In some cases, the database can just "figure out" what to do, such as importing the string "10/23/1963" as a date. That is called an implicit conversion, and other than the convenience factor, it is not always a good thing. The system might implicitly convert a telephone number string into a numeric format, and subtract the phone number from the area code.
To prevent this kind of thing, you should at least implicitly transform the data types from the source system to the destination. In other cases, you might want to completely change the string, even if the data types match. In applications such as BI systems, you might have a need to change the case of text or add ten days to a date.
So far, I have been describing changing the contents of a single field within the extract. In more complex applications, you need to do much more than that. SSIS includes several types of transformation operations, covering everything from the single-field changes to creating fields that do not even exist in the source system.
I am describing these transformations so that you can plan your ETL process. Even if you are not setting up a complex BI system, you still need to have a plan for the data path. The reason for that plan is that you do not own the data. The minute you change the data from what it originally was in the source system you have added meaning, which is not something the DBA should do. I explain that concept in greater depth in Chapter 10, "Analysis Services," but for now you should ensure that you present your plan to change data to the data owners for their sign-off. Only the owners can set the meaning for the data.
Let's take a look at some of the more common types of transformations you are likely to use and then compare that to the transformation tasks available in SSIS.
One of the most common transformations is to homogenize data. That simply means making the data from multiple sources look the same by changing case, order, data types, and so forth.
Assume that you have three source systems at various plants. One might represent part numbers separated by dashes, another might use forward slashes, and still another might use periods to separate the numbers. When you bring that data in, you may want to strip out all of those characters and simply run the numbers together. That is a homogenization transformation.
Another more difficult transformation is cleansing. If you import much data, you will certainly encounter sources that have allowed "dirty" data into their systems. This can include foreign characters, strings where numbers are required, or any number of data corruptions. In this case, a judicious review of what your goals are is essential. Do you ignore the rows with the bad data? If you do, there is a possibility you will violate the integrity of the system you are designing. For instance, if you delete a "bad" part number from an import, you might not be able to import the larger assembly that depends on it.
In addition to ignoring the data, which really is not a transform anyway, you could set up a logical rule that changes the bad data to an acceptable form. This is possible if you know what the possible errors are and what they should map to, but this is not always possible.
The last choice you have if you cannot cleanse the data is to write it to an "exceptions table" or to a logging file. Although technically a transform, you should at least employ some sort of detection logic so that you do not stop an entire import process because of a single row of bad data.
An aggregation transformation is used with numeric values. In this case, you add numbers or counts of data elements and store the result into a single field. For instance, instead of recording every part from the source system, you might record that there were ten part number 12345s, five number 34215s, and so on.
Aggregations are common in analytical systems because by definition they are more concerned with strategic numbers than the detail.
A merging transformation type is used when you need to combine fields. You may find that your source systems have broken a single field into multiple parts. This is quite common in project management systems, where the Work Breakdown Structure (WBS) numbers are used as a "database in a database," meaning that each character in the number has positional relevance. In that case, you may want to combine the fields into one field.
In global systems, you may need to convert data from one language or format to another. This might be as simple as changing one character to another or as complex as looking up a currency rate on a Web service and changing the inputs to dollars or euros.
Much like the merge transformation, joining transformations involves combining fields. The difference is that a joining transformation adds two or more fields together or even appends data to an existing field and does not make any other changes to it.
One of the most interesting transformations is when you create a new field from the data that is imported. This is called a materialized transformation. Possible uses include creating totals by performing mathematical operations on merged or joined columns or adding a location field to the import based on where it came from.
Most every BI extract includes this type of transform because they include a timestamp field on the import. Because that data was not in the original file, it is a materialized transform.
You can achieve the types of transformations I have just described by using one of the transformations that SQL Server provides or by combining the output of one to the input of another.
It is beyond the scope of this chapter to describe each of the transformations and how you can use them, but here is a chart that gives you a quick rundown of the variety of capabilities SSIS transformations provides.
The destinations for the data are similar to what I explained for sources, but SSIS includes a few more destinations than sources. The reason is that SSIS is used a great deal for BI applications, so you will see destinations for dimensions, data models, and partitions. I cover those objects more in Chapter 10. Here is a list to help you with the type of destinations you have available:
To connect containers to tasks or tasks to other tasks, you use the next object in the control flow, precedence constraints. Precedence constraints are conditions of failure, success, completion, or any mix of all three.
Essentially, you "tie" a container or task to another container or task. In this way, you can create your logic for the workflow, such as "Do A if this task succeeds, B if it fails, or C when it completes." By using the proper constraint precedence arrangements, you can create AND and OR conditions on a task. This creates the flow for the package.
You have a few choices for the precedence constraints. Most of the time, you will use the execution results from the task to enforce the constraint. For instance, to enforce the following workflow: "If the execute T-SQL command is successful, move on to the next task, but if it fails, send an e-mail," you would create two precedence constraints: one for the success tied to the next task, and the other for the failure tied to a Send Mail task.
You can also code your precedence constraints to use an expression evaluation. This would mean something like "If A = B, then True" to evaluate whether to branch off to one task over another. The variables I discuss in a moment are useful for this type of workflow requirement.
Precedence constraints attach to either an input or an output of another object. Although you might think a destination would not have an output, it does. Many objects, such as destinations, have error outputs.
Not all inputs can be linked to all inputs, but don't worry: The interface will stop you if you try to attach to the wrong place.
Another concept that is borrowed from programming that you may not be familiar with is the event. Conceptually, the idea of an event is not hard to follow. When something happens, that is an event. The more interesting part is that certain SSIS objects can "raise" events, which you can watch for and respond to. You can watch events from a package, a task, or a container.
To watch and respond to the events, you create an event handler. Here are the types of event handlers and what they watch for.
Event handlers are great for debugging your process, too. You can watch for status changes, errors, and so on when the package is running, and when you have got everything worked out, you can remove them.
Even with all those objects, there is still more that you can do with SSIS. To really extend the Package into an enterprise process, you need two other facets: logging and extensibility. I cover the logging options in the "Management" section a little later, because it allows you to maintain your system as well as develop against it.
Another handy concept borrowed from programming is the variable. A variable is a name set aside to hold a value that can change while the package is running. It is similar to a notepad off to the side that you can write on and read while your workflow completes.
Variables are stored in a central location, called a namespace. SSIS has some built-in variables and stores them in the namespace called System Variables, which you can also read. The variables you create for yourself are normally stored in a namespace called User Variables. In this section, I focus on the user variables, but do not overlook the valuable information that the system variables hold. I show you how to get to those later.
When you create a variable, it applies to only the area that contains it. For instance, if you create a variable in a task, only that task and any objects contained within it can see or change the variable. This is called scope.
If you create a variable in a container, all the objects within that container can reference the variable. Creating a variable in the package essentially creates a "global" variable that all objects can see. Although my developer friends cringe when they see me do it, I normally create variables with the scope in their name.
To set a variable, you can right-click an object and select that option from the menu that appears. You can also work with variables in the Expressions property page in the Business Intelligence Development Studio, which I describe next.
Because the SSIS feature has so many facets, you can use more than one tool to work with it. You can use several command-line tools to manage and deploy packages. SSIS also includes a full programming model, so your developers can create entire applications using only code. In this chapter, I focus on the tools that you will be asked to use as the DBA.
Creating a Project with the Business Intelligence Development Studio
To design SSIS packages, you will use the Business Intelligence Development Studio. This is a Visual Studio-inspired application that has an integrated graphical design interface, a debugger, context-sensitive help, and even source control.
You will find the Business Intelligence Development Studio in the Start menu in the SQL Server 2005 item. In this section, I open the interface and explain the environment. You can follow along on your own testing system.
You have already seen much of this interface when we explored the SQL Server Management Studio, so I do not cover things such as the pushpins on the menus or the panel placements. Those work here the same as in the SQL Server Management Studio. There are, however, a few other items that are unique to a design environment.
In this section, I am not creating a real project; we are just exploring the interface. While you read through this section, watch for all the objects I explained earlier in the chapter.
When you first open the Business Intelligence Development Studio (which I call the Development Studio from here on), you are placed in the Start page, which is actually in the main section of the environment, similar to the Summary area in SQL Server Management Studio. You can see the screen layout in Figure 8-1.
[View full size image]
This HTML-enabled area has hyperlinked items you can click, showing the latest news from the Microsoft development world and introductions to various parts of the tool. One of the main panels within this screen allows you to create various projects, such as a Reporting application, a BI application, and an Integration Services application. Click the Create: Project link on the Start page to bring up the dialog shown in Figure 8-2.
[View full size image]
Here you are able to set the directory for the project and create a subdirectory for the project automatically. This is a practice that I recommend, along with storing your projects in a central location.
After you make your selections, you are dropped right into the interface without connecting to a server first. This is because you are working in a development environment. You could create an entire project without connecting to a database at all.
In Figure 8-3, you can see the screen I am presented with on my test system.
[View full size image]
If your screen does not look like this one, you can click the Window menu bar item and then select Reset Window Layout.
On the far-right side of the screen are the Solution Explorer and Properties panels, just as I showed you earlier when we spoke about the SQL Server Management Studio. In a moment, I show you how to add data sources and data source views and how you use those, but for now "unpin" those panels to hide them. Next slide out and "pin" the Toolbox on the left side of the screen.
As you can see in Figure 8-4, this is where having multiple monitors really comes in handy. With a large-enough monitor, you could set all of the menus to "Float," and then have a large central area to work with the package. In this case, just keep the menus in the tabbed interface.
[View full size image]
The large area in the center-right of this screen has four tabs, containing the control flow, the data flow, event handlers, and the Package Explorer view. That is all encompassed inside one tab referenced by the name of the project. At the bottom of the screen is the Connection Manager area. This is where we will make the connections to the databases that are used by the other objects. On the left-hand panel is the Toolbox for the control flow items.
To begin the project, simply drag items from the Toolbox into the panel on the right that you are interested in working with. Because the control flow is what is showing here, drag a File System task into that part of the screen.
In Figure 8-5, you can see that there is an error on this task, indicated by the X in a red circle. It also has a precedent constraint highlighted. In this case, it is green, which means it is the success precedent. I come back to that in a moment.
[View full size image]
Right-click the object and select the Properties item from the menu that appears. The tab on the right shown in Figure 8-6 slides out to show all the properties of the object and allows you to set or edit them.
[View full size image]
Close that panel and double-click the File System Task object. You can also right-click an object and select Edit from the menu that appears to do the same thing. The results are shown in Figure 8-7.
[View full size image]
In this case, the File System task expects information about what it needs to do. As I explained earlier, the File System task can copy, move, delete, or create files and folders on the operating system. This example I have created copies a file called test.txt in the c:\temp directory to a file called new.txt in the c:\temp directory (and overwrites it if it already exists). Most of those instructions are visible in this panel, but the directories are not.
The reason is that when you click the SourceConnection entry box to set the start file, you have to either specify a file connection or create a new one. Although I'm not showing that dialog box here, selecting the item marked <…new connection .. > brings up a panel with a browse button to set the path or filename. Once those are created, they show up in the bottom part of the environment and are available to any of the other objects you work with.
Click the OK button to close that item and create the next item in the workflow we are developing. In this case, we will send an e-mail that the process is complete. Drag the Send Mail task onto the control flow.
You can see the object in Figure 8-8. With the Send Mail task created, it must be configured for use. Double-click the object to bring up the editor for it, shown in Figure 8-9.
[View full size image]
[View full size image]
In this case, the Mail item on the left is selected to show that the system will attempt to mail a message containing the text of the file called new.txt from the file connection created a moment ago; notice how we are already reusing connections. Clicking the name of the SMTP Connection allows you to set up the SMTP server the task connects to. Click OK to close out the task.
Next we turn our attention to the precedence constraints. We need to tell the system to create the new.txt file and then send out the mail only if that is a successful event.
To set the constraint, click once on the File System task and then click once on the green arrow pointing downward from it. You do not have to hold the mouse button down; the line is now "locked" to your pointer until you click another object. Move the arrow until it connects to the Send Mail task, and then click once more. You can see the results in Figure 8-10.
[View full size image]
Now double-click the green line, which brings up the dialog box shown in Figure 8-11.
This is the Precedence Constraint Editor. Here you can set whether the step evaluates to a constraint such as success, failure, or completion; or whether the step evaluates an expression, which contains the variables I explained earlier in the chapter. You can also set the evaluation to both the constraint and the expression or the constraint or the expression. Selecting the Expression option opens up the Expression text box below. Click OK to continue.
If you want to add another precedence constraint, right-click the object and then select Add Precedence Constraint from the menu that appears.
Figure 8-12 shows the complete picture of the package as it stands now.
[View full size image]
In that screen, you can see the two tasks I have created, the success precedent constraint, and the connections for the files and the mail system. Although this seems perfectly clear now, if you begin to add more File System tasks or Send Mail tasks, it will not be obvious what you are trying to accomplish in the package.
To add more meaning to the graphics, click the File System object in the text that is displayed, wait a moment, and then click it again. The text will highlight, which allows you to type a more meaningful title, as shown in Figure 8-13.
[View full size image]
With everything complete, it is time to save the project. I normally click the "multiple disks" icon in the icon bar, but you can also use the menu bar to save your work. The project is now saved, the objects are all configured, and we are ready to run the package. In the Development Studio, this is known as debugging, and you can start it by pressing the F5 key or you can click Debug and then Start Debugging in the menu bar.
On my test system, the File System task worked properly and is colored green. The Send Mail task has failed, however, and is indicated with a red box, which you can see as the darker box in Figure 8-14.
[View full size image]
To find out why it has failed, I click the Progress tab at the top of the package.
You can see in Figure 8-15 that my test system is behind a firewall that blocks SMTP traffic. Although the actual failure message text is not helpful in this case, the steps taken by the package are clearly diagrammed so that you can trace each activity within SSIS. This is useful when trying to find an issue with your package's execution. You will also get feedback on how long the process takes, which is valuable in performance monitoring.
[View full size image]
The graphical tools are by far the easiest method to work with SSIS, but sometimes you will need to use the command line to access the system. In that case, you have four tools at your disposal.
Although you have a powerful environment to work with for designing projects, you might only have modest data transport needs, such as moving a table to a file or reading data from an Excel spreadsheet into a database. You will be happy to learn that you do not have to learn an entire new development environment to do that. Microsoft provides a simple wizard-driven interface that walks you through the process to move data to or from your database. In fact, if you followed the Maintenance Wizard process I demonstrated earlier in the book, you have already used SSIS to set up a package. In your role as a DBA, you will often create a package using a wizard and then expand the pac