Understanding the basics of ADF pipelines isn’t hard. Figure 4 shows the components of a simple example.
One way to start a pipeline running is to execute it on demand. You can do this through PowerShell, by calling a RESTful API, through .NET, or by using Python.
A pipeline can also start executing because of some trigger.
For example, ADF provides a scheduler trigger that starts a pipeline running at a specific time. However it starts, a pipeline always runs in some Azure data center.
The activities a pipeline uses might run either on the Azure IR, which is also in an Azure data center or on the Self-Hosted IR, which runs either on-premises or on another cloud platform. The pipeline shown in Figure 4 uses both options.
Pipelines are the operation's boss, but activities do the actual work. Which activities a pipeline invokes depends on what the pipeline needs to do. For example, the pipeline in Figure 4 carries out several steps, using an activity for each one. Those steps are:
1. Copy data from AWS Simple Storage Service (S3) to Azure Blobs. This uses ADF’s Copy activity, which runs on an instance of the Self-Hosted IR installed on AWS.
2. If this copy fails, the pipeline invokes ADF’s Web activity to send an email informing somebody of this. The Web activity can call an arbitrary REST endpoint, so in this case, it invokes an email service to send the failure message.
3. If the copy succeeds, the pipeline invokes ADF’s Spark activity. This activity runs a job on an HDInsight Spark cluster. In this example, that job does some processing on the newly copied data, then writes the result back to Blobs.
4. Once the processing is complete, the pipeline invokes another Copy activity, this time to move the processed data from Blobs into SQL Data Warehouse.
The example in Figure 4 gives you an idea of what activities can do, but it’s pretty simple. Activities can do much more.
For example, the Copy activity is a general-purpose tool to move data efficiently from one place to another. It provides built-in support for dozens of data sources and sinks—it’s data movement as a service.
Among the options it supports are virtually all Azure data technologies, AWS S3 and Redshift, SAP HANA, Oracle, DB2, Mongo DB, and many more.
These can be scaled as needed, speeding up data transfers by letting them run in parallel, with speeds up to one gigabit per second.
ADF also supports a much more comprehensive range of activities than in Figure 4. Along with the Spark activity, for example, it provides activities for other approaches to data transformation, including Hive, Pig, U-SQL, and stored procedures.
ADF also provides a range of control activities, including If Condition for branching, Until for looping, and For Each for iterating over a collection.
These activities can also scale out, letting you run loops and more in parallel for better performance.
Pipelines are described using JavaScript Object Notation (JSON), and anyone using ADF is free to author a pipeline by writing JSON directly. But many people who work with data integration aren’t developers; they prefer graphical tools. For this audience, ADF provides a web-based tool for authoring and monitoring pipelines. There’s no need to use Visual Studio. Figure 5 shows an example of authoring a simple pipeline.
This example shows the same simple pipeline illustrated earlier in Figure 4. Each of the pipeline’s activities — the two Copies, Spark, and Web — is represented by a rectangle, with arrows defining the connections between them. Some other available activities are shown on the left, ready to be dragged and dropped into a pipeline as needed.
The first Copy activity is highlighted, bringing up space at the bottom to give it a name (used in monitoring the pipeline’s execution), a description, and a way to set parameters for this activity.
Note: It’s possible to pass parameters into a pipeline, such as the name of the AWS S3 bucket to copy from and to pass the state from one activity to another within a pipeline.
Every pipeline also exposes its REST interface, which an ADF trigger uses to start a pipeline.
This tool generates JSON, which can be examined directly as it’s stored in a git repository.
Still, this isn’t necessary to create a pipeline. This graphical tool lets an ADF user create fully functional pipelines with no knowledge of how those pipelines are described under the covers.
In a perfect world, all pipelines would complete successfully, and there would be no need to monitor their execution.
In the real world, however, pipelines can fail. One reason is that a single pipeline might interact with multiple cloud services, each of which has its failure modes.
For example, it’s possible to pause a SQL Data Warehouse instance, something that might cause an ADF pipeline using this instance to fail.
But whatever the reason, the reality is the same: We need an effective tool for monitoring pipelines. ADF provides this as part of the authoring and monitoring tool. Figure 6 shows an example.
As this example shows, the tool lets you monitor the execution of individual pipelines. You can see when each one started, for example, how it was started, whether it succeeded or failed, and more. A primary goal of this tool is to help you find and fix failures. To help do this, the tool lets you look further into the execution of each pipeline.
For example, clicking on the Actions column for a specific pipeline brings up an activity-by-activity view of that pipeline’s status, including any errors that have occurred, what ADF Integration Runtime it’s using, and other information.
If an activity failed because someone paused the SQL Data Warehouse instance it depended on, for example, you’ll be able to see this directly.
The tool also pushes all of its monitoring data to Azure Monitor, the common clearinghouse for monitoring data on Azure.