An effective data integration service must provide several components:
A way to perform specific actions. You might need to copy data from one datastore to another, for example, or to run a Spark job to process data. To allow this, ADF provides activities, each focused on carrying out a specific task.
A mechanism to specify the overall logic of your data integration process. This is what an ADF pipeline does, calling activities to carry out each step in the process.
A tool for authoring and monitoring the execution of pipelines and the activities they depend on.
Figure 1 illustrates how these aspects of ADF fit together.
As the figure shows, you can create and monitor a pipeline using the pipeline authoring and monitoring tool. This browser-based graphical environment lets you create new pipelines without being a developer. People who prefer to use code can do this.
However, ADF also provides SDKs that allow the creation of pipelines in several languages. Each pipeline runs in the Azure data center you choose, calling on one or more activities to carry out its work
If an activity runs in Azure (either in the same data center as the pipeline or another Azure data center), it relies on the Integration Runtime (IR).
An activity can also run on-premises or in another public cloud, such as AWS. In this case, the activity relies on the Self-Hosted Integration Runtime. This is essentially the same code as the Azure IR, but you must install it wherever you need it to run. But why bother with the Self-Hosted IR?
The most common answer is that activities on Azure may not be able to directly access on-premises data sources, such as those that sit behind firewalls.
It’s often possible to configure the connection between Azure and on-premises data sources so that there is a direct connection (if you do, you don’t need to use the Self-Hosted IR), but not always. For example, setting up a direct connection from Azure to an on-premises data source might require working with your network administrator to configure your firewall in a specific way, something admins aren’t always happy to do.
The Self-Hosted IR exists for situations like this. It provides a way for an ADF pipeline to use an activity that runs outside Azure while giving it a direct connection back to the cloud.
A single pipeline can use many different Self-Hosted IRs, along with the Azure IR, depending on where its activities need to execute. It’s entirely possible, for example, that a single pipeline uses activities running on Azure, on AWS, inside your organization, and in a partner organization. All but the activities on Azure could run on instances of the Self-Hosted IR.
High (reaping the greatest possible business benefits from data)