To get a sense of how you can use ADF pipelines, it’s helpful to look at real scenarios. This section describes two:
Building a modern data warehouse on Azure, and
Providing the data analysis back end for a Software as a Service (SaaS) application.
Data warehouses let an organization store large amounts of historical data, then analyze it to understand its customers, revenue, or other things. Most data warehouses today are on-premises, using technology such as SQL Server.
Going forward, however, data warehouses are moving into the cloud. There are some excellent reasons for this, including low-cost data storage (which means you can store more data) and massive amounts of processing power (which lets you do more analysis on that data).
In any case, creating a modern data warehouse in the cloud requires a way to automate data integration throughout your environment. ADF pipelines are designed to do precisely this.
Figure 2 shows an example of data movement and processing that can be automated using ADF pipelines.
In this scenario, data is first extracted from an on-premises Oracle database and Salesforce.com (step 1).
This data isn’t moved directly into the data warehouse, however. Instead, it’s copied into a data lake, a much less expensive form of storage implemented using either Blob Storage or Azure Data Lake.
Unlike a relational data warehouse, a data lake typically stores data in its original form. If this data is relational, the data lake can store traditional tables. But if it’s not relational (you might be working with a stream of tweets, for example, or clickstream data from a web application), the data lake stores your data in whatever form it’s in.
Why do this?
Rather than using a data lake, why not transform the data as needed and dump it directly into a data warehouse?
The answer stems from the fact that organizations are storing ever-larger amounts of increasingly diverse data. Some of that data might be worth processing and copying into a data warehouse, but much of it might not.
Because data lake storage is so much less expensive than data warehouse storage, you can afford to dump large amounts of data into your lake, then decide later which of it is worth processing and copying to your more expensive data warehouse.
In this era of big data, using a data lake and your cloud data warehouse together gives you more options at a lower cost.
Suppose you’d like to prepare some of the data just copied into the data lake to get it ready to load into a relational data warehouse for analysis.
Doing this might require cleaning that data somehow, such as by deleting duplicates. It might also require transforming it, such as by shaping it into tables. If there’s a lot of data to process, you want this work to be done in parallel so that it won’t take too long.
On Azure, you might run your prepare and transform application on an HDInsight Spark cluster (step 2). In some situations, an organization might copy the resulting data directly into Azure SQL Data Warehouse.
But it can also be helpful to do some more work on the prepared data first. For example, suppose the data contains calls made by customers of a mobile phone company. Using machine learning, the company can use this call data to estimate how likely each customer is to churn (switch to a competitor).
In the scenario shown in Figure 2, the organization uses Azure Machine Learning to do this (step 3).
Suppose each row in the table produced in step 2 represents a customer, for example. In that case, this step could add another column to the table containing the estimated probability that each customer will churn.
The critical thing to realize is that, along with traditional analysis techniques, you’re also free to use data science tools on the contents of your Azure data lake.
Now that the data has been prepared and had some initial analysis, it’s finally time to load it into SQL Data Warehouse (step 4).
(While this technology focuses on relational data, it can also access non-relational data using PolyBase.)
Most likely, the warehoused data will be accessed by Azure Analysis Services, which allows scalable interactive queries from users via Power BI, Tableau, and other tools (step 5).
This complete process has several steps. If it needed to be done just once, you might choose to do each step manually. In most cases, though, the process will run over and over, regularly feeding new data into the warehouse.
This implies that the entire process should be automated, which is precisely what ADF allows.
You can create one or more ADF pipelines to orchestrate the process, with an ADF activity for each step. Even though ADF isn’t shown in Figure 2, it is nonetheless the cloud service driving every step in this scenario.
Most enterprises today use data analysis to guide their internal decisions. Increasingly, however, data analysis is also crucial to independent software vendors (ISVs) building SaaS applications.
For example, suppose an application provides connections between you and other users, including recommendations for new people to connect with. Doing this requires processing a significant amount of data regularly, then making the results available to the SaaS application.
Even simpler scenarios, such as providing detailed customization for each app user, can require significant back-end data processing.
This processing looks much like what’s required to create and maintain an enterprise data warehouse, and ADF pipelines can be used to automate the work.
Figure 3 shows an example of how this might look.
This scenario looks much like the previous example. It begins with data extracted from various sources into a data lake (step 1).
This data is then prepared, such as with a Spark application (step 2), and perhaps processed using data science technologies such as Azure Machine Learning (step 3).
The resulting data isn’t typically loaded into a relational data warehouse, however.
Instead, this data is a fundamental part of the service the application provides to its users.
Accordingly, it’s copied into the operational database this application uses, which in this example is Azure Cosmos DB (step 4).
Unlike the scenario shown in Figure 2, the primary goal here isn’t to allow interactive queries on the data through standard BI tools (although an ISV might also provide that for its internal use).
Instead, it’s to give the SaaS application the data it needs to support its users, who access this app through a browser or device (step 5). And as in the previous scenario, an ADF pipeline can be used to automate this entire process.
Several applications already use ADF for scenarios like these, including Adobe Marketing Cloud and Lumdex, a healthcare data intelligence company.
As big data becomes increasingly important, expect to see others follow suit.