Azure Synapse Analytics for SQL Server Gurus Part One: Overview

David Norton

0/5 (0 vote)

Jul 13, 2021

CPOL

6 min read

4674

In this article we outline the problem: our fictional CEO wants to predict taxi use to ensure taxis are available where and when customers need them.

We’ve all been there.

We’re thrust into a project of paramount importance to the business. This project will test our SQL and analytical skills as never before. Although the CEO summarizes the problem in one short sentence, we know that the solution is much bigger than the problem they describe.

The CEO says that idle taxis cost the company. Moving empty vehicles from one location to another is also expensive. They want to cut costs by knowing what vehicles to deploy ahead of time. A car should be in the right place at the right time.

The catch, as always, is there’s no budget, and the CEO wants results tomorrow morning.

This series of three articles will explore how to use Azure Synapse Analytics and its comprehensive T-SQL support to solve this challenge.

Problem Outline

Since we have a short turnaround time, we’ll solve this problem using what we know best: an SQL server. Although building an application that uses time series analysis would take weeks, we can instead put our data into an Azure Spark pool and use Microsoft Azure’s automated machine learning (ML).

We need two forecasts for our solution: a prediction of which trip will occur and a projection of the number of calls for that trip.

From the outset we knew that we could solve the problem using a logistical model. To solve this problem, we’ll use:

Calls for a taxi: This is the demand.
Taxi availability: This is the supply.
Trips between two locations: These are the products or SKUs.
Trip length: This is the cost. The longer the trip, the higher the cost.
Zones: These are the stocking points in our logistical model.

You could also use this model to fit other circumstances. For example, replace the phrase "A call for a taxi to zone 1 at zone 50" with the words "A call for a pair of pliers at the Denver branch."

The problem consists of two main parts. First, we must forecast where the demand and desired destinations will be. We must also deploy excess vehicle supply in the most cost-effective way, where a "hole" exists in a zone’s supply.

Typical seasonal, monthly, quarterly, and annual patterns may exist in the data. Patterns may also be weekly, daily, and hourly. For this model, we total our demand into hourly buckets. We assume that any vehicle ending a trip in an hour bucket is available as vehicle supply in that hour, which is not necessarily true, but a good enough approximation for our purposes.

Our Aim

Armed with a demand forecast and our T-SQL skills, we'll use this taxi logistics model to generate a supply model. Combining the two models enables us to create a deployment schedule.

We’ll create a forecast for the forecast period and zone. Based on this forecast, we make the future supply available. We’ll know which locations are under-supplied and which are over-supplied. It is then a simple matter to deploy the closest available taxi to bring the supply in balance with the demand.

At the end of a forecast period, we are in a position to create a suggested deployment plan for the subsequent few periods.

What Tools Will We Use?

We employ several Azure resources to solve this problem:

Azure Synapse Analytics
Azure Machine Learning Studio
Azure SQL database (serverless)
Azure SQL database (dedicated)
Azure Spark pool

Let’s assume we have an on-premises SQL database containing the trip data. We want to use Microsoft Azure Synapse to analyze our data, so we need to upload the data to our Azure Blob Storage.

We won’t walk you through the basics in this article, but if you need more information about setting up Azure Synapse Analytics, check the resources at the end of this article or register for the Hands-on Training Series for Azure Synapse Analytics. Note that you can access other Azure services outside the Azure Synapse workspace by creating a linked service.

To follow this article series, you’ll need to have an Azure account. You can create a free Azure account, and you’ll get a $200 credit to explore everything Azure has to offer for 30 days.

For this solution, we only need our trip data. The raw trip data contains start time, end time, start location, end location, and trip distance. We can derive all the inputs for our model from these. We are not interested in the physical location — only the location ID stored in the trip data. We’ll categorize trips by averaging the trip distance and duration, and group by the start and end zones. The zone numbers combine to create a unique trip index.

We can put data into our Azure Synapse workspace efficiently through Azure Data Studio. This application ships with SQL Server Management Studio (SSMS) after version 18.

We can connect to our on-premises SQL server using Azure Data Studio. We can query our data using both SQL scripts and Jupyter notebooks. In our case, we extracted our data from SQL using Python in a Jupyter notebook and saved it in Parquet format, a compressed format database file — we’ll explain this process later in this series. We can then upload this data to an Azure Spark container directory in our Azure Blob storage.

In Azure Synapse Analytics, we can query the uploaded data from an external Serverless SQL table and query it as a Spark table. Whether your comfort zone is SQL, Python, Scala, or C#, you are in an excellent position to use Azure Synapse Analytics.

Next Steps

All the project code, SQL scripts, and notebooks are available on GitHub through the links below. The list also includes links to resources with details and knowledge we rely on in this article.

In the following two articles, we’ll dig into this project in detail. For our tutorial, simple SQL scripts turn our raw taxi data into usable demand data. Then, Azure Machine Learning creates the demand forecast. Up until that point, all data remains in Azure Blob storage containers. We’ll next save the demand forecast to a dedicated SQL instance then generate a deployment report to help our CEO ensure taxis are in the right place at the right time.

Continue to the second article of this series to explore how to get data into Azure Synapse Analytics and build a machine learning model.

To learn more about making the most of your organization’s data, you can also register for Microsoft’s Hands-on Training Series for Azure Synapse Analytics.

Resources

Code: Notebook for Azure Data Studio - Get SQL Data to Parquet
Step 1 - SQL Script to import NYC taxi trip data
Step 2 - Create views in Serverless SQL
Step 3 - Load demand from views
Step 3.5 - Create forecast tables in Spark
Step 3.6 - Get forecast data into Spark Pool
Step 4 - Demo import of uploaded data
Step 5 - Deployment server definition
Step 6 - Deployment report
Deployment report data generated
Azure Analytics Demo Miniseries
Create a Synapse workspace (Ensure you create it with VPN enabled, and while developing, keep Interactive authoring enabled in your integration runtime)
Create an Azure Machine Learning workspace
Create a Linked Service
Create a Managed private endpoint
Azure best practices
Explore Spark databases with Azure Synapse Analytics
Microsoft Tutorial: Train a machine learning model without code