For the workshop, we will need to provision multiple resources/services. Many of these are for the primer only as labeled below. The following is a step-by-step provisioning guide.
- Azure Resource Group
- Azure Virtual network
- Azure Blob Storage
- Azure Databricks
- Azure Data Lake Storage Gen1 (for the primer only)
- Azure Data Lake Storage Gen2 (for the primer only)
- Azure Event Hub (for the primer only)
- Azure HDInsight Kafka (for the primer only)
- Azure SQL Database
- Azure SQL Data Warehouse (for the primer only)
- Azure Cosmos DB (for the primer only)
- Azure Data Factory v2 (for the primer only)
- Azure Key Vault (for the primer only)
- A Linux VM to use Databricks CLI
Note: All resources shoud be provisioned in the same datacenter.
Create a resource group called "gws-rg" into which we will provision all other Azure resources.
https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-portal
Provision an Azure Vnet in #1.
https://docs.microsoft.com/en-us/azure/virtual-network/quick-create-portal#create-a-virtual-network
Provision an Azure Blob Storage account (gen1) in #1
https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?toc=%2fazure%2fstorage%2fblobs%2ftoc.json
(i) Provision an Azure Lake Store Gen 1 account in #1
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-portal#create-a-data-lake-storage-gen1-account
(ii) Create a service principal via app registration
This step is crucial for ADLSGen1, ADLSGen2 and Azure Key Vault
https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal
Capture the Application ID, also referred to as Client ID
(iii) Go to the app registration/service principal, click on settings and create a key/access token, capture the value - its exposed only temporarily; We will refer to this as access token
(iv) Create a new directory in ADLS Gen1 called gwsroot from the portal
(v) Grant the service principal (SPN) from (ii), super user access to the directory, and child items.
We will mount the ADLS Gen 1 in Databricks with this SPN's credentials.
Provision an Azure Lake Store Gen 2 account in #1 - with Hierarchical Name Space (HNS) enabled.
https://docs.microsoft.com/en-us/azure/storage/data-lake-storage/quickstart-create-account
Lets grant the same service principal name, blob contributor role based access to ADLS Gen 2
Provision an Azure Databricks premium tier workspace in the Vnet we created in #2, the same resource group as in #1, and in the same region as #1. Then peer the Dataricks service provisioned Vnet with the Vnet from #2 for access if you are trying out Kafka - otherwise, not required. We will discuss Vnet injection in the classroom.
Provision a workspace & cluster
Peering networks
The peering aspect is specific to Kafka and assumed no Vnet injection with Databricks. NOT required unless working with Kafka.
Provision Azure Event Hub (standard tier, three throughput units) in #1, and a consumer group. Set up SAS poliies for access, and capture the credentials required for access from Spark.
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-create
Provision a Kafka cluster in the resource group and Vnet from #2. Enable Kafka to advertise private IPs, and configure it to listen on all network interfaces.
Provisioning: https://docs.microsoft.com/en-us/azure/hdinsight/kafka/apache-kafka-get-started#create-an-apache-kafka-cluster
Broadcast IP addresses, configure listener: https://docs.microsoft.com/en-us/azure/hdinsight/kafka/apache-kafka-connect-vpn-gateway#configure-kafka-for-ip-advertising
Provision a logical database (Standard S0 - 10 DTUs) server in #1, and an Azure SQL Database within the server. Configure the firewall to allow our machine to access, and also enable access to the Dataricks Vnet. Capture credentials for access from Databricks.
https://docs.microsoft.com/en-us/azure/sql-database/sql-database-get-started-portal#create-a-sql-database
We will work on the firewall aspect in the classroom.
Provision an Azure SQL datawarehouse (gen1, 100 DWU), in #1 in the same database server created in #8.
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/create-data-warehouse-portal#create-a-data-warehouse
Provision a Cosmos DB account in #1, a database and 3 collections - one for batch load, one for streaming. Provision specs are below.
Database: gws_db (dont provision throughput)
Within gws_db, the following collections:
- Name: chicago_crimes_curated_batch; Partition key: /case_id; Throughput: 1000
- Name: chicago_crimes_curated_stream; Partition key: /case_id; Throughput: 1000
- Name: chicago_crimes_curated_stream_aggr; Partition key: /case_type; Throughput: 1000
Complete step 1, 2 and 3 from the link - https://docs.microsoft.com/en-us/azure/cosmos-db/sql-api-get-started
Provision Azure Data Factory v2 instance in #1.
Complete steps 1-9 from the link - https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-portal#create-a-data-factory
Provision Azure Key Vault instance in #1.
https://docs.microsoft.com/en-us/azure/key-vault/quick-create-portal
Provision a Linux VM (Ubuntu Server 18.10) on Azure from the portal.
We will install the Databricks CLI on this VM - due to the ephemeral nature of the cloud bash shell.