Friday, December 27, 2019

Data Platform Tips 30 - Integrate Azure Data Lake Storage Gen2 with Azure Databricks using Spark

In this post we will look at how to connect and access data from Azure Data Lake Storage Gen2 with Azure Databricks using Spark.

Pre-requisites

  1. Go to Research and Innovative Technology Administration, Bureau of Transportation Statistics.
  2. Select the Prezipped File check box to select all data fields.
  3. Select the Download button and save the results to your computer.
  4. Rename the .csv file to "On_Time.csv" and upload the file to the Azure Data Lake Storage Gen2 provisioned on the Azure Portal using Azure Storage Explorer.

Access Azure Data Lake Storage Gen 2 data using Azure Databricks via Azure Portal

a) Logon to the Azure Portal

b) Provision Azure Databricks Service.






































c) Once provisioned, click on "Launch Workspace" from the provisioned Azure Databricks service.













d) Create a new cluster named "ADLSDatabricksDemo" and provide the required configuration for the cluster.




















e) Create a new notebook to run some spark queries.













f) Also make sure to create an App Registration in your Azure Active Directory to provide the required access for the client to your Azure Data Lake Storage Gen2 and also generate a client secret.[https://docs.microsoft.com/azure/active-directory/develop/howto-create-service-principal-portal]







































g) Finally run the following queries on the newly created notebook on Azure Databricks to query the data loaded in the Azure Data Lake Storage Gen2.




No comments:

Post a Comment