Friday, January 31, 2020

Data Platform Tips 65 - Apache Sqoop with Hadoop in Azure HDInsight

Apache Sqoop in HDInsight to import and export data between an HDInsight cluster and an Azure SQL database.

Apache Hadoop is mainly used processing unstructured and semi-structured data, such as logs and files, there may also be a need to process structured data that is stored in relational databases.

Apache Sqoop is a tool used to transfer data between Hadoop clusters and relational databases. It can be used to  import data from a relational database management system (RDBMS) such as SQL Server, MySQL, or Oracle into the Hadoop distributed file system (HDFS), transform the data in Hadoop with MapReduce or Apache Hive, and then export the data back into an RDBMS.

Things to consider

  • Both Hadoop and SQL Server must be on same Virtual network.
  • SQL Server must be configured to allow SQL Authentication and remote connections.

Limitations

  • Bulk export - With Linux-based HDInsight, the Sqoop connector used to export data to Microsoft SQL Server or Azure SQL Database doesn't support bulk inserts.
  • Batching - With Linux-based HDInsight, When using the -batch switch when performing inserts, Sqoop makes multiple inserts instead of batching the insert operations.

Thursday, January 30, 2020

Data Platform Tips 64 - Recommended node configuration and VM sizes for HDInsight Clusters

The default and minimum recommended node configuration and virtual machine sizes for Azure HDInsight Clusters are listed below.

Size
vCPU
Memory: GiB
Temp storage (SSD) GiB
Max temp storage throughput: IOPS / Read MBps / Write MBps
Max data disks / throughput: IOPS
Max NICs / Expected network bandwidth (Mbps)
Standard_D3_v2
4
14
200
12000 / 187 / 93
16 / 16x500
4 / 3000
Standard_D4_v2
8
28
400
24000 / 375 / 187
32 / 32x500
8 / 6000
Standard_D5_v2
16
56
800
48000 / 750 / 375
64 / 64x500
8 / 12000
Standard_D12_v2
4
28
200
12000 / 187 / 93
16 / 16x500
4 / 3000
Standard_D13_v2
8
56
400
24000 / 375 / 187
32 / 32x500
8 / 6000
Standard_D14_v2
16
112
800
48000 / 750 / 375
64 / 64x500
8 / 12000
Standard_A1_v2
1        
2          
10            
1000 / 20 / 10                                          
2 / 2x500              
2 / 250                
Standard_A2_v2
2
4
20
2000 / 40 / 20
4 / 4x500
2 / 500
Standard_A4_v2
4
8
40
4000 / 80 / 40
8 / 8x500
4 / 1000

If you need more than 32 worker nodes in a cluster, select a head node size with at least 8 cores and 14 GB of RAM.

The only cluster types that have data disks are Kafka and HBase clusters with the Accelerated Writes feature enabled.

Wednesday, January 29, 2020

Data Platform Tips 63 - Azure HDInsight and VM Sizes

Virtual machines in Azure are optimized to suit different use cases. The VM size and type is determined by CPU processing power, RAM size, and network latency. VM sizes for Linux and VM sizes for Windows machines on Azure.

Type Sizes Description
Entry-level A, Av2 Have CPU performance and memory configurations best suited for entry level workloads like development and test. They are economical and provide a low-cost option to get started with Azure.
General purpose D, DSv2, Dv2 Balanced CPU-to-memory ratio. Ideal for testing and development, small to medium databases, and low to medium traffic web servers.
Compute optimized F High CPU-to-memory ratio. Good for medium traffic web servers, network appliances, batch processes, and application servers.
Memory optimized Esv3, Ev3 High memory-to-CPU ratio. Great for relational database servers, medium to large caches, and in-memory analytics.

Tuesday, January 28, 2020

Data Platform Tips 62 - Comparing Azure HDInsight cluster Storage options

When you plan to provision Storage for HDInsight clusters you have an option to select from 3 storage options.

a) Azure Storage
b) Azure Data Lake Storage Gen2
c) Azure Data Lake Storage Gen1

Account type Namespace Type Supported services Supported performance tiers Supported access tiers HDInsight Version Cluster type
Azure Data Lake Storage Gen2 General-purpose V2 Hierarchical (filesystem) Blob Standard Hot, Cool, Archive 3.6+ All
Azure Storage General-purpose V2 Object Blob Standard Hot, Cool, Archive 3.6+ All
Azure Storage General-purpose V1 Object Blob Standard N/A All All
Azure Storage Blob Storage** Object Block Blob Standard Hot, Cool, Archive All All
Azure Data Lake Storage Gen1 N/A Hierarchical (filesystem) N/A N/A N/A 3.6 Only All except HBase

Monday, January 27, 2020

Data Platform Tips 61 - Create a SAS to download file from Azure Storage

In this post, let us look us how to provision an Azure Storage Service and a blob container and create a SAS to download file from Azure Storage using Azure Portal.

a) Logon to the Azure Portal.

b) Search for "Azure Storage" in the market place.













c) Create a Storage account as shown below.

























d) Once provisioned, you can see the Azure Storage account service created.










e) Create a new container named "democontainer" and upload a file.












































f) Right-click on the file and select "Generate SAS" and provide the necessary attributes and click on "Generate SAS token and URL".



















g) Now copy the SAS URL and paste it on a new browser window and you will be able to download the file from the Azure Storage. Note: The SAS URL is valid until the duration you specified as part of the Expiry attribute when you generated the SAS.


Sunday, January 26, 2020

Data Platform Tips 60 - Shared Access Signatures (SAS) in Azure Storage

The resources on Azure Storage can be provided access to your end users and customers using Shared Access Signatures (SAS). SAS is a signed URI that includes a token which contains a set of query parameters that provides access to the resources.

Azure Storage supports three types of Shared Access Signatures.

User Delegation SAS - It is secured by Azure Active Directory credentials along with the permissions specified on the SAS. This is applicable to Blob storage only.

Service SAS - It is secured with Storage Account Key and delegates access to resources on one of the storage services - Blob, Queue, Table and Azure Files.

Account SAS - It is also secured with Storage Account Key and delegates access to resources one or more of the storage services - Blob, Queue, Table and Azure Files.

A SAS can take 2 forms one is

Ad-hoc SAS - When you create an ad hoc SAS, the start time, expiry time, and permissions for the SAS are all specified in the SAS URI.

SAS with a Storage access policy - This can be created on Blob, Queue, Table or File Share and can be used to manage constraints over one or more service Shared Access Signatures.

You can create SAS either via Azure Portal, Azure CLI or even through Azure Storage Explorer.

Saturday, January 25, 2020

Data Platform Tips 59 - Azure Storage Access Tiers

Azure Storage General purpose v2 accounts and BlobStorage accounts (Legacy)  provides multiple Access Tiers which allows storing data in a cost effective way. The access tiers include:

a) Hot - Holds data that can be accessed more frequently. Hot access tier has higher storage costs than cool and archive tiers, but the lowest access costs. Can be set at account level.

b) Cold - Holds data that is infrequently accessed and stored for at least 30 days. Cool access tier has lower storage costs and higher access costs compared to hot storage. Can be set at account level.

c) Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days. The archive access tier has the lowest storage cost. But it has higher data retrieval costs compared to the hot and cool tiers. Data in the archive tier can take several hours to retrieve.

You can change the access tier using the Azure Portal, Azure Storage Explorer as well as using Azure CLI.


Friday, January 24, 2020

Data Platform Tips 58 - Create Apache Kafka Cluster in Azure HDInsight

Apache Kafka is an open-source distributed streaming platform used to build real time pipelines. Kafka on HDInsight provides a Kafka as a managed, highly scalable, and highly available service in Azure.

In this post we will look at how to provision Azure HDInsight Kafka cluster using the Azure Portal.

a) Log on to the Azure Portal.

b) Search for HDInsight in the Azure Marketplace.












c) Select the Cluster type and in this case we are going to provision the "Kafka" Cluster.


















































d) Choose the storage type and in this case we are provisioning Azure Storage.


























e) Once provisioned you can see the Azure Kafka Cluster and it provides functionality similar to a publish-subscribe message queue.



Thursday, January 23, 2020

Data Platform Tips 57 - Create an Interactive Query Cluster in HDInsight

Interactive Query supports in-memory caching, which makes Apache Hive queries faster and much more interactive. Interactive query makes it easy for developers and data scientist to work with the big data stored in Azure Storage or Azure Data Lake Storage using BI tools.

Interactive Query is different from Apache Hadoop cluster as it contains only Hive service supporting in-memory caching.

In this post we will look at how to provision Azure HDInsight Interactive Query cluster using the Azure Portal.

a) Log on to the Azure Portal.

b) Search for HDInsight in the Azure Marketplace.












c) Select the Cluster type and in this case we are going to provision the "Interactive Query" Cluster.

















































d) Choose the storage type and in this case we are provisioning Azure Storage.


























e) Once provisioned you can see the Azure HDInsight Interactive Query Cluster and can execute Apache Hive queries using the cluster.
















Wednesday, January 22, 2020

Data Platform Tips 56 - Create Apache Spark Cluster in Azure HDInsight

In this post we will look at how to provision Azure HDInsight Spark cluster using the Azure Portal.

a) Log on to the Azure Portal.

b) Search for HDInsight in the Azure Marketplace.













c) Select the Cluster type and in this case we are going to provision the "Spark" Cluster.





























d) Choose the storage type and in this case we are provisioning Azure Storage.

























e) Once provisioned you can see the Azure HDInsight Spark cluster and go ahead and create Jupyter notebook by clicking on Jupyter notebook as shown below.




Tuesday, January 21, 2020

Data Platform Tips 55 - Create Azure HDInsight Cluster using Azure Portal

In this post we will look at how to provision Azure HDInsight Hadoop cluster using the Azure Portal.

a) Log on to the Azure Portal

b) Search for HDInsight in the Azure Marketplace.





















c) Select the Cluster type and in this case we are going to provision the "Hadoop" Cluster.



























d) Choose the storage type and in this case we are using Azure Storage.


























e) Once provisioned you can see the Azure HDInsight Hadoop cluster. Similarly you can create different types of clusters using Azure HDInsight.


Monday, January 20, 2020

Data Platform Tips 54 - Azure HDInsight

Azure HDInsight is managed cloud service which comprises of the complete spectrum of open source analytics frameworks like Hadoop, Apache Spark, Apache Kafka, Apache Storm, Apache Hive and R etc.

Components and versions available on Azure HDInsight - https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning

Azure HDInsight supports various business scenarios in Big Data Processing and Analytics like Batch processing, Real time stream processing, IoT, Data Science and other Advanced Analytics capabilities.

Azure HDInsight Ecosystem




















Cluster types supported in Azure HDInsight

  • Apache Hadoop - Framework that has HDFS, YARN and MapReduce programming model
  • Apache Spark - Open source parallel and in-memory processing framework
  • Apache HBase - NoSQL database built on Hadoop
  • Apache Storm - Framework to process large volumes of streaming datasets
  • Apache Interactive Query - In-memory caching for Interactive Hive queries
  • Apache Kafka - Open-source platform for developing streaming pipelines and applications.
  • ML Services- Platform for hosting distributed R processes.

Storage for Azure HDInsight clusters

Any of the following storage services can be used with Azure HDInsight Clusters.

  • Azure Storage
  • Azure Data Lake Storage Gen 2
  • Azure Data Lake Storage Gen 1

Sunday, January 19, 2020

Data Platform Tips 53 - Time to Live (TTL) in Azure Cosmos DB

Azure Cosmos DB provides the ability to remove items from the container after a defined period of time by configuring the Time to Live setting. TTL can be set at container or individual item level. The time to live value is set in seconds, and calculated as a delta from the time that an item was last modified.

Configure TTL on Container


a) Log on to the Azure Portal

b) Select the provisioned Azure Cosmos DB or provision a new one.

c) Navigate to Data Explorer and select the Container as shown below and click on "Settings" to find the TTL setting.


























d) Set the TTL to 3600 seconds and click on "Save".



















Configure TTL on an item


a) Log on to the Azure Portal

b) Select the provisioned Azure Cosmos DB or provision a new one.

c) Set the TTL for the container as shown above.

c) Navigate to Data Explorer and select the Container and the Item for which you need to add TTL and include the "ttl": 10, to the item JSON where (10 is the TTL value in seconds)