A data lake is where vast amounts of raw data or data in its native format is stored, unlike a data warehouse which stores data in files or folders (a hierarchical structure). Data lakes provide unlimited space to store data, unrestricted file size and a number of different ways to access data, as well as providing the tools necessary for analysing, querying and processing. In a data lake each data item is assigned with a unique identifier and metadata tags. In this way the data lake can be queried for relevant data and that smaller set of relevant data can be analysed. Also, data can also be stored in data lakes before being curated and moved to a data warehouse.
Examples of some of the types of data that can be stored in a data lake include:
· Data generated by humans (e.g. blogs, emails, Tweets)
· Data generated by machines (e.g. log files, Internet of Things, sensor readings)
· Operational data (e.g. ticketing, inventory, sales)
· Images, audio and video
Prior to the development of Hadoop, a set of open source programs and procedures which can be used in big data operations, it was only possible for resource-rich companies such as Google and Facebook to reap the benefits of data lakes. However, with the emergence of Hadoop, data lakes became much more accessible to a wide range of organisations which could now store and process their big data.
Data lakes are used to provide large amounts of detailed source data which is then used for a variety of data analytics including mining, graphing, clustering and statistics. The outputs from data analytics include churn models, estimates, visualisations and identification of customer segments, all of which can be valuable to businesses and organisations.
Azure Data Lake overview
The Azure Data Lake is a Hadoop File System (HDFS) and enables Microsoft services such as Azure HDInsight, Revolution-R Enterprise, industry Hadoop distributions like Hortonworks and Cloudera all to connect to it. Azure Data Lake has all Azure Active Directory features including Multi-Factor Authentication, conditional access, role-based access control, application usage monitoring, security monitoring and alerting.
Azure Data Lake has no fixed limits to how much data can be stored in a single account. It can also store very large files with no fixed limits to size. This means that Azure Data Lake can support massively parallel queries so that Hadoop and advanced analytics can be run on all the data in the data lake. Furthermore, Azure Data Lake can handle high volumes of small writes at low latency which means it is ideal for scenarios such as Internet of Things (IoT), website analytics and analytics from sensors, among others.
Azure Data Lake includes three core services:
· Azure Data Lake Store
· Azure Data Lake Analytics
· Azure HDInsight
Azure Data Lake Store
TAzure Data Lake Store is a fully-distributed, scalable and cost-effective solution for big data analytics, allowing processing and analytics to be carried out across platforms and languages with data of any shape, size and speed. It keeps data separate from compute and allows access to data, whether or not a cluster is running. Multiple clusters can access the same storage so data can be easily shared.
It integrates with other Azure data services, including Azure Databricks and Azure Data Factory and also works with existing IT investments for identity, management and security, thus enabling organisations to easily build end-to-end big data and advanced analytics solutions. Azure Data Lake Store has high levels of security including encryption of data at rest and storage account firewalls. It also uses Azure Active Directory for authentication and Access Control Lists to manage access to data held in the data lake.
You can carry out the following operations using Azure Data Lake Store’s available languages and interfaces:
• Account Management Operations – Azure Powershell, .NET SDK, REST API, Python.
• Filesystem Operations – Azure Powershell, Java SDK, .NET SDK, REST API, Python.
• Load and move data – Azure Powershell, Azure Data Factory, AdlCopy (Storage Blob to Lake store), Distcp (HDInsight storage cluster), Sqloop (Azure SQL Database), Azure Import/Export Service (for large offline files), SSIS (using the Azure feature pack).
The diagram below shows some of the options possible with Data Lake Store.
The advantages of Azure Data Lake Storage over Azure Storage Blobs include:
- Optimized for parallel processing
- No file size or storage limits
- Security is integrated with Azure Active Directory.
Azure Data Lake Analytics
Azure Data Lake Analytics is a cloud-based, distributed data processing architecture and is based on YARN, the same as the Hadoop platform. It allows processing of very large data sets, integration with existing warehousing and parallel processing of both structured and unstructured data. Data Lake Analytics works with Azure Data Lake Store and Azure Storage blobs, Azure SQL Database and Azure Warehouse. Azure Data Lake Analytics is offered only as a platform service by Microsoft which means that you won’t have to deal with any cluster problems and you won’t have to manage security separately.
Azure Data Lake Analytics can be used for the following, among others:
- Processing data scraped from websites.
- Preparing data to be inserted into a data warehouse.
- Processing unstructured imaging data.
Azure Data Lake Analytics uses U-SQL. This language allows you to efficiently analyse data in the store as well as in relational stores, such as Azure SQL Database. U-SQL works with any kind of data, whether it’s structured or unstructured. For example, it can also handle:
- Operations over a set of files with patterns.
- Using Partitioned Tables.
- Federated Queries against Azure SQL DB.
- Encapsulating your U-SQL code with Views, Table-Valued Functions and Procedures.
- SQL Windowing Functions.
- Programming with C# User-defined Operators (custom extractors, processors).
- Complex Types (MAP, ARRAY).
- Using U-SQL in data processing pipelines.
- U-SQL in a lambda architecture for IOT analytics.
Data Lake Analytics can also be a cost-effective option as you only pay on a per-job basis when the data is being processed. You can have a pay-as-you-go or a monthly pre-pay plan. For regular usage, a monthly plan is the most cost effective.
Setting up a Data Lake Analytics operation involves the following steps:
- Create a Data Lake Analytics account
- Prepare the source data. You need to have either an Azure Data Lake Store account or Azure Blob storage account.
- Develop a U-SQL script.
- Submit a job (U-SQL script) to your Data Lake Analytics account. The job reads from the source data, processes the data as instructed in the U-SQL script, and then saves the output to either a Data Lake Store or Blob storage account.
Azure HD Insight is a Hadoop service offering hosted in Azure that enables clusters of managed Hadoop instances, delivering Hadoop on top of the Azure platform. Azure HDInsight provides a software framework which is designed to manage, analyse and report on big data. You can create multiple clusters to meet the needs of different jobs, which can be scaled up and down as needed
Azure HDInsight has four main types of workloads: ETL/ELT which uses a Hadoop cluster, Internet of Things or data in motion with a Storm cluster, transactional processing which uses HBase and data science or data analytics which uses a Spark or R-server with Spark cluster type.
Azure HDInsight has guaranteed high availability at large scale with SLAs of 99.9%. HDInsight monitors the health of your big data applications and recovers automatically from failures. Also, you can pick from more than 30 popular Hadoop and Spark applications which ADInsight then deploys to the cluster. Alternatively, you can build Hadoop/Spark applications using development tools such as Visual Studio, Eclipse or IntelliJ, Notebooks, such as Jupyter or Zeppelin or languages, including Scala, Python, R or C# and frameworks, such as Java or .NET.
HDInsight also integrates with other Azure services such as Data Factory and Data Lake Storage, which enables you to build comprehensive analytics pipelines. Furthermore, HDInsight can enable you to easily meet compliance standards as it includes encryption and integration with Azure Active Directory.
You can use Azure HDInsight to-
- Create big data solutions and services which are powered by Hadoop.
- Monitor and manage Hadoop clusters.
- Provide report statistics on the availability and use of big data.
Version 1 - 16 Nov 2018