Did you know ADLS Gen2 is core element!

Vijay Borkar (VBCloudboy)
11 min readMar 18, 2023

--

of data analytics architectures…

Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash

Surprisingly on my learning journey towards Azure Synapse brought me into a speed-bump where I came to know that Data lakes are a core element of data analytics architectures. Where, Azure Data Lake Storage Gen2 provides a scalable, secure, cloud-based solution for data lake storage.

Q. What gave a birth to ADLS — Azure Data Lake Storage ?

Azure Data Lake Storage (ADLS) was introduced by Microsoft in 2016 as a cloud-based repository for big data analytics workloads. It was designed to support both structured and unstructured data, and to provide scalable and reliable storage for large amounts of data.

ADLS was created to address the challenges that enterprises face when managing and analyzing large amounts of data. Traditional data storage solutions often struggle to handle the volume, variety, and velocity of big data, and can be costly and complex to manage. ADLS provides a more efficient and cost-effective solution by leveraging the scalability and flexibility of the cloud.

Q. How this ADLS — Azure Data Lake Storage is built by Microsoft?

ADLS is built on top of Azure Blob Storage, which provides a low-cost, high-scale storage solution for unstructured data. It also supports Hadoop Distributed File System (HDFS) and is fully compatible with the Apache Hadoop ecosystem, allowing customers to use their existing Hadoop tools and applications.

With ADLS, customers can store and analyze petabytes of data, and take advantage of advanced analytics capabilities such as machine learning and AI. It also provides enterprise-grade security and compliance features, including data encryption and access controls.

Q. Then why bring in ADLS — Azure Data Lake Storage Gen2?

Microsoft introduced ADLS Gen 2 in 2018 to provide an even more scalable, cost-effective, and feature-rich solution for big data analytics workloads. ADLS Gen 2 is built on top of Azure Blob Storage, but with the addition of a hierarchical namespace and support for Azure Data Lake Analytics, providing better performance, easier management, and more powerful analytics capabilities.

Q. Most Important why should I choose ADLS Gen2 ?

One of the main reason to choose ADLS Gen2 for your big data storage and analytics needs because it provides a scalable, cost-effective, and feature-rich solution with a hierarchical namespace, support for Azure Data Lake Analytics, and improved performance and scalability. It enables you to manage and analyze large amounts of data more efficiently and effectively, and to extract valuable insights from your data with ease.

Q. Which Important top-notch rich feature you are describing of ADLS Gen2?

One of the most important and rich features of ADLS Gen2 is its support for a hierarchical namespace, which allows for more efficient data management and faster data processing. This enables data to be organized into directories and subdirectories, making it easier for users to find and access the data they need. Additionally, ADLS Gen2’s integration with Azure Data Lake Analytics allows for complex analytics jobs to be run over large datasets using SQL-like syntax, making it easier for data analysts and data scientists to extract insights from large amounts of data.

Q. How can we know that ADLS Gen2 support hierarchical namespace?

In ADLS Gen2, data is organized into a hierarchical namespace, which means that files and directories can be organized into a tree-like structure with directories and subdirectories. This allows for more efficient data management and faster data processing, as data can be accessed more quickly and easily.

For example, suppose you have a large dataset that includes data from different regions and different departments. You could organize the data into a hierarchical namespace with directories for each region, and subdirectories for each department. This would allow users to quickly and easily find and access the data they need, without having to search through a large, unorganized dataset.

Here is an example of what the hierarchical namespace for a hypothetical company might look like:

company-data
├── region1
│ ├── department1
│ │ ├── sales.csv
│ │ └── expenses.csv
│ ├── department2
│ │ ├── sales.csv
│ │ └── expenses.csv
│ └── department3
│ ├── sales.csv
│ └── expenses.csv
└── region2
├── department1
│ ├── sales.csv
│ └── expenses.csv
├── department2
│ ├── sales.csv
│ └── expenses.csv
└── department3
├── sales.csv
└── expenses.csv

In this example, the data is organized into a hierarchical namespace with a top-level directory called “company-data”, which contains subdirectories for each region, and subdirectories within each region directory for each department. Each department directory contains files with sales and expenses data. This makes it easy for users to navigate the data and find the information they need.

Q. Can you define Data Lake?

A data lake is a repository of data that is stored in its natural format, usually as blobs or files. Azure Data Lake Storage is a comprehensive, massively scalable, secure, and cost-effective data lake solution for high performance analytics built into Azure. Azure Data Lake Storage combines a file system with a storage platform to help you quickly identify insights into your data. Data Lake Storage builds on Azure Blob storage capabilities to optimize it specifically for analytics workloads.

Q. By doing this integration is there any benefits at the service level..?

This integration enables analytics performance, the tiering and data lifecycle management capabilities of Blob storage, and the high-availability, security, and durability capabilities of Azure Storage.

A benefit of Data Lake Storage is that you can treat the data as if it’s stored in a Hadoop Distributed File System. With this feature, you can store the data in one place and access it through compute technologies including Azure Databricks, Azure HDInsight, and Azure Synapse Analytics without moving the data between environments

Q. How about the data size handled by this ADLS?

Data Lake Storage is designed to deal with this variety and volume of data at exabyte scale while securely handling hundreds of gigabytes of throughput. With this, you can use Data Lake Storage Gen2 as the basis for both real-time and batch solutions.

Q. Awesome, If this much data can be stored then how the data engineer takes cost into the account?

The data engineer also has the ability to use storage mechanisms such as the parquet format, which is highly compressed and performs well across multiple platforms using an internal columnar storage.

Q. Just give me a glimpse of the major differentiation between Azure Blob Storage & ADLS?

Azure Data Lake Storage Gen2 builds on blob storage and optimizes I/O of high-volume data by using a hierarchical namespace that organizes blob data into directories, and stores metadata about each directory and the files within it.

Q. They say that Azure Data Lake Storage Gen2 isn’t a standalone Azure service! It means what?

A standalone Azure service is a service in Azure that can be used independently of other Azure services, and is not dependent on any other services to function. Standalone Azure services are designed to provide specific functionalities, such as data storage, compute resources, or network services, and can be used in a variety of scenarios, depending on the needs of the user. The ADLS is built on the Azure storage services hence it isn’t a standalone Azure service.

Q. How should I plan my big data solutions..?

There are four stages for processing (ISPTMS) big data solutions that are common to all architectures:

Q. Okay, how about security in ADLS?

Data Lake Storage supports access control lists (ACLs) and Portable Operating System Interface (POSIX) permissions that don’t inherit the permissions of the parent directory. In fact, you can set permissions at a directory level or file level for the data stored within the data lake, providing a much more secure storage system. This security is configurable through technologies such as Hive and Spark or utilities such as Azure Storage Explorer, which runs on Windows, macOS, and Linux. All data that is stored is encrypted at rest by using either Microsoft or customer-managed keys.

Q. Alright. Then how about performance..?

Azure Data Lake Storage organizes the stored data into a hierarchy of directories and subdirectories, much like a file system, for easier navigation. As a result, data processing requires less computational resources, reducing both the time and cost.

Q. Got it. Very important one then how about Data Redundancy?

Data Lake Storage takes advantage of the Azure Blob replication models that provide data redundancy in a single data center with locally redundant storage (LRS), or to a secondary region by using the Geo-redundant storage (GRS) option. This feature ensures that your data is always available and protected if catastrophe strikes.

Q. Does Data lake solve enterprises problem..?

Data lakes have become a common solution to many enterprises problem. A data lake provides file-based storage, usually in a distributed file system that supports high scalability for massive volumes of data. Organizations can store structured, semi-structured, and unstructured files in the data lake and then consume them from there in big data processing technologies, such as Apache Spark. Azure Data Lake Storage Gen2 provides a cloud-based solution for data lake storage in Microsoft Azure, and underpins many large-scale analytics solutions built on Azure.

Q. How you can use Azure Data Lake Storage Gen2 in Big data processing and analytics?

Big data scenarios usually refer to analytical workloads that involve massive volumes of data in a variety of formats that needs to be processed at a fast velocity — the so-called “three v’s”. Azure Data Lake Storage Gen 2 provides a scalable and secure distributed data store on which big data services such as Azure Synapse Analytics, Azure Databricks, and Azure HDInsight can apply data processing frameworks such as Apache Spark, Hive, and Hadoop. The distributed nature of the storage and the processing compute enables tasks to be performed in parallel, resulting in high-performance and scalability even when processing huge amounts of data.

Q. How you can use Azure Data Lake Storage Gen2 in Data Warehousing?

Data warehousing has evolved in recent years to integrate large volumes of data stored as files in a data lake with relational tables in a data warehouse. In a typical example of a data warehousing solution, data is extracted from operational data stores, such as Azure SQL database or Azure Cosmos DB, and transformed into structures more suitable for analytical workloads. Often, the data is staged in a data lake in order to facilitate distributed processing before being loaded into a relational data warehouse. In some cases, the data warehouse uses external tables to define a relational metadata layer over files in the data lake and create a hybrid “data lakehouse” or “lake database” architecture. The data warehouse can then support analytical queries for reporting and visualization.

There are multiple ways to implement this kind of data warehousing architecture. The diagram shows a solution in which Azure Synapse Analytics hosts pipelines to perform extract, transform, and load (ETL) processes using Azure Data Factory technology. These processes extract data from operational data sources and load it into a data lake hosted in an Azure Data Lake Storage Gen2 container. The data is then processed and loaded into a relational data warehouse in an Azure Synapse Analytics dedicated SQL pool, from where it can support data visualization and reporting using Microsoft Power BI

Q. How you can use Azure Data Lake Storage Gen2 in Real-time data analytics?

Increasingly, businesses and other organizations need to capture and analyze perpetual streams of data, and analyze it in real-time (or as near to real-time as possible). These streams of data can be generated from connected devices (often referred to as internet-of-things or IoT devices) or from data generated by users in social media platforms or other applications. Unlike traditional batch processing workloads, streaming data requires a solution that can capture and process a boundless stream of data events as they occur.

Streaming events are often captured in a queue for processing. There are multiple technologies you can use to perform this task, including Azure Event Hubs as shown in the image. From here, the data is processed, often to aggregate data over temporal windows (for example to count the number of social media messages with a given tag every five minutes, or to calculate the average reading of an Internet connected sensor per minute). Azure Stream Analytics enables you to create jobs that query and aggregate event data as it arrives, and write the results in an output sink. One such sink is Azure Data Lake Storage Gen2; from where the captured real-time data can be analyzed and visualized.

Q. How you can use Azure Data Lake Storage Gen2 in Data science and machine learning?

Data science involves the statistical analysis of large volumes of data, often using tools such as Apache Spark and scripting languages such as Python. Azure Data Lake Storage Gen 2 provides a highly scalable cloud-based data store for the volumes of data required in data science workloads.

Machine learning is a subarea of data science that deals with training predictive models. Model training requires huge amounts of data, and the ability to process that data efficiently. Azure Machine Learning is a cloud service in which data scientists can run Python code in notebooks using dynamically allocated distributed compute resources. The compute processes data in Azure Data Lake Storage Gen2 containers to train models, which can then be deployed as production web services to support predictive analytical workloads.

Q. As a Data Engineer, when I plan to use ADLS which things should I take into consideration..?

Whenever planning for a data lake, a data engineer should give thoughtful consideration to structure, data governance, and security. This should include consideration of factors that can influence lake structure and organization, such as:

  • Types of data to be stored
  • How the data will be transformed
  • Who should access the data
  • What are the typical access patterns

Establishing a baseline and following best practices for Azure Data Lake will help ensure a proper and robust implementation that will allow the organization to grow and gain insight to achieve more. Since, It’s a comprehensive data lake solution.

--

--

Vijay Borkar (VBCloudboy)
Vijay Borkar (VBCloudboy)

Written by Vijay Borkar (VBCloudboy)

Assisting Microsoft partners in elevating their technical capabilities in AI, Analytics, and Cybersecurity.

No responses yet