HDFS Storage Policies and Real-Life Applications

Hadoop Distributed File System (HDFS) has evolved significantly to accommodate the diverse storage needs of large-scale data processing. The introduction of HDFS storage policies is a testament to this evolution, offering a flexible and cost-effective solution for managing data storage.

Archival Storage and Its Importance:

Archival Storage in HDFS is designed to separate the ever-growing storage capacity from compute capacity. This is achieved by utilizing nodes with high storage density but lower compute power, which serve as cold storage within clusters. Data is moved from hot to cold storage based on defined policies, allowing for storage expansion independent of compute capacity.

Types of Storage in HDFS:

HDFS supports various storage types, including ARCHIVE, DISK, SSD, and RAM_DISK. ARCHIVE storage is characterized by high storage density and is ideal for archival purposes. On the other hand, RAM_DISK is used for writing single replica files in memory. These diverse storage options cater to different performance and cost requirements.

Storage Policies and Their Applications:

HDFS storage policies allow for flexible data management by specifying how data should be stored across different types of storage media. Here’s a detailed look at each policy:

Hot:

Ideal for data that is frequently accessed and used for processing. All replicas of a block are stored in DISK.

Cold:

Suitable for data that is rarely accessed or needs archiving. All replicas are stored in ARCHIVE storage.

Warm:

A mix of hot and cold; some replicas are in DISK and others in ARCHIVE.

All_SSD:

All replicas of a block are stored in SSD, offering high performance.

One_SSD:

At least one replica is stored in SSD for better performance, while others are in DISK.

Lazy_Persist:

Designed for data that can initially reside in memory (RAM_DISK) and is later lazily persisted to DISK. This is used for single replica blocks.

Provided:

This policy allows data to be stored outside the HDFS, in externally provided storage solutions.

These policies enable efficient use of storage resources, optimizing cost and performance based on the data’s access patterns and importance.

Managing Storage Space:

HDFS handles space management by using fallback storage types when the primary storage type is running low on space. This ensures continuous data storage without manual intervention.

Setting and Unsetting Storage Policies:

Users can set or unset storage policies on files or directories. Changing a storage policy updates the policy in the Namespace, but doesn’t physically move the blocks across storage media. Tools like HdfsAdmin API and data migration tools are used to manage the physical movement of blocks according to the new policy.

Managing storage policies in HDFS involves a few key commands:

Set a Storage Policy:

Use hdfs storagepolicy -setStoragePolicy to assign a specific storage policy to a file or directory.

Get the Storage Policy:

With hdfs storagepolicy -getStoragePolicy, you can find out the current storage policy applied to a file or directory.

List Storage Policies:

To see all available storage policies, use hdfs storagepolicy -listPolicies.

Unset a Storage Policy:

The command hdfs storagepolicy -unsetStoragePolicy removes the specific storage policy from a file or directory, reverting it to the policy of its nearest ancestor or the default if none is set.

Change a Storage Policy and move the data:

To change a storage policy and move data in HDFS, first, use the hdfs storagepolicy -setStoragePolicy command to assign a new policy to the desired file or directory. This updates the policy in the Namespace but does not physically move the data. To initiate the data movement, use the HdfsAdmin API’s satisfyStoragePolicy() function. This triggers the Storage Policy Satisfier (SPS) to scan for storage mismatches and schedule block movement tasks to DataNodes. The SPS, which can run as an external service, ensures that the data is moved in accordance with the newly set policy.

Real-Life Usage Scenarios:

Temporal Locality in Datasets:

For datasets like time-series data, where the latest data is of prime importance, using SSD storage initially and then migrating to disk storage as the data ages is a common practice. This approach optimizes performance for recent data while being cost-effective for older data.

Archiving Cold Data:

Moving rarely accessed, cold data to denser archival storage is a cost-saving strategy. This is particularly useful for data that won’t be accessed frequently, allowing organizations to use cheaper storage solutions for such data.

Configuring Heterogeneous Storage in HDFS:

Hadoop provides tools for configuring heterogeneous storage, allowing users to assign different storage types to each DataNode Data Directory. This flexibility is crucial for optimizing data storage based on specific data access patterns and frequency.