Hadoop Distributed Filesystem (HDFS)

Hadoop Distributed Filesystem (HDFS)

HDFS is a distributed file system designed to store large datasets reliably and provide high-throughput access to application data. It's the storage foundation of the Apache Hadoop ecosystem.

Similar self-hosted alternatives:
Repository activity:
Stars
15,115
Forks
9,049
Watchers
976
Open Issues
1,145
Last commit
about 20 hours ago
Details:
Estimated Popularity
81
Pricing Model
Free
Hosting Type
Self-Hosted
License
Apache-2.0
Deployment Difficulty
Advanced
Language
Java
Hadoop Distributed Filesystem (HDFS) screenshot

HDFS is the backbone of the Apache Hadoop ecosystem, providing reliable, scalable distributed storage designed specifically for big data workloads. It's optimized for large files and high-throughput access patterns common in data analytics.

Key Features

  • Big Data Optimized:

    • Designed for large files (GBs to TBs)
    • High throughput data access
    • Write-once, read-many model
    • Batch processing optimization
    • Streaming data access
    • MapReduce integration
  • Fault Tolerance:

    • Automatic data replication
    • Configurable replication factor
    • Rack-aware placement
    • Automatic failure detection
    • Self-healing capabilities
    • Checksum verification
  • Scalability:

    • Petabyte-scale storage
    • Thousands of nodes
    • Linear scaling
    • Distributed metadata
    • Namespace federation
    • High availability NameNode
  • Hadoop Integration:

    • Native MapReduce support
    • YARN resource management
    • Apache Spark compatibility
    • HBase storage layer
    • Hive data warehouse
    • Rich ecosystem integration

Help improve this content

Found an error or want to add more information about Hadoop Distributed Filesystem (HDFS)? You can edit this page directly on GitHub.

Project Categories

Click on a category to explore similar projects