Hadoop, the most popular open-source distributed framework has arrived with a new release 3.x. It brings promisingfeatures and enhancements, but here we will demystify the Hadoop 3.0 Architecture in detail. The difference between Hadoop 3.0 & Hadoop 2.0 is already talked a lot but how all such changes fit into Hadoop 3.0 architecture will give you a better insight and make you a better aware developer.
Lets see how Hadoop 3.0 architecture evolved from its initial release in 2006 till Hadoop 2.x version. Hadoop 2.x has much improved architecture with YARN and building blocks look more flexible.
As data started growing and enterprise working on Enterprise Data Lake (EDL) solution, optimizing the cost of storage is one of the key concern. The underline development programming language (Java) also moved moved forward to 1.8 with many enhanced feature, the adoption is must for Hadoop community. YARN improvement, Task Level Native Optimization, Derive heap size automatically, Schedule enhancement, Change of default ports, Client side class path isolation are the other changes which brought the new architecture for Hadoop 3.0
Hadoop 3.0 Architecture for HDFS
HDFS 2.x current implementation has 200% of space overhead. Each data block is copied to two other data nodes. This is a very simple, scalable and robust architecture but has too much of space overhead.
HDFS 3.0 architecture is implemented by Erasure Coding
Hadoop 3.0 Downstream Compatibility
Following are the version compatibility matrix sheet indication the version of different Apache projects and their unit test status including basic functionality testing. This was done as part of Hadoop 3.0 Beta 1 release in Oct 2017.