Concept: He Initialization
Core Intuition
Designed to solve the vanishing/exploding gradient problem in deep networks using ReLU activations. It ensures that the variance of activations stays roughly constant across layers to enable effective learning in deep architectures.
Mathematical Foundation
- Variance Preservation: For deep networks to learn, the variance of activations should stay roughly constant across layers.
- The ReLU Problem: ReLU zeros out half of the input distribution (negative values). This halves the variance at every layer.
- The Compensation: While Xavier initialization uses , He initialization doubles the variance (using a factor of 2 in the numerator) to compensate for the 50% signal loss caused by ReLU.
Key Equation
Code:
W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / n_in)Analogy
Like adjusting a volume knob to be twice as loud because you know half of your speakers are muted—keeping the total sound level consistent across the room.
Component of
Insights
- Xavier vs. He: Xavier assumes symmetric activations (Tanh/Sigmoid), whereas He is tailored for asymmetric rectifiers.
- Standard for ReLU: He initialization is the industry standard for ReLU, Leaky ReLU, and GELU activations.
- Dead Neurons: Proper scaling ensures that in the early stages of training, enough neurons are in the “active” region of the ReLU.
Pitfalls
- Using He initialization with Tanh or Sigmoid can lead to exploding gradients as the variance will be too high for those bounded activations.
Connections
Implementation Notes
- The “He Normal” version uses a truncated normal distribution.
- Can also be implemented as “He Uniform” with a range of .