Training with more parameters than seemingly necessary makes deep networks easier to optimize. Extra parameters smooth the loss landscape, improve gradient flow, and provide many redundant pathways, so gradient descent is less likely to get stuck in poor minima. Even though large models can interpolate data, optimizers like SGD tend to find simple, generalizable solutions. After training, the redundant parameters can be pruned, leaving a smaller, efficient model without losing accuracy.