In the past several years we have developed a comprehensive theory of large scale learning with Deep Neural Networks (DNN), when optimized with Stochastic Gradient Decent (SGD). The theory is built on three theoretical components: (1) Rethinking the standard (PAC like) distribution independent worse case generalization bounds - turning them to problem dependent typical (in the Information Theory sense) bounds that are independent of the model architecture.(2)The Information Plane theorem: For large scale typical learning the sample-complexity and accuracy tradeoff is characterized by only two numbers: the mutual information that the representation (a layer in the network) maintain on the input patterns, and the mutual information each layer has on the desired output label. The Information Theoretic optimal tradeoff between thees encoder and decoder information values is given by the Information Bottleneck (IB) bound for the rule specific input-output distribution.(3)The layers of the DNN reach this optimal bound via standard SGD training, in high (input&layers) dimension.
In this talk, I will discuss two new surprising outcomes of this theory: (1) The computational benefit of the hidden layers, (2) the emerging understanding of the features encoded by each layers which follows from the convergence to the IB bound.