Foundations of deep learning
September 29, 2023
- resurgence in neural networks
- data, hardware, software
- architecture
- perceptron
- building block of neural networks
- includes multi-output perceptrons
- linked via dense layers (all inputs connected to all outputs) and hidden layers
- perceptron
- algorithm
- forward pass / propagation
- input values (including bias)
- weights
- sum
- activation function
- sigmoid, hyperbolic tangent, rectified linear unit
- introduce non-linearities into the network, approximate arbitrarily complex functions
- output value
- objective function / loss function
- types
- empriical loss (total loss over dataset)
- binary cross entropy loss (models that output probabilities between 0 and 1)
- mean sqaured error loss (regression models that output continuous real numbers)
- measure the cost incurred from incorrect predictions
- types
- backward pass / backpropagation
- computing gradients via the chain rule
- how does a small change in one weight affect the final loss
- forward pass / propagation
- training
- goal is to find network weights that achieve lowest loss (ie: determine a function of the weights)
- pick initial weights (usually random, but there are optimisations that can be made)
- compute gradient
- update weights (take small step in opposite direction of gradient)
- repeat until convergence
- in practice
- optimisation
- learning rate
- too small (converges slowly, stuck in local minima)
- too big (overshoots, unstable and diverges)
- setting the right rate
- iterate through different rates
- adaptive learning rate
- how large is the gradient
- how fast is learning happening
- size of particular weights
- gradient descent optmisers
- SGD, Adam, Adadelta, Adagrad, RMSProp
- learning rate
- batches and mini-batches
- more accurate estimate of gradient
- smoother convergence
- allows for larger learning rate
- faster training (parallelize computation, achieve speed increases)
- regularisation
- technique to constrain optimisation problem to discourage complex models
- improve generalisation of model on unseen data
- underfitting
- model doesn't learn data
- overfitting
- too complex, extra parameters, does not generalise well
- underfitting
- techniques
- dropout
- during training, randomly set some activations to 0
- forces network to not rely on one node
- early stopping
- stop before overfitting
- dropout
- optimisation
References:
- MIT Introduction to Deep Learning | 6.S191 | Foundations of Deep Learning