Foundations of deep learning

29-09-2023

resurgence in neural networks
- data, hardware, software
architecture
- perceptron
  - building block of neural networks
  - includes multi-output perceptrons
  - linked via dense layers (all inputs connected to all outputs) and hidden layers
algorithm
- forward pass / propagation
  - input values (including bias)
  - weights
  - sum
  - activation function
    - sigmoid, hyperbolic tangent, rectified linear unit
    - introduce non-linearities into the network, approximate arbitrarily complex functions
  - output value
- objective function / loss function
  - types
    - empriical loss (total loss over dataset)
    - binary cross entropy loss (models that output probabilities between 0 and 1)
    - mean sqaured error loss (regression models that output continuous real numbers)
  - measure the cost incurred from incorrect predictions
- backward pass / backpropagation
  - computing gradients via the chain rule
  - how does a small change in one weight affect the final loss
training
- goal is to find network weights that achieve lowest loss (ie: determine a function of the weights)
- pick initial weights (usually random, but there are optimisations that can be made)
- compute gradient
- update weights (take small step in opposite direction of gradient)
- repeat until convergence
in practice
- optimisation
  - learning rate
    - too small (converges slowly, stuck in local minima)
    - too big (overshoots, unstable and diverges)
    - setting the right rate
      - iterate through different rates
      - adaptive learning rate
        
        how large is the gradient
        
        how fast is learning happening
        
        size of particular weights
  - gradient descent optmisers
    - SGD, Adam, Adadelta, Adagrad, RMSProp
- batches and mini-batches
  - more accurate estimate of gradient
  - smoother convergence
  - allows for larger learning rate
  - faster training (parallelize computation, achieve speed increases)
- regularisation
  - technique to constrain optimisation problem to discourage complex models
  - improve generalisation of model on unseen data
    - underfitting
      - model doesn't learn data
    - overfitting
      - too complex, extra parameters, does not generalise well
  - techniques
    - dropout
      - during training, randomly set some activations to 0
      - forces network to not rely on one node
    - early stopping
      - stop before overfitting

References:

MIT Introduction to Deep Learning | 6.S191 | Foundations of Deep Learning