Bias Term
Back Propagation
Vanishing and Exploding Gradients
NN Weight Initialization
What are the differences between DNN and Logistic Regression?
Hyper-Parameter Tuning (Random Search, Grid Search)
Prevent Overfitting
Dropout
Batch Norm and Layer Norm
Learning Rate
Plateau and Saddle Point
Transfer Learning
Activation Functions (sigmoid, tanh, RELU, leaky RELU, maxout, elu)
Why Non-Linear Activation Functions?
Optimizers (SGD, RMSprop, Momentum, Adagrad, Adam, AdamW)
Batch GD and SGD
Original Self Attention