Optimisers
SGD
Stochastic Gradient Descent. Basic gradient-based optimiser, doesn’t converge fast in production cases.
SGD uses the basic weight update assignment:
w = w - r*g
Momentum
Rather good optimiser but still not the best in production, Adam can converge faster and more adaptable to use cases.
Momentum has extra coefficient in weight update formula.
Adam
Adaptive Momentum. The best and common optimiser in production, converge fast on multiple kinds of problems.
Adam has more extra coefficient in weight update formula.
Training Process
Data Preparation
Split data into training set and test test.
Training
Train the network for multiple epochs.
Inference
Use training data, test data, and unknown data to test the model.
Hardware Utilisation
Training can be done on CPU (slow), or GPU (fast). Training on GPU is fast because GPUs are designed for matrix calculation with thousands of cores (simple cores) instead of some cores (complex core) in CPU.
Distributed Training
It is essential to use distributed training in modern ML solutions to have the training process done faster.