Skip to content
[New] Concise and Practical AI/ML
  • Pages
    • Preface
    • Artificial Intelligence
      • Concepts
      • High-level Intelligence
    • Maths for ML
      • Calculus
      • Algebra
    • Machine Learning
      • History of ML
      • ML Models
        • ML Model is Better
        • How a Model Learns
        • Boosted vs Combinatory
      • Neuralnet
        • Neuron
          • Types of Neurons
        • Layers
        • Neuralnet Alphabet
        • Heuristic Hyperparams
      • Feedforward
        • Input Separation
      • Backprop
        • Activation Functions
        • Loss Functions
        • icon picker
          Gradient Descent
        • Optimizers
      • Design Techniques
        • Normalization
        • Regularization
          • Drop-out Technique
        • Concatenation
        • Overfitting & Underfitting
        • Explosion & Vanishing
      • Engineering Techniques
    • Methods of ML
      • Supervised Learning
        • Regression
        • Classification
      • Reinforcement Learning
        • Concepts
        • Bellman Equation
        • Q-table
        • Q-network
        • Learning Tactics
          • Policy Network
      • Unsupervised Learning
        • Some Applications
      • Other Methods
    • Practical Cases
    • Ref & Glossary

Gradient Descent

image.png
The gradient descent method (GD) of finding minimum of a function was invented by Cauchy a French mathematician. It is based on gradient to find the min point (local minimum only, unsure about global minimum) by trying to reduce the gradient to zero.

GD Diagram

As seen above.

GD Diagram Description

Consider a function fe (in ML, this function to find local minimum will be called loss function), which is concave downward; and an arbitrary coord w on x axis. Draw the vertical line from coord w to cut the function; and make the tangent line to the function at the crossing point.
Gradient of function fe at tangent point T is the coefficient of the tangent line, call it value g.

The Maths

The more to the right side of the function, the higher the gradient is (positive).
The more to the left side of the function, the smaller the gradient is (negative).
Point T is on the right side of the function, thus, in order to reach to local minimum, w should be reduced.
The bigger g is, the faster w is reduced toward local minimum by this assignment: ​
image.png
Where r is called the learning rate, with w and g as described above. Value w later on in backpropagation will be a weight of a neuron.

 
Want to print your doc?
This is not the way.
Try clicking the ··· in the right corner or using a keyboard shortcut (
CtrlP
) instead.