{
    "byline": null,
    "dir": null,
    "excerpt": "The big idea here is to use the geometric mean instead of the arithmetic mean across samples in the batch when computing the gradient for SGD. This overcomes the situation where averaging produces optima that are not actually optimal for any individual samples, as demonstrated in their toy example below:",
    "length": 895,
    "siteName": null,
    "title": "Learning explanations that are hard to vary"
}