Deep Learning: Why is the energy function of the Restricted Boltzmann Machine defined the way it is defined?

I know the energy function of RBM has its roots in hopfield networks and Ising Models. Considering the fact that there are many RBM's (binary, gaussian, mcRBM, convolutional RBM, conditional RBM) and all of them have different energy function, how did people come up with those energy functions? How can we create another energy function?
Answer:

Statistical mechanics has played a large role in the computational neuroscience and it's application to machine learning research for a long time. and the the Ising model is the workhorse of statistical mechanics the Ising model is actually an ok model of the dynamics actual neurons at certain stable points. This is seen through Jack Cowan's work at the University of Chicago and the Wilson-Cowan model of the dynamics of neurons http://en.wikipedia.org/wiki/Wilson%E2%80%93Cowan_model see Excitatory and Inhibitory Interactions in Localized Populations of Model Neurons Hugh R. Wilson and Jack D. Cowan 1972 http://www.cell.com/biophysj/retrieve/pii/S0006349572860685 and this semi-recent review The Wilsonâ€“Cowan model, 36 years later (2009) https://papers.cnl.salk.edu/PDFs/The%20Wilson-Cowan%20Model,%2036%20Years%20Later%202009-4146.pdf and also see: It was Cowan, at the University of Chicago, who first proposed the Sigmoid function that we see appearing in the Deep Learning networks Although even he admits he did not see the connection to machine learning way back in the 1960s (watch the video) This was a key study that was on of the first to use numerical methods to study model neurons, and first demonstrated the existence of both multiple stable states and hysteresis--characteristics of non-equilibrium systems Soon after the Wislon-Cowan model was presented, it was recognized (by Little, 1974), that a simpler model could be employed at the stationary points (where the model satisfies detailed balance) That is, it was recognized in the early 70s that the Ising model would make a good model for memory The classic physics model for these kinds of systems is an Ising model. So, early on, there were 2 major stat-mech inspired ML models. These are like 'spherical cow models of neurons' in that the don't actually describe real neurons but are mathematical abstractions designed to capture the 'essence' of learning function the http://en.wikipedia.org/wiki/Hopfield_network J. J. Hopfield, "Neural networks and physical systems with emergent collective computational abilities", Proceedings of the National Academy of Sciences of the USA, vol. 79 no. 8 pp. 2554â€“2558, April 1982. and the http://en.wikipedia.org/wiki/Self-organizing_map Kohonen, Teuvo (1982). "Self-Organized Formation of Topologically Correct Feature Maps". Biological Cybernetics 43 (1): 59â€“69. This was also the time that chaos-theory was becoming popular, and these minima represented chaotic attractors. It was argued that the brain was a self-organized, chaotic system. This idea persists today. It was soon recognized, however, that the associated computational models of ML were too complicated and did not converge well in numerical simulations. so efforts were made to either improve convergence directly and/or find approximate solutions that could be used to pre-train (i.e. seed the non-convex optimization problem) For example, another energy function that had been explored is the Neuron Gas , introduced in 1991 by Martinetz and Schulten. This method changes the energy function as means to speed up convergence of the Self Organizing Map. It was, however, always suspected that once a neural network got very large, it would behave like a convex function (i.e. the spin glass of minimal frustration) https://charlesmartin14.wordpress.com/2015/03/25/why-does-deep-learning-work/ Likewise, RBMs were recognized as much easier problems to solve than full blown backprop-networks, and they became very popular as a method for pre-training larger networks As of 2015, it appears that there is also a deep connection between deep learning and renormalization group theory https://charlesmartin14.wordpress.com/2015/04/01/why-deep-learning-works-ii-the-renormalization-group/ and that the choice of the energy function may allow for an RG fixed point (IMHO this is quite interesting, especially recent work by Cowan suggests that the brain itself is operating at a subcritical point just below a phase transition) Although even a simple analysis shows that the RBM is equivalent to a Hopfield network in the thermodynamic limit On the equivalence of Hopfield Networks and Boltzmann Machines (2012) http://arxiv.org/pdf/1105.2790v3.pdf [Convolutional Neural Networks build on this early work, but are inspired by our current understanding of how the brain processes visual information. I assume this is beyond the question, which asks only about the energy function and not the structure of the network]

Charles H Martin at Quora Visit the source

Was this solution helpful to you?

Other answers

This is a good question since the original papers on the Boltzmann Machine (BM) by Sejnowski et al. never showed the mathematical relation between the Boltzmann distribution and the corresponding the activation function. The short answer is that the activation functions can be derived simply by Bayesian analysis. For a BM, P(x) = exp(-E(x))/(sum_(xâ€™) exp(E(xâ€™) ) ) (1) where: x = state of the BM P(x) = probability of x generated by the BM (when running at equilibrium) E(x) = -sum_(ij) x_i x_j (ignoring biases) = the energy function If you assume the x units are binary and derive P(x_i| x_j: j!= i) (â€˜!=â€˜ means â€˜not equalâ€™) from equation 1 you derive the sigmoidal stochastic activation function: P(x_i=1 | x_j: j!= i) = siqmoid( sum( w_ij * x_j: j!= i ) ) ) (2) i.e, the standard activation function of a BM unit (ignoring biases), by using Bayes rule with eq. 1 to obtain eq. 2. A good paper on generalizing the BM, or really, on generalizing Restricted Boltzmann Machines, is by Hinton, Rosen-Zvi, and Welling) called "Exponential Family Harmoniums with an Application to Information Retrieval." The different forms of RBMâ€™s derived in that paper can be reached by using Bayes rule as above. From eqs. 1 and 2 it's not hard to show that the transition dynamics of a BM obey â€œdetailed balanceâ€ where the probability of being on the transition from x to xâ€™ is the same as being on the transition in reverse, i.e., P(x)P(x->xâ€™) = P(x')P(xâ€™->x), where P(x) refers to the probability of being in state x for the equilibrium distribution of the BM. This means that the BM has no information in it with respect to the sequence of the states - at least in equilibrium (it asserts acausality). From a practical point of view it might be easier to use the detailed balance equation to derive, e.g., the activation functions than the method described in the previous paragraph. It is also interesting to consider what kind of distribution will be generated from a fully-connected network like the BM but with with some arbitrarily concocted unit activation function(s). Generally such a distribution will not be in detailed balance, and in fact there may not even be an equilibrium distribution. I am wondering if the author of this question has a mindset somewhat like mine, since this question loomed large in my mind when I first started studying this. Being a mechanical engineer and more apt to think about the "visceral" operation of the BM, it amazed me that its behavior can be reduced to a very simple equation. But at least now I see how it works more viscerally than I used to. The key is that the equilibrium distribution of the BM is implied by the summation over all states in the partition function (the denominator of eq. 1), i.e., the latter is performing an expectation. Perhaps the more interesting part is how the summation over all the states reduces to a summation over all local states when finding the activation function (eq. 2).

John Jameson

All our efforts in machine learning are devoted to designing discriminators of some kind. That is, we want to design systems which discriminate good inputs from the bad ones. We also want to make sure that the system learns to discriminate between the inputs in a predictable way. One such system is the Energy based model. In energy based model, we tend to create an analogy with the thermodynamic systems. Learning is done by minimizing the system energy for desirable inputs and not minimizing it for the undesirable inputs. Initial effort for creating the energy based models tried to mimic the thermodynamic systems. As always, learning from nature is the best way. You can define your own energy function (or whatever you call it), as far as you can define an efficient learning procedure for it, you are doing it right.

Anonymous

Your question suggests 2 different ways to modify the energy function: (1) different types of units and (2) more types of connections. For the first, you can use whatever distributions which can be described in exponential family and then combine them multiplicatively. The combined distribution will also belongs to exponential family. Then you can decompose the log-likelihood term into data-dependent and data-independent expectations. For the second, you also can add connections among units, but now the model is "unrestricted" and more complicated (e.g., conditional RBM and Boltzmann machine). You will have to resort an approximation way to estimate the data-dependent. The computational cost also increases.

Tu Dinh Nguyen

I think though RBM has its roots in Hopfield nets and Ising models, it is also motivated by Boltzmann distribution. If we think about RBM in the context of Boltzmann distribution, finding the energy function is just solving a functional equation in which the current energy function is the simplest solution.

Anh T. Hoang

Related Q & A:

Why can't I add subversion package to my OpenBSD 4.5 machine?Best solution by Server Fault
Why doesn't my PHP function work as expected?Best solution by Stack Overflow
Why is it important to use substitute energy?Best solution by answers.yahoo.com
Why is learning assembly language valuable to a company?Best solution by codeproject.com
Why do I have a learning disability?Best solution by ChaCha

Just Added Q & A:

How many active mobile subscribers are there in China?Best solution by Quora
How to find the right vacation?Best solution by bookit.com
How To Make Your Own Primer?Best solution by thekrazycouponlady.com
How do you get the domain & range?Best solution by ChaCha
How do you open pop up blockers?Best solution by Yahoo! Answers

For every problem there is a solution! Proved by Solucija.

Got an issue and looking for advice?
Ask Solucija to search every corner of the Web for help.
Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.