\( \newcommand{\matr}[1] {\mathbf{#1}} \newcommand{\vertbar} {\rule[-1ex]{0.5pt}{2.5ex}} \newcommand{\horzbar} {\rule[.5ex]{2.5ex}{0.5pt}} \newcommand{\E} {\mathrm{E}} \)
deepdream of
          a sidewalk
Show Question
\( \newcommand{\cat}[1] {\mathrm{#1}} \newcommand{\catobj}[1] {\operatorname{Obj}(\mathrm{#1})} \newcommand{\cathom}[1] {\operatorname{Hom}_{\cat{#1}}} \newcommand{\multiBetaReduction}[0] {\twoheadrightarrow_{\beta}} \newcommand{\betaReduction}[0] {\rightarrow_{\beta}} \newcommand{\betaEq}[0] {=_{\beta}} \newcommand{\string}[1] {\texttt{"}\mathtt{#1}\texttt{"}} \newcommand{\symbolq}[1] {\texttt{`}\mathtt{#1}\texttt{'}} \newcommand{\groupMul}[1] { \cdot_{\small{#1}}} \newcommand{\groupAdd}[1] { +_{\small{#1}}} \newcommand{\inv}[1] {#1^{-1} } \newcommand{\bm}[1] { \boldsymbol{#1} } \require{physics} \require{ams} \require{mathtools} \)
Math and science::INF ML AI::probabilistic graphical models

Compression and modularity of probabilistic models

A probability space with a sample space \( I \times G \times S \), intelligence \( I = \{i^0, i^1\} \), grade \( G = \{g^0, g^1, g^2 \} \) and SAT score \( S = \{s^0, s^1\} \) in general requires 11 independent parameters to define a probability measure. 

[From here on we abuse the notions of random variables & probability distributions]

If we know that grade and SAT score are independent given intelligence (knowing the grade in addition to knowing the intelligence gives no extra information regarding the SAT score in comparison to knowing the intelligence alone), then we can write:
\[ P(I, S, G) = P(S \vert I)P(G \vert I)P(I) \]
This is a factorization of the joint distribution into a product of three conditional probability distributions. The parameterization involves three bernoulli distributions, \( P(I), P(S \vert i^0) \), \( P(S \vert i^1) \), and two three-valued multinomial distributions,  \( P(G \vert i^0), P(G \vert i^1) \). The total independent parameter count is thus 7. Thus, the representation is more compact.

It is important to note another advantage of this way of representing the joint: modularity. When we added the new variable G, the joint distribution changed entirely. Had we used the explicit representation of the joint, we would have had to write down twelve new numbers. In the factored representation, we could reuse our local probability models for the variables \( I \) and \( S \), and specify only the probability model for \( G \) - the CPD \( P(G \vert I) \). This property will turn out to be invaluable in modeling real-world systems.