Probability distribution
In statistics , the matrix normal distribution or matrix Gaussian distribution is a probability distribution that is a generalization of the multivariate normal distribution to matrix-valued random variables.
The probability density function for the random matrix X (n × p ) that follows the matrix normal distribution
M
N
n
,
p
(
M
,
U
,
V
)
{\displaystyle {\mathcal {MN}}_{n,p}(\mathbf {M} ,\mathbf {U} ,\mathbf {V} )}
has the form:
p
(
X
∣
M
,
U
,
V
)
=
exp
(
−
1
2
t
r
[
V
−
1
(
X
−
M
)
T
U
−
1
(
X
−
M
)
]
)
(
2
π
)
n
p
/
2
|
V
|
n
/
2
|
U
|
p
/
2
{\displaystyle p(\mathbf {X} \mid \mathbf {M} ,\mathbf {U} ,\mathbf {V} )={\frac {\exp \left(-{\frac {1}{2}}\,\mathrm {tr} \left[\mathbf {V} ^{-1}(\mathbf {X} -\mathbf {M} )^{T}\mathbf {U} ^{-1}(\mathbf {X} -\mathbf {M} )\right]\right)}{(2\pi )^{np/2}|\mathbf {V} |^{n/2}|\mathbf {U} |^{p/2}}}}
where
t
r
{\displaystyle \mathrm {tr} }
denotes trace and M is n × p , U is n × n and V is p × p , and the density is understood as the probability density function with respect to the standard Lebesgue measure in
R
n
×
p
{\displaystyle \mathbb {R} ^{n\times p}}
, i.e.: the measure corresponding to integration with respect to
d
x
11
d
x
21
…
d
x
n
1
d
x
12
…
d
x
n
2
…
d
x
n
p
{\displaystyle dx_{11}dx_{21}\dots dx_{n1}dx_{12}\dots dx_{n2}\dots dx_{np}}
.
The matrix normal is related to the multivariate normal distribution in the following way:
X
∼
M
N
n
×
p
(
M
,
U
,
V
)
,
{\displaystyle \mathbf {X} \sim {\mathcal {MN}}_{n\times p}(\mathbf {M} ,\mathbf {U} ,\mathbf {V} ),}
if and only if
v
e
c
(
X
)
∼
N
n
p
(
v
e
c
(
M
)
,
V
⊗
U
)
{\displaystyle \mathrm {vec} (\mathbf {X} )\sim {\mathcal {N}}_{np}(\mathrm {vec} (\mathbf {M} ),\mathbf {V} \otimes \mathbf {U} )}
where
⊗
{\displaystyle \otimes }
denotes the Kronecker product and
v
e
c
(
M
)
{\displaystyle \mathrm {vec} (\mathbf {M} )}
denotes the vectorization of
M
{\displaystyle \mathbf {M} }
.
The equivalence between the above matrix normal and multivariate normal density functions can be shown using several properties of the trace and Kronecker product , as follows. We start with the argument of the exponent of the matrix normal PDF:
−
1
2
tr
[
V
−
1
(
X
−
M
)
T
U
−
1
(
X
−
M
)
]
=
−
1
2
vec
(
X
−
M
)
T
vec
(
U
−
1
(
X
−
M
)
V
−
1
)
=
−
1
2
vec
(
X
−
M
)
T
(
V
−
1
⊗
U
−
1
)
vec
(
X
−
M
)
=
−
1
2
[
vec
(
X
)
−
vec
(
M
)
]
T
(
V
⊗
U
)
−
1
[
vec
(
X
)
−
vec
(
M
)
]
{\displaystyle {\begin{aligned}&\;\;\;\;-{\frac {1}{2}}{\text{tr}}\left[\mathbf {V} ^{-1}(\mathbf {X} -\mathbf {M} )^{T}\mathbf {U} ^{-1}(\mathbf {X} -\mathbf {M} )\right]\\&=-{\frac {1}{2}}{\text{vec}}\left(\mathbf {X} -\mathbf {M} \right)^{T}{\text{vec}}\left(\mathbf {U} ^{-1}(\mathbf {X} -\mathbf {M} )\mathbf {V} ^{-1}\right)\\&=-{\frac {1}{2}}{\text{vec}}\left(\mathbf {X} -\mathbf {M} \right)^{T}\left(\mathbf {V} ^{-1}\otimes \mathbf {U} ^{-1}\right){\text{vec}}\left(\mathbf {X} -\mathbf {M} \right)\\&=-{\frac {1}{2}}\left[{\text{vec}}(\mathbf {X} )-{\text{vec}}(\mathbf {M} )\right]^{T}\left(\mathbf {V} \otimes \mathbf {U} \right)^{-1}\left[{\text{vec}}(\mathbf {X} )-{\text{vec}}(\mathbf {M} )\right]\end{aligned}}}
which is the argument of the exponent of the multivariate normal PDF with respect to Lebesgue measure in
R
n
p
{\displaystyle \mathbb {R} ^{np}}
. The proof is completed by using the determinant property:
|
V
⊗
U
|
=
|
V
|
n
|
U
|
p
.
{\displaystyle |\mathbf {V} \otimes \mathbf {U} |=|\mathbf {V} |^{n}|\mathbf {U} |^{p}.}
If
X
∼
M
N
n
×
p
(
M
,
U
,
V
)
{\displaystyle \mathbf {X} \sim {\mathcal {MN}}_{n\times p}(\mathbf {M} ,\mathbf {U} ,\mathbf {V} )}
, then we have the following properties:[ 1] [ 2]
The mean, or expected value is:
E
[
X
]
=
M
{\displaystyle E[\mathbf {X} ]=\mathbf {M} }
and we have the following second-order expectations:
E
[
(
X
−
M
)
(
X
−
M
)
T
]
=
U
tr
(
V
)
{\displaystyle E[(\mathbf {X} -\mathbf {M} )(\mathbf {X} -\mathbf {M} )^{T}]=\mathbf {U} \operatorname {tr} (\mathbf {V} )}
E
[
(
X
−
M
)
T
(
X
−
M
)
]
=
V
tr
(
U
)
{\displaystyle E[(\mathbf {X} -\mathbf {M} )^{T}(\mathbf {X} -\mathbf {M} )]=\mathbf {V} \operatorname {tr} (\mathbf {U} )}
where
tr
{\displaystyle \operatorname {tr} }
denotes trace .
More generally, for appropriately dimensioned matrices A ,B ,C :
E
[
X
A
X
T
]
=
U
tr
(
A
T
V
)
+
M
A
M
T
E
[
X
T
B
X
]
=
V
tr
(
U
B
T
)
+
M
T
B
M
E
[
X
C
X
]
=
V
C
T
U
+
M
C
M
{\displaystyle {\begin{aligned}E[\mathbf {X} \mathbf {A} \mathbf {X} ^{T}]&=\mathbf {U} \operatorname {tr} (\mathbf {A} ^{T}\mathbf {V} )+\mathbf {MAM} ^{T}\\E[\mathbf {X} ^{T}\mathbf {B} \mathbf {X} ]&=\mathbf {V} \operatorname {tr} (\mathbf {U} \mathbf {B} ^{T})+\mathbf {M} ^{T}\mathbf {BM} \\E[\mathbf {X} \mathbf {C} \mathbf {X} ]&=\mathbf {V} \mathbf {C} ^{T}\mathbf {U} +\mathbf {MCM} \end{aligned}}}
Transpose transform:
X
T
∼
M
N
p
×
n
(
M
T
,
V
,
U
)
{\displaystyle \mathbf {X} ^{T}\sim {\mathcal {MN}}_{p\times n}(\mathbf {M} ^{T},\mathbf {V} ,\mathbf {U} )}
Linear transform: let D (r -by-n ), be of full rank r ≤ n and C (p -by-s ), be of full rank s ≤ p , then:
D
X
C
∼
M
N
r
×
s
(
D
M
C
,
D
U
D
T
,
C
T
V
C
)
{\displaystyle \mathbf {DXC} \sim {\mathcal {MN}}_{r\times s}(\mathbf {DMC} ,\mathbf {DUD} ^{T},\mathbf {C} ^{T}\mathbf {VC} )}
Let's imagine a sample of n independent p -dimensional random variables identically distributed according to a multivariate normal distribution :
Y
i
∼
N
p
(
μ
,
Σ
)
with
i
∈
{
1
,
…
,
n
}
{\displaystyle \mathbf {Y} _{i}\sim {\mathcal {N}}_{p}({\boldsymbol {\mu }},{\boldsymbol {\Sigma }}){\text{ with }}i\in \{1,\ldots ,n\}}
.
When defining the n × p matrix
X
{\displaystyle \mathbf {X} }
for which the i th row is
Y
i
{\displaystyle \mathbf {Y} _{i}}
, we obtain:
X
∼
M
N
n
×
p
(
M
,
U
,
V
)
{\displaystyle \mathbf {X} \sim {\mathcal {MN}}_{n\times p}(\mathbf {M} ,\mathbf {U} ,\mathbf {V} )}
where each row of
M
{\displaystyle \mathbf {M} }
is equal to
μ
{\displaystyle {\boldsymbol {\mu }}}
, that is
M
=
1
n
×
μ
T
{\displaystyle \mathbf {M} =\mathbf {1} _{n}\times {\boldsymbol {\mu }}^{T}}
,
U
{\displaystyle \mathbf {U} }
is the n × n identity matrix, that is the rows are independent, and
V
=
Σ
{\displaystyle \mathbf {V} ={\boldsymbol {\Sigma }}}
.
Maximum likelihood parameter estimation [ edit ]
Given k matrices, each of size n × p , denoted
X
1
,
X
2
,
…
,
X
k
{\displaystyle \mathbf {X} _{1},\mathbf {X} _{2},\ldots ,\mathbf {X} _{k}}
, which we assume have been sampled i.i.d. from a matrix normal distribution, the maximum likelihood estimate of the parameters can be obtained by maximizing:
∏
i
=
1
k
M
N
n
×
p
(
X
i
∣
M
,
U
,
V
)
.
{\displaystyle \prod _{i=1}^{k}{\mathcal {MN}}_{n\times p}(\mathbf {X} _{i}\mid \mathbf {M} ,\mathbf {U} ,\mathbf {V} ).}
The solution for the mean has a closed form, namely
M
=
1
k
∑
i
=
1
k
X
i
{\displaystyle \mathbf {M} ={\frac {1}{k}}\sum _{i=1}^{k}\mathbf {X} _{i}}
but the covariance parameters do not. However, these parameters can be iteratively maximized by zero-ing their gradients at:
U
=
1
k
p
∑
i
=
1
k
(
X
i
−
M
)
V
−
1
(
X
i
−
M
)
T
{\displaystyle \mathbf {U} ={\frac {1}{kp}}\sum _{i=1}^{k}(\mathbf {X} _{i}-\mathbf {M} )\mathbf {V} ^{-1}(\mathbf {X} _{i}-\mathbf {M} )^{T}}
and
V
=
1
k
n
∑
i
=
1
k
(
X
i
−
M
)
T
U
−
1
(
X
i
−
M
)
,
{\displaystyle \mathbf {V} ={\frac {1}{kn}}\sum _{i=1}^{k}(\mathbf {X} _{i}-\mathbf {M} )^{T}\mathbf {U} ^{-1}(\mathbf {X} _{i}-\mathbf {M} ),}
See for example [ 3] and references therein. The covariance parameters are non-identifiable in the sense that for any scale factor, s >0, we have:
M
N
n
×
p
(
X
∣
M
,
U
,
V
)
=
M
N
n
×
p
(
X
∣
M
,
s
U
,
1
s
V
)
.
{\displaystyle {\mathcal {MN}}_{n\times p}(\mathbf {X} \mid \mathbf {M} ,\mathbf {U} ,\mathbf {V} )={\mathcal {MN}}_{n\times p}(\mathbf {X} \mid \mathbf {M} ,s\mathbf {U} ,{\tfrac {1}{s}}\mathbf {V} ).}
Drawing values from the distribution [ edit ]
Sampling from the matrix normal distribution is a special case of the sampling procedure for the multivariate normal distribution . Let
X
{\displaystyle \mathbf {X} }
be an n by p matrix of np independent samples from the standard normal distribution, so that
X
∼
M
N
n
×
p
(
0
,
I
,
I
)
.
{\displaystyle \mathbf {X} \sim {\mathcal {MN}}_{n\times p}(\mathbf {0} ,\mathbf {I} ,\mathbf {I} ).}
Then let
Y
=
M
+
A
X
B
,
{\displaystyle \mathbf {Y} =\mathbf {M} +\mathbf {A} \mathbf {X} \mathbf {B} ,}
so that
Y
∼
M
N
n
×
p
(
M
,
A
A
T
,
B
T
B
)
,
{\displaystyle \mathbf {Y} \sim {\mathcal {MN}}_{n\times p}(\mathbf {M} ,\mathbf {AA} ^{T},\mathbf {B} ^{T}\mathbf {B} ),}
where A and B can be chosen by Cholesky decomposition or a similar matrix square root operation.
Relation to other distributions [ edit ]
Dawid (1981) provides a discussion of the relation of the matrix-valued normal distribution to other distributions, including the Wishart distribution , inverse-Wishart distribution and matrix t-distribution , but uses different notation from that employed here.
^ A K Gupta; D K Nagar (22 October 1999). "Chapter 2: MATRIX VARIATE NORMAL DISTRIBUTION". Matrix Variate Distributions . CRC Press. ISBN 978-1-58488-046-2 . Retrieved 23 May 2014 .
^ Ding, Shanshan; R. Dennis Cook (2014). "Dimension folding PCA and PFC for matrix-valued predictors". Statistica Sinica . 24 (1): 463–492. JSTOR 26432553 .
^ Glanz, Hunter; Carvalho, Luis (2013). "An Expectation-Maximization Algorithm for the Matrix Normal Distribution". arXiv :1309.6609 [stat.ME ].
Discrete univariate
with finite support with infinite support
Continuous univariate
supported on a bounded interval supported on a semi-infinite interval supported on the whole real line with support whose type varies
Mixed univariate
Multivariate (joint) Directional Degenerate and singular Families