Notation

This section provides a concise reference describing the notation used throughout

this book. If you are unfamiliar with any of the corresponding mathematical

concepts, we describe most of these ideas in chapters 2–4.

Numbers and Arrays

a A scalar (integer or real)

a A vector

A A matrix

A A tensor

Identity matrix with n rows and n columns

Identity matrix with dimensionality implied by

context

(i)

Standard basis vector [0

, . . . ,

0] with a

1 at position i

diag(a)

A square, diagonal matrix with diagonal entries

given by a

a A scalar random variable

a A vector-valued random variable

A A matrix-valued random variable

xiii

CONTENTS

Sets and Graphs

A A set

R The set of real numbers

{0, 1} The set containing 0 and 1

{0, 1, . . . , n} The set of all integers between 0 and n

[a, b] The real interval including a and b

(a, b] The real interval excluding a but including b

A\B

Set subtraction, i.e., the set containing the ele-

ments of A that are not in B

G A graph

P a

) The parents of x

in G

Indexing

Element

of vector

, with indexing starting at 1

−i

All elements of vector a except for element i

i,j

Element i, j of matrix A

i,:

Row i of matrix A

:,i

Column i of matrix A

i,j,k

Element (i, j, k) of a 3-D tensor A

:,:,i

2-D slice of a 3-D tensor

Element i of the random vector a

Linear Algebra Operations



Transpose of matrix A

Moore-Penrose pseudoinverse of A

A  B Element-wise (Hadamard) product of A and B

det(A) Determinant of A

xiv

CONTENTS

Calculus

Derivative of y with respect to x

∂y

∂x

Partial derivative of y with respect to x

∇

y Gradient of y with respect to x

∇

y Matrix derivatives of y with respect to X

∇

Tensor containing derivatives of

with respect to

∂f

∂x

Jacobian matrix J ∈ R

m×n

of f : R

→ R

∇

f(x) or H(f)(x) The Hessian matrix of f at input point x



f(x)dx Deﬁnite integral over the entire domain of x



f(x)dx Deﬁnite integral with respect to x over the set S

Probability and Information Theory

a⊥b The random variables a and b are independent

a⊥b | c They are conditionally independent given c

P (a)

A probability distribution over a discrete variable

p(a)

A probability distribution over a continuous vari-

able, or over a variable whose type has not been

speciﬁed

a ∼ P Random variable a has distribution P

x∼P

[f(x)] or Ef (x) Expectation of f(x) with respect to P (x)

Var(f(x)) Variance of f (x) under P (x)

Cov(f(x), g(x)) Covariance of f (x) and g(x) under P (x)

H(x) Shannon entropy of the random variable x

(P Q) Kullback-Leibler divergence of P and Q

N(x; µ, Σ)

Gaussian distribution over

with mean

and

covariance Σ

CONTENTS

Functions

f : A → B The function f with domain A and range B

f ◦ g Composition of the functions f and g

f(x; θ)

A function of

parametrized by

. (Sometimes

we write

(

) and omit the argument

to lighten

notation)

log x Natural logarithm of x

σ(x) Logistic sigmoid,

1 + exp(−x)

ζ(x) Softplus, log(1 + exp(x))

||x||

norm of x

||x|| L

norm of x

Positive part of x, i.e., max(0, x)

condition

is 1 if the condition is true, 0 otherwise

Sometimes we use a function

whose argument is a scalar but apply it to a

vector, matrix, or tensor:

(

), or

(

). This denotes the application of

to the array element-wise. For example, if

(

), then C

i,j,k

) for all

valid values of i, j and k.

Datasets and Distributions

data

The data generating distribution

ˆp

data

The empirical distribution deﬁned by the training

set

X A set of training examples

(i)

The i-th example (input) from a dataset

(i)

or y

(i)

The target associated with

(i)

for supervised learn-

ing

The

m × n

matrix with input example

(i)

in row

i,:

xvi