Independent component analysis: An introduction

Alaa Tharwat

Applied Computing and Informatics

ISSN: 2634-1964

Open Access. Article publication date: 4 August 2020

Issue publication date: 29 April 2021

Downloads

36675

pdf (3.8 MB)

Abstract

Independent component analysis (ICA) is a widely-used blind source separation technique. ICA has been applied to many applications. ICA is usually utilized as a black box, without understanding its internal details. Therefore, in this paper, the basics of ICA are provided to show how it works to serve as a comprehensive source for researchers who are interested in this field. This paper starts by introducing the definition and underlying principles of ICA. Additionally, different numerical examples in a step-by-step approach are demonstrated to explain the preprocessing steps of ICA and the mixing and unmixing processes in ICA. Moreover, different ICA algorithms, challenges, and applications are presented.

Keywords

Citation

Tharwat, A. (2021), "Independent component analysis: An introduction", Applied Computing and Informatics, Vol. 17 No. 2, pp. 222-249. https://doi.org/10.1016/j.aci.2018.08.006

Publisher

:

Emerald Publishing Limited

License

Published in Applied Computing and Informatics. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) license. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this license may be seen at http://creativecommons.org/licences/by/4.0/legalcode

1. Introduction

Measurements cannot be isolated from a noise which has a great impact on measured signals. For example, the recorded sound of a person in a street has sounds of footsteps, pedestrians, etc. Hence, it is difficult to record a clean measurement; this is due to (1) source signals always are corrupted with a noise, and (2) the other independent signals (e.g. car sounds) which are generated from different sources [31]. Thus, the measurements can be defined as a combination of many independent sources. The topic of separating these mixed signals is called blind source separation (BSS).The term blind indicates that the source signals can be separated even if little information is known about the source signals.

One of the most widely-used examples of BSS is to separate voice signals of people speaking at the same time, this is called cocktail party problem [31]. The independent component analysis (ICA) technique is one of the most well-known algorithms which are used for solving this problem [23]. The goal of this problem is to detect or extract the sound with a single object even though different sounds in the environment are superimposed on one another [31]. Figure 1 shows an example of the cocktail party problem. In this example, two voice signals are recorded from two different individuals, i.e., two independent source signals. Moreover, two sensors, i.e., microphones, are used for recording two signals, and the outputs from these sensors are two mixtures. The goal is to extract original signals¹ from mixtures of signals. This problem can be solved using independent component analysis (ICA) technique [23].

ICA was first introduced in the 80s by J. Hérault, C. Jutten and B. Ans, and the authors proposed an iterative real-time algorithm [15]. However, in that paper, there is no theoretical explanation was presented and the proposed algorithm was not applicable in a number of cases. However, the ICA technique remained mostly unknown till 1994, where the name of ICA appeared and introduced as a new concept [9]. The aim of ICA is to extract useful information or source signals from data (a set of measured mixture signals). These data can be in the form of images, stock markets, or sounds. Hence, ICA was used for extracting source signals in many applications such as medical signals [7,34], biological assays [3], and audio signals [2]. ICA is also considered as a dimensionality reduction algorithm when ICA can delete or retain a single source. This is also called filtering operation, where some signals can be filtered or removed [31].

ICA is considered as an extension of the principal component analysis (PCA) technique [9,33]. However, PCA optimizes the covariance matrix of the data which represents second-order statistics, while ICA optimizes higher-order statistics such as kurtosis. Hence, PCA finds uncorrelated components while ICA finds independent components [21,33]. As a consequence, PCA can extract independent sources when the higher-order correlations of mixture data are small or insignificant [21].

ICA has many algorithms such as FastICA [18], projection pursuit [21], and Infomax [21]. The main goal of these algorithms is to extract independent components by (1) maximizing the non-Gaussianity, (2) minimizing the mutual information, or (3) using maximum likelihood (ML) estimation method [20]. However, ICA suffers from a number of problems such as over-complete ICA and under-complete ICA.

Many studies treating the ICA technique as a black box without understanding the internal details. In this paper, in a step-by-step approach, the basic definitions of ICA, and how to use ICA for extracting independent signals are introduced. This paper is divided into eight sections. In Section 2, an overview of the definition of the main idea of ICA and its background are introduced. This section begins by explaining with illustrative numerical examples how signals are mixed to form mixture signals, and then the unmixing process is presented. Section 3 introduces with visualized steps and numerical examples two preprocessing steps of ICA, which greatly help for extracting source signals. Section 4 presents principles of how ICA extracts independent signals using different approaches such as maximizing the likelihood, maximizing the non-Gaussianity, or minimizing the mutual information. This section explains mathematically the steps of each approach. Different ICA algorithms are highlighted in Section 5. Section 6 lists some applications that use ICA for recovering independent sources from a set of sensed signals that result from a mixing set of source signals. In Section 7, the most common problems of ICA are explained. Finally, concluding remarks will be given in Section 8.

2. ICA background

2.1 Mixing signals

Each signal varies over time and a signal is represented as follows, $s_{i} = {s_{i 1}, s_{i 2}, \dots, s_{i N}}$ , where N is the number of time steps and $s_{ij}$ represents the amplitude of the signal $s_{i}$ at the $j th$ time.² Given two independent source signals³ $s_{1} = {s_{11}, s_{12}, \dots, s_{1 N}}$ and $s_{2} = {s_{21}, s_{22}, \dots, s_{2 N}}$ (see Figure 2). Both signals can be represented as follows:

(1)

S = (\begin{array}{l} s_{1} \\ s_{2} \end{array}) = (\begin{array}{l} (s_{11}, s_{12}, \dots, s_{1 N}) \\ (s_{21}, s_{22}, \dots, s_{2 N}) \end{array})

where

S \in R^{p \times N}

represents the space that is defined by source signals and p indicates the number of source signals.⁴ The source signals (

s_{1}

and

s_{2}

) can be mixed as follows,

x_{1} = a \times s_{1} + b \times s_{2}

, where a and b are the mixing coefficients and

x_{1}

is the first mixture signal. Thus, the mixture

x_{1}

is the weighted sum of the two source signals (

s_{1}

and

s_{2}

). Similarly, another mixture

(x_{2})

can be measured by changing the distance between the source signals and the sensing device, e.g. microphone, and it is calculated as follows,

x_{2} = c \times s_{1} + d \times s_{2}

, where c and d are mixing coefficients. The two mixing coefficients a and b are different than the coefficients c and d because the two sensing devices which are used for sensing these signals are in different locations, so that each sensor measures a different mixture of source signals. As a consequence, each source signal has a different impact on output signals. The two mixtures can be represented as follows:

(2)

X = (\begin{array}{l} x_{1} \\ x_{2} \end{array}) = (\begin{array}{l} a s_{1} + b s_{2} \\ c s_{1} + d s_{2} \end{array}) = (\begin{array}{l} a & b \\ c & d \end{array}) (\begin{array}{l} s_{1} \\ s_{2} \end{array}) = As

where

X \in R^{n \times N}

is the space that is defined by the mixture signals and n is the number of mixtures. Therefore, simply, the mixing coefficients (

a, b, c

, and d) are utilized for transforming linearly source signals from

S

space to mixed signals in

X

space as follows,

S \to X : X = AS

, where

A \in R^{n \times p}

is the mixing coefficients matrix (see Figure 2) and it is defined as:

(3)

A = (\begin{array}{l} a & b \\ c & d \end{array})

2.1.1 Illustrative example

The goal of this example is to show the properties of source and mixture signals. Given two source signals $s_{1} = \sin (a)$ and $s_{2} = r - 0.5$ , where a is in the range of [1,30] with time step 0.05 and r indicates a random number in the range of [0,1]. Figure 3 shows source signals, histograms, and scatter diagram of both signals. As shown, the two source signals are independent and their histograms are not Gaussian. The scatter diagram in Figure 3(e) shows how the two source signals are independent, where each point represents the amplitude of both source signals. Figure 4 shows the mixture signals with their histograms and scatter diagram. As shown, the histograms of both mixture signals are approximately Gaussian, and the mixtures are not independent. Moreover, the mixture signals are more complex than the source signals. From this example, it is remarked that the mixed signals have the following properties:

1.Independence: if the source signals are independent (as in Figure 3(a and b)), their mixture signals are not (see Figure 4(a and b)). This is because the source signals are shared between both mixtures.
2.Gaussianity: the histogram of mixed signals are bell-shaped histogram (see Figure 4e., Gaussian or normal. This property can be used for searching for non-Gaussian signals within mixture signals to extract source or independent signals. In other words, the source signals must be non-Gaussian, and this assumption is a fundamental restriction in ICA. Hence, the ICA model cannot estimate Gaussian independent components.
3.Complexity: It is clear from the previous example that mixed signals are more complex than source signals.

From these properties we can conclude that if the extracted signals from mixture signals are independent, have non-Gaussian histograms, or have low complexity than mixture signals; then these signals represent source signals.

2.1.2 Numerical example: Mixing signals

The goal of this example⁵ is to explain how source signals are mixed to form mixture signals. Figure 5 shows two source signals $s_{1}$ and $s_{2}$ which form the space $S$ . The two axes of the $S$ space ( $s_{1}$ and $s_{2}$ ) represent the x-axis and y-axis, respectively. Additionally, the vector with coordinates ${(\begin{array}{l} 1 & 0 \end{array})}^{T}$ lie on the axis $s_{1}$ in $S$ and hence simply, the symbol $s_{1}$ refers to this vector and similarly, $s_{2}$ refers to the vector with the following coordinates ${(\begin{array}{l} 0 & 1 \end{array})}^{T}$ . During the mixing process, the matrix $A$ transforms $s_{1}$ and $s_{2}$ in the $S$ space to $s_{1}^{'}$ and $s_{2}^{'}$ , respectively, in the $X$ space (see Eqs. (4) and (10)).

(4)

s_{1}^{'} = A s_{1} = (\begin{array}{l} a & b \\ c & d \end{array}) (\begin{array}{l} 1 \\ 0 \end{array}) = (\begin{array}{l} a \\ c \end{array})

(5)

s_{2}^{'} = A s_{2} = (\begin{array}{l} a & b \\ c & d \end{array}) (\begin{array}{l} 0 \\ 1 \end{array}) = (\begin{array}{l} b \\ d \end{array})

In our example, assume that the mixing matrix is as follows,

A = (\begin{array}{l} 1 & 2 \\ 1 & - 1 \end{array})

. Given two source signals are as follows,

s_{1} = (\begin{array}{l} 1 & 2 & 1 & 2 \end{array})

and

s_{2} = (\begin{array}{l} 1 & 1 & 2 & 2 \end{array})

. These two signals can be represented by four points which are plotted in the

S

space in black color (see Figure 5). The coordinates of these points are as follows:

(6)

(\begin{array}{l} 1 \\ 1 \end{array}), (\begin{array}{l} 2 \\ 1 \end{array}), (\begin{array}{l} 1 \\ 2 \end{array}), (\begin{array}{l} 2 \\ 2 \end{array})

The new axes in the $X$ space ( $s_{1}^{'}$ and $s_{2}^{'}$ ) are plotted in solid red and blue color, respectively (see Figure 5) and and they can be calculated as follows:

(7)

s_{1}^{'} = A (\begin{array}{l} 1 \\ 0 \end{array}) = (\begin{array}{l} 1 & 2 \\ 1 & - 1 \end{array}) (\begin{array}{l} 1 \\ 0 \end{array}) = (\begin{array}{l} 1 \\ 1 \end{array})

(8)

s_{2}^{'} = A (\begin{array}{l} 0 \\ 1 \end{array}) = (\begin{array}{l} 1 & 2 \\ 1 & - 1 \end{array}) (\begin{array}{l} 0 \\ 1 \end{array}) = (\begin{array}{l} 2 \\ - 1 \end{array})

The four points are transformed in the $X$ space; these points are plotted in a red color in Figure 5; and the values of these new points are

(9)

(\begin{array}{l} 3 \\ 0 \end{array}), (\begin{array}{l} 4 \\ 1 \end{array}), (\begin{array}{l} 5 \\ - 1 \end{array}), (\begin{array}{l} 6 \\ 0 \end{array})

Assumed the second source $s_{2}$ is silent/OFF; hence, the sensors record only the signal that is generated from $s_{1}$ (see Figure 6(a)). The mixed signals are laid along $s_{1}^{'} = {(\begin{array}{l} a & c \end{array})}^{T}$ and the distribution of the projected samples onto $s_{1}^{'}$ are depicted in Figure 6(a). Similarly, Figure 6(b) shows the projection onto $s_{2}^{'} = {(\begin{array}{l} b & d \end{array})}^{T}$ when the first source is silent; this projection represents the mixed data. It is worth mentioning that the new axes $s_{1}^{'}$ and $s_{2}^{'}$ need not to be orthogonal on the $s_{1}$ and $s_{2}$ , respectively. Figure 5 is the combination of Figure 6(a) and (b) when both source signals are played together and the sensors measure the two signals simultaneously.

A related point to consider is that the number of red points in Figure 6(a) which represent the projected points onto $s_{1}^{'}$ is three while the number of original points was four. This can be interpreted mathematically by calculating the coordinates of the projected points onto $s_{1}^{'}$ . For example, the projection of the first point ${(\begin{array}{l} 1 & 1 \end{array})}^{T}$ is calculated as follows, $s_{1}^{'} {(\begin{array}{l} 1 & 1 \end{array})}^{T} = (\begin{array}{l} 1 \\ 1 \end{array}) {(\begin{array}{l} 1 & 1 \end{array})}^{T} = 2$ . Similarly, the projection of the second, third, and fourth points are $3,3$ , and 4, respectively. Therefore, the second and third samples were projected onto the same position onto $s_{1}^{'}$ . This is the reason why the number of projected points is three.

2.2 Unmixing signals

In this section, the unmixing process for extracting source signals will be presented. Given a mixing matrix $A$ , independent components can be estimated by inverting the linear system as in Eq. (2), but we know neither $S$ nor $A$ ; hence, the problem is considerably more difficult. Assume that the matrix ( $A$ ) is known; hence, source signals can be extracted. For simplicity, we assume that the number of sources and mixture signals are the same and hence the unmixing matrix is a square matrix.

Given two mixture signals $x_{1}$ and $x_{2}$ . The aim is to extract source signals, and this can be achieved by searching for unmixing coefficients as follows:

(10)

\begin{array}{l} y_{1} = α x_{1} + β x_{2} \\ y_{2} = γ x_{1} + δ x_{2} \end{array}

where

α, β, γ

, and

δ

represent unmixing coefficients, which are used for transforming the mixture signals into a set of independent signals as follow,

X \to Y : Y = W^{T} X

, where

W \in R^{n \times p}

is the unmixing coefficients matrix as shown in Figure 7. Simply we can say that the first source signal,

y_{1}

, can be extracted from the mixtures (

x_{1}

and

x_{2}

) using two unmixing coefficients (

α

and

β

). This pair of unmixing coefficients defines a point with coordinates

(α, β)

, where

w_{1} = {(\begin{array}{l} α & β \end{array})}^{T}

is a weight vector (see Eq. (11)). Similarly,

y_{2}

can be extracted using the two unmixing coefficients

γ

and

δ

which define the weight vector

w_{2} = {(\begin{array}{l} γ & δ \end{array})}^{T}

(see Eq. (11))

(11)

\begin{array}{l} y_{1} = α x_{1} + β x_{2} = w_{1}^{T} X \\ y_{2} = γ x_{1} + δ x_{2} = w_{2}^{T} X \end{array}

W = {(\begin{array}{l} w_{1} & w_{2} \end{array})}^{T}

is the unmixing matrix and it represents the inverse of

A

. The unmixing process can be achieved by rotating the rows of

W

. This rotation will continue till each row in

W

(

w_{1}

or

w_{2}

) finds the orientation which is orthogonal on other transformed signals. For example, in our example,

w_{1}

is orthogonal on

s_{2}^{'}

(see Figure 5). The source signals are then extracted by projecting mixture signals onto that orientation.

In practice, changing the length or orientation of weight vectors has a great influence on the extracted signals ( $Y$ ). This is the reason why the extracted signals may be not identical to original source signals. The consequences of changing the length or orientation of the weight vectors are as follows:

Length: The length of the weight vector $w_{1}$ is $| w_{1} | = \sqrt{α^{2} + β^{2}}$ , and assume that the length of $w_{1}$ is changed by a factor $λ$ as follows, $λ | w_{1} | = λ \sqrt{α^{2} + β^{2}} = \sqrt{{(λ α)}^{2} + {(λ β)}^{2}}$ . The extracted signal or the best approximation of $s_{1}$ is denoted by $y_{1} = w_{1}^{T} X$ and it is estimated as in Eq. (12). Hence, the extracted signal is a scaled version of the source signal and the length of the weight vector affects only the amplitude of the extracted signal.

(12)

\begin{array}{l} y_{1} = (λ w_{1}^{T}) X = (λ α) x_{1} + (λ β) x_{2} \\ = λ (α x_{1} + β x_{2}) = λ s_{1} \end{array}

Orientation: As mentioned before, the source signals $s_{1}$ and $s_{2}$ in the $S$ space are transformed to $s_{1}^{'}$ and $s_{2}^{'}$ (see Eqs. (4) and (5)), respectively, where $s_{1}^{'}$ and $s_{2}^{'}$ form the mixture space $X$ . The signal $(s_{1})$ is extracted only if $w_{1}$ is orthogonal to $s_{2}^{'}$ and hence at different orientations, different signals are extracted. This is because the inner product for any orthogonal vectors is zero as follows, $y_{1} = w_{1}^{T} X = w_{1}^{T} AS = w_{1}^{T} (\begin{array}{l} s_{1}^{'} & s_{2}^{'} \end{array})$ , where $w_{1} s_{2}^{'} = 0$ because $w_{1}$ is orthogonal to $s_{2}^{'}$ , and the inner product of $w_{1}$ and $s_{1}^{'}$ is as follows, $w_{1}^{T} s_{1}^{'} = | w_{1} | | s_{1}^{'} | \cos θ = | w_{1} | | A s_{1} | \cos θ = k s_{1}$ , where $θ$ is the angle between $w_{1}$ and $s_{1}^{'}$ as shown in Figure 5, and k is a constant. The value of k depends on the length of $w_{1}$ and $s_{1}^{'}$ and the angle $θ$ . The extracted signal will be as follows, $y_{1} = w_{1}^{T} (\begin{array}{l} s_{1}^{'} & s_{2}^{'} \end{array}) = (w_{1}^{T} s_{1}^{'} + w_{1}^{T} s_{2}^{'}) = k s_{1}$ . The extracted signal ( $k s_{1}$ ) is a scaled version from the source signal ( $s_{1}$ ), and $k s_{1}$ is extracted from $X$ by taking the inner product of all mixture signals with $w_{1}$ which is orthogonal to $s_{2}^{'}$ . Thus, it is difficult to recover the amplitude of source signals.

Figure 8 displays the mixing and unmixing steps of ICA. As shown, the first mixture signal $x_{1}$ is observed using only the first row in $A$ matrix, where the first element in $x_{1}$ is calculated as follows, $x_{11} = {a_{11} s_{11} + a_{12} s_{21} + \dots + a_{1 p} s_{p 1}}$ . Moreover, the number of mixture signals and the number of source signals are not always the same. This is because, the number of mixture signals depends on the number of sensors. Additionally, the dimension of $W$ is not agree with $X$ ; hence, $W$ is transposed, and the first element in the first extracted signal ( $y_{1}$ ) is estimated as follows, $y_{11} = {w_{11} x_{11} + w_{21} x_{21} + \dots + w_{n 1} x_{n 1}}$ . Similarly all the other elements of all extracted signals can be estimated.

2.2.1 Numerical examples: Unmixing signals

The goal of this example is to extract source signals which are mixed in the numerical example in Section 2.1.2. The matrix $W$ is the inverse of $A$ and the value of $W$ is $W = (\begin{array}{l} \frac{1}{3} & \frac{2}{3} \\ \frac{1}{3} & \frac{- 1}{3} \end{array})$ , where the vector $w_{1}$ in $W$ is orthogonal to $s_{2}^{'}$ , i.e., the inner product $(\begin{array}{l} \frac{1}{3} & \frac{2}{3} \end{array}) {(\begin{array}{l} 2 & - 1 \end{array})}^{T} = 0$ , and similarly, the vector $w_{2}$ is orthogonal to $s_{1}^{'}$ (see Figure 5). Moreover, the source signal $s_{1}$ is extracted as follows, $s_{1} = w_{1}^{T} X = (\begin{array}{l} \frac{1}{3} & \frac{2}{3} \end{array}) (\begin{array}{l} 1 & 2 \\ 1 & - 1 \end{array}) = (\begin{array}{l} 1 \\ 0 \end{array})$ , and similarly, $s_{2}$ is extracted as follows, $s_{2} = w_{2}^{T} X = (\begin{array}{l} \frac{1}{3} & - \frac{1}{3} \end{array}) (\begin{array}{l} 1 & 2 \\ 1 & - 1 \end{array}) = (\begin{array}{l} 0 \\ 1 \end{array})$ . Hence, the original source signals are extracted perfectly. This is because $k \approx 1$ and hence according to Eq. (12) the extracted signal is identical to the source signal. As mentioned before, the value of k is calculated as follows, $k = | w_{1} | | s_{1}^{'} | \cos θ$ , and the value of $| w_{1} | = \sqrt{{(\frac{1}{3})}^{2} + {(\frac{2}{3})}^{2}} = \frac{\sqrt{5}}{3}$ , and the value of $| s_{1}^{'} | = \sqrt{{(1)}^{2} + {(1)}^{2}} = \sqrt{2}$ . The angle between $s_{1}^{'}$ and the $s_{1}$ axes is $45 °$ because $s_{1}^{'} = {(\begin{array}{l} 1 & 1 \end{array})}^{T}$ ; and similarly, the angle between $w_{1}$ and $s_{1}$ is $\cos^{- 1} (\frac{1 / 3}{\sqrt{5} / 3}) = \cos^{- 1} (\frac{1}{\sqrt{5}}) \approx 63 °$ (see Figure 5 top left corner). Therefore, $θ \approx 63 ° - 45 ° \approx 18 °$ , and hence $k = \sqrt{\frac{5}{9}} \sqrt{2} \cos 18 ° \approx 1$ . Hence, changing the orientation of $w_{1}$ leads to a different extracted signal.

2.3 Ambiguities of ICA

ICA has some ambiguities such as:

The order of independent components: In ICA, the weight vector $(w_{i})$ is initialized randomly and then rotated to find one independent component. During the rotation, the value of $w_{i}$ is updated iteratively. Thus, $w_{i}$ extracts source signals but not in a specific order.
The sign of independent components: Changing the sign of independent components has not any influence on the ICA model. In other words, we can multiply the weight vectors in $W$ by $- 1$ without affecting the extracted signal. In our example, in Section 2.2.1, the value of $w_{1}$ was $(\begin{array}{l} \frac{1}{3} & \frac{2}{3} \end{array})$ . Multiplying $w_{1}$ by $- 1$ , i.e., $w_{1} = (\begin{array}{l} - \frac{1}{3} & - \frac{2}{3} \end{array})$ has no influence because $w_{1}$ still in the same direction with the same magnitude and hence the value of k will not be changed, and the extracted signal $s_{1}$ will be with the same values but with a different sign, i.e., $s_{1} = w_{1}^{T} X = {(\begin{array}{l} - 1 & 0 \end{array})}^{T}$ . As a result, the matrix $W$ in n-dimensional space has $2 n$ local maxima, i.e., two local maxima for each independent component, corresponding to $s_{i}$ and $- s_{i}$ [21]. This problem is insignificant in many applications [16,19].

3. ICA: Preprocessing phase

This section explains the preprocessing steps of the ICA technique. This phase has two main steps: centering and whitening.

3.1 The centering step

The goal of this step is to center the data by subtracting the mean from all signals. Given n mixture signals $(X)$ , the mean is $μ$ and the centering step can be calculated as follows:

(13)

D = X - μ = (\begin{array}{l} d_{1} \\ d_{2} \\ ⋮ \\ d_{n} \end{array}) = (\begin{array}{l} x_{1} - μ \\ x_{2} - μ \\ ⋮ \\ x_{n} - μ \end{array})

where

D

is the mixture signals after the centering step as in Figure 9A) and

μ \in R^{1 \times N}

is the mean of all mixture signals. The mean vector can be added back to independent components after applying ICA.

3.2 The whitening data step

This step aims to whiten the data which means transforming signals into uncorrelated signals and then rescale each signal to be with a unit variance. This step includes two main steps as follows.

1.Decorrelation: The goal of this step is to decorrelate signals; in other words, make each signal uncorrelated with each other. Two random variables are considered uncorrelated if their covariance is zero.
In ICA, the PCA technique can be used for decorrelating signals. In PCA, eigenvectors which form the new PCA space are calculated.In PCA, first, the covariance matrix is calculated. The covariance matrix of any two variables $(x_{i} x_{j})$ is defined as follows, $Σ_{i j} = E {x_{i} x_{j}} - E {x_{i}} E {x_{j}} = E [(x_{i} - μ_{i}) (x_{j} - μ_{j})]$ . With many variables, the covariance matrix is calculated as follows, $Σ = E [D D^{T}]$ , where $D$ is the centered data (see Figure 9B)). The covariance matrix is solved by calculating the eigenvalues $(λ)$ and eigenvectors $(V)$ as follows, $V Σ = λ V$ , where the eigenvectors represent the principal components which represent the directions of the PCA space and the eigenvalues are scalar values which represent the magnitude of the eigenvectors. The eigenvector which has the maximum eigenvalue is the first principal component $({PC}_{1})$ and it has the maximum variance [33]. For decorrelating mixture signals, they are projected onto the calculated PCA space as follows, $U = VD$ .
2.Scaling: the goal here is to scale each decorrelated signal to be with a unit variance. Hence, each vector in $U$ has a unit length and is then rescaled to be with a unit variance as follows, $Z = λ^{- \frac{1}{2}} U = λ^{- \frac{1}{2}} VD$ , where $Z$ is the whitened or sphered data and $λ^{\frac{- 1}{2}}$ is calculated by simple component-wise operation as follows, $λ^{- \frac{1}{2}} = {λ_{1}^{- \frac{1}{2}}, λ_{2}^{- \frac{1}{2}}, \dots, λ_{n}^{- \frac{1}{2}}}$ . After the scaling step, the data becomes rotationally symmetric like a sphere; therefore, the whitening step is also called sphering [32].

3.3 Numerical example

Given eight mixture signals $X = {x_{1}, x_{2}, \dots, x_{8}}$ , each mixture signal is represented by one row in $X$ as in Eq. (14).⁶ The mean ( $μ$ ) was then calculated and its value was $μ = \begin{array}{l} 2.63 & 3.63 \end{array}$ .

(14)

X^{T} = [\begin{array}{l} 1.00 & 1.00 & 2.00 & 0.00 & 5.00 & 4.00 & 5.00 & 3.00 \\ 3.00 & 2.00 & 3.00 & 3.00 & 4.00 & 5.00 & 5.00 & 4.00 \end{array}]

In the centering step, the data are centered by subtracting the mean from each signal and the value of $D$ will be as follows:

(15)

D^{T} = [\begin{array}{l} - 1.63 & - 1.63 & - 0.63 & - 2.63 & 2.38 & 1.38 & 2.38 & 0.38 \\ - 0.63 & - 1.63 & - 0.63 & - 0.63 & 0.38 & 1.38 & 1.38 & 0.38 \end{array}]

The covariance matrix $(Σ)$ and its eigenvalues $(λ)$ and eigenvectors $(V)$ are then calculated as follows:

(16)

Σ = [\begin{array}{l} 3.70 & 1.70 \\ 1.70 & 1.13 \end{array}], λ = [\begin{array}{l} 0.28 & 0.00 \\ 0.00 & 4.54 \end{array}], and V = [\begin{array}{l} 0.45 & - 0.90 \\ - 0.90 & - 0.45 \end{array}]

From Eq. (16) it can be remarked that the two eigenvectors are orthogonal as shown in Figure 10, i.e., $v_{1}^{T} v_{2} = [0.45 - 0.9] - 0.90 - {0.45}^{T} = 0$ , where $v_{1} and v_{2}$ represent the first and second eigenvectors, respectively. Moreover, the value of the second eigenvalue $(λ_{2})$ was more than the first one $(λ_{1})$ , and $λ_{2}$ represents $\frac{4.54}{0.28 + 4.54} \approx 94.19$ % of the total eigenvalues; thus, $v_{2} and v_{1}$ represent the first and second principal components of the PCA space, respectively, and $v_{2}$ points to the direction of the maximum variance (see Figure 10).

The two signals are decorrelated by projecting the centered data onto the PCA space as follows, $U = VD$ . The values of $U$ is

(17)

U^{T} = [\begin{array}{l} - 0.16 & 0.73 & 0.28 & - 0.61 & 0.72 & - 0.62 & - 0.18 & - 0.17 \\ 1.73 & 2.18 & 0.84 & 2.63 & - 2.29 & - 1.84 & - 2.74 & - 0.50 \end{array}]

The matrix $U$ is already centered; thus, the covariance matrix for $U$ is given by

(18)

E (U U^{T}) = [\begin{array}{c} 0.28 & 0 \\ 0 & 4.54 \end{array}]

From Eq. (18) it is remarked that the two mixture signals are decorrelated by projecting them onto the PCA space. Thus, the covariance matrix is diagonal and the off-diagonal elements which represent the covariance between two mixture signals are zeros. Figure 10 displays the contour of the two mixtures is ellipsoid centered at the mean. The projection of mixture signals onto the PCA space rotates the ellipse so that the principal components are aligned with the $x_{1}$ and $x_{2}$ axes. After the decorrelation step, the signals are then rescaled to be with a unit variance (see Figure 10). The whitening can be calculated as follows, $Z = λ^{- \frac{1}{2}} VD$ , and the values of the mixture signals after the scaling step are

(19)

Z^{T} = [\begin{array}{l} - 0.31 & 1.38 & 0.53 & - 1.15 & 1.36 & - 1.17 & - 0.33 & - 0.32 \\ 0.81 & 1.02 & 0.39 & 1.23 & - 1.08 & - 0.87 & - 1.29 & - 0.24 \end{array}]

The covariance matrix for the whitened data is $E [Z Z^{T}] = E [(λ^{- 0.5} VD) {(λ^{- 0.5} VD)}^{T}] = E [(λ^{- 0.5} VD) (D^{T} V^{T} λ^{- 0.5})]$ . $λ$ is diagonal; thus, $λ = λ^{T}$ , and $[D D^{T}]$ is the covariance matrix $(Σ)$ which is equal to $V^{T} λ V$ . Hence, $E [Z Z^{T}] = E [λ^{- 0.5} V V^{T} λ V V^{T} λ^{- 0.5}] = I$ , where $V V^{T} = I$ because $V$ is orthonormal.⁷ This means that the covariance matrix of the whitened data is the identity matrix (see Eq. (20)) which means that the data are decorrelated and have unit variance.

(20)

E (Z Z^{T}) = [\begin{array}{c} 1.00 & 0 \\ 0 & 1.00 \end{array}]

Figure 11 displays the scatter plot for two mixtures, where each mixture signal is represented by 500-time steps. As shown in Figure 11(a), the scatter of the original mixtures forms an ellipse centered at the origin. Projecting the mixture signals onto the PCA space rotates the principal components to be aligned with the $x_{1}$ and $x_{2}$ axes and hence the ellipse is also rotated as shown in Figure 11(b). After the whitening step, the contour of the mixture signals forms a circle. This is because the signals have unit variance.

4. Principles of ICA estimation

In ICA, the goal is to find the unmixing matrix $(W)$ and then projecting the whitened data onto that matrix for extracting independent signals. This matrix can be estimated using three main approaches of independence, which result in slightly different unmixing matrices. The first is based on the non-Gaussianity. This can be measured by some measures such as negentropy and kurtosis, and the goal of this approach is to find independent components which maximize the non-Gaussianity [25,30]. In the second approach, the ICA goal can be obtained by minimizing the mutual information [22,14]. Independent components can be also estimated by using maximum likelihood (ML) estimation [28]. All approaches simply search for a rotation or unmixing matrix $W$ . Projecting the whitened data onto that rotation matrix extracts independent signals. The preprocessing steps are calculated from the data, but the rotation matrix is approximated numerically through an optimization procedure. Searching for the optimal solution is difficult due to the local minima exists in the objective function. In this section, different approaches are introduced for extracting independent components.

4.1 Measures of non-Gaussianity

Searching for independent components can be achieved by maximizing the non-Gaussianity of extracted signals [23]. Two measures are used for measuring the non-Gaussianity, namely, Kurtosis and negative entropy.

4.1.1 Kurtosis

Kurtosis can be used as a measure of non-Gaussianity, and the extracted signal can be obtained by finding the unmixing vector which maximizes the kurtosis of the extracted signal [4]. In other words, the source signals can be extracted by finding the orientation of the weight vectors which maximize the kurtosis.

Kurtosis is simple to calculate; however, it is sensitive for outliers. Thus, it is not robust enough for measuring the non-Gaussianity [21]. The Kurtosis (K) of any probability density function (pdf) is defined as follow,

(21)

K (x) = E [x^{4}] - 3 {[E [x^{2}]]}^{2}

where the normalized kurtosis

(\hat{K})

is the ratio between the fourth and second central moments, and it is given by

(22)

\hat{K} (x) = \frac{E [x^{4}]}{E {[x^{2}]}^{2}} - 3 = \frac{\frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - μ)}^{4}}{{(\frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - μ)}^{2})}^{2}} - 3

For whitened data $(Z)$ , $E [Z^{2}] = 1$ because $Z$ with a unit variance. Therefore, the kurtosis will be

(23)

K (Z) = \hat{K} (Z) = E [Z^{4}] - 3

As reported in [20], the fourth moment for Gaussian signals is $3 {(E [Z^{2}])}^{2}$ and hence $\hat{K} (x) = E [Z^{4}] - 3 = E [3 {(E [Z^{2}])}^{2}] - 3 = E [3 {(1)}^{2}] - 3 = 0$ , where $E [Z^{2}] = 1$ . As a consequence, Gaussian pdfs have zero kurtosis.

Kurtosis has an additivity property as follows:

(24)

K (x_{1} + x_{2}) = K (x_{1}) + K (x_{2}),

and for any scalar parameter $α$ ,

(25)

K (α x_{1}) = α^{4} K (x_{1})

where

α

is a scalar.

These properties can be used for interpreting one of the ambiguities of ICA that are mentioned in Section 2.3, which is the sign of independent components. Given two source signals $s_{1} and s_{2}$ , and the matrix $Q = A^{T} W = A^{- 1} W$ . Hence,

(26)

Y = W^{T} X = W^{T} AS = QS = q_{1} s_{1} + q_{2} s_{2}

Using the kurtosis properties in Eqs. (24) and (25), we have

(27)

K (Y) = K (q_{1} s_{1}) + K (q_{2} s_{2}) = q_{1}^{4} K (s_{1}) + q_{2}^{4} K (s_{2})

Assume that $s_{1}, s_{2}$ , and $Y$ have a unit variance. This implies that $E [Y^{2}] = q_{1}^{2} E [s_{1}] + q_{2}^{2} E [s_{2}] = q_{1}^{2} + q_{2}^{2} = 1$ . Geometrically, this means that $Q$ is constrained to a unit circle in the two-dimensional space. The aim of ICA is to maximize the kurtosis $(K (Y) = q_{1}^{4} K (s_{1}) + q_{2}^{4} K (s_{2}))$ on the unit circle. The optimal solutions, i.e., maxima, are the points when one of $Q$ is zero and the other is nonzero; this is due to the unit circle constraint, and the nonzero element must be 1 or −1 [11]. These optimal solutions are the ones which are used to extract $\pm s_{i}$ . Generally, $Q = A^{T} W = I$ means that each vector in the matrix $Q$ extracts only one source signal.

The ICs can be obtained by finding the ICs which maximizes kurtosis of extracted signals $Y = W^{T} Z$ . The kurtosis of $Y$ is then calculated as in Eq. (23), where the term ${(E [y_{i}^{2}])}^{2}$ in Eq. (22) is equal one because $W and Z$ have a unit length. $W$ has a unit length because it is scaled to be with a unit length, and $Z$ is the whitened data, so, it has a unit length. Thus, the kurtosis can be expressed as:

(28)

K (Y) = E [{(W^{T} Z)}^{4}] - 3

The gradient of the kurtosis of $Y$ is given by, $\frac{\partial K (W^{T} Z)}{\partial W} = c E [Z {(W^{T} Z)}^{3}]$ , where c is a constant, which we set to unity for convenience. The weight vector is updated in each iteration as follows, $w_{n e w} = w_{o l d} + η E [Z {(w_{o l d}^{T} Z)}^{3}]$ , where $η$ is the step size for the gradient ascent. Since we are optimizing the kurtosis on the unit circle $‖ w ‖ = 1$ , the gradient method must be complemented by projecting $w$ onto the unit circle after every step. This can be done by normalizing the weight vectors $w_{n e w}$ through dividing it by its norm as follows, $w_{n e w} = w_{n e w} / ‖ w_{n e w} ‖$ . The value of $w_{n e w}$ is updated in each iteration.

4.1.2 Negative entropy

Negative entropy is termed negentropy, and it is defined as follows, $J (y) = H (y_{G a u s s i a n}) - H (y)$ , where $H (y_{G a u s s i a n})$ is the entropy of a Gaussian random variable whose covariance matrix is equal to the covariance matrix of y. The entropy of a random variable Q which has N possible outcomes is

(29)

H (Q) = - E [\log p_{q} (q)] = - \frac{1}{N} \sum_{t}^{N} \log p_{q} (q^{t})

where

p_{q} (q^{t})

is the probability of the event

q^{t}, t = 1, 2, \dots, N

.

The negentropy is zero when all variables are Gaussian, i.e., $H (y_{G a u s s i a n}) = H (y)$ . Negentropy is always nonnegative because the entropy of Gaussian variable is the maximum among all other random variables with the same variance. Moreover, it is invariant for invertible linear transformation and it is scale-invariant [21]. However, calculating the entropy from a finite data is computationally difficult. Hence, different approximations have been introduced for calculating the negentropy [21]. For example,

(30)

J (y) \approx \frac{1}{12} E {[y^{3}]}^{2} + \frac{1}{48} K {(y)}^{2}

where y is assumed to be with zero mean. This approximation suffers from the sensitivity of kurtosis; therefore, Hyvarinen proposed another approximation based on the maximum entropy principle as follows [23]:

(31)

J (y) \approx \sum_{i = 1}^{p} k_{i} {(E [G_{i} (y)] - E [G_{i} (v)])}^{2},

where

k_{i}

are some positive constants, v indicates a Gaussian variable with zero mean and unit variance,

G_{i}

represent some quadratic functions [23,20]. The function G has different choices such as

(32)

G_{1} (y) = \frac{1}{a_{1}} \log \cos h a_{1} y and G_{2} (y) = - \exp (- y^{2} / 2)

where

1 \leq a_{1} \leq 2

. These two functions are widely used, and these approximations give a very good compromise between the kurtosis and negentropy properties which are the two classical non-Gaussianity measures.

4.2 Minimization of mutual information

Minimizing mutual information between independent components is one of the well-known approaches for ICA estimation. In ICA, maximizing the entropy of $Y = W^{T} X$ can be achieved by spreading out the points in $Y$ as much as possible. Signals $\hat{Y}$ can be obtained by transforming $Y$ by g as follows, $\hat{Y} = g (Y)$ , where g is assumed to be the cumulative density function cdf of source signals. Hence, $\hat{Y}$ have a uniform joint distribution.

The pdf of the linear transformation $Y = W^{T} X$ is, $p_{Y} (Y) = p_{X} (X) / | W |$ , where $| W |$ represents $| \partial Y / \partial X |$ . Similarly, $p_{\hat{Y}} (\hat{Y}) = p_{Y} (Y) / | \frac{d \hat{Y}}{d Y} | = \frac{p_{Y} (Y)}{p_{S} (Y)}$ , where $| \frac{d \hat{Y}}{d Y} |$ is equal to $g^{'} (y)$ which represents the pdf for source signals ( $p_{S}$ ).

This can be substituted in Eq. (29) and the entropy will be

(33)

\begin{matrix} H (\hat{Y}) = - \frac{1}{N} \sum_{t = 1}^{N} \log p_{\hat{Y}} ({\hat{Y}}_{t}) = - \frac{1}{N} \sum_{t}^{N} \log \frac{p_{Y} (Y)}{p_{S} (Y)} \\ = - \frac{1}{N} \sum_{t = 1}^{N} \log \frac{p_{X} (x_{t})}{| W | p_{S} (y_{t})} \\ = \frac{1}{N} \sum_{t = 1}^{N} \log p_{S} (y_{t}) + \log | W | - \frac{1}{N} \sum_{t = 1}^{N} \log p_{X} (x_{t}) \end{matrix}

In Eq. (33), increasing the matching between the extracted and source signals, the ratio $\frac{p_{Y} (Y)}{p_{S} (Y)}$ will be one. As a consequence, the $p_{\hat{Y}} (\hat{Y}) = \frac{p_{Y} (Y)}{p_{S} (Y)}$ becomes uniform which maximizes the entropy of $p_{\hat{Y}} (\hat{Y})$ . Moreover, the term $\frac{- 1}{N} \sum_{t = 1}^{N} \log p_{X} (X_{t})$ represents the entropy of $X$ ; hence, Eq. (33) is given by

(34)

H (\hat{Y}) = \frac{1}{N} \sum_{t = 1}^{N} \log p_{S} (y_{t}) + \log | W | + H (X)

Hence, from Eq. (34), $H (Y) = H (X) + \log | W |$ . This means that in the linear transformation $Y = W^{T} X$ , the entropy is changed (increased or decreased) by $\log | W |$ . As mentioned before, the entropy $H (X)$ is not affected by $W$ and $W$ maximizes only the entropy $H (\hat{Y})$ and hence $H (X)$ is removed from Eq. (34), and final form of the entropy with M marginal pdfs is

(35)

H (\hat{Y}) = \frac{1}{N} \sum_{t = 1}^{N} \sum_{i = 1}^{M} \log p_{S} (y_{i_{t}}) + \log | W |

Mutual information measures the independence between random variables. Thus, independent components can be obtained by minimizing the mutual information between different components [6]. Given two random variables x and y, the mutual information is denoted by I, and it is given by

(36)

\begin{matrix} I (x, y) = \sum_{x, y} p (x, y) \log \frac{p (x, y)}{p (x) p (y)} \\ = H (x) - H (x | y) = H (y) - H (y | x) \\ = H (x) + H (y) - H (x, y) \\ = H (x, y) - H (x | y) - H (y | x) \end{matrix}

where

H (x)

and

H (y)

represent the marginal entropies,

H (x | y)

and

H (y | x)

are conditional entropies, and

H (x, y)

is the joint entropy of x and y. The value of I is zero if and only if the variables are independent; otherwise, I is non-negative. Mutual information between m random variables

(y_{i}, i = 1, 2, \dots, m)

is given by

(37)

I (y_{1}, y_{2}, \dots, y_{m}) = \sum_{i = 1}^{m} H (y_{i}) - H (y)

In ICA, where $Y = W^{T} X$ and $H (Y) = H (X) + \log | W |$ , Eq. (37) can be written as

(38)

\begin{matrix} I (y_{1}, y_{2}, \dots, y_{m}) = \sum_{i = 1}^{m} H (y_{i}) - H (Y) \\ = \sum_{i = 1}^{m} H (y_{i}) - H (X) - \log | \det W | \end{matrix}

where

\det W

is a notation for a determine of the matrix W. When

Y

is whitened; thus,

E [Y Y^{T}] = W E [X X^{T}] W^{T} = I \Rightarrow \det (W E [X X^{T}] W^{T}) = (\det W) (\det E [X X^{T}]) (\det W^{T}) \Rightarrow \det (W E [X X^{T}] W^{T}) = \det I = 1

. As a consequence,

\det W

is a constant, and the definition of mutual information is

(39)

I (y_{1}, y_{2}, \dots, y_{m}) = C - \sum_{i} J (y_{i})

where C is a constant.

From Eq. (39), it is clear that maximizing negentropy is related to minimizing mutual information and they differ only by a sign and a constant C. Moreover, non-Gaussianity measures enable the deflationary (one-by-one) estimation of the ICs which is not possible with mutual information or likelihood approaches.⁸ Further, with the non-Gaussianity approach, all signals are enforced to be uncorrelated, while this constraint is not necessary using mutual information approach.

4.3 Maximum Likelihood (ML)

Maximum likelihood (ML) estimation method is used for estimating parameters of statistical models given a set of observations. In ICA, this method is used for estimating the unmixing matrix $(W)$ which provides the best fit for extracted signals $Y$ .

The likelihood is formulated in the noise-free ICA model as follows, $X = AS$ , and this model can be estimated using ML method [6]. Hence, $p_{X} (X) = \frac{p_{S} (S)}{| \det A |} = | \det W | p_{S} (S)$ . For independent source signals, $(i .e . p_{S} (S) = p_{1} (s_{1}) p_{2} (s_{2}) \dots p_{p} (s_{p}) = \prod_{i} p_{i} (s_{i}))$ , $p_{X} (X)$ is given by

(40)

p_{x} (X) = | \det W | \prod_{i} p_{i} (s_{i}) = | \det W | \prod_{i} p_{i} (w_{i}^{T} X)

Given T observations of $X$ , the log-likelihood of $W$ which is denoted by $L (W)$ is given by

(41)

L (W) = \prod_{t}^{T} \prod_{i}^{p} | \det W | p_{i} (w_{i}^{T} x (t))

Practically, the likelihood is usually simplified using the logarithm, this is called log-likelihood, which makes Eq. (41) more simpler as follows:

(42)

\log L (W) = \sum_{i = 1}^{p} \log p_{i} (w_{i}^{T} x (t)) + T \log | \det W |

The mean of any random variable x can be calculated as $E [x] = \frac{1}{T} \sum_{i = 1}^{T} x_{t} \Rightarrow \sum_{i = 1}^{T} x_{t} = TE [x]$ . Hence, Eq. (42) can be simplified to

(43)

\frac{1}{T} \log L (W) = E \sum_{i = 1}^{p} \log p_{i} (w_{i}^{T} X) + \log | \det W |

The first term $E \sum_{i = 1} \log p_{i} (w_{i}^{T} X) = - \sum_{i = 1} H (w_{i}^{T} X)$ ; therefore, the likelihood and mutual information are approximately equal, and they differ only by a sign and an additive constant. It is worth mentioning that maximum likelihood estimation will give wrong results if the information of ICs are not correct; but, with the non-Gaussianity approach, we need not for any prior information [23].

5. ICA algorithms

In this section, different ICA algorithms are introduced.

5.1 Projection pursuit

Projection pursuit (PP) is a statistical technique for finding possible projections of multi-dimensional data [13]. In the basic one-dimensional projection pursuit, the aim is to find the directions where the projections of the data onto these directions have distributions which are deviated from Gaussian distribution, and this exactly is the same goal of ICA [13]. Hence, ICA is considered as a variant of projection pursuit.

In PP, one source signal is extracted from each projection, which is different than ICA algorithms that extract p signals simultaneously from n mixtures. Simply, in PP, after finding the first projection which maximizes the non-Gaussianity, the same process is repeated to find new projections for extracting next source signal(s) from the reduced set of mixture signals, and this sequential process is called deflation [17].

Given n mixture signals which represent the axes of the n-dimensional space $(X)$ . The $n th$ source signal can be extracted using the vector $w_{n}$ which is orthogonal to the other $n - 1$ axes. These mixture signals in the n-dimensional space are projected onto the $(n - 1)$ -dimensional space which has $n - 1$ transformed axes. For example, assume $n = 3$ , and the third source signal can be extracted by finding $w_{3}$ which is orthogonal to the plane that is defined by the other two transformed axes $s_{1}^{'}$ and $s_{2}^{'}$ ; this plane is denoted by $p_{1,2}^{'}$ . Hence, the data points in three-dimensional space are projected onto the plane $p_{1,2}^{'}$ which is a two-dimensional space. This process is continued until all source signals are extracted [20,32].

Given three source signals each source signal has 10000 time-steps as shown in Figure 12. These signals represent sound signals. These sound signals were collected from Matlab, where the first signal is called Chrip, the second signal is called gong, and the third is called train. Figure 12 (d, e, and f) shows the histogram for each signal. As shown, the histograms are non-Gaussian. These three signals were mixed, and the mixing matrix was as follows:

(44)

A = (\begin{array}{l} 1.5 & 0.7 & 0.2 \\ 0.6 & 0.2 & 0.9 \\ 0.1 & 1 & 0.6 \end{array})

Figure 13 shows the mixed signals and the histogram for these mixture signals. As shown in the figure, the mixture signals follow all the properties that were mentioned in Section 2.1.1, where (1) source signals are more independent than mixture signals, (2) the histograms of mixture signals in Figure 13 are much more Gaussian than the histogram of source signals in Figure 12 mixtures signals (see Figure 13 are more complex than source signals (see Figure 12)).

In the projection pursuit algorithm, mixture signals are first whitened, and then the values of the first weight vector $(w_{1})$ are initialized randomly. The value of $w_{1}$ is listed in Table 1. This weight vector is then normalized, and it will be used for extracting one source signal $(y_{1})$ . The kurtosis for the extracted signal is then calculated and the weight vector is updated to maximize the kurtosis iteratively. Table 1 shows the kurtosis of the extracted signal during some iterations of the projection pursuit algorithm. It is remarked that the kurtosis increases during the iterations as shown in Figure 14(a). Moreover, in this example, the correlation between the extracted signal $(y_{1})$ and all source signals ( $s_{1}, s_{2}$ , and $s_{3}$ ) were calculated. This may help to understand that how the extracted signal is correlated with one source signal and not correlated with the other signals. From the table, it can be remarked that the correlation between $y_{1}$ and source signals are changed iteratively, and the correlation between $y_{1}$ and $s_{1}$ was 1 at the end of iterations.

Figure 15 shows the histogram of the extracted signal during the iteration. As shown in Figure 15(a), the extracted signal is Gaussian; hence, its kurtosis value which represents the measure of non-Gaussianity in the projection pursuit algorithm is small (0.18). The kurtosis value of the extracted signal increased to 0.21, 3.92, and 4.06 after the 10th, 100th, and 1000th iterations, respectively. This reflects that the non-Gaussianity of $y_{1}$ increased during the iterations of the projection pursuit algorithm. Additionally, Figure 14(b) shows the angle between the optimal vector and the gradient vector $(α)$ . As shown, the value of the angle is dramatically decreased and it reached zero which means that both the optimal and gradient vectors have the same direction.

5.2 FastICA

FastICA algorithm extracts independent components by maximizing the non-Gaussianity by maximizing the negentropy for the extracted signals using a fixed-point iteration scheme [18]. FastICA has a cubic or at least quadratic convergence speed and hence it is much faster than Gradient-based algorithms that have linear convergence. Additionally, FastICA has no learning rate or other adjustable parameters which makes it easy to use.

FastICA can be used for extracting one IC, this is called one-unit, where FastICA finds the weight vector $(w)$ that extracts one independent component. The values of $w$ are updated by a learning rule that searches for a direction which maximizes the non-Gaussianity.

The derivative of the function G in Eq. (31) is denoted by g, and the derivatives for $G_{1}$ and $G_{2}$ in Eq. (32) are:

(45)

g_{1} (y) = \tan h (a_{1} u) and g_{2} (y) = u \exp (- u^{2} / 2)

where

1 \leq a_{1} \leq 2

is a constant, and often

a_{1} = 1

. In FastICA, the convergence means that the dot-product between the current and old weight vectors is almost equal to one and hence the values of the new and old weight vectors are in the same direction. The maxima of the approximation of the negentropy of

w^{T} X

is calculated at a certain optima of

E [G (w^{T} X)]

, where

E [{(w^{T} X)}^{2}] = ‖ w^{2} ‖ = 1

. The optimal solution is obtained where,

E [X g (w^{T} X)] - β w = 0

, and this equation can be solved using Newton’s method.⁹ Let

F (w) = E [X g (w^{T} X)] - β w

; hence, the Jacobian matrix is given by,

J F (w) = \frac{\partial F}{\partial w} = E [X X^{T} g^{'} (w^{T} X)] - β I

. Since the data are whitened; thus,

[{XX}^{T} g^{'} (w^{T} X)] \approx E [{XX}^{T}] E [g^{'} (w^{T} X)] \Rightarrow E [{XX}^{T} g^{'} (w^{T} X)] = E [g^{'} (w^{T} X)] I

and hence the Jacobian matrix becomes diagonal, which is easily inverted. The value of

w

can be updated according to Newton’s method as follows:

(46)

w^{+} = w - \frac{F (w)}{F^{'} (w)} = w - \frac{E [X g (w^{T} X)] - β w}{E [g^{'} (w^{T} X)] - β}

Eq. (46) can be further simplified by multiplying both sides by $β - E [g^{'} (w^{T} X)]$ as follows:

(47)

w^{+} = E [X g (w^{T} X)] - E [g^{'} (w^{T} X)] w

Several units of FastICA can be used for extracting several independent components, the output $w_{i}^{T} X$ is decorrelated iteratively with the other outputs which were calculated in the previous iterations $(w_{1}^{T} X, w_{2}^{T} X, \dots, w_{i - 1}^{T} X)$ . This decorrelation step prevents different vectors from converging to the same optima. Deflation orthogonalization method is similar to the projection pursuit, where the independent components are estimated one by one. For each iteration, the projections of the previously estimated weight vectors $(w_{p} w_{j}) w_{j}$ are subtracted from $w_{p}$ , where $j = 1, 2, \dots, p - 1$ , and then $w_{p}$ is normalized as in Eq. (48). In this method, estimation errors in the first vectors are cumulated over the next ones by orthogonalization. Symmetric orthogonalization method can be used when a symmetric correlation, i.e., no vectors are privileged over others, is required [18]. Hence, the vectors $w_{i}$ can be estimated in parallel which enables parallel computation. This method calculates all $w_{i}$ vectors using one-unit algorithm in parallel, and then the orthogonalization step is applied for all vectors using symmetric method as follows, $W = {(W W^{T})}^{- \frac{1}{2}} W$ , where ${({WW}^{T})}^{- \frac{1}{2}}$ is calculated from the eigenvalue decomposition as follows, $V ({WW}^{T}) = λ V$ ; thus, ${({WW}^{T})}^{- \frac{1}{2}} = V^{T} λ^{- \frac{1}{2}} V$ .

(48)

\begin{array}{l} 1 . w_{p} = w_{p} - \sum_{j = 1}^{p} w_{p}^{T} w_{j} w_{j} \\ 2 . w_{p} = \frac{w_{p}}{\sqrt{w_{p}^{T} w_{p}}} \end{array}

6. Applications

ICA has been used in many applications for extracting source signals from a set of mixed signals. These applications include:

Biomedical applications: ICA was used for removing artifacts which mixed with different biomedical signals such as Electroencephalogram (EEG), functional magnetic resonance imaging (fMRI), and Magnetoencephalography (MEG) signals [5]. Also, ICA was used for removing the electrocardiogram (ECG) interference from EEG signals, or for differentiating between the brain signals and the other signals that are generated from different activities as in [29].
Audio signal processing: ICA has been widely used in audio signals for removing noise [36]. Additionally, ICA was used as a feature extraction method to design robust automatic speech recognition models [8].
Biometrics: ICA is for extracting discriminative features in different biometrics such as face recognition [10], ear recognition [35], and finger print [27].
Image processing: ICA is used in image segmentation to extract different layers from the original image [12]. Moreover, ICA is widely used for noise removing from raw images which represent the original signals [24].

7. Challenges of ICA

ICA is used for estimating the unknown matrix $W = A^{- 1}$ . When the number of sources (p) and the number of mixture signals (n) are equal, the matrix $A$ is invertible. When the number of mixtures is less than the number of source signals $(n < p)$ this is called the over-complete problem; thus, A is not square and not invertible [26]. This representation sometimes is advantageous as it uses as few “basis” elements as possible; this is called sparse coding. On the other hand, when $n > p$ means that the number of mixtures is higher than the number of source signals and this is called the Under-complete problem. This problem can be solved by deleting some mixtures using dimensionality reduction techniques such as PCA to decrease the number of mixtures [1].

8. Conclusions

ICA is a widely-used statistical technique which is used for estimating independent components (ICs) through maximizing the non-Gaussianity of ICs, maximizing the likelihood of ICs, or minimizing mutual information between ICs. These approaches are approximately equivalent; however, each approach has its own limitations.

This paper followed the approach of not only explaining the steps for estimating ICs, but also presenting illustrative visualizations of the ICA steps to make it easy to understand. Moreover, a number of numerical examples are introduced and graphically illustrated to explain (1) how signals are mixed to form mixture signals, (2) how to estimate source signals, and (3) the preprocessing steps of ICA. Different ICA algorithms are introduced with detailed explanations. Moreover, ICA common challenges and applications are briefly highlighted.

Figures

Figure 1

Example of the cocktail party problem. Two source signals (e.g. sound signals) are generated from two individuals and then recorded by two sensors, e.g., microphones. Two microphones mixed the two source signals linearly. The goal of this problem is to recover the original signals from the mixed signals.

Figure 2

An illustrative example of the process of mixing signals. Two source signals are mixed linearly by the mixing matrix $(A)$ to form two new mixture signals.

Figure 3

An illustrative example for two source signals. (a) and (b) first and second source signals ( $s_{1}$ and $s_{2}$ ), (c) and (d) histograms of $s_{1}$ and $s_{2}$ , respectively, (e) scatter diagram of source signals, where $s_{1}$ and $s_{2}$ represent the x-axis and y-axis, respectively.

Figure 4

An illustrative example for two mixture signals (a) and (b) first and second mixture signals $x_{1}$ and $x_{2}$ , respectively, (c) and (d) the histogram of $x_{1}$ and $x_{2}$ , respectively, (e) scatter diagram of both mixture signals, where $x_{1}$ and $x_{2}$ represent the x-axis and y-axis, respectively.

Figure 5

An example of the mixing process. The mixing matrix $A$ transforms the two source signals ( $s_{1}$ and $s_{2}$ ) in the $S$ space to ( $s_{1}^{'}$ and $s_{2}^{'}$ ) in the mixture space $X$ . The two source signals can be represented by four points (in black color) in the $S$ space. These points are also transformed using the mixing matrix $A$ into different four points (in red color) in the $X$ space. Additionally, the vectors $w_{1}$ and $w_{2}$ are used to extract the source signal $s_{1}$ and $s_{2}$ , and they are plotted in dotted red and blue lines, respectively. $w_{1}$ and $w_{2}$ are orthogonal on $s_{2}^{'}$ and $s_{1}^{'}$ , respectively. A color version of this figure is available online.

Figure 6

An example of the mixing process. The mixing matrix $A$ transforms source signals as follows: (a) $s_{1}$ is transformed from $S$ space to $s_{1}^{'} = {(a, c)}^{T}$ (solid red line) which is one of the axes of the mixture space $X$ . The red stars represent the projection of the data points onto $s_{1}^{'}$ . These red stars represent all samples that are generated from the first source $s_{1}$ . (b) $s_{2}$ is transformed from $S$ space to $s_{2}^{'} = {(b, d)}^{T}$ (solid blue line) which is one of the axes of the mixture space $X$ . The blue stars represent the projection of the data points onto $s_{2}^{'}$ . These blue stars represent all samples that are generated from the second source $s_{2}$ . A color version of this figure is available online.

Figure 7

An illustrative example of the process of extracting signals. Two source signals ( $y_{1}$ and $y_{2}$ ) are extracted from two mixture signals ( $x_{1}$ and $x_{2}$ ) using the unmixing matrix $W$ .

Figure 8

Block diagram of the ICA mixing and unmixing steps. $a_{ij}$ is the mixing coefficient for the $i th$ mixture signal and $j th$ source signal, and $w_{ij}$ is the unmixing coefficient for the $i th$ extracted signal and $j th$ mixture signal.

Figure 9

Visualization for the preprocessing steps in ICA. (A) the centering step, (B) The whitening data step.

Figure 10

Visualization for our whitening example. In the left panel, the mixture signals are plotted in red stars. This panel also shows the principal components ( ${PC}_{1}$ and ${PC}_{2}$ ). In the top right panel, the data in a blue color represent the projected data onto the PCA space. The data are then normalized to be with a unit variance (the bottom right panel). In this panel, the data in a green color represent the whitened data. A color version of this figure is available online.

Figure 11

Visualization for mixture signals during the whitening step. (a) scatter plot for two mixture signals $x_{1}$ and $x_{2}$ , (b) the projection of mixture signals onto the PCA space, i.e., decorrelation, (c) mixture signals after the whitening step are scaled to have a unit variance.

Figure 12

Three source signals in our example (a, b, and c) and their histograms (d, e, and f).

Figure 13

Three mixture signals in our example (a, b, and c) and their histograms (d, e, and f).

Figure 14

Results of the projection pursuit algorithm. (a) Kurtosis of the extracted signal ( $y_{1}$ ) during some iterations of the projection pursuit algorithm, (b) the angle between the optimal vector and gradient vector $(α)$ during some iterations of the projection pursuit algorithm.

Figure 15

Histogram of the extracted signal $(y_{1})$ . (a) after the first iteration, (b) after the tenth iteration, (c) after the 100th iteration, and (d) after the 1000th iteration.

Table 1

Results of the projection pursuit algorithm in terms of the correlation between the extracted signal $(y_{1})$ and source signals, values of the weight vector $(w_{1})$ , kurtosis of $y_{1}$ , and the angle between the optimal vector and the gradient vector $(α)$ during the iterations of the projection pursuit algorithm.

Results	iter = 1	iter =10	iter = 100	iter = 1000
0.3	0.33	0.99	1.00	$corr (y_{1}, s_{1})$
0.94	0.93	0.14	0.00	$corr (y_{1}, s_{2})$
0.14	0.13	0.01	0.01	$corr (y_{1}, s_{3})$
$w_{1}$	$(\begin{array}{l} - 0.50 \\ 0.23 \\ - 0.84 \end{array})$	$(\begin{array}{l} - 0.46 \\ 0.24 \\ - 0.85 \end{array})$	$(\begin{array}{l} 0.42 \\ 0.72 \\ - 0.5 \end{array})$	$(\begin{array}{l} 0.50 \\ 0.74 \\ - 0.45 \end{array})$
$K (y_{1})$	0.18	0.21	3.92	4.06
$α$	1.22	1.18	0.07	$5.3 e^{0 - 5}$

Notes

¹ In this paper, original signals, source signals, or independent components (ICs) are the same.

² In this paper, source and mixture signals are represented as random variables instead of time series or time signals, i.e., the time index is dropped.

³ Two signals s₁ and s₂ are independent if the amplitude of s₁ is independent of the amplitude of $s_{2}$ .

⁴ In this paper, all bold lowercase letters denote vectors and bold uppercase letters indicate matrices.

⁵ In all numerical examples, the numbers are rounded up to the nearest hundredths (two numbers after the decimal point).

⁶ Due to the paper size, Eq. (14) indicates $X^{T}$ instead of $X$ ; hence, each column represents one signal/sample. Similarly, $D$ in Eq. (15), $U$ in Eq. (17), and $Z$ in Eq. (19).

⁷ Two vectors x and y are orthonormal if they are orthogonal, i.e., the dot product $x . y = 0$ , and they are unit vectors, i.e., $(x) = (y) = 1$ .

⁸ Maximum Likelihood approach will be introduced in the next section.

⁹ Assume $f (x) = 0$ , using Newton’s method, the solution is calculated as follows, $x_{i + 1} = x_{i} - \frac{f (x)}{f^{'} (x)}$ .

References

[1]S.-I. Amari, Natural gradient learning for over-and under-complete bases in ICA, Neural Comput. 11 (8) (1999) 1875–1883.

[2]A. Asaei, H. Bourlard, M.J. Taghizadeh, V. Cevher, Computational methods for underdetermined convolutive speech localization and separation via model-based sparse component analysis, Speech Commun. 76 (2016) 201–217.

[3]R. Aziz, C. Verma, N. Srivastava, A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data, Genomics data 8 (2016) 4–15.

[4]E. Bingham, A. Hyvärinen, A fast fixed-point algorithm for independent component analysis of complex valued signals, Int. J. Neural Syst. 10 (01) (2000) 1–8.

[5]V.D. Calhoun, J. Liu, T. Adal, A review of group ICA for FMRA data and ICA for joint inference of imaging, genetic, and ERP data, Neuroimage 45 (1) (2009) S163–S172.

[6]J.-F. Cardoso, Infomax and maximum likelihood for blind source separation, IEEE Sig. Process. Lett. 4 (4) (1997) 112–114.

[7]R. Chai, G.R. Naik, T.N. Nguyen, S.H. Ling, Y. Tran, A. Craig, H.T. Nguyen, Driver fatigue classification with independent component by entropy rate bound minimization analysis in an eeg-based system, IEEE J. Biomed. Health Inf. 21 (3) (2017) 715–724.

[8]J.-W. Cho, H.-M. Park, Independent vector analysis followed by hmm-based feature enhancement for robust speech recognition, Sig. Process. 120 (2016) 200–208.

[9]P. Comon, Independent component analysis, a new concept?, Sig. Process. 36 (3) (1994) 287–314.

[10]I. Dagher, R. Nachar, Face recognition using ipca-ica algorithm, IEEE Trans. Pattern Anal. Machine Intell. 28 (6) (2006) 996–1000.

[11]N. Delfosse, P. Loubaton, Adaptive blind separation of independent sources: a deflation approach, Sig. Process. 45 (1) (1995) 59–83.

[12]S. Derrode, G. Mercier, W. Pieczynski, Unsupervised multicomponent image segmentation combining a vectorial hmc model and ica, in: Proceedings of International Conference on Image Processing (ICIP), Vol. 2, IEEE, 2003, pp. II–407.

[13]J.H. Friedman, J.W. Tukey, A projection pursuit algorithm for exploratory data analysis, IEEE Trans. Comput. 100 (9) (1974) 881–890.

[14]S.S. Haykin, S.S. Haykin, S.S. Haykin, S.S. Haykin, Neural Netw. Learn. Machines, Vol. 3, Pearson Upper Saddle River, NJ, USA, 2009.

[15]J. Hérault, C. Jutten, B. Ans, Détection de grandeurs primitives dans un message composite par une architecture de calcul neuromimétique en apprentissage non supervisé. In: 10 Colloque sur le traitement du signal et des images, FRA, 1985.GRETSI, Groupe d’Etudes du Traitement du Signal et des Images 1985.

[16]A. Hyvärinen, Independent component analysis in the presence of gaussian noise by maximizing joint likelihood, Neurocomputing 22 (1) (1998) 49–67.

[17]A. Hyvärinen, New approximations of differential entropy for independent component analysis and projection pursuit. In: Advances in neural information processing systems. (1998b) pp. 273–279.

[18]A. Hyvarinen, Fast and robust fixed-point algorithms for independent component analysis, IEEE Trans. Neural Networks 10 (3) (1999) 626–634.

[19]A. Hyvarinen, Gaussian moments for noisy independent component analysis, IEEE Signal Process. Lett. 6 (6) (1999) 145–147.

[20]A. Hyvärinen, J. Karhunen, E. Oja, Independent Component Analysis, Vol. 46, John Wiley & Sons, 2004.

[21]A. Hyvärinen, E. Oja, Independent component analysis: algorithms and applications, Neural Networks 13 (4) (2000) 411–430.

[22]D. Langlois, S. Chartier, D. Gosselin, An introduction to independent component analysis: Infomax and fastica algorithms, Tutorials Quantit. Methods Psychol. 6 (1) (2010) 31–38.

[23]T.-W. Lee, Independent component analysis, in: Independent Component Analysis, Springer, 1998, pp. 27–66.

[24]T.-W. Lee, M.S. Lewicki, Unsupervised image classification, segmentation, and enhancement using ica mixture models, IEEE Trans. Image Process. 11 (3) (2002) 270–279.

[25]T.-W. Lee, M.S. Lewicki, T.J. Sejnowski, Ica mixture models for unsupervised classification of non-gaussian classes and automatic context switching in blind signal separation, IEEE Trans. Pattern Anal. Mach. Intell. 22 (10) (2000) 1078–1089.

[26]M.S. Lewicki, T.J. Sejnowski, Learning overcomplete representations, Learning 12 (2) (2006).

[27]F. Long, B. Kong, Independent component analysis and its application in the fingerprint image preprocessing, in: Proceedings. International Conference on Information Acquisition, IEEE, 2004, pp. 365–368.

[28]B.A. Pearlmutter, L.C. Parra, Maximum likelihood blind source separation: A context-sensitive generalization of ica. In: Advances in neural information processing systems. 1997, pp. 613–619.

[29]M.B. Pontifex, K.L. Gwizdala, A.C. Parks, M. Billinger, C. Brunner, Variability of ica decomposition may impact eeg signals when used to remove eyeblink artifacts, Psychophysiology 54 (3) (2017) 386–398.

[30]S. Shimizu, P.O. Hoyer, A. Hyvärinen, A. Kerminen, A linear non-gaussian acyclic model for causal discovery, J. Mach. Learn. Res. 7 (Oct) (2006) 2003–2030.

[31]J. Shlens, A tutorial on independent component analysis. arXiv preprint arXiv:1404.2986, 2014.

[32]J.V. Stone, 2004. Independent component analysis. A tutorial introduction. A bradford book.

[33]A. Tharwat, Principal component analysis-a tutorial, Int. J. Appl. Pattern Recognit. 3 (3) (2016) 197–240.

[34]J. Xie, P.K. Douglas, Y.N. Wu, A.L. Brody, A.E. Anderson, Decoding the encoding of functional brain networks: an fmri classification comparison of non-negative matrix factorization (nmf), independent component analysis (ica), and sparse coding algorithms, J. Neurosci. Methods 282 (2017) 81–94.

[35]H.-J. Zhang, Z.-C. Mu, W. Qu, L.-M. Liu, C.-Y. Zhang, A novel approach for ear recognition based on ica and rbf network, in: Proceedings of International Conference on Machine Learning and Cybernetics, Vol. 7, IEEE, 2005, pp. 4511–4515.

[36]M. Zibulevsky, B.A. Pearlmutter, Blind source separation by sparse decomposition in a signal dictionary, Neural Computat. 13 (4) (2001) 863–882.

Acknowledgements

Publishers note: The publisher wishes to inform readers that the article “Independent component analysis: An introduction” was originally published by the previous publisher of Applied Computing and Informatics and the pagination of this article has been subsequently changed. There has been no change to the content of the article. This change was necessary for the journal to transition from the previous publisher to the new one. The publisher sincerely apologises for any inconvenience caused. To access and cite this article, please use Tharwat, A. (2021), “Independent component analysis: An introduction”, Applied Computing and Informatics. Vol. 17 No. 2, pp. 222-249. The original publication date for this paper was 31/08/2018.

Corresponding author

Alaa Tharwat can be contacted at: aothman@fb2.fra-uas.de

Abstract

Keywords

Citation

Publisher

License

1. Introduction

2. ICA background

2.1 Mixing signals

2.1.1 Illustrative example

2.1.2 Numerical example: Mixing signals

2.2 Unmixing signals

2.2.1 Numerical examples: Unmixing signals

2.3 Ambiguities of ICA

3. ICA: Preprocessing phase

3.1 The centering step

3.2 The whitening data step

3.3 Numerical example

4. Principles of ICA estimation

4.1 Measures of non-Gaussianity

4.1.1 Kurtosis

4.1.2 Negative entropy

4.2 Minimization of mutual information

4.3 Maximum Likelihood (ML)

5. ICA algorithms

5.1 Projection pursuit

5.2 FastICA

6. Applications

7. Challenges of ICA

8. Conclusions

Figures

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11

Figure 12

Figure 13

Figure 14

Figure 15

Notes

References

Acknowledgements

Corresponding author

Related articles

All feedback is valuable

Report an issue or find answers to frequently asked questions