Access the full text.
Sign up today, get DeepDyve free for 14 days.
axioms Article Gabor Frames and Deep Scattering Networks in Audio Processing 1, 1 1,2 Roswitha Bammer *, Monika Dörﬂer and Pavol Harar NuHAG, Faculty of Mathematics, University of Vienna, 1090 Wien, Austria; monika.doerﬂer@univie.ac.at (M.D.); pavol.harar@univie.ac.at (P.H.) Department of Telecommunications, Brno University of Technology, 60190 Brno, Czech Republic * Correspondence: roswitha.bammer@univie.ac.at; Tel.: +43-1-4277-50460 Received: 16 April 2019; Accepted: 20 September 2019; Published: 26 September 2019 Abstract: This paper introduces Gabor scattering, a feature extractor based on Gabor frames and Mallat’s scattering transform. By using a simple signal model for audio signals, speciﬁc properties of Gabor scattering are studied. It is shown that, for each layer, speciﬁc invariances to certain signal characteristics occur. Furthermore, deformation stability of the coefﬁcient vector generated by the feature extractor is derived by using a decoupling technique which exploits the contractivity of general scattering networks. Deformations are introduced as changes in spectral shape and frequency modulation. The theoretical results are illustrated by numerical examples and experiments. Numerical evidence is given by evaluation on a synthetic and a “real” dataset, that the invariances encoded by the Gabor scattering transform lead to higher performance in comparison with just using Gabor transform, especially when few training samples are available. Keywords: machine learning; scattering transform; Gabor transform; deep learning; time-frequency analysis; CNN 1. Introduction During the last two decades, enormous amounts of digitally encoded and stored audio have become available. For various purposes, the audio data, be it music or speech, need to be structured and understood. Recent machine learning techniques, known as (deep) convolutional neural networks (CNN), have led to state of the art results for several tasks such as classiﬁcation, segmentation or voice detection, cf. [1,2]. CNNs were originally proposed for images [1], which may be directly fed into a network. Audio signals, on the other hand, commonly undergo some preprocessing to extract features that are then used as input to a trainable machine. Very often, these features consist of one or several two-dimensional arrays, such that the image processing situation is mimicked in a certain sense. However, the question about the impact of this very ﬁrst processing step is important and it is not entirely clear whether a short-time Fourier transform (STFT), here based on Gabor frames, the most common representation system used in the analysis of audio signals, leads to optimal feature extraction. The convolutional layers of the CNNs can themselves be seen as feature extractors, often followed by a classiﬁcation stage, either in the form of one or several dense network layers or classiﬁcation tools such as support vector machine (SVM). Stéphane Mallat gave a ﬁrst mathematical analysis of CNN as feature extractor, thereby introducing the so called scattering transform, based on wavelet transforms and modulus nonlinearity in each layer [3]. The basic structure thus parallels the one of CNNs, as these networks are equally composed of multiple layers of local convolutions, followed by a nonlinearity and, optionally, a pooling operator, cp., Section 1.1. Axioms 2019, 8, 106; doi:10.3390/axioms8040106 www.mdpi.com/journal/axioms Axioms 2019, 8, 106 2 of 25 In the present contribution, we consider an approach inspired by Mallat’s scattering transform, but based on Gabor frames, respectively, Gabor transform (GT). The resulting feature extractor is called Gabor scattering (GS). Our approach is a special case of the extension of Mallat’s scattering transform proposed by Wiatowski and Bölcskei [4,5], which introduces the possibility to use different semi-discrete frames, Lipschitz-continuous nonlinearities and pooling operators in each layer. In [3,6,7], invariance and deformation stability properties of the scattering transform with respect to operators deﬁned via some group action were studied. In the more general setting of [4,5], vertical translation invariance, depending on the network depth, and deformation stability for band-limited functions have been proved. In this contribution, we study the same properties of the GS and a particular class of signals, which model simple musical tones (Section 2.2). Due to this concrete setting, we obtain quantitative invariance statements and deformation stability to speciﬁc, musically meaningful, signal deformations. Invariances are studied considering the ﬁrst two layers, where the feature extractor extracts certain signal features of the signal model (i.e., frequency and envelope information), cp., Section 3.1.1. By using a low-pass ﬁlter and pooling in each layer, the temporal ﬁne structure of the signal is averaged out. This results in invariance with respect to the envelope in the ﬁrst and frequency invariance in the second layer output. To compute deformation bounds for the GS feature extractor, we assume more speciﬁc restrictions than band-limitation and use the decoupling technique, ﬁrst presented in [4,8]. Deformation stability is proven by only computing the robustness of the signal class w.r.t spectral shape and frequency modulation, see Section 3.1.2. The robustness results together with contractivity of the feature extractor, which is given by the networks architecture, yields deformation stability. To empirically demonstrate the beneﬁts of GS time-frequency representation for classiﬁcation, we have conducted a set of experiments. In a supervised learning setting, where the main aim is the multiclass classiﬁcation of generated sounds, we have utilized a CNN as a classiﬁer. In these numerical experiments, we compare the GS to a STFT-based representation. We demonstrate the beneﬁts of GS in a quasi-ideal setting on a self implemented synthetic dataset, and we also investigate if it beneﬁts the performance on a real dataset, namely, GoodSounds [9]. Moreover we focus on comparing these two time-frequency representations in terms of performance on limited sizes of training data, see Section 4. 1.1. Convolutional Neural Networks (CNNs) and Invariance CNNs are a specific class of neural network architectures which have shown extremely convincing results on various machine learning tasks in the past decade. Most of the problems addressed using CNNs are based on, often, big amounts of annotated data, in which case one speaks about supervised learning. When learning from data, the intrinsic task of the learning architecture is to gradually extract useful information and suppress redundancies, which always abound in natural data. More formally, the learning problem of interest may be invariant to various changes of the original data and the machine or network must learn these invariances in order to avoid overfitting. As, given a sufficiently rich architecture, a deep neural network can practically fit arbitrary data, cp. [10,11], good generalization properties depend on the systematic incorporation of the intrinsic invariances of the data. Generalization properties hence suffer if the architecture is too rich given the amount of available data. This problem is often addressed by using data augmentation. Here, we raise the hypothesis that using prior representations which encode some potentially useful invariances will increase the generalization quality, in particular when using a restricted size of data set. The evaluation of the performance on validation data in comparison to the results on test data strengthens our hypothesis for the experimental problem presented in Section 4. To understand the mathematical construction used within this paper, we briefly introduce the principal idea and structure of a CNN. We shall see that the scattering transforms, in general, and the GS, in particular, follow a similar concept of concatenating various processing steps, which ultimately leads to rather flexible grades of invariances in dependence on the chosen parameters. Usually, a CNN consists of several layers, namely, an input, several hidden (as we consider the case of deep CNN the number of hidden layers is supposed to be2) and one output layer. A hidden layer consists of the following steps: Axioms 2019, 8, 106 3 of 25 first the convolution of the data with a small weighting matrix, often referred to as a kernel , which can be interpreted as localization of certain properties of the input data. The main advantage of this setup is that only the size and number of these (convolutional) kernels are ﬁxed, but their coefﬁcients are learned during training. So they reﬂect the structure of the training data in the best way w.r.t the task being solved. The next building block of the hidden layer is the application of a nonlinearity function, also called activation function, which signals if information of this neuron is relevant to be transmitted. Finally, to reduce redundancy and increase invariance, pooling is applied. Due to these building blocks, invariances to speciﬁc deformations and variations in the dataset are generated in dependence on the speciﬁc ﬁlters used, whether they are learned, as in the classical CNN case, or designed, as in the case of scattering transforms [13]. In this work, we will derive concrete qualitative statements about invariances for a class of music signals and will show by numerical experiments that these invariances indeed lead to a better generalization of the CNNs used to classify data. Note that in a neural network, in particular in CNNs, the output, e.g., classiﬁcation labels, is obtained after several concatenated hidden layers. In the case of scattering network, the outputs of each layer are stacked together into a feature vector, and further processing is necessary to obtain the desired result. Usually, after some kind of dimensionality reduction, cf. [14], this vector can be fed into a SVM or a dense NN, which performs the classiﬁcation task. 1.2. Invarince Induced by Gabor Scattering In this section, we give a motivation for the theory and underlying aim of this paper. In Figure 1, we see sound examples from different classes, where Class 0 is a pure tone with 5 harmonics, Class 1 is an amplitude modulated version thereof, Class 2 is the frequency modulated version, and Class 3 contains the amplitude and frequency modulated signal. So, we have classes with different amplitudes, as is clearly visible in the waveforms shown in the left-most plots. In this paper, we introduce GS, as a new feature extractor that introduces certain invariances. GS has several layers, denoted by OutA, OutB, and OutC, and each layer is invariant with respect to some features. The ﬁrst layer, here OutA, is the spectrogram of the waveform. So, we see the time-frequency content of the four classes. OutB can be seen to be invariant with respect to amplitude changes, whereas the last layer, OutC, is invariant with respect to to frequency content while encoding the amplitude information. With GS it is therefore possible to separate different qualities of information contained in a spectrogram. We introduce GS mathematically in Section 2.1 and elaborate on the resulting invariances in different layers in Section 3.1.1. Numerical experiments, showing the beneﬁt of GS, are discussed in Section 4. We point out that the term kernel as used in this work always means convolutional kernels in the sense of filterbanks. Both the fixed kernels used in the scattering transform and the kernels used in the CNNs, whose size is fixed but whose elements are learned, should be interpreted as convolutional kernels in a filterbank. This should not be confused with the kernels used in classical machine learning methods based on reproducing kernel Hilbert spaces, e.g., the famous support vector machine, c.f. [12] Axioms 2019, 8, 106 4 of 25 Wave GS OutA (GT) GS OutB GS OutC Figure 1. Wave Outputs A, i.e., GT; B; and C of Gabor scattering (GS) for all four classes of generated sound. 2. Materials and Methods 2.1. Gabor Scattering As Wiatowski and Bölcskei used general semi-discrete frames to obtain a wider class of window functions for the scattering transform (cp. [4,5]), it seems natural to consider speciﬁc frames used for audio data analysis. Therefore, we use Gabor frames for the scattering transform and study corresponding properties. We next introduce the basics of Gabor frames, refer to [15] for more details. A sequence (g ) of elements in a Hilbert spaceH is called frame if there exist positive frame bounds k=1 A, B > 0 such that for all f 2 H 2 2 2 Ak fk jh f , g ij Bk fk . (1) å k k=1 If A = B, then we call (g ) a tight frame. k=1 2 2 Remark 1. In our context, the Hilbert space H is either L (R) or ` (Z). To deﬁne Gabor frames we need to introduce two operators, i.e., the translation and modulation operator. The translation (time shift) operator: – for a function f 2 L (R) and x 2 R is deﬁned as T f (t) := f (t x) for all t 2 R. – for a function f 2 ` (Z) and k 2 Z is deﬁned as T f (j) := ( f (j k)) . k j2Z The modulation (frequency shift) operator: 2 2piwt – for a function f 2 L (R) and w 2 R is deﬁned as M f (t) := e f (t) for all t 2 R. 2 1 1 2piw j – for a function f 2 ` (Z) and w 2 [ , ] is deﬁned as M f (j) := (e f (j)) . j2Z 2 2 We use these operators to express the STFT of a function f 2 H with respect to a given window function g 2 H as V f (x, w) = h f , M T gi. To reduce redundancy, we sample V f on a separable g w x g ( M1) lattice L = aZI , where I = bZ in case of H = L (R), and I = f0, . . . , g with b = in case M M H = ` (Z). The sampling is done in time by a > 0 and in frequency by b > 0. The resulting samples correspond to the coefﬁcients of f with respect to a “Gabor system”. Class 3 Class 2 Class 1 Class 0 Axioms 2019, 8, 106 5 of 25 Deﬁnition 1. (Gabor System) Given a window function 0 6= g 2 H and lattice parameters a, b > 0, the set of time-frequency shifted versions of g G(g, a, b) = f M T g : (ak, bj) 2 Lg bj ak is called a Gabor system. This Gabor system is called Gabor frame if it is a frame, see Equation (1). We proceed to introduce a scattering transform based on Gabor frames. We base our considerations on [4] by using a triplet sequence W = (Y , s , S ) , where ` is associated to the `-th layer of the network. Note that in this ` ` ` `2N 2 2 contribution, we will deal with Hilbert spaces L (R) or ` (Z); more precisely in the input layer, i.e., the 2 2 0th layer, we haveH = L (R) and, due to the discretization inherent in the GT,H = ` (Z) 8` > 0. 0 ` We recall the elements of the triplet: Y := fg g with g = M T g , l = (a k, b j), is a Gabor frame indexed by a lattice L . ` l l 2L l b j a k ` ` ` ` ` ` ` ` ` ` ` A nonlinearity function (e.g., rectiﬁed linear units, modulus function, see [4]) s : C ! C, is applied pointwise and is chosen to be Lipschitz-continuous, i.e.,ks f s hk L k f hk for ` ` 2 ` 2 all f , h 2 H. In this paper we only use the modulus function with Lipschitz constant L = 1 for all ` 2 N. Pooling depends on a pooling factor S > 0, which leads to dimensionality reduction. Mostly used are max- or average-pooling, some more examples can be found in [4]. In our context, pooling is covered by choosing speciﬁc lattices L in each layer. To explain the interpretation of GS as CNN, we write I(g)(t) = g(t) and have jh f , M T gij = f I M (g) (ak). (2) bj ak bj Thus, the Gabor coefﬁcients can be interpreted as the samples of a convolution. We start by deﬁning “paths” on index sets q := (q , . . . , q ) = (b j , . . . , b j ) 2 b Z . . . 1 ` 1 1 ` ` 1 b Z =: B , ` 2 N. Deﬁnition 2. (Gabor Scattering) Let W = (Y , s , L ) be a given triplet sequence. Then, the ` ` ` `2N components of the `-th layer of the GS transform are deﬁned to be the output of the operator U [q ] : H ! H , ` ` `1 ` q 2 b Z: ` ` (q ,...,q ) (q ,...,q ) (q ,...,q ) 1 ` 1 `1 1 `1 f (k) = U [b j ] f (k) := s h f , M T g i j , k 2 Z, (3) ` ` ` ` b j a k ` H ` ` `1 `1 ` ` `1 where f is some output-vector of the previous layer and f 2 H 8` 2 N. The GS operator is deﬁned as `1 ` ` U[q] f = U[(q , . . . , q )] f := U [q ] U [q ] f . 1 1 1 ` ` ` Similar to [4], for each layer, we use one atom of the Gabor frame in the subsequent layer as output-generating atom, i.e., f := g . Note that convolution with this element corresponds to `1 ` 2 ¥ ` low-pass ﬁltering . We next introduce a countable set Q := B , which is the union of all `=0 2 Q 2 possible paths of the net and the space (` (Z)) of sets of Q elements from ` (Z). Now we deﬁne the feature extractor F ( f ) of a signal f 2 L (R) as in ([4], Deﬁnition 3) based on chosen (not learned) Gabor windows. In general, one could take f := g , l 2 L . As this element is the `-th convolution, it is an element of the `-th frame, `1 l ` but because it belongs to the (` 1)-th layer, its index is (` 1). Axioms 2019, 8, 106 6 of 25 Deﬁnition 3. (Feature Extractor) Let W = (Y , s , L ) be a triplet sequence and f the ` ` ` ` `2N 2 2 Q output-generating atom for layer `. Then the feature extractor F : L (R) ! (` (Z)) is deﬁned as F ( f ) := f(U[q] f ) f g . (4) W ` q2B `=0 In the following section we are going to introduce the signal model which we consider in this paper. 2.2. Musical Signal Model Tones are one of the smallest units and simple models of an audio signal, consisting of one fundamental frequency x , corresponding harmonics nx , and a shaping envelope A for each 0 0 n harmonic, providing speciﬁc timbre. Further, as our ears are limited to frequencies below 20 kHz, we develop our model over ﬁnitely many harmonics, i.e., f1, . . . , Ng N. The general model has the following form, 2pih (t) f (t) = A (t)e , (5) å n n=1 where A (t) 0 8n 2 f1, . . . , Ng and 8t. For one single tone we choose h (t) = nx t. Moreover, we n n 2pinx t ¥ 1 create a space of tones T = A (t)e j A 2 C (R) and assume k A k . n n n ¥ n=1 c 3. Theoretical Results 3.1. Gabor Scattering of Music Signals 3.1.1. Invariance In [6], it was already stated that due to the structure of the scattering transform the energy of the signal is pushed towards low frequencies, where it is then captured by a low-pass ﬁlter as output-generating atom. The current section explains how GS separates relevant structures of signals modeled by the signal space T . Due to the smoothing action of the output-generating atom, each layer expresses certain invariances, which will be illustrated by numerical examples in Section 3.2. In Proposition 1, inspired by [6], we add some assumptions on the analysis window in the ﬁrst layer s 1 g : jg ˆ (w)j C (1 +jwj ) for some s > 1 and ktg (t)k = C < ¥. 1 1 g ˆ 1 1 g 1 1 Proposition 1 (Layer 1). Let f 2 T with k A k C < ¥ 8n 2 f1, . . . , Ng. For ﬁxed j, for which ¥ n n = argmin jb j x nj such that jbj x n j , can be found, we obtain 0 1 0 0 0 n2f1,...,Ng U[b j]( f )(k) = jh f , M T g ij = A (a k)jg ˆ (b j n x )j + E (k) (6) 1 b j a k 1 1 1 1 0 0 1 1 1 Nn 1 s s 1 1 E (k) C k A T c[a ; a ]k + C 1 + x n , (7) g ¥ g ˆ 0 1 å n k 1 1 å 1 1 n + n 1 2 n=1 n=2n where c is the indicator function. Remark 2. Equation (6) shows that for slowly varying amplitude functions A , the ﬁrst layer mainly captures the contributions near the frequencies of the tone’s harmonics. Obviously, for time sections during which the envelopes A undergo faster changes, such as during a tone’s onset, energy will also be found outside a small interval around the harmonics’ frequencies and thus the error estimate Equation (7) becomes less stringent. The second term of the error in Equation (7) depends only on the window g and its behavior is governed by the frequency decay of g . Note that the error bound increases for lower frequencies, as the separation of the fundamental frequency and corresponding harmonics by the analysis window deteriorates. Axioms 2019, 8, 106 7 of 25 Proof. Step 1: Using the signal model for tones as input, interchanging the ﬁnite sum with the integral and performing a substitution u = t a k, we obtain h f , M T g i = h M A , M T g i b j a k 1 å nx n b j a k 1 1 1 1 1 n=1 = h A , M T g i å b jnx a k 1 1 0 1 n=1 2pi(b jnx )(u+a k) 1 0 1 = A (u + a k)g (u)e du. n 1 1 n=1 After performing a Taylor series expansion locally around a k : A (u + a k) = A (a k) + uR (a k, u), where the remainder can be estimated by jR (a k, u)j n 1 n 1 n 1 n 1 k A T c[a ; a ]k , we have n k 1 1 2pi(b jnx )a k 2pi(b jnx )u 1 0 1 1 0 h f , M T g i = e A (a k) g (u)e du b j a k 1 å 1 1 1 1 n=1 2pi(b jnx )(u+a k) 1 0 1 + uR (a k, u)g (u)e du . n 1 1 Therefore, we choose n = argminjb j x nj, set 0 1 0 2pi(b jnx )(u+a k) 1 1 E (k) = uR (a k, u)g (u)e du (8) n n 1 1 2pi(b jnx )a k 1 0 1 E(k) = e A (a k)g ˆ (b j nx ) (9) å 1 1 1 0 n=1 n6=n and split the sum to obtain 2pi(b jn x )a k 1 0 0 1 e h f , M T g i = A (a k)e g ˆ (b j n x ) + E(k) + E (k). b j a k 1 n 1 1 1 0 0 å n 1 1 n=1 Step 2: We bound the error terms, starting with Equation (8): N N 2pi(b jn x )(u+a k) 1 0 0 1 E (k) = uR (a k, u)g (u)e du . å n å n 1 1 n=1 n=1 Using triangle inequality and the estimate for the Taylor remainder, we obtain, together with the assumption on the analysis window, N N E (k) k A T c[a ; a ]k jug (u)jdu n ¥ å å n k 1 1 1 n=1 n=1 C k A T c[a ; a ]k . g å k 1 1 ¥ 1 n n=1 For the second bound, i.e., the bound of Equation (9), we use the decay condition on g ˆ , thus jE(k)j C j A (a k)j 1 +jb j x nj . g ˆ n 1 1 0 n=1 n6=n o Axioms 2019, 8, 106 8 of 25 Next we split the sum into n > n and n < n . We estimate the error term for n > n : 0 0 0 1 1 Nn N s s (10) j A (a k)j 1 +jb j x nj = j A (a k)j 1 +jb j x n x nj . å å n 1 1 0 n +n 1 1 0 0 0 n=n +1 n=1 0 As n = argminjb j x nj, we have jb j x n j and, also, using k A k , we obtain n ¥ 0 1 0 1 0 0 2 n Nn 1 Nn 1 0 0 s s s x 1 1 j A (a k)j 1 + x n 1 + x n . (11) n +n 0 0 å 1 å 2 n + n 2 n=1 n=1 Further we estimate the error for n < n : n 1 n 1 0 0 s 1 s 1 j A (a k)j(1 +jb j x nj ) j A (a k)j(1 +jb j x n + x n x nj ) , n n å 1 1 0 å 1 1 0 0 0 0 0 n=1 n=1 where we added and subtracted the term x n . Due to the reverse triangle inequality andjb j x n j 0 0 0 0 , we obtain b j x n x (n n ) x (n n) . 1 0 0 0 0 0 0 For convenience, we call m = n n and perform a little trick by adding and subtracting , 0 1 so x (n n) = jx j (m + 1) + . The reason for this steps will become more clear when 0 0 0 2 2 putting the two sums back together. Now, we have n 1 0 1 s s 1 s 1 j A (a k)j(1 +jb j x nj ) j A (a k)j 1 + x (m + 1) . å n 1 1 0 å n +m 1 0 n=1 m=1n Shifting the sum, i.e., taking n = m + 1, and using k A k , we get n ¥ 1 0 s s 1 s s 1 1 1 1 j A (a k)j 1 + x (m + 1) 1 + x n . (12) n +m å 1 0 å 0 2 n + n 1 2 m=1n n=2n 0 0 1 1 Combining the two sums Equations (11) and (12) and observing that < , we obtain n +n n +n1 0 0 Nn 1 s s 1 1 jE(k)j C 1 + x n . (13) g ˆ 0 n + n 1 2 n=2n Summing up the error terms, we obtain Equation (7). To obtain the GS coefﬁcients, we need to apply the output-generating atom as in Equation (4). Corollary 1 (Output of Layer 1). Let f 2 Y be the output-generating atom, then the output of the ﬁrst layer is U [b j] f f (k) = jg ˆ (b j n x )j( A f )(k) + e (k), 1 1 1 1 1 0 0 1 1 where 2 2 e (k) kE k kf k . 1 1 1 ¥ 1 Here E is the error term of Proposition 1. 1 Axioms 2019, 8, 106 9 of 25 Remark 3. Note that we focus here on an unmodulated Gabor frame element f , and the convolution may be interpreted as a low-pass ﬁlter. Therefore, in dependence on the pooling factor a , the temporal ﬁne-structure of A corresponding to higher frequency content is averaged out. Proof. For this proof, we use the result of Proposition 1. We show that the calculations for the ﬁrst layer are similar to those of the second layer: å jh f , M T g ijjg ˆ (b j x n )j A (k) f (l k) 0 0 n k b j a k 1 1 1 1 1 1 0 (14) 2 2 = å E (k)f (l k) kE k kf k k 1 1 1 ¥ 1 Nn N 0 0 1 s 1 s where E (k) C k A T c[a ; a ]k + C 1 +jx j jn j . g å ¥ ˆ å 1 n k 1 1 g 0 1 n=1 1 n=2n n +n1 2 0 0 We introduce two more operators, ﬁrst the sampling operator S f (x) = f (ax) ˆ ˆ 8x 2 R and second the periodization operator P f (w) = f (w ) 8w 2 R. These operators 1 å k2Z have the following relation F S ( f ) (w) = P ( f (w)). In order to see how the second layer captures a 1 relevant signal structures, depending on the ﬁrst layer, we propose the following Proposition 2. Recall that g 2 H 8` 2 N. ` ` k s 1 Proposition 2 (Layer 2). Let f 2 T , å j A (. )j # and jg (h)j C (1 +jhj ) . Then the n a 2 g ˆ k6=0 0 a 1 2 elements of the second layer can be expressed as U [b h]U [b j] f (m) = g ˆ (b j x n ) M A , T g + E (m), (15) n a m 2 2 1 1 1 1 0 0 b h 2 2 2 0 2 where E (m) # C jg ˆ (b j x n )j 1 +jb h rj +kE k kgk . 2 a g ˆ 1 1 0 0 å 2 1 ¥ 1 1 2 Remark 4. Note that, as the envelopes A are expected to change slowly except around transients, their Fourier transforms concentrate their energy in the low frequency range. Moreover, the modulation term M pushes b h the frequencies of A down by b h, and therefore they can be captured by the output-generating atom f n 2 2 in Corollary 2. In Section 3.2, we show, by means of the analysis of example signals, how the second layer output distinguishes tones that have a smooth onset (transient) from those that have a sharp attack, which leads to broadband characteristics of A around this attack. Similarly, if A undergoes an amplitude modulation, n n the frequency of this modulation can be clearly discerned, cf. Figure 5 and the corresponding example. This observation is clearly reﬂected in expression Equation (15). Proof. Using the outcome of Proposition 1, we obtain U [b h]U [b j] f (m) = 2 2 1 1 jhS ( A )jg ˆ (b j x n )j + E , M T g i j a n 1 1 0 0 1 b h a m 2 1 0 2 ` (Z) jhS ( A )jg (b j x n )j, M T g i 2 j +jhE , M T g i 2 j. a n 1 1 0 0 a m 2 1 a m 2 b h ` (Z) b h ` (Z) 1 0 2 2 2 2 For the error E (k), we use the global estimate jhE , M T g i j kE k kgk . Moreover, 1 1 b h a m 2 1 ¥ 1 2 ` (Z) using the notation above and ignoring the constant term jg ˆ (b j x n )j, we proceed as follows, 1 1 0 0 2pib hk hS ( A ), M T g i = S ( A (k))T g (k)e = 2 å a n b h a m 2 a n a m 2 1 0 2 ` (Z) k2Z 1 0 2 F S ( A ) T g (b h) = F S ( A ) F T g (b h) = a n a m 2 2 a n a m 2 2 1 0 2 1 0 2 (16) ˆ ˆ P 1 A M g ˆ (b h) = å A . M g ˆ (b h). n a m 2 2 n a m 2 2 k2Z 0 2 0 a 2 1 Axioms 2019, 8, 106 10 of 25 As g ˆ is concentrated around 0, the right-hand term in Equation (16) can only contain signiﬁcant values if A has frequency-components concentrated around b h, therefore we consider the case k = 0 separately and obtain hS ( A ), M T g i = A M g ˆ (b h) a n a m 2 n a m b h 2 2 2 1 0 2 ` (Z) 0 2 (17) + A . M g ˆ (b h). å n a m k2Znf0g 2 2 0 a 2 It remains to bound the sum of aliases, i.e., the second term of Equation (17): A . M g ˆ (b h) = n a m 2 2 k2Znf0g 0 2 A r M g ˆ (b h r) (18) å å n a m 2 2 r k2Znf0g 0 2 A r g ˆ (b h r) å å r n 2 2 k2Znf0g 0 Using the assumption j A (. )j # and also the assumption on the analysis å n a k2Znf0g 0 a 1 window g , namely, the fast decay of gˆ , we obtain 2 2 A r g ˆ (b h r) # g ˆ (b h r) å å å n 2 2 a 2 2 r k2Znf0g 0 1 r (19) # C 1 +jb h rj . a g ˆ 2 1 r We rewrite the ﬁrst term in Equation (17) and make use of the operator I introduced in Equation (2): ˆ ˆ A M g ˆ (b h) = å A (r) M g ˆ (b h r) = n a m 2 2 n a m 2 2 0 2 0 2 (20) h A , T I M g ˆ i = h A , M T g i. n b h a m 2 n b h a m 2 0 2 0 2 2 2 The last Equation (20) uses Plancherl’s theorem. Rewriting the last term, we obtain h A , M T g i = h M A , T g i. n b h a m 2 b h n a m 2 0 2 0 2 2 2 Remark 5. For sufﬁciently big s the sum 1 +jb h rj decreases fast, e.g., taking s = 5 the sum is r 2 approximately 2. The second layer output is obtained by applying the output-generating atom as in Equation (4). Corollary 2 (Output of Layer 2). Let f 2 Y , then the output of the second layer is 2 3 U [b h]U [b j] f f (m) = jg ˆ (b j x n )jj M A , T g j f (m) + e (m) 2 2 1 1 2 1 1 0 0 b h n a m 2 2 2 0 2 where 2 2 e (m) kE k kf k . 2 2 2 ¥ 1 Here E is the error of Proposition 2. Remark 6. Note that in the second layer, applying the output-generating atom f 2 Y removes the ﬁne 2 3 temporal structure, and thus the second layer output reveals information contained in the envelopes A . Proof. Proof is similar to the ﬁrst layer output, see Corollary 1. Axioms 2019, 8, 106 11 of 25 3.1.2. Deformation Stability In this section, we study the extent to which GS is stable with respect to certain, small deformations. This question is interesting, as we may often intuitively assume that the classiﬁcation of natural signals, be it sound or images, is preserved under mild and possibly local deformations. For the signal class T , we consider musically meaningful deformations and show stability of GS with respect to these deformations. We consider changes in spectral shape as well as frequency modulations. Note that, as opposed to the invariance properties derived in Section 3.1.1 for the output of speciﬁc layers, the derived stability results pertain to the entire feature vector obtained from the GS along all included layers, cp. the deﬁnition and derivation of deformation stability in [3]. The method we apply is inspired by the authors of [8] and uses the decoupling technique, i.e., to prove stability of the feature extractor we ﬁrst take the structural properties of the signal class into account and search for an error bound of deformations of the signals in T . In combination with the contractivity property kF ( f ) F (h)k k f hk of F , see ([4] Proposition 4), which follows from B 1 8` 2 N, W W 2 2 W where B is the upper frame bound of the Gabor frame G(g , a , b ), this yields deformation stability ` ` ` ` of the feature extractor. Simply deforming a tone would correspond to deformations of the envelope A , n = 1, . . . , N. This corresponds to a change in timbre, for example, by playing a note on a different instrument. N 2pinx t Mathematically this can be expressed as D ( f )(t) = A t + t(t) e . A n t n=1 0 s 1 Lemma 1 (Envelope Changes). Let f 2 T and j A (t)j C (1 +jtj ) , for constants C > 0, n = n n 1, . . . , N and s > 1. Moreover let ktk < . Then k f D ( f )k Dktk C , ¥ n A 2 å n=1 for D > 0 depending only on ktk . Proof. Setting h (t) = A (t)D ( A (t)), we obtain n n A n k f D ( f )k kh (t)k . A 2 n 2 t å n=1 We apply the mean value theorem for a continuous function A (t) and get jh (t)j ktk sup j A (y)j. n ¥ y2B (t) ktk¥ Applying the 2norm on h (t) and the assumption on A (t), we obtain Z Z 2 2 2 0 jh (t)j dt ktk sup j A (y)j dt ¥ n R R y2B (t) ktk 2 2 s 2 C ktk sup (1 +jyj ) dt. n ¥ y2B (t) ktk Splitting the integral into B (0) and RnB (0), we obtain 1 1 Z Z 2 2 2 s 2 kh (t)k C ktk 1dt + sup (1 +jyj ) dt . 2 n ¥ B (0) RnB (0) 1 1 y2B (t) ktk ¥ Axioms 2019, 8, 106 12 of 25 s 1 Using the monotonicity of (1 +jyj ) and in order to remove the supremum, by shifting ktk , we have Z Z 2 2 2 s 2 kh (t)k C ktk 1dt + (1 +jjtjktk j ) dt . n ¥ 2 n ¥ B (0) RnB (0) 1 1 ktk s ¥ s Moreover for t 2 / B (0) we have j(1ktk )tj j(1 )tj . This leads to 1 ¥ jtj 2 2 2 s 2 kh (t)k C ktk 2 + (1 +j(1ktk )tj ) dt . n ¥ 2 n ¥ RnB (0) dx 1 Performing a change of variables, i.e., x = (1ktk )t with = 1ktk > we obtain ¥ ¥ dt 2 2 2 2 s 2 kh (t)k C ktk 2 + 2 (1 +jxj ) dx 2 n ¥ 2 2 = C ktk 2 + 2 . n ¥ 1 +jxj 2 2 Setting D := 2 1 +k k and summing up we obtain 1+jxj 2 k f D ( f )k Dktk C . A 2 ¥ n n=1 Remark 7. Harmonics’ energy decreases with increasing frequency, hence C C , hence the sum å C n n1 n n=1 can be expected to be small. Another kind of sound deformation results from frequency modulation of f 2 T . This corresponds to, for example, playing higher or lower pitch, or producing a vibrato. This can be formulated as 2pi nx t+t (t) D : f (t) 7! A (t)e . t å n n=1 arccos(1 ) Lemma 2 (Frequency Modulation). Let f 2 T . Moreover let kt k < . Then, n ¥ 2p k f D ( f )k # . t 2 n=1 Proof. We have k f D fk kh (t)k , t n 2 å 2 n=1 2pit (t) with h (t) = A (t)(1 e ). Computing the 2norm of h (t), we obtain n n n Z Z 2 2pit (t) 2 2pit (t) 2 2 n n jh (t)j dt = j A (t)(1 e )j dt k1 e k k A (t)k . n n n ¥ ¥ R R We rewrite 2pit (t) 2 j1 e j = 1 cos 2pt (t) + i sin 2pt (t) = 2 1 cos 2pt (t) . n n n Axioms 2019, 8, 106 13 of 25 arccos(1 ) 2pit (t) 2 2 2 Setting k1 e k # , this term gets small if kt (t)k . Using the n ¥ 2p assumptions of our signal model on the envelopes, i.e., k A k < , we obtain n ¥ k f D ( f )k # . 2 å n=1 2 2 Q 0 Proposition 3 (Deformation Stability). Let F : L (R) ! (` (Z)) , f 2 T and j A (t)j C (1 + W n arccos(1 ) s 1 1 2 jtj ) , for constants C > 0, n = 1, . . . , N and s > 1. Moreover let ktk < and kt k < . n ¥ n ¥ 2 2p Then the feature extractor F is deformation stable with respect to 2pinx t envelope changes D f (t) = å A t + t(t) e : A n n=1 F f F D ( f ) Dktk C , W W A ¥ n n=1 for D > 0 depending only on ktk . 2pi nx t+t (t) N 0 n frequency modulation D f (t) = å A (t)e : t n n=1 F f F D ( f ) # . W W å n=1 Proof. The Proof follows directly from a result of ([4] Proposition 4), called contractivity property kF ( f )F (h)k k f hk of F , which follows from B 1 8` 2 N, where B is the upper frame W W 2 2 W ` ` bound of the Gabor frame G(g , a , b ) and deformation stability of the signal class in Lemmas 1 and 2. ` ` ` 3.2. Visualization Example In this section, we present some visualizations based on two implementations, one in M ATLAB, which we call the GS implementation, and the other one in Python, which is the channel averaged GS implementation. The main difference between these implementations is an averaging step of Layer 2 in the case of the Python implementation; averaging over channels is introduced in order to obtain a 2D representation in each layer. Furthermore, the averaging step signiﬁcantly accelerates the computation of the second layer output. Referring to Figure 2, the following nomenclature will be used. Layer 1 (L1) is the GT, which, after resampling to the desired size, becomes Out A. Output 1 (O1) is the output of L1, i.e., after applying the output-generating atom. Recall that this is done by a low-pass ﬁltering step. Again, Out B is obtained by resampling to the desired matrix size. Axioms 2019, 8, 106 14 of 25 Out A O1 Out B signal L1 L2 L2avg Gabor transform reassignment resampling time averaging O2 Out C channel averaging Figure 2. Diagram explaining the naming of the GS building blocks of the Python implementation in the following sections. Layer 2 (L2) is obtained by applying another GT for each frequency channel. In the MATLAB code, Output 2 (O2) is then obtained by low-pass ﬁltering the separate channels of each resulting spectrogram. In the case of Python implementation (see Figure 2), we average all the GT of L2 to one spectrogram (for the sake of speed) and then apply a time averaging step in order to obtain O2. Resampling to the desired size yields Out C. As input signal for this section we generate single tones following the signal model from Section 2.2. 3.2.1. Visualization of Different Frequency Channels within the GS Implementation Figures 3 and 4 show two tones, both having a smooth envelope but different fundamental frequencies and number of harmonics. The ﬁrst tone has fundamental frequency x = 800 Hz and 15 harmonics, and the second tone has fundamental frequency x = 1060 Hz and 10 harmonics. Layer 1 Output 1 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 Time [s] Figure 3. First layer (i.e., GT) and Output 1 of two tones with different fundamental frequencies. Frequency channels Axioms 2019, 8, 106 15 of 25 Output 2 evaluated at 800 Hz, i.e. frequency channel 38 of Layer 1 Output 2 evaluated at 1060 Hz, i.e. frequency channel 50 of Layer 1 Output 2 evaluated at 3200 Hz, i.e. frequency channel 150 of Layer 1 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 Time [s] Figure 4. Output 2 of two tones with different fundamental frequencies, at different ﬁxed frequency channels of Layer 1. Content of Figures 3 and 4: Layer 1: The ﬁrst spectrogram of Figure 3 shows the GT. Observe the difference in the fundamental frequencies and that these two tones have a different number of harmonics, i.e., tone one has more than tone two. Output 1: The second spectrogram of Figure 3 shows Output 1, which is is time averaged version of Layer 1. Output 2: For the second layer output (see Figure 4), we take a ﬁxed frequency channel from Layer 1 and compute another GT to obtain a Layer 2 element. By applying an output-generating atom, i.e., a low-pass ﬁlter, we obtain Output 2. Here, we show how different frequency channels of Layer 1 can affect Output 2. The ﬁrst spectrogram shows Output 2 with respect to, the fundamental frequency of tone one, i.e., x = 800 Hz. Therefore no second tone is visible in this output. On the other hand, in the second spectrogram, if we take as ﬁxed frequency channel in Layer 1 the fundamental frequency of the second tone, i.e., x = 1060 Hz, in Output 2, the ﬁrst tone is not visible. If we consider a frequency that both share, i.e., x = 3200 Hz, we see that for Output 2 in the third spectrogram both tones are present. As GS focuses on one frequency channel in each layer element, the frequency information in this layer is lost; in other words, Layer 2 is invariant with respect to frequency. 3.2.2. Visualization of Different Envelopes within the GS Implementation Here, Figure 5 shows two tones, played sequentially, having the same fundamental frequency x = 800 Hz and 15 harmonics, but different envelopes. The ﬁrst tone has a sharp attack, maintains and goes softly to zero, the second starts with a soft attack and has some amplitude modulation. N 2pinx t An amplitude modulated signal would, for example, correspond to f (t) = sin(2p20t)e ; n=1 here, the signal is modulated by 20Hz. The GS output of these signals are shown in Figure 5. Frequency channels Axioms 2019, 8, 106 16 of 25 Layer 1 Output 1 Output 2 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 Time [s] Figure 5. Layer 1 (i.e., GT), Output 1, and Output 2 of the signal having a sharp attack and afterwards some modulation. Layer 1: In the spectrogram showing the GT, we see the difference between the envelopes and we see that the signals have the same pitch and the same harmonics. Output 1: The output of the first layer is invariant with respect to the envelope of the signals. This is due to the output-generating atom and the subsampling, which removes temporal information of the envelope. In this output, no information about the envelope (neither the sharp attack nor the amplitude modulation) is visible, therefore the spectrogram of the different signals look almost the same. Output 2: For the second layer output we took as input a time vector at ﬁxed frequency of 800 Hz (i.e., frequency channel 38) of the ﬁrst layer. Output 2 is invariant with respect to the pitch, but differences on larger scales are captured. Within this layer we are able to distinguish the different envelopes of the signals. We ﬁrst see the sharp attack of the ﬁrst tone and then the modulation with a second frequency is visible. The source code of the MATLAB implementation and further examples can be found in [16]. 3.2.3. Visualization of How Frequency and Amplitude Modulations Inﬂuence the Outputs Using the Channel Averaged Implementation To visualize the resampled transformation in a more structured way, we created an interactive plot (see Figure 6), which shows 25 different synthetic audio signals side by side, transformed into Out A, Out B, and Out C with chosen GS parameters. Each signal consists of one or more sine waves modulated in amplitude and frequency with 5 Hz steps. Frequency channels Axioms 2019, 8, 106 17 of 25 Figure 6. Example output of the interactive plot. The parameters can be adjusted by sliders and the plot is changed accordingly. The chosen parameters to be adjusted were number of frequency channels in Layer 1, number of frequency channels in Layer 2, sampling rate, and number of harmonics of the signal. The code for the interactive plot is available as a part of the repository [16]. 4. Experimental Results In the numerical experiments, we compare the GS to a GT representation, which is one of the standard time-frequency representations used in a preprocessing phase for training neural networks applied to audio data. We compare these two time-frequency representations with respect to the performance on limited size of the training dataset. To convert the raw waveform into the desired representations (GS and GT), we have used the Gabor-scattering v0.0.4 library [17], which is our Python implementation of the GS transform based on the Scipy v1.2.1 [18–20] implementation of STFT. To demonstrate the beneficial properties of GS, we first create synthetic data in which we have the data generation under a full control. In this case, we generate four classes of data that reflect the discriminating properties of GS. Second, we investigate whether the GS representation is beneficial when using a “real” dataset for training. For this purpose, we have utilized the GoodSounds dataset [9]. Axioms 2019, 8, 106 18 of 25 4.1. Experiments with Synthetic Data In the synthetic dataset, we created four classes containing 1 s long signals, sampled at 44.1 kHz with 16 bit precision. All signals consist of a fundamental sine wave and four harmonics. The whole process of generating sounds is controlled by ﬁxed random seeds for reproducibility. 4.1.1. Data We describe the sound generator model for one component of the ﬁnal signal by the following equation, f (t) = A sin 2p x t + cw (t, A , x , j ) + j cw (t, A , x , j ), (21) am am am am f m f m f m f m where cw (t, A , x , j ) = A sin(2px t + j ) is the frequency modulation and f m f m f m f m f m f m f m ( ) A sin(2px t + j ) if A > 0 and j > 0 or x > 0 am am am am am am cw (t, A , x , j ) = is am am am am 1 else the amplitude modulation. Here, A is the amplitude, x denotes the frequency and j denotes the phase. Furthermore, the amplitude, frequency, and phase of the frequency modulation carrier wave is denoted by A , x , and j , respectively, and for the case of amplitude modulation carrier wave f m f m f m we have A , x , and j . am am am To generate ﬁve component waves using the sound generator described above, we needed to decide upon the parameters of each component wave. We started by randomly generating the frequencies and phases of the signal and the carrier waves for frequency and amplitude modulation from given intervals. These parameters describe the fundamental sine wave of the signal. Next we create harmonics by taking multiples (from 2 to 5) of the fundamental frequency x, where A of each next harmonic is divided by a factor. Afterwards, by permuting the two parameters, namely, by turning the amplitude modulation and frequency modulation on and off, we deﬁned four classes of th sound. These classes are indexed starting from zero. The 0 class has neither amplitude nor frequency modulation. Class 1 is just amplitude modulated, Class 2 is just modulated in frequency, and Class 3 is modulated in both amplitude and frequency, as seen in Table 1. At the end, we used those parameters to generate each harmonic separately and then summed them together to obtain the ﬁnal audio ﬁle. Table 1. Overview of classes. A = 0 A = 1 am am A = 0 class 0 class 1 f m A = 1 class 2 class 3 f m The following parameters were used to obtain GS; n_fft = 500—number of frequency channels, n_perseg = 500—window length, n_overlap = 250—window overlap were taken for Layer 1, i.e., GT, n_fft = 50, n_perseg = 50, n_overlap = 40 for Layer 2, window_length of the time averaging window for Output 2 was set to 5 with mode set to “same”. All the shapes for Output A, Output B, and Output C were 240 160. Bilinear resampling [21] was used to adjust the shape if necessary. The same shape of all of the outputs allows the stacking of matrices into shape 3 240 160, which is convenient for CNN, because it can be treated as a 3-channel image. Illustration of the generated sounds from all four classes transformed into GT and GS can be seen in Section 1.2 and Figure 1. With the aforementioned parameters, the mean time necessary to compute the GS was 17.4890 ms, whereas the mean time necessary to compute the GT was 5.2245 ms, which is approximately 3 times less. Note that such comparison is only indicative, because the time is highly dependent on chosen parameters, hence the ﬁnal time depends on the speciﬁc settings. Axioms 2019, 8, 106 19 of 25 4.1.2. Training To compare the discriminating power of both GS and GT, we have generated 10,000 training samples (2500 from each class) and 20,000 (5000 from each class) validation samples. As the task at hand is not as challenging as some real-world datasets, we assume these sizes to be sufﬁcient for both time-frequency representations to converge to very good performances. To compare the performance of GS and GT on a limited set of training data, we have altogether created four scenarios in which the training set was limited to 400, 1000, 4000, and 10,000 samples. In all of these scenarios, the size of the validation set remained at its original size of 20,000 samples and we have split the training set into smaller batches each containing 100 samples with the same number of samples from each class. Batches were used to calculate the model error based on which the model weights were updated. The CNN consisted of the batch normalization layer, which acted upon the input data separately for each channel of the image (we have three channels, namely Out A, Out B, and Out C), followed by four stacks of 2D convolution with average pooling. The ﬁrst three convolutional layers were identical in the number of kernels, which was set to 16 of the size 3 3 with stride 1 1. The last convolutional layer was also identical apart from using just 8 kernels. Each convolutional layer was initialized by a Glorot uniform initialization [22], and followed by a ReLu nonlinearity [23] and an average pooling layer with a 2 2 pool size. After the last average pooling the feature maps were ﬂattened and fully connected to an output layer with 4 neurons and a softmax activation function [24]. For more details about the networks architecture, the reader should consult the repository [16]. There one also ﬁnds the exact code in order to reproduce the experiment. The network’s categorical cross-entropy loss function was optimized using the Adam optimizer [25] with lr = 0.001, b = 0.9, and b = 0.999. To have a fair comparison, we limit each of the 1 2 experiments in terms of computational effort as measured by a number of weight updates during the training phase. One weight update is made after each batch. Each experiment with synthetic data was limited to 2000 weight updates. To create the network, we used Python 3.6 programming language with Keras framework v2.2.4 [26] on Tensorﬂow backend v1.12.0 [27]. To train the models, we used two GPUs, namely, NVIDIA Titan XP and NVIDIA GeForce GTX 1080 Ti, on the OS Ubuntu 18.04 based system. Experiments are fully reproducible and can be obtained by running the code in the repository [16]. 4.1.3. Results The results are shown in Table 2, listing the accuracies of the model’s best weight update on training and validation sets. The best weight update was chosen based on the performance on the validation set. More detailed tables of the results can be found in the aforementioned repository. In this experiment, we did not use any testing set, because of the synthetic nature of the data. Accuracy is computed as a fraction of correct predictions to all predictions. The most important observation is visible in Figure 7, where it is shown that in the earlier phases of the training, GS reaches higher accuracies after less weight updates than GT. This effect diminishes with bigger training sets and vanishes completely in case of 100 training batches. In case of very limited data, i.e., with only 400 training samples, the results show that GS even outperformed GT. With more training samples, i.e., 1000 and 4000, the best performances of GT and GS are nearly the same. In this case we could hypothesize that the prior knowledge of the intrinsic properties of a time series signal shown by GS (in the invariances of Layer 1 and Layer 2) is not needed anymore and the network is able to learn the necessary transformation itself. Axioms 2019, 8, 106 20 of 25 Table 2. Performance of the convolutional neural network (CNN) trained using GS and GT data. TF N Train N Valid BWU Train Valid GS 400 20,000 280 1.0000 0.9874 GT 400 20,000 292 1.0000 0.9751 GS 1000 20,000 650 0.9990 0.9933 GT 1000 20,000 1640 1.0000 0.9942 GS 4000 20,000 1640 0.9995 0.9987 GT 4000 20,000 1720 0.9980 0.9943 GS 10,000 20,000 1800 0.9981 0.9968 GT 10,000 20,000 1800 0.9994 0.9985 Table notation: TF—Time-frequency representation. N train and N valid—Number of samples in training and validation sets. BWU—Weight update after which the highest performance was achieved on the validation set. Train and valid—accuracy on training and validation sets. 4 train batches (400 samples) 10 train batches (1000 samples) GT (best w.u.292: 97.51%) GT (best w.u.1640: 99.42%) GS (best w.u.280: 98.74%) GS (best w.u.650: 99.33%) 1 2 3 1 2 3 10 10 10 10 10 10 40 train batches (4000 samples) 100 train batches (10000 samples) GT (best w.u.1720: 99.43%) GT (best w.u.1800: 99.85%) GS (best w.u.1640: 99.87%) GS (best w.u.1800: 99.68%) 1 2 3 1 2 3 10 10 10 10 10 10 Weight updates (log) Weight updates (log) Figure 7. CNN performance milestone reached over number of weight updates—Synthetic data. Figure notation: Valid acc—Accuracy performance metric measured on the validation set. Best w.u.—Weight update after which the highest performance was reached. 4.2. Experiments with GoodSounds Data In the second set of experiments, we used the GoodSounds dataset [9]. It contains monophonic audio recordings of single tones or scales played by 12 different musical instruments. The main purpose of this second set of experiments is to investigate whether GS shows superior performance to GT in a classiﬁcation task using real-life data. 4.2.1. Data To transform the data into desired form for training, we removed the silent parts using the SoX v14.4.2 library [28,29]; next, we split all ﬁles into 1 s long segments sampled at a rate of 44.1 kHz with 16 bit precision. A Tukey window was applied to all segments to smooth the onset and the offset of each with the aim to prevent undesired artifacts after applying the STFT. The dataset contains 28.55 h of recordings, which is a reasonable amount of audio data to be used in training of Deep Neural Networks considering the nature of this task. Unfortunately, the data are distributed into classes unevenly, half of the classes are extremely underrepresented, i.e., half of the Valid acc [%] Valid acc [%] Axioms 2019, 8, 106 21 of 25 classes together contain only 12.6% of all the data. In order to alleviate this problem, we decided upon an equalization strategy by variable stride. To avoid extensive equalization techniques, we have discarded all classes that spanned less than 10% of the data. In total we used six classes, namely, clarinet, ﬂute, trumpet, violin, sax alto, and cello. To equalize the number of segments between these classes, we introduced the aforementioned variable stride when creating the segments. The less data a particular class contains, the bigger is the overlap between segments, thus more segments are generated and vice versa. The whole process of generating sounds is controlled by ﬁxed random seeds for reproducibility. Detailed information about the available and used data, stride settings for each class, obtained number of segments and their split can be seen in Table 3. Table 3. Overview of available and used data. All Available Data Obtained Segments Class Files Dur Ratio Stride Train Valid Test Clarinet 3358 369.70 21.58% 37,988 12,134 4000 4000 Flute 2308 299.00 17.45% 27,412 11,796 4000 4000 Trumpet 1883 228.76 13.35% 22,826 11,786 4000 4000 Used Violin 1852 204.34 11.93% 19,836 11,707 4000 4000 Sax alto 1436 201.20 11.74% 19,464 11,689 4000 4000 Cello 2118 194.38 11.35% 15,983 11,551 4000 4000 Sax tenor 680 63.00 3.68% Sax soprano 668 50.56 2.95% Sax baritone 576 41.70 2.43% Not used Piccolo 776 35.02 2.04% Oboe 494 19.06 1.11% Bass 159 6.53 0.38% Total 16,308 1713.23 100.00% 70,663 24,000 24,000 Table notation: Files—Number of available audio ﬁles. Dur—Duration of all recordings within one class in minutes. Ratio—Ratio of the duration to total duration of all recordings in the dataset. Stride—Step size (in samples) used to obtain segments of the same length. Train, Valid, Test—Number of segments used to train (excluding the leaking segments), validate, and test the model. As seen from the table, the testing and validation sets were of the same size comprising the same number of samples from each class. The remaining samples were used for training. To prevent leaking of information from validation and testing sets into the training set, we have excluded all the training segments originating from the audio excerpts, which were already used in validation or testing set. More information can be found in the repository [16]. The following parameters were used to obtain GS; n_fft = 2000—number of frequency channels, n_perseg = 2000—window length, n_overlap = 1750—window overlap were taken for Layer 1, i.e., GT, n_fft = 25, n_perseg = 25, n_overlap = 20 for Layer 2, window_length of the time averaging window for Output 2 was set to 5 with mode set to ‘same’. All the shapes for Output A, Output B and Output C were 480 160. Bilinear resampling [21] was used to adjust the shape if necessary. The same shape of all the outputs allows the stacking of matrices into shape 3 480 160. Illustration of the sounds from all six classes of musical instruments transformed into GT and GS can be found in the repository [16]. 4.2.2. Training To make the experiments on synthetic data and the experiments on GoodSounds data comparable, we again used the CNN as a classiﬁer trained in a similar way as described in Section 4.1.2. We have also preprocessed the data, so the audio segments are of the same duration and sampling frequency. However, musical signals have different distribution of frequency components than the synthetic data, therefore we had to adjust the parameters of the time-frequency representations. This led to a change in the input dimension to 3 480 160. These changes and the more challenging nature of the task Axioms 2019, 8, 106 22 of 25 led to slight modiﬁcation of the architecture in comparison to the architecture in the experiment with synthetic data: The number of kernels in the ﬁrst three convolutional layers was raised to 64. The number of kernels in the last convolutional layer was raised to 16. The output dimension of this architecture was set to 6, as this was the number of classes. The batch size changed to 128 samples per batch. The number of weight updates was set to 11,000. To prevent unnecessary training, this set of experiments was set to terminate after 50 consecutive epochs without an improvement in validation loss as measured by categorical cross-entropy. The loss function and optimization algorithm remained the same as well as the used programming language, framework, and hardware. Experiments are fully reproducible and can be obtained by running the code in the repository [16]. Consider this repository also for more details about the networks architecture. In this set of experiments, we have trained 10 models in total with ﬁve scenarios with limited training set (5, 11, 55, 110, and 550 batches each containing 128 samples) for each time-frequency representation. In all of these scenarios, the sizes of the validation and testing sets remained at their full sizes each consisting of 188 batches containing 24,000 samples. 4.2.3. Results Table 4 shows the accuracies of the model’s best weight update on training, validation, and testing sets. The best weight update was chosen based on the performance on the validation set. As before, more details can be found in the aforementioned repository. In this experiment using GoodSounds data, a similar trend as for the synthetic data is visible. GS performs better than GT if we are limited in training set size, i.e., having 640 training samples, the GS outperformed GT. Table 4. Performance of CNN: GoodSounds data. TF N Train N Valid N Test BWU Train Valid Test GS 640 24,000 24,000 485 0.9781 0.8685 0.8748 GT 640 24,000 24,000 485 0.9766 0.8595 0.8653 GS 1408 24,000 24,000 1001 0.9773 0.9166 0.9177 GT 1408 24,000 24,000 1727 0.9943 0.9194 0.9238 GS 7040 24,000 24,000 9735 0.9996 0.9846 0.9853 GT 7040 24,000 24,000 8525 0.9999 0.9840 0.9829 GS 14,080 24,000 24,000 10,780 0.9985 0.9900 0.9900 GT 14,080 24,000 24,000 9790 0.9981 0.9881 0.9883 GS 70,400 24,000 24,000 11,000 0.9963 0.9912 0.9932 GT 70,400 24,000 24,000 8800 0.9934 0.9895 0.9908 Table notation: TF—Time-frequency representation. N train, N valid and N test—Number of samples in training, validation and testing sets. BWU—Weight update after which the highest performance was achieved on the validation set. Train, valid and test—accuracy on training, validation, and testing sets. In Figure 8, we again see that in earlier phases of the training, GS reaches higher accuracies after less weight updates than GT. This effect diminishes with bigger training sets and vanishes in case of 550 training batches. Axioms 2019, 8, 106 23 of 25 5 train batches (640 samples) 11 train batches (1408 samples) GT (best w.u.485: 86.53%) GT (best w.u.1727: 92.38%) GS (best w.u.485: 87.48%) GS (best w.u.1012: 91.81%) 1 2 3 4 1 2 3 4 10 10 10 10 10 10 10 10 55 (7040) 110 (14080) 550 (70400) GT (best w.u.10010: 98.42%) GT (best w.u.11000: 98.95%) GT (best w.u.9900: 99.15%) GS (best w.u.9735: 98.53%) GS (best w.u.10780: 99.00%) GS (best w.u.11000: 99.32%) 1 2 3 4 1 2 3 4 1 2 3 4 10 10 10 10 10 10 10 10 10 10 10 10 Weight updates (log) Figure 8. CNN performance milestone reached over number of weight updates—GoodSounds data. Figure notation: Test acc—Accuracy performance metric measured on the testing set. Best w.u.—Weight update after which the highest performance was reached. 5. Discussion and Future Work In the current contribution, a scattering transform based on Gabor frames has been introduced, and its properties were investigated by relying on a simple signal model. Thereby, we have been able to mathematically express the invariances introduced by GS within the ﬁrst two layers. The hypothesis raised in Section 1.1, that explicit encoding of invariances by using an adequate feature extractor is beneﬁcial when a restricted amount of data is available, was substantiated in the experiments presented in the previous section. It was shown that in the case of a limited dataset the application of a GS representation improves the performance in classiﬁcation tasks in comparison to using GT. In the current implementation and with parameters described in Section 4.1.1, the GS is approximately 3 times more expensive to compute than GT. However, this transformation needs to be done only once—in the preprocessing phase. Therefore, the majority of computational effort is still spent during training, e.g., in the case of the GoodSounds experiment, the training with GS is 2.5 times longer than with GT. Note that this is highly dependent on the used data handling pipeline, network architecture, software framework, and hardware, which all can be optimized to alleviate this limitation. Although GS is more computationally-expensive, the obtained improvement justiﬁes its use in certain scenarios; in particular, for classiﬁcation tasks which can be expected to beneﬁt from the invariances introduced by GS. In these cases, the numerical experiments have shown that by using GS instead of GT a negative effect of a limited dataset can be compensated. Hypothetically, with enough training samples, both GS and GT should perform equally assuming sufﬁcient training, i.e., performing enough weight updates. This is shown in the results of both numerical experiments presented in this article (see Tables 2 and 4). This is justiﬁed by the fact that GS comprises exclusively the information contained within GT, only separated into three different Test acc [%] Test acc [%] Axioms 2019, 8, 106 24 of 25 channels. We assume it is easier for the network to learn from such a separated representation. The evidence to support this assumption is visible in the earlier phases of the training, where GS reaches higher accuracies after less weight updates than GT (see Figures 7 and 8). This effect increases with smaller datasets while with very limited data GS even surpasses GT in performance. This property can be utilized in restricted settings, e.g., in embedded systems with limited resources or in medical applications, where sufﬁcient datasets are often too expensive or impossible to gather, whereas the highest possible performance is crucial. We believe that GT would eventually reach the same performance as GS, even on the smallest feasible datasets, but the network would need more trainable parameters, i.e., more complex architecture to do the additional work of ﬁnding the features that GS already provides. Unfortunately, in such a case, it remains problematic to battle the overﬁtting problem. This opens a new question—whether the performance boost of GS would amplify on lowering the number of trainable parameters of the CNN. This is out of the scope of this article and will be addressed in the future work. In another paper [30], we extended GS to mel-scattering (MS), where we used GS in combination with a mel-filterbank. This MS representation reduces the dimensionality, and therefore it is computationally less expensive compared to GS. It remains to be said that the parameters in computing GS coefﬁcients have to be carefully chosen to exploit the beneﬁcial properties of GS by systematically capturing data-intrinsic invariances. Future work will consist of implementing GS on the GPU, to allow for fast parallel computation. At the same time, more involved signal models, in particular, those concerning long-term correlations, will be studied analytically to the end of achieving results in the spirit of the theoretical results presented in this paper. Author Contributions: Equal contribution of authors. Funding: This work was supported by the Uni:docs Fellowship Programme for Doctoral Candidates in Vienna, the Vienna Science and Technology Fund (WWTF) project SALSA (MA14-018), the International Mobility of Researchers (CZ.02.2.69/0.0/0.0/16 027/0008371), and the project LO1401. Infrastructure of the SIX Center was used for computation. Open Access Funding by the University of Vienna. Conﬂicts of Interest: The authors declare no conﬂict of interest. References 1. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classiﬁcation with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25; Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105. 2. Grill, T.; Schlüter, J. Music Boundary Detection Using Neural Networks on Combined Features and Two-Level Annotations. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR 2015), Malaga, Spain, 26–30 October 2015. 3. Mallat, S. Group Invariant Scattering. Comm. Pure Appl. Math. 2012, 65, 1331–1398. 4. Wiatowski, T.; Bölcskei, H. A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction. IEEE Trans. Inf. Theory 2017, 64, 1845–1866. 5. Wiatowski, T.; Bölcskei, H. Deep Convolutional Neural Networks Based on Semi-Discrete Frames. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 1212–1216. 6. Andén, J.; Mallat, S. Deep Scattering Spectrum. IEEE Trans. Signal Process. 2014, 62, 4114–4128. 7. Andén, J.; Lostanlen, V.; Mallat, S. Joint time-frequency scattering for audio classiﬁcation. In Proceedings of the 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, USA, 17–20 September 2015; pp. 1–6. 8. Grohs, P.; Wiatowski, T.; Bölcskei, H. Deep convolutional neural networks on cartoon functions. In Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 1163–1167. Axioms 2019, 8, 106 25 of 25 9. Romani Picas, O.; Parra Rodriguez, H.; Dabiri, D.; Tokuda, H.; Hariya, W.; Oishi, K.; Serra, X. A real-time system for measuring sound goodness in instrumental sounds. In Audio Engineering Society Convention 138; Audio Engineering Society: New York, NY, USA, 2015. 10. Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv 2016, arXiv:1611.03530. 11. Kawaguchi, K.; Kaelbling, L.P.; Bengio, Y. Generalization in Deep Learning. arXiv 2017, arXiv:1710.05468. 12. Hofmann, T.; Schölkopf, B.; Smola, A. Kernel Methods in Machine Learning. Ann. Stat. 2008, 36, 1171–1220. 13. Mallat, S. Understanding deep convolutional networks. Philos. Trans. R. Soc. Lond. A Math. Phys. Eng. Sci. 2016, 374, doi:10.1098/rsta.2015.0203. 14. Wiatowski, T.; Tschannen, M.; Stanic, A.; Grohs, P.; Bölcskei, H. Discrete deep feature extraction: A theory and new architectures. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016. 15. Gröchenig, K. Foundations of Time-Frequency Analysis; Applied and Numerical Harmonic Analysis; Birkhäuser: Boston, MA, USA; Basel, Switzerland; Berlin, Germany, 2001. 16. Harar, P.; Bammer, R. gs-gt. Available online: https://gitlab.com/hararticles/gs-gt (accessed on 20 June 2019). 17. Harar, P. Gabor Scattering v0.0.4. Available online: https://gitlab.com/paloha/gabor-scattering (accessed on 20 June 2019). 18. Jones, E.; Oliphant, T.; Peterson, P. SciPy: Open Source Scientiﬁc Tools for Python. 2001. Available online: http://www.scipy.org/ (accessed on 1 Feburary 2019). 19. Oppenheim, A.V. Discrete-Time Signal Processing; Pearson Education India: Uttar Pradesh, India, 1999. 20. Grifﬁn, D.; Lim, J. Signal estimation from modiﬁed short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 236–243. 21. Kirkland, E.J. Bilinear interpolation. In Advanced Computing in Electron Microscopy; Springer: Berlin/Heidelberg, Germany, 2010; pp. 261–263. 22. Glorot, X.; Bengio, Y. Understanding the difﬁculty of training deep feedforward neural networks. Aistats 2010, 9, 249–256. 23. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectiﬁers: Surpassing Human-Level Performance on ImageNet Classiﬁcation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. 24. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. 25. Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. 26. Chollet, F. Keras. Available online: https://keras.io (accessed on 19 August 2019). 27. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, C.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. Available online: https://www.tensorﬂow.org (accessed on 1 February 2019 ) 28. Bagwell, C. SoX—Sound Exchange the Swiss Army Knife of Sound Processing. Available online: https: //launchpad.net/ubuntu/+source/sox/14.4.1-5 (accessed on 31 October 2018). 29. Navarrete, J. The SoX of Silence Tutorial. Available online: https://digitalcardboard.com/blog/2009/08/ 25/the-sox-of-silence (accessed on 31 October 2018). 30. Bammer, R.; Breger, A.; Dörﬂer, M.; Harar, P.; Smékal, Z. Machines listening to music: The role of signal representations in learning from music. arXiv 2019, arXiv:1903.08950. c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Axioms – Multidisciplinary Digital Publishing Institute
Published: Sep 26, 2019
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.