August 2019 Meeting Report: Machine Learning/Artifical Intelligence and Audio - Audio Engineering Society

Following the Section’s Annual General Meeting (see AGM Report at: http://www.aesmelbourne.org.au/2019-agm-report/ ), Graeme Huon introduced our speaker Guillaume Potard who presented to us on Machine Learning/Artificial Intelligence and its relevance to the world of audio.

Guillaume started with a brief introduction to the concepts of ML and AI, introducing the concepts of artificial narrow intelligence (where it is designed for a very narrow speciality, like playing a specific game of skill, like Go), and artificial general intelligence, as in the human brain.
He went on to contrast the estimated performance of the human brain, at approx. 1000 petaflops (one thousand million floating point operations) operating at 200Hz with a power consumption of just 20 watts with the most powerful current supercomputers ( eg Tahne-2) at only 32 petaflops at GHZz ranges and with a power consumption of 25 Megawatts.

Guillaume explains Machine Learning/Artificial Intelligence for audio – photo Rod Staples

He then moved on to a brief history of AI/ML from the 1950s to the present day, and followed that with some typical applications like fraud detection, image recognition, spam detection, recommendation systems, medical diagnosis and speech recognition and synthesis.

Guillaume then covered the concepts and principles – firstly shallow learning via supervised learning, unsupervised learning, and reinforcement learning. He cited examples such as regression, giving a linear regression example, then explained gradient descent as a technique for arriving at a result. He also described the challenges of high bias and overfitting affecting the results.

Moving on to audio-specific topics, he mentioned classification for audio – with feature engineering using features like tonality, pitch, spectrum flatness, and mel-frequency cepstrum (MFC) coefficients for power spectrum.

He then described some classification algorithms – specifically k nearest neighbours (kNN), decision trees & random forests, and logistic regression.

Testing methods were then described for measuring performance of the algorithm using k-fold validation and overfitting/underfitting methods.

Guillaume then described examples of unsupervised learning and reinforcement learning, moving on to describe the use of reinforcement learning in gameplay by AI agents, with examples from Atari 2600 games to Google’s Alpha Go player.

He then moved on to Deep Learning, covering time-delay neural; networks, recurrent neural networks, and long-term short-term neurons.

He then went on to describe the limits of Machine Learning with neural nets, namely: the model is only as good as the data, it can be slow to train (needing very large datasets), is opaque (hard to debug), is not magical (cannot learn noise), and finally it is not a cure-all solution to every problem.

He then covered machine learning and audio – describing the challenges in speech recognition, as well as the applications beyond simple speech recognition, such as auditory scene analysis, fault analysis, virtual sensing, robotics and autonomous vehicles.

He then displayed and demonstrated a Sparkfun speech recognition board (https://www.sparkfun.com/products/15170), which uses Google’s TensorFlow ML library for speech recognition.

The Sparkfun speech recognition board – photo Rod Staples (hand model Rod Staples)

He went on to describe DSP applications adapted to machine learning such as de-noising using classification, and sound source separation with Convolutional Neural Networks (CNNs).

Moving on from audio recognition applications, he then described audio synthesis using tools such as Google’s cloud-based WaveNet text-to-speech, on-line voice cloning with other services like lyrebird.ai, as well as using GANs (Generative Adversarial Networks).

He then described using Unsupervised Learning for music retrieval and classification, and other music library management tasks.

He went on to describe his latest project – Audio Cortex – using audiology inspired feature engineering + deep learning. He suggested that it could offer Super Hearing – not limited to 2 ears, and with ultrasonic range; allowing, in theory, the transcribing of all conversations in a crowd simultaneously.

Guillaume then described some tools and hardware available for ML/AI. Python libraries like Sci-Kit learn, TensorFlow, Keras, and Torch moving on to Mathlab and Weka.
He then described the on-line resources available like Google’s Colab or Amazon/Google cloud and compared them with using a local Graphics Processing Unit (GPU) accelerated rig (desktop PC + NVIDIA graphics cards). Finally he highlighted Google’s recently released Tensor Processing Unit (TPU), and its benefits over GPU-based accelerators.

He then gave us a glimpse into the near future – like the increasing use of photonics to improve performance (200THz, with no heat and low power), as used in Google’s TPUs, and ultimately the promise of Quantum Computing.

Concluding, he noted that the AI genie is out of the bottle, promising exciting times for ML/AI, and suggested that audio engineering can benefit a lot for AI and ML.

A lively Q&A session followed his talk, with many attendees showing a keen interest in the topic, and the willingness to dig deeply into the subject; taking advantage of Guillaume’s knowledge and experience.

We thank Guillaume for educating us on this important new field.

The video below is of slides + audio

The video can also be viewed directly on YouTube at:
https://youtu.be/SqT4lvB8nEw

The audio-only recording can be heard or downloaded here

A PDF copy of Guillaume’s slide deck can be viewed or downloaded here

References:
Atari Games:
https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf
https://deepmind.com/research/publications/mastering-game-go-deep-neural-networks-tree-search
Free book:
https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf

History:
https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf
Audio Synthesis:
https://deepmind.com/research/publications/efficient-neural-audio-synthesis
Speech Synthesis:
https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
Photonics Processor:
https://medium.com/lightmatter/the-story-behind-lightmatters-tech-e9fa0facca30

Learning Resources:
Great Introduction:
https://medium.com/machine-learning-for-humans
Large Free Audio Dataset:
https://research.google.com/audioset/
Coursera courses:
Machine Learning (Andrew Ng), Tensorflow specialisation, Mathemetics for Machine Learning:Linear Algebra + many others
Deep Learning Book – Ian Goodfellow et al.
Hands On Machine Learning with TensorFlow and Keras – Aurelien Geron

We thank Graham Haynes, and his trusty Tascam for the audio recording.

We especially thank the SAE Institute for the use of their excellent facilities for our meetings.