MareArts Computer Vision Study.: brief explain about "Audio → Spectrogram → Mel-spectrogram → MFCC"

10/30/2024

Audio → Spectrogram → Mel-spectrogram → MFCC

Spectrogram = raw photo

Mel-spectrogram = photo adjusted for human vision

MFCC = compressed, essential features extracted from that photo

Audio → Spectrogram
- Start with raw audio waveform
- Apply pre-emphasis to boost higher frequencies
- Frame the signal into short segments (typically 20-40ms with overlap)
- Apply window function (usually Hamming) to reduce edge effects
- Perform FFT on each frame
- Calculate power spectrum (|FFT|²)
Spectrogram → Mel-spectrogram
- Create mel filter banks (triangular overlapping windows)
- Convert frequencies to mel scale using formula: mel = 2595 * log10(1 + f/700)
- Apply mel filter banks to power spectrum
- Sum up the energy in each mel band
Mel-spectrogram → MFCC
- Take logarithm of mel filter bank energies (to match human perception)
- Apply Discrete Cosine Transform (DCT)
- Keep first N coefficients (typically 13-39)
- Optionally:
  - Calculate delta (velocity) features
  - Calculate delta-delta (acceleration) features
  - Apply cepstral mean normalization (CMN)