Audio → Spectrogram → Mel-spectrogram → MFCC
- Spectrogram
- Raw time-frequency representation
- Shows energy at each frequency over time
- Doesn't account for human perception
- Mel-spectrogram
- Spectrogram mapped to mel scale
- Mimics human frequency perception
- Still maintains all frequency band information
- MFCC
- Derived FROM the mel-spectrogram
- Additional step: DCT (Discrete Cosine Transform) is applied
- Keeps only lower coefficients (dimensionality reduction)
- Decorrelates features
.
- Audio → Spectrogram
- Start with raw audio waveform
- Apply pre-emphasis to boost higher frequencies
- Frame the signal into short segments (typically 20-40ms with overlap)
- Apply window function (usually Hamming) to reduce edge effects
- Perform FFT on each frame
- Calculate power spectrum (|FFT|²)
- Spectrogram → Mel-spectrogram
- Create mel filter banks (triangular overlapping windows)
- Convert frequencies to mel scale using formula: mel = 2595 * log10(1 + f/700)
- Apply mel filter banks to power spectrum
- Sum up the energy in each mel band
- Mel-spectrogram → MFCC
- Take logarithm of mel filter bank energies (to match human perception)
- Apply Discrete Cosine Transform (DCT)
- Keep first N coefficients (typically 13-39)
- Optionally:
- Calculate delta (velocity) features
- Calculate delta-delta (acceleration) features
- Apply cepstral mean normalization (CMN)
..
No comments:
Post a Comment