10/30/2024

brief explain about "Audio → Spectrogram → Mel-spectrogram → MFCC"

 Audio → Spectrogram → Mel-spectrogram → MFCC

  • Spectrogram = raw photo
  • Mel-spectrogram = photo adjusted for human vision
  • MFCC = compressed, essential features extracted from that photo
    1. Spectrogram
    • Raw time-frequency representation
    • Shows energy at each frequency over time
    • Doesn't account for human perception
    1. Mel-spectrogram
    • Spectrogram mapped to mel scale
    • Mimics human frequency perception
    • Still maintains all frequency band information
    1. MFCC
    • Derived FROM the mel-spectrogram
    • Additional step: DCT (Discrete Cosine Transform) is applied
    • Keeps only lower coefficients (dimensionality reduction)
    • Decorrelates features

    .

    1. Audio → Spectrogram
      • Start with raw audio waveform
      • Apply pre-emphasis to boost higher frequencies
      • Frame the signal into short segments (typically 20-40ms with overlap)
      • Apply window function (usually Hamming) to reduce edge effects
      • Perform FFT on each frame
      • Calculate power spectrum (|FFT|²)
    2. Spectrogram → Mel-spectrogram
      • Create mel filter banks (triangular overlapping windows)
      • Convert frequencies to mel scale using formula: mel = 2595 * log10(1 + f/700)
      • Apply mel filter banks to power spectrum
      • Sum up the energy in each mel band
    3. Mel-spectrogram → MFCC
      • Take logarithm of mel filter bank energies (to match human perception)
      • Apply Discrete Cosine Transform (DCT)
      • Keep first N coefficients (typically 13-39)
      • Optionally:
        • Calculate delta (velocity) features
        • Calculate delta-delta (acceleration) features
        • Apply cepstral mean normalization (CMN)

    ..

    No comments:

    Post a Comment