10/30/2024

brief explain about "Audio → Spectrogram → Mel-spectrogram → MFCC"

 Audio → Spectrogram → Mel-spectrogram → MFCC

  • Spectrogram = raw photo
  • Mel-spectrogram = photo adjusted for human vision
  • MFCC = compressed, essential features extracted from that photo
    1. Spectrogram
    • Raw time-frequency representation
    • Shows energy at each frequency over time
    • Doesn't account for human perception
    1. Mel-spectrogram
    • Spectrogram mapped to mel scale
    • Mimics human frequency perception
    • Still maintains all frequency band information
    1. MFCC
    • Derived FROM the mel-spectrogram
    • Additional step: DCT (Discrete Cosine Transform) is applied
    • Keeps only lower coefficients (dimensionality reduction)
    • Decorrelates features

    .

    1. Audio → Spectrogram
      • Start with raw audio waveform
      • Apply pre-emphasis to boost higher frequencies
      • Frame the signal into short segments (typically 20-40ms with overlap)
      • Apply window function (usually Hamming) to reduce edge effects
      • Perform FFT on each frame
      • Calculate power spectrum (|FFT|²)
    2. Spectrogram → Mel-spectrogram
      • Create mel filter banks (triangular overlapping windows)
      • Convert frequencies to mel scale using formula: mel = 2595 * log10(1 + f/700)
      • Apply mel filter banks to power spectrum
      • Sum up the energy in each mel band
    3. Mel-spectrogram → MFCC
      • Take logarithm of mel filter bank energies (to match human perception)
      • Apply Discrete Cosine Transform (DCT)
      • Keep first N coefficients (typically 13-39)
      • Optionally:
        • Calculate delta (velocity) features
        • Calculate delta-delta (acceleration) features
        • Apply cepstral mean normalization (CMN)

    ..

    10/26/2024

    Download Youtube Video as best Quality

     code..

    import yt_dlp
    import os
    from typing import Optional

    def format_size(bytes):
    """Convert bytes to human readable format"""
    for unit in ['B', 'KB', 'MB', 'GB']:
    if bytes < 1024:
    return f"{bytes:.2f} {unit}"
    bytes /= 1024
    return f"{bytes:.2f} TB"

    def download_video(url: str, output_path: Optional[str] = None) -> str:
    """
    Download a YouTube video in the best quality using yt-dlp.
    Args:
    url (str): The URL of the YouTube video
    output_path (str, optional): Directory to save the video
    """
    try:
    if not output_path:
    output_path = os.getcwd()
    os.makedirs(output_path, exist_ok=True)
    # Configure yt-dlp options for best quality
    ydl_opts = {
    'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best', # Best video + audio quality
    'outtmpl': os.path.join(output_path, '%(title)s.%(ext)s'),
    'merge_output_format': 'mp4', # Merge to MP4
    'progress_hooks': [lambda d: print(f"\rDownloading: {d['_percent_str']} of {d['_total_bytes_str']}", end="") if d['status'] == 'downloading' else None],
    'postprocessor_hooks': [lambda d: print("\nMerging video and audio...") if d['status'] == 'started' else None],
    'quiet': False,
    'no_warnings': False,
    # Additional options for best quality
    'format_sort': ['res:2160', 'res:1440', 'res:1080', 'res:720'],
    'video_multistreams': True,
    'audio_multistreams': True,
    'prefer_free_formats': True,
    'postprocessors': [{
    'key': 'FFmpegVideoConvertor',
    'preferedformat': 'mp4',
    }],
    }
    print(f"Fetching video information...")
    # Create yt-dlp object and download the video
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    # Get video info first
    info = ydl.extract_info(url, download=False)
    video_title = info.get('title', 'video')
    duration = info.get('duration')
    formats = info.get('formats', [])
    # Find best quality format
    best_video = max(
    (f for f in formats if f.get('vcodec') != 'none'),
    key=lambda f: (
    f.get('height', 0),
    f.get('filesize', 0)
    ),
    default=None
    )
    # Print video details
    print(f"\nVideo details:")
    print(f"Title: {video_title}")
    print(f"Duration: {duration//60}:{duration%60:02d}")
    if best_video:
    print(f"Best quality available: {best_video.get('height', 'N/A')}p")
    if best_video.get('filesize'):
    print(f"Approximate size: {format_size(best_video['filesize'])}")
    print("\nStarting download in best quality...")
    # Download the video
    ydl.download([url])
    # Get the output filename
    output_file = os.path.join(output_path, f"{video_title}.mp4")
    print(f"\nDownload completed successfully!")
    print(f"Saved to: {output_file}")
    return output_file
    except Exception as e:
    print(f"\nError: {str(e)}")
    print("\nTroubleshooting steps:")
    print("1. Check if the video URL is correct")
    print("2. Check your internet connection")
    print("3. Make sure yt-dlp is up to date: pip install -U yt-dlp")
    print("4. Install or update ffmpeg (required for best quality):")
    print(" - On macOS: brew install ffmpeg")
    print(" - On Ubuntu/Debian: sudo apt-get install ffmpeg")
    print(" - On Windows: download from https://ffmpeg.org/download.html")
    return ""

    def main():
    """
    Main function to handle user input for video download.
    """
    print("YouTube Video Downloader (Best Quality)")
    print("-------------------------------------")
    print("This will download videos in the highest available quality")
    print("Note: Higher quality downloads may take longer and use more disk space")
    while True:
    url = input("\nEnter the YouTube video URL (or 'q' to quit): ").strip()
    if url.lower() == 'q':
    print("Goodbye!")
    break
    if not url:
    print("Please enter a valid URL")
    continue
    download_video(url)
    choice = input("\nWould you like to download another video? (y/n): ").strip().lower()
    if choice != 'y':
    print("Goodbye!")
    break

    if __name__ == "__main__":
    main()

    ..


    That's it.

    but install this

    pip install yt-dlp      


    Thank you!!!



    10/18/2024

    Sequence Parallel(SP)

    toy model

    class ToyModel(nn.Module):
    """MLP based model"""
    def __init__(self):
    super().__init__()
    self.in_proj = nn.Linear(10, 32)
    self.relu = nn.ReLU()
    self.out_proj = nn.Linear(32, 5)

    def forward(self, x):
    return self.out_proj(self.relu(self.in_proj(x)))

     .

    configuration

    sp_model = parallelize_module(
    module=model,
    device_mesh=device_mesh,
    parallelize_plan={
    "in_proj": ColwiseParallel(input_layouts=Shard(0)),
    "out_proj": RowwiseParallel(output_layouts=Shard(0)),
    },
    )

    ..






    1. Input Sharding:
      • The input sequence (shape [4 x 12 x 10]) is initially split along the sequence length dimension across 3 GPUs.
      • Each GPU receives a [4 x 4 x 10] shard of the input.
    2. All-Gather Operation:
      • An all-gather operation is performed to reconstruct the full input on each GPU.
      • After this, each GPU has the full [4 x 12 x 10] input.
    3. First Layer - in_proj (ColwiseParallel):
      • The weight matrix [10 x 32] is split column-wise across GPUs: [10 x 11], [10 x 11], [10 x 10].
      • Each GPU processes the full input [4 x 12 x 10] with its portion of the weight matrix.
      • The output on each GPU is [4 x 12 x 11], [4 x 12 x 11], and [4 x 12 x 10] respectively.
    4. ReLU Activation:
      • Applied element-wise to the output of the first layer on each GPU.
      • Shapes remain [4 x 12 x 11], [4 x 12 x 11], and [4 x 12 x 10] on the respective GPUs.
    5. Second Layer - out_proj (RowwiseParallel):
      • The weight matrix [32 x 5] is split row-wise across GPUs: [11 x 5], [11 x 5], [10 x 5].
      • Each GPU processes its input ([4 x 12 x 11], [4 x 12 x 11], [4 x 12 x 10]) with its portion of the weight matrix.
      • The output on each GPU is [4 x 12 x 5], representing partial sums for the full sequence.
    6. Reduce-Scatter Operation:
      • A reduce-scatter operation is performed to sum the partial results and distribute them across GPUs.
      • This results in each GPU having a portion of the final output, sharded along the sequence dimension.

    Key Corrections and Clarifications:

    • There are indeed two collective operations: an all-gather at the beginning and a reduce-scatter at the end.
    • The GPUs do not receive the same amount of tensor in the first layer output due to the uneven split of the weight matrix.
    • The sequence dimension (12 in this example) is not sharded during the middle layers but is reconstructed and then re-sharded at the end.

    This corrected diagram and explanation more accurately represent the sequence parallelism process as described in the original comment. It shows how the input is gathered, processed in parallel, and then the output is scattered, allowing for efficient parallel processing of the entire sequence across GPUs.



    full source code
    .
    import os
    import sys
    import torch
    import torch.nn as nn
    from torch.distributed._tensor import Shard
    from torch.distributed.tensor.parallel import (
    parallelize_module,
    ColwiseParallel,
    RowwiseParallel,
    )
    from log_utils import rank_log, get_logger, verify_min_gpu_count
    import torch.profiler

    # ---- GPU check ------------
    _min_gpu_count = 2
    if not verify_min_gpu_count(min_gpus=_min_gpu_count):
    print(f"Unable to locate sufficient {_min_gpu_count} gpus to run this example. Exiting.")
    sys.exit()
    # ---------------------------
    from torch.distributed._tensor.device_mesh import init_device_mesh



    """
    This is the script to test Sequence Parallel(SP) on a toy model in a
    Megetron-LM SPMD style. We show an E2E working flow from forward,
    backward and optimization.

    We use the example of two `nn.Linear` layers with an element-wise `nn.RELU`
    in between to show an example of sequence parallel, which was proposed in paper:

    https://arxiv.org/pdf/2205.05198.pdf.

    Like tensor parallel, we parallelize the first linear layer by column
    and also parallelize the second linear layer by row. But the input in each rank
    now is different so that we need one all-gather for input and one reduce-scatter
    in the end of the second linear layer.
    """

    class ToyModel(nn.Module):
    """MLP based model"""
    def __init__(self):
    super().__init__()
    self.in_proj = nn.Linear(10, 32)
    self.relu = nn.ReLU()
    self.out_proj = nn.Linear(32, 5)

    def forward(self, x):
    return self.out_proj(self.relu(self.in_proj(x)))

    def main():
    logger = get_logger()
    # create a device mesh based on the given world_size.
    device_mesh = init_device_mesh(
    device_type="cuda", mesh_shape=(int(os.environ["WORLD_SIZE"]),)
    )
    _rank = device_mesh.get_rank()
    print(f"Starting PyTorch Sequence Parallel example on rank {_rank}.")
    rank_log(_rank, logger, f"Device Mesh created: {device_mesh=}")

    # create model and move it to GPU. Init_device_mesh has already assigned gpu ids...
    model = ToyModel().to("cuda")

    # Custom parallelization plan for the model
    sp_model = parallelize_module(
    module=model,
    device_mesh=device_mesh,
    parallelize_plan={
    "in_proj": ColwiseParallel(input_layouts=Shard(0)),
    "out_proj": RowwiseParallel(output_layouts=Shard(0)),
    },
    )

    # Create a optimizer for the parallelized module.
    lr = 0.25
    optimizer = torch.optim.AdamW(sp_model.parameters(), lr=lr, foreach=True)

    # Perform a num of iterations of forward/backward
    # and optimizations for the sharded module.
    num_iters = 10
    rank_log(_rank, logger, "Sequence Parallel training starting...")

    with torch.profiler.profile(
    activities=[
    torch.profiler.ProfilerActivity.CPU,
    torch.profiler.ProfilerActivity.CUDA,
    ],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
    on_trace_ready=torch.profiler.tensorboard_trace_handler(f'./log/tensorboard/rank_{_rank}'),
    record_shapes=True,
    profile_memory=True,
    with_stack=True
    ) as prof:
    for i in range(num_iters):
    # For SP, input can be different across all ranks.
    inp = torch.rand(20, 10, device="cuda")
    output = sp_model(inp)
    output.sum().backward()
    optimizer.step()
    rank_log(_rank, logger, f"Sequence Parallel iter {i} completed")
    prof.step()

    rank_log(_rank, logger, "Sequence Parallel training completed!")

    # Print profiler results
    print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

    if __name__ == "__main__":
    main()
    ..

    Thank you!

    10/11/2024

    10/10/2024

    FSDP and TP explanation for 2 layer model

     FSDP and TP are complementary parallelism techniques:

    1. FSDP (Fully Sharded Data Parallelism):
      • Shards model parameters across GPUs
      • Each GPU holds a portion of each layer's parameters
      • During forward/backward pass, it gathers/scatters parameters as needed
      • Reduces memory usage per GPU, allowing larger models
    2. TP (Tensor Parallelism):
      • Splits individual tensors (layers) across GPUs
      • Each GPU computes a portion of a layer's operations
      • Useful for very large layers that don't fit on a single GPU

    When combined:

    • FSDP handles overall model distribution
    • TP handles distribution of large individual layers
    • This allows for even larger models and better GPU utilization

    Textual Representation:

    GPU 1 GPU 2 GPU 3 GPU 4 +--------+ +--------+ +--------+ +--------+ | L1 P1 | | L1 P2 | | L2 P1 | | L2 P2 | | TP1 | | TP2 | | TP1 | | TP2 | +--------+ +--------+ +--------+ +--------+ | | | | +------------+ +------------+ Layer 1 Layer 2 L1, L2: Layers 1 and 2 P1, P2: Parameter shards (FSDP) TP1, TP2: Tensor Parallel splits