Close Menu
    Trending
    • Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025
    • The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z
    • Musk’s X appoints ‘king of virality’ in bid to boost growth
    • Why Entrepreneurs Should Stop Obsessing Over Growth
    • Implementing IBCS rules in Power BI
    • What comes next for AI copyright lawsuits?
    • Why PDF Extraction Still Feels LikeHack
    • GenAI Will Fuel People’s Jobs, Not Replace Them. Here’s Why
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»Efficient Metric Collection in PyTorch: Avoiding the Performance Pitfalls of TorchMetrics
    Artificial Intelligence

    Efficient Metric Collection in PyTorch: Avoiding the Performance Pitfalls of TorchMetrics

    Team_AIBS NewsBy Team_AIBS NewsFebruary 7, 2025No Comments14 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Metric assortment is an important a part of each machine studying mission, enabling us to trace mannequin efficiency and monitor coaching progress. Ideally, Metrics ought to be collected and computed with out introducing any further overhead to the coaching course of. Nonetheless, similar to different parts of the coaching loop, inefficient metric computation can introduce pointless overhead, enhance training-step occasions and inflate coaching prices.

    This put up is the seventh in our sequence on performance profiling and optimization in PyTorch. The sequence has aimed to emphasise the important position of efficiency evaluation and Optimization in machine studying growth. Every put up has targeted on completely different phases of the coaching pipeline, demonstrating sensible instruments and methods for analyzing and boosting useful resource utilization and runtime effectivity.

    On this installment, we give attention to metric assortment. We’ll exhibit how a naïve implementation of metric assortment can negatively affect runtime efficiency and discover instruments and methods for its evaluation and optimization.

    To implement our metric assortment, we’ll use TorchMetrics a well-liked library designed to simplify and standardize metric computation in Pytorch. Our objectives shall be to:

    1. Display the runtime overhead attributable to a naïve implementation of metric assortment.
    2. Use PyTorch Profiler to pinpoint efficiency bottlenecks launched by metric computation.
    3. Display optimization methods to cut back metric assortment overhead.

    To facilitate our dialogue, we’ll outline a toy PyTorch mannequin and assess how metric assortment can affect its runtime efficiency. We’ll run our experiments on an NVIDIA A40 GPU, with a PyTorch 2.5.1 docker picture and TorchMetrics 1.6.1.

    It’s essential to notice that metric assortment habits can range tremendously relying on the {hardware}, runtime surroundings, and mannequin structure. The code snippets supplied on this put up are meant for demonstrative functions solely. Please don’t interpret our point out of any instrument or method as an endorsement for its use.

    Toy Resnet Mannequin

    Within the code block under we outline a easy picture classification mannequin with a ResNet-18 spine.

    import time
    import torch
    import torchvision
    
    system = "cuda"
    
    mannequin = torchvision.fashions.resnet18().to(system)
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(mannequin.parameters())

    We outline an artificial dataset which we’ll use to coach our toy mannequin.

    from torch.utils.information import Dataset, DataLoader
    
    # A dataset with random pictures and labels
    class FakeDataset(Dataset):
        def __len__(self):
            return 100000000
    
        def __getitem__(self, index):
            rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
            label = torch.tensor(information=index % 1000, dtype=torch.int64)
            return rand_image, label
    
    train_set = FakeDataset()
    
    batch_size = 128
    num_workers = 12
    
    train_loader = DataLoader(
        dataset=train_set,
        batch_size=batch_size,
        num_workers=num_workers,
        pin_memory=True
    )

    We outline a group of normal metrics from TorchMetrics, together with a management flag to allow or disable metric calculation.

    from torchmetrics import (
        MeanMetric,
        Accuracy,
        Precision,
        Recall,
        F1Score,
    )
    
    # toggle to allow/disable metric assortment
    capture_metrics = False
    
    if capture_metrics:
            metrics = {
            "avg_loss": MeanMetric(),
            "accuracy": Accuracy(job="multiclass", num_classes=1000),
            "precision": Precision(job="multiclass", num_classes=1000),
            "recall": Recall(job="multiclass", num_classes=1000),
            "f1_score": F1Score(job="multiclass", num_classes=1000),
        }
    
        # Transfer all metrics to the system
        metrics = {title: metric.to(system) for title, metric in metrics.objects()}

    Subsequent, we outline a PyTorch Profiler occasion, together with a management flag that permits us to allow or disable profiling. For an in depth tutorial on utilizing PyTorch Profiler, please check with the first post on this sequence.

    from torch import profiler
    
    # toggle to allow/disable profiling
    enable_profiler = True
    
    if enable_profiler:
        prof = profiler.profile(
            schedule=profiler.schedule(wait=10, warmup=2, energetic=3, repeat=1),
            on_trace_ready=profiler.tensorboard_trace_handler("./logs/"),
            profile_memory=True,
            with_stack=True
        )
        prof.begin()

    Lastly, we outline a regular coaching step:

    mannequin.practice()
    
    t0 = time.perf_counter()
    total_time = 0
    depend = 0
    
    for idx, (information, goal) in enumerate(train_loader):
        information = information.to(system, non_blocking=True)
        goal = goal.to(system, non_blocking=True)
        optimizer.zero_grad()
        output = mannequin(information)
        loss = criterion(output, goal)
        loss.backward()
        optimizer.step()
    
        if capture_metrics:
            # replace metrics
            metrics["avg_loss"].replace(loss)
            for title, metric in metrics.objects():
                if title != "avg_loss":
                    metric.replace(output, goal)
    
            if (idx + 1) % 100 == 0:
                # compute metrics
                metric_results = {
                    title: metric.compute().merchandise() 
                        for title, metric in metrics.objects()
                }
                # print metrics
                print(f"Step {idx + 1}: {metric_results}")
                # reset metrics
                for metric in metrics.values():
                    metric.reset()
    
        elif (idx + 1) % 100 == 0:
            # print final loss worth
            print(f"Step {idx + 1}: Loss = {loss.merchandise():.4f}")
    
        batch_time = time.perf_counter() - t0
        t0 = time.perf_counter()
        if idx > 10:  # skip first steps
            total_time += batch_time
            depend += 1
    
        if enable_profiler:
            prof.step()
    
        if idx > 200:
            break
    
    if enable_profiler:
        prof.cease()
    
    avg_time = total_time/depend
    print(f'Common step time: {avg_time}')
    print(f'Throughput: {batch_size/avg_time:.2f} pictures/sec')

    Metric Assortment Overhead

    To measure the affect of metric assortment on coaching step time, we ran our coaching script each with and with out metric calculation. The outcomes are summarized within the following desk.

    Our naïve metric assortment resulted in a virtually 10% drop in runtime efficiency!! Whereas metric assortment is important for machine studying growth, it often entails comparatively easy mathematical operations and hardly warrants such a major overhead. What’s going on?!!

    Figuring out Efficiency Points with PyTorch Profiler

    To higher perceive the supply of the efficiency degradation, we reran the coaching script with the PyTorch Profiler enabled. The resultant hint is proven under:

    Hint of Metric Assortment Experiment (by Writer)

    The hint reveals recurring “cudaStreamSynchronize” operations that coincide with noticeable drops in GPU utilization. Some of these “CPU-GPU sync” occasions had been mentioned intimately in part two of our sequence. In a typical coaching step, the CPU and GPU work in parallel: The CPU manages duties like information transfers to the GPU and kernel loading, and the GPU executes the mannequin on the enter information and updates its weights. Ideally, we wish to reduce the factors of synchronization between the CPU and GPU with a purpose to maximize efficiency. Right here, nevertheless, we will see that the metric assortment has triggered a sync occasion by performing a CPU to GPU information copy. This requires the CPU to droop its processing till the GPU catches up which, in flip, causes the GPU to attend for the CPU to renew loading the following kernel operations. The underside line is that these synchronization factors result in inefficient utilization of each the CPU and GPU. Our metric assortment implmentation provides eight such synchronization occasions to every coaching step.

    A more in-depth examination of the hint exhibits that the sync occasions are coming from the update name of the MeanMetric TorchMetric. For the skilled profiling professional, this can be ample to determine the basis trigger, however we’ll go a step additional and use the torch.profiler.record_function utility to determine the precise offending line of code.

    Profiling with record_function

    To pinpoint the precise supply of the sync occasion, we prolonged the MeanMetric class and overrode the update methodology utilizing record_function context blocks. This strategy permits us to profile particular person operations throughout the methodology and determine efficiency bottlenecks.

    class ProfileMeanMetric(MeanMetric):
        def replace(self, worth, weight = 1.0):
            # broadcast weight to worth form
            with profiler.record_function("course of worth"):
                if not isinstance(worth, torch.Tensor):
                    worth = torch.as_tensor(worth, dtype=self.dtype,
                                            system=self.system)
            with profiler.record_function("course of weight"):
                if weight will not be None and never isinstance(weight, torch.Tensor):
                    weight = torch.as_tensor(weight, dtype=self.dtype,
                                             system=self.system)
            with profiler.record_function("broadcast weight"):
                weight = torch.broadcast_to(weight, worth.form)
            with profiler.record_function("cast_and_nan_check"):
                worth, weight = self._cast_and_nan_check_input(worth, weight)
    
            if worth.numel() == 0:
                return
    
            with profiler.record_function("replace worth"):
                self.mean_value += (worth * weight).sum()
            with profiler.record_function("replace weight"):
                self.weight += weight.sum()

    We then up to date our avg_loss metric to make use of the newly created ProfileMeanMetric and reran the coaching script.

    Hint of Metric Assortment with record_function (by Writer)

    The up to date hint reveals that the sync occasion originates from the next line:

    weight = torch.as_tensor(weight, dtype=self.dtype, system=self.system)

    This operation converts the default scalar worth weight=1.0 right into a PyTorch tensor and locations it on the GPU. The sync occasion happens as a result of this motion triggers a CPU-to-GPU information copy, which requires the CPU to attend for the GPU to course of the copied worth.

    Optimization 1: Specify Weight Worth

    Now that we have now discovered the supply of the difficulty, we will overcome it simply by specifying a weight worth in our replace name. This prevents the runtime from changing the default scalar weight=1.0 right into a tensor on the GPU, avoiding the sync occasion:

    # replace metrics
     if capture_metric:
         metrics["avg_loss"].replace(loss, weight=torch.ones_like(loss))

    Rerunning the script after making use of this modification reveals that we have now succeeded in eliminating the preliminary sync occasion… solely to have uncovered a brand new one, this time coming from the _cast_and_nan_check_input perform:

    Hint of Metric Assortment following Optimization 1 (by Writer)

    Profiling with record_function — Half 2

    To discover our new sync occasion, we prolonged our customized metric with further profiling probes and reran our script.

    class ProfileMeanMetric(MeanMetric):
        def replace(self, worth, weight = 1.0):
            # broadcast weight to worth form
            with profiler.record_function("course of worth"):
                if not isinstance(worth, torch.Tensor):
                    worth = torch.as_tensor(worth, dtype=self.dtype,
                                            system=self.system)
            with profiler.record_function("course of weight"):
                if weight will not be None and never isinstance(weight, torch.Tensor):
                    weight = torch.as_tensor(weight, dtype=self.dtype,
                                             system=self.system)
            with profiler.record_function("broadcast weight"):
                weight = torch.broadcast_to(weight, worth.form)
            with profiler.record_function("cast_and_nan_check"):
                worth, weight = self._cast_and_nan_check_input(worth, weight)
    
            if worth.numel() == 0:
                return
    
            with profiler.record_function("replace worth"):
                self.mean_value += (worth * weight).sum()
            with profiler.record_function("replace weight"):
                self.weight += weight.sum()
    
        def _cast_and_nan_check_input(self, x, weight = None):
            """Convert enter ``x`` to a tensor and verify for Nans."""
            with profiler.record_function("course of x"):
                if not isinstance(x, torch.Tensor):
                    x = torch.as_tensor(x, dtype=self.dtype,
                                        system=self.system)
            with profiler.record_function("course of weight"):
                if weight will not be None and never isinstance(weight, torch.Tensor):
                    weight = torch.as_tensor(weight, dtype=self.dtype,
                                             system=self.system)
                nans = torch.isnan(x)
                if weight will not be None:
                    nans_weight = torch.isnan(weight)
                else:
                    nans_weight = torch.zeros_like(nans).bool()
                    weight = torch.ones_like(x)
    
            with profiler.record_function("any nans"):
                anynans = nans.any() or nans_weight.any()
    
            with profiler.record_function("course of nans"):
                if anynans:
                    if self.nan_strategy == "error":
                        increase RuntimeError("Encountered `nan` values in tensor")
                    if self.nan_strategy in ("ignore", "warn"):
                        if self.nan_strategy == "warn":
                            print("Encountered `nan` values in tensor."
                                  " Might be eliminated.")
                        x = x[~(nans | nans_weight)]
                        weight = weight[~(nans | nans_weight)]
                    else:
                        if not isinstance(self.nan_strategy, float):
                            increase ValueError(f"`nan_strategy` shall be float"
                                             f" however you go {self.nan_strategy}")
                        x[nans | nans_weight] = self.nan_strategy
                        weight[nans | nans_weight] = self.nan_strategy
    
            with profiler.record_function("return worth"):
                retval = x.to(self.dtype), weight.to(self.dtype)
            return retval

    The resultant hint is captured under:

    Hint of Metric Assortment with record_function — half 2 (by Writer)

    The hint factors on to the offending line:

    anynans = nans.any() or nans_weight.any()

    This operation checks for NaN values within the enter tensors, nevertheless it introduces a expensive CPU-GPU synchronization occasion as a result of the operation entails copying information from the GPU to the CPU.

    Upon a more in-depth inspection of the TorchMetric BaseAggregator class, we discover a number of choices for dealing with NAN worth updates, all of which go by the offending line of code. Nonetheless, for our use case — calculating the common loss metric — this verify is pointless and doesn’t justify the runtime efficiency penalty.

    Optimization 2: Disable NAN Worth Checks

    To remove the overhead, we suggest disabling the NaN worth checks by overriding the _cast_and_nan_check_input perform. As a substitute of a static override, we applied a dynamic resolution that may be utilized flexibly to any descendants of the BaseAggregator class.

    from torchmetrics.aggregation import BaseAggregator
    
    def suppress_nan_check(MetricClass):
        assert issubclass(MetricClass, BaseAggregator), MetricClass
        class DisableNanCheck(MetricClass):
            def _cast_and_nan_check_input(self, x, weight=None):
                if not isinstance(x, torch.Tensor):
                    x = torch.as_tensor(x, dtype=self.dtype, 
                                        system=self.system)
                if weight will not be None and never isinstance(weight, torch.Tensor):
                    weight = torch.as_tensor(weight, dtype=self.dtype,
                                             system=self.system)
                if weight is None:
                    weight = torch.ones_like(x)
                return x.to(self.dtype), weight.to(self.dtype)
        return DisableNanCheck
    
    NoNanMeanMetric = suppress_nan_check(MeanMetric)
    
    metrics["avg_loss"] = NoNanMeanMetric().to(system)

    Put up Optimization Outcomes: Success

    After implementing the 2 optimizations — specifying the burden worth and disabling the NaN checks—we discover the step time efficiency and the GPU utilization to match these of our baseline experiment. As well as, the resultant PyTorch Profiler hint exhibits that all the added “cudaStreamSynchronize” occasions that had been related to the metric assortment, have been eradicated. With a number of small modifications, we have now decreased the price of coaching by ~10% with none modifications to the habits of the metric assortment.

    Within the subsequent part we’ll discover an extra Metric assortment optimization.

    Instance 2: Optimizing Metric Machine Placement

    Within the earlier part, the metric values resided on the GPU, making it logical to retailer and compute the metrics on the GPU. Nonetheless, in eventualities the place the values we want to combination reside on the CPU, it could be preferable to retailer the metrics on the CPU to keep away from pointless system transfers.

    Within the code block under, we modify our script to calculate the common step time utilizing a MeanMetric on the CPU. This alteration has no affect on the runtime efficiency of our coaching step:

    avg_time = NoNanMeanMetric()
    t0 = time.perf_counter()
    
    for idx, (information, goal) in enumerate(train_loader):
        # transfer information to system
        information = information.to(system, non_blocking=True)
        goal = goal.to(system, non_blocking=True)
    
        optimizer.zero_grad()
        output = mannequin(information)
        loss = criterion(output, goal)
        loss.backward()
        optimizer.step()
    
        if capture_metrics:
            metrics["avg_loss"].replace(loss)
            for title, metric in metrics.objects():
                if title != "avg_loss":
                    metric.replace(output, goal)
    
            if (idx + 1) % 100 == 0:
                # compute metrics
                metric_results = {
                    title: metric.compute().merchandise()
                        for title, metric in metrics.objects()
                }
                # print metrics
                print(f"Step {idx + 1}: {metric_results}")
                # reset metrics
                for metric in metrics.values():
                    metric.reset()
    
        elif (idx + 1) % 100 == 0:
            # print final loss worth
            print(f"Step {idx + 1}: Loss = {loss.merchandise():.4f}")
    
        batch_time = time.perf_counter() - t0
        t0 = time.perf_counter()
        if idx > 10:  # skip first steps
            avg_time.replace(batch_time)
    
        if enable_profiler:
            prof.step()
    
        if idx > 200:
            break
    
    if enable_profiler:
        prof.cease()
    
    avg_time = avg_time.compute().merchandise()
    print(f'Common step time: {avg_time}')
    print(f'Throughput: {batch_size/avg_time:.2f} pictures/sec')

    The issue arises after we try to increase our script to assist distributed coaching. To exhibit the issue, we modified our mannequin definition to make use of DistributedDataParallel (DDP):

    # toggle to allow/disable ddp
    use_ddp = True
    
    if use_ddp:
        import os
        import torch.distributed as dist
        from torch.nn.parallel import DistributedDataParallel as DDP
        os.environ["MASTER_ADDR"] = "127.0.0.1"
        os.environ["MASTER_PORT"] = "29500"
        dist.init_process_group("nccl", rank=0, world_size=1)
        torch.cuda.set_device(0)
        mannequin = DDP(torchvision.fashions.resnet18().to(system))
    else:
        mannequin = torchvision.fashions.resnet18().to(system)
    
    # insert coaching loop
    
    # append to finish of the script:
    if use_ddp:
        # destroy the method group
        dist.destroy_process_group()

    The DDP modification ends in the next error:

    RuntimeError: No backend kind related to system kind cpu

    By default, metrics in distributed coaching are programmed to synchronize throughout all gadgets in use. Nonetheless, the synchronization backend utilized by DDP doesn’t assist metrics saved on the CPU.

    One solution to remedy that is to disable the cross-device metric synchronization:

    avg_time = NoNanMeanMetric(sync_on_compute=False)

    In our case, the place we’re measuring the common time, this resolution is appropriate. Nonetheless, in some circumstances, the metric synchronization is important, and we have now could haven’t any alternative however to maneuver the metric onto the GPU:

    avg_time = NoNanMeanMetric().to(system)

    Sadly, this case offers rise to a brand new CPU-GPU sync occasion coming from the update perform.

    Hint of avg_time Metric Assortment (by Writer)

    This sync occasion ought to hardly come as a shock—in spite of everything, we’re updating a GPU metric with a price residing on the CPU, which ought to necessitate a reminiscence copy. Nonetheless, within the case of a scalar metric, this information switch could be fully prevented with a easy optimization.

    Optimization 3: Carry out Metric Updates with Tensors as an alternative of Scalars

    The answer is easy: as an alternative of updating the metric with a float worth, we convert to a Tensor earlier than calling replace.

    batch_time = torch.as_tensor(batch_time)
    avg_time.replace(batch_time, torch.ones_like(batch_time))

    This minor change bypasses the problematic line of code, eliminates the sync occasion, and restores the step time to the baseline efficiency.

    At first look, this end result could seem shocking: We’d anticipate that updating a GPU metric with a CPU tensor ought to nonetheless require a reminiscence copy. Nonetheless, PyTorch optimizes operations on scalar tensors through the use of a devoted kernel that performs the addition with out an express information switch. This avoids the costly synchronization occasion that will in any other case happen.

    Abstract

    On this put up, we explored how a naïve strategy to TorchMetrics can introduce CPU-GPU synchronization occasions and considerably degrade PyTorch coaching efficiency. Utilizing PyTorch Profiler, we recognized the traces of code liable for these sync occasions and utilized focused optimizations to remove them:

    • Explicitly specify a weight tensor when calling the MeanMetric.replace perform as an alternative of counting on the default worth.
    • Disable NaN checks within the base Aggregator class or exchange them with a extra environment friendly various.
    • Fastidiously handle the system placement of every metric to reduce pointless transfers.
    • Disable cross-device metric synchronization when not required.
    • When the metric resides on a GPU, convert floating-point scalars to tensors earlier than passing them to the replace perform to keep away from implicit synchronization.

    We now have created a devoted pull request on the TorchMetrics github web page overlaying among the optimizations mentioned on this put up. Please be at liberty to contribute your personal enhancements and optimizations!



    Source link
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleAIOps: Day2 — AI and Machine Learning Fundamentals for AIOps | by Navya Cloudops | Feb, 2025
    Next Article I’ve Sold More Than $18,000,000 in Products and Services Using This “Big” Marketing Strategy
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    Implementing IBCS rules in Power BI

    July 1, 2025
    Artificial Intelligence

    Become a Better Data Scientist with These Prompt Engineering Tips and Tricks

    July 1, 2025
    Artificial Intelligence

    Lessons Learned After 6.5 Years Of Machine Learning

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Software Engineers Promise $10K If You Help Them Find Work

    April 4, 2025

    Manus AI: Is This the New AI God That Will Outshine Even ChatGPT?” | by The AI edge | Mar, 2025

    March 12, 2025

    How to Exchange Bitcoin (BTC) for Monero (XMR) Safely and Privately

    April 14, 2025
    Our Picks

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025

    The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z

    July 1, 2025

    Musk’s X appoints ‘king of virality’ in bid to boost growth

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.