Day 26 of #100DaysOfML 🚀
Efficiency optimization is essential for deep studying fashions, particularly when deploying to manufacturing. At the moment, let’s discover how you can profile PyTorch fashions to determine and repair efficiency bottlenecks.
PyTorch supplies highly effective built-in profiling instruments that assist us analyze mannequin efficiency throughout CPU and GPU operations. Let’s dive right into a sensible instance:
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity# Outline a easy mannequin
class SimpleModel(nn.Module):
def __init__(self):
tremendous().__init__()
self.conv1 = nn.Conv2d(3, 64, 3)
self.relu = nn.ReLU()
self.pool = nn.MaxPool2d(2)
self.fc = nn.Linear(64 * 14 * 14, 10)
def ahead(self, x):
with record_function("conv_block"):
x = self.conv1(x)
x = self.relu(x)
x = self.pool(x)
with record_function("classifier"):
x = x.view(x.measurement(0), -1)
x = self.fc(x)
return x
# Create pattern knowledge
mannequin = SimpleModel().cuda()
inputs = torch.randn(32, 3, 32, 32).cuda()
# Profile the mannequin
with profile(
actions=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
with_stack=True
) as prof:
for _ in vary(5):
mannequin(inputs)
# Print profiling outcomes
print(prof.key_averages().desk(
sort_by="cuda_time_total",
row_limit=10
))
# Export hint for visualization
prof.export_chrome_trace("pytorch_trace.json")
- Operation-level metrics: The profiler reveals us timing for every operation, serving to determine sluggish operations.
- Reminiscence utilization: Monitor reminiscence allocation and launch patterns.
- CPU-GPU synchronization: Determine potential bottlenecks in knowledge switch.
- Information Loading
- Use
DataLoader
withnum_workers > 0
- Allow
pin_memory=True
for quicker CPU to GPU switch
2. Mannequin Structure
- Substitute costly operations with environment friendly alternate options
- Use acceptable batch sizes
- Take into account mannequin compression methods
3. GPU Utilization
- Guarantee correct batch measurement for GPU reminiscence
- Use combined precision coaching when attainable
# Allow automated combined precision
from torch.cuda.amp import autocast, GradScalerscaler = GradScaler()
with autocast():
outputs = mannequin(inputs)
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.replace()
Bear in mind: Profile earlier than optimizing! Information-driven optimization is all the time simpler than guesswork.
Blissful profiling! 🔍