What if you wish to write the entire object detection coaching pipeline from scratch, so you may perceive every step and have the ability to customise it? That’s what I got down to do. I examined a number of well-known object detection pipelines and designed one which most accurately fits my wants and duties. Due to Ultralytics, YOLOx, DAMO-YOLO, RT-DETR and D-FINE repos, I leveraged them to achieve deeper understanding into varied design particulars. I ended up implementing SoTA real-time object detection model D-FINE in my customized pipeline.
Plan
- Dataset, Augmentations and transforms:
- Mosaic (with affine transforms)
- Mixup and Cutout
- Different augmentations with bounding containers
- Letterbox vs easy resize
- Coaching:
- Optimizer
- Scheduler
- EMA
- Batch accumulation
- AMP
- Grad clipping
- Logging
- Metrics:
- mAPs from TorchMetrics / cocotools
- The right way to compute Precision, Recall, IoU?
- Choose an appropriate resolution on your case
- Experiments
- Consideration to information preprocessing
- The place to start out
Dataset
Dataset processing is the very first thing you often begin engaged on. With object detection, it’s essential to load your picture and annotations. Annotations are sometimes saved in COCO format as a json file or YOLO format, with txt file for every picture. Let’s check out the YOLO format: Every line is structured as: class_id
, x_center
, y_center
, width
, peak
, the place bbox values are normalized between 0 and 1.
When you may have your photos and txt recordsdata, you may write your dataset class, nothing tough right here. Load the whole lot, rework (augmentations included) and return throughout coaching. I desire splitting the information by making a CSV file for every cut up after which studying it within the Dataloader class relatively than bodily shifting recordsdata into prepare/val/check folders. That is an instance of a customization that helped my use case.
Augmentations
Firstly, when augmenting photos for object detection, it’s essential to use the identical transformations to the bounding containers. To comfortably do this I exploit Albumentations lib. For instance:
def _init_augs(self, cfg) -> None:
if self.keep_ratio:
resize = [
A.LongestMaxSize(max_size=max(self.target_h, self.target_w)),
A.PadIfNeeded(
min_height=self.target_h,
min_width=self.target_w,
border_mode=cv2.BORDER_CONSTANT,
fill=(114, 114, 114),
),
]
else:
resize = [A.Resize(self.target_h, self.target_w)]
norm = [
A.Normalize(mean=self.norm[0], std=self.norm[1]),
ToTensorV2(),
]
if self.mode == "prepare":
augs = [
A.RandomBrightnessContrast(p=cfg.train.augs.brightness),
A.RandomGamma(p=cfg.train.augs.gamma),
A.Blur(p=cfg.train.augs.blur),
A.GaussNoise(p=cfg.train.augs.noise, std_range=(0.1, 0.2)),
A.ToGray(p=cfg.train.augs.to_gray),
A.Affine(
rotate=[90, 90],
p=cfg.prepare.augs.rotate_90,
fit_output=True,
),
A.HorizontalFlip(p=cfg.prepare.augs.left_right_flip),
A.VerticalFlip(p=cfg.prepare.augs.up_down_flip),
]
self.rework = A.Compose(
augs + resize + norm,
bbox_params=A.BboxParams(format="pascal_voc", label_fields=["class_labels"]),
)
elif self.mode in ["val", "test", "bench"]:
self.mosaic_prob = 0
self.rework = A.Compose(
resize + norm,
bbox_params=A.BboxParams(format="pascal_voc", label_fields=["class_labels"]),
)
Secondly, there are loads of attention-grabbing and never trivial augmentations:
- Mosaic. The concept is straightforward, let’s take a number of photos (for instance 4), and stack them collectively in a grid (2×2). Then let’s do some affine transforms and feed it to the mannequin.
- MixUp. Initially utilized in picture classification (it’s stunning that it really works). Concept – let’s take two photos, put them onto one another with some p.c of transparency. In classification fashions it often implies that if one picture is 20% clear and the second is 80%, then the mannequin ought to predict 80% for sophistication 1 and 20% for sophistication 2. In object detection we simply get extra objects into 1 picture.
- Cutout. Cutout entails eradicating elements of the picture (by changing them with black pixels) to assist the mannequin be taught extra strong options.
I see mosaic typically utilized with Likelihood 1.0 of the primary ~90% of epochs. Then, it’s often turned off, and lighter augmentations are used. The identical concept applies to mixup, however I see it getting used lots much less (for the most well-liked detection framework, Ultralytics, it’s turned off by default. For an additional one, I see P=0.15). Cutout appears to be used much less ceaselessly.
You’ll be able to learn extra about these augmentations in these two articles: 1, 2.
Outcomes from simply turning on mosaic are fairly good (darker one with out mosaic acquired mAP 0.89 vs 0.92 with, examined on an actual dataset)
Letterbox or easy resize?
Throughout coaching, you often resize the enter picture to a sq.. Fashions typically use 640×640 and benchmark on COCO dataset. And there are two fundamental methods the way you get there:
- Easy resize to a goal dimension.
- Letterbox: Resize the longest aspect to the goal dimension (e.g., 640), preserving the facet ratio, and pad the shorter aspect to achieve the goal dimensions.


Each approaches have benefits and downsides. Let’s talk about them first, after which I’ll share the outcomes of quite a few experiments I ran evaluating these approaches.
Easy resize:
- Compute goes to the entire picture, with no ineffective padding.
- “Dynamic” facet ratio might act as a type of regularization.
- Inference preprocessing completely matches coaching preprocessing (augmentations excluded).
- Kills actual geometry. Resize distortion may have an effect on the spatial relationships within the picture. Though it may be a human bias to assume {that a} mounted facet ratio is necessary.
Letterbox:
- Preserves actual facet ratio.
- Throughout inference, you may reduce padding and run not on the sq. picture should you don’t lose accuracy (some fashions can degrade).
- Can prepare on a much bigger picture dimension, then inference with reduce padding to get the identical inference latency as with easy resize. For instance 640×640 vs 832×480. The second will protect the facet ratios and objects will seem +- the identical dimension.
- A part of the compute is wasted on grey padding.
- Objects get smaller.
The right way to check it and resolve which one to make use of?
Practice from scratch with parameters:
- Easy resize, 640×640
- Preserve facet ratio, max aspect 640, and add padding (as a baseline)
- Preserve facet ratio, bigger picture dimension (for instance max aspect 832), and add padding Then inference 3 fashions. When the facet ratio is preserved – reduce padding through the inference. Evaluate latency and metrics.
Instance of the identical picture from above with reduce padding (640 × 384):

Here’s what occurs while you protect ratio and inference by reducing grey padding:
params | F1 rating | latency (ms). |
-------------------------+-------------+-----------------|
ratio stored, 832 | 0.633 | 33.5 |
no ratio, 640x640 | 0.617 | 33.4 |
As proven, coaching with preserved facet ratio at a bigger dimension (832) achieved a better F1 rating (0.633) in comparison with a easy 640×640 resize (F1 rating of 0.617), whereas the latency remained comparable. Be aware that some fashions might degrade if the padding is eliminated throughout inference, which kills the entire function of this trick and possibly the letterbox too.
What does this imply:
Coaching from scratch:
- With the identical picture dimension, easy resize will get higher accuracy than letterbox.
- For letterbox, When you reduce padding through the inference and your mannequin doesn’t lose accuracy – you may prepare and inference with a much bigger picture dimension to match the latency, and get just a little bit larger metrics (as within the instance above).
Coaching with pre-trained weights initialized:
- When you finetune – use the identical tactic because the pre-trained mannequin did, it ought to provide the greatest outcomes if the datasets aren’t too totally different.
For D-FINE I see decrease metrics when reducing padding throughout inference. Additionally the mannequin was pre-trained on a easy resize. For YOLO, a letterbox is often a sensible choice.
Coaching
Each ML engineer ought to know how you can implement a coaching loop. Though PyTorch does a lot of the heavy lifting, you would possibly nonetheless really feel overwhelmed by the variety of design selections out there. Listed here are some key parts to think about:
- Optimizer – begin with Adam/AdamW/SGD.
- Scheduler – mounted LR might be okay for Adams, however check out StepLR, CosineAnnealingLR or OneCycleLR.
- EMA. This can be a good approach that makes coaching smoother and typically achieves larger metrics. After every batch, you replace a secondary mannequin (typically known as the EMA mannequin) by computing an exponential shifting common of the first mannequin’s weights.
- Batch accumulation is good when your vRAM may be very restricted. Coaching a transformer-based object detection mannequin implies that in some circumstances even in a middle-sized mannequin you solely can match 4 photos into the vRAM. By accumulating gradients over a number of batches earlier than performing an optimizer step, you successfully simulate a bigger batch dimension with out exceeding your reminiscence constraints. One other use case is when you may have loads of negatives (photos with out goal objects) in your dataset and a small batch dimension, you may encounter unstable coaching. Batch accumulation also can assist right here.
- AMP makes use of half precision robotically the place relevant. It reduces vRAM utilization and makes coaching quicker (when you have a GPU that helps it). I see 40% much less vRAM utilization and at the very least a 15% coaching pace improve.
- Grad clipping. Typically, while you use AMP, coaching can turn out to be much less secure. This may additionally occur with larger LRs. When your gradients are too large, coaching will fail. Gradient clipping will ensure that gradients are by no means greater than a sure worth.
- Logging. Attempt Hydra for configs and one thing like Weights and Biases or Clear ML for experiment monitoring. Additionally, log the whole lot domestically. Save your greatest weights, and metrics, so after quite a few experiments, you may at all times discover all the data on the mannequin you want.
def prepare(self) -> None:
best_metric = 0
cur_iter = 0
ema_iter = 0
one_epoch_time = None
def optimizer_step(step_scheduler: bool):
"""
Clip grads, optimizer step, scheduler step, zero grad, EMA mannequin replace
"""
nonlocal ema_iter
if self.amp_enabled:
if self.clip_max_norm:
self.scaler.unscale_(self.optimizer)
torch.nn.utils.clip_grad_norm_(self.mannequin.parameters(), self.clip_max_norm)
self.scaler.step(self.optimizer)
self.scaler.replace()
else:
if self.clip_max_norm:
torch.nn.utils.clip_grad_norm_(self.mannequin.parameters(), self.clip_max_norm)
self.optimizer.step()
if step_scheduler:
self.scheduler.step()
self.optimizer.zero_grad()
if self.ema_model:
ema_iter += 1
self.ema_model.replace(ema_iter, self.mannequin)
for epoch in vary(1, self.epochs + 1):
epoch_start_time = time.time()
self.mannequin.prepare()
self.loss_fn.prepare()
losses = []
with tqdm(self.train_loader, unit="batch") as tepoch:
for batch_idx, (inputs, targets, _) in enumerate(tepoch):
tepoch.set_description(f"Epoch {epoch}/{self.epochs}")
if inputs is None:
proceed
cur_iter += 1
inputs = inputs.to(self.gadget)
targets = [
{
k: (v.to(self.device) if (v is not None and hasattr(v, "to")) else v)
for k, v in t.items()
}
for t in targets
]
lr = self.optimizer.param_groups[0]["lr"]
if self.amp_enabled:
with autocast(self.gadget, cache_enabled=True):
output = self.mannequin(inputs, targets=targets)
with autocast(self.gadget, enabled=False):
loss_dict = self.loss_fn(output, targets)
loss = sum(loss_dict.values()) / self.b_accum_steps
self.scaler.scale(loss).backward()
else:
output = self.mannequin(inputs, targets=targets)
loss_dict = self.loss_fn(output, targets)
loss = sum(loss_dict.values()) / self.b_accum_steps
loss.backward()
if (batch_idx + 1) % self.b_accum_steps == 0:
optimizer_step(step_scheduler=True)
losses.append(loss.merchandise())
tepoch.set_postfix(
loss=np.imply(losses) * self.b_accum_steps,
eta=calculate_remaining_time(
one_epoch_time,
epoch_start_time,
epoch,
self.epochs,
cur_iter,
len(self.train_loader),
),
vram=f"{get_vram_usage()}%",
)
# Remaining replace for any leftover gradients from an incomplete accumulation step
if (batch_idx + 1) % self.b_accum_steps != 0:
optimizer_step(step_scheduler=False)
wandb.log({"lr": lr, "epoch": epoch})
metrics = self.consider(
val_loader=self.val_loader,
conf_thresh=self.conf_thresh,
iou_thresh=self.iou_thresh,
path_to_save=None,
)
best_metric = self.save_model(metrics, best_metric)
save_metrics(
{}, metrics, np.imply(losses) * self.b_accum_steps, epoch, path_to_save=None
)
if (
epoch >= self.epochs - self.no_mosaic_epochs
and self.train_loader.dataset.mosaic_prob
):
self.train_loader.dataset.close_mosaic()
if epoch == self.ignore_background_epochs:
self.train_loader.dataset.ignore_background = False
logger.data("Together with background photos")
one_epoch_time = time.time() - epoch_start_time
Metrics
For object detection everybody makes use of mAP, and it’s already standardized how we measure these. Use pycocotools or faster-coco-eval or TorchMetrics for mAP. However mAP implies that we test how good the mannequin is general, on all confidence ranges. mAP0.5 implies that IoU threshold is 0.5 (the whole lot decrease is taken into account as a improper prediction). I personally don’t totally like this metric, as in manufacturing we at all times use 1 confidence threshold. So why not set the edge after which compute metrics? That’s why I additionally at all times calculate confusion matrices, and based mostly on that – Precision, Recall, F1-score, and IoU.
However logic additionally may be tough. Here’s what I exploit:
- 1 GT (floor reality) object = 1 predicted object, and it’s a TP if IoU > threshold. If there isn’t any prediction for a GT object – it’s a FN. If there isn’t any GT for a prediction – it’s a FP.
- 1 GT must be matched by a prediction just one time. If there are 2 predictions for 1 GT, then I calculate 1 TP and 1 FP.
- Class ids must also match. If the mannequin predicts class_0 however GT is class_1, it means FP += 1 and FN += 1.
Throughout coaching, I choose one of the best mannequin based mostly on the metrics which can be related to the duty. I usually contemplate the typical of mAP50 and F1-score.
Mannequin and loss
I haven’t mentioned mannequin structure and loss perform right here. They often go collectively, and you may select any mannequin you want and combine it into your pipeline with the whole lot from above. I did that with DAMO-YOLO and D-FINE, and the outcomes have been nice.
Choose an appropriate resolution on your case
Many individuals use Ultralytics, nevertheless it has GPLv3, and you may’t use it in business initiatives until your code is open supply. So individuals typically look into Apache 2 and MIT licensed fashions. Take a look at D-FINE, RT-DETR2 or some yolo fashions like Yolov9.
What if you wish to customise one thing within the pipeline? If you construct the whole lot from scratch, it is best to have full management. In any other case, strive selecting a undertaking with a smaller codebase, as a big one could make it tough to isolate and modify particular person parts.
When you don’t want something customized and your utilization is allowed by the Ultralytics license – it’s an awesome repo to make use of, because it helps a number of duties (classification, detection, occasion segmentation, key factors, oriented bounding containers), fashions are environment friendly and obtain good scores. Reiterating ones extra, you most likely don’t want a customized coaching pipeline in case you are not doing very particular issues.
Experiments
Let me share some outcomes I acquired with a customized coaching pipeline with the D-FINE mannequin and examine it to the Ultralytics YOLO11 mannequin on the VisDrone-DET2019 dataset.
Skilled from scratch:
mannequin | mAP 0.50. | F1-score | Latency (ms) |
---------------------------------+--------------+--------------+------------------|
YOLO11m TRT | 0.417 | 0.568 | 15.6 |
YOLO11m TRT dynamic | - | 0.568 | 13.3 |
YOLO11m OV | - | 0.568 | 122.4 |
D-FINEm TRT | 0.457 | 0.622 | 16.6 |
D-FINEm OV | 0.457 | 0.622 | 115.3 |
From COCO pre-trained:
mannequin | mAP 0.50 | F1-score |
------------------+------------|-------------|
YOLO11m | 0.456 | 0.600 |
D-FINEm | 0.506 | 0.649 |
Latency was measured on an RTX 3060 with TensorRT (TRT), static picture dimension 640×640, together with the time for cv2.imread.
OpenVINO (OV) on i5 14000f (no iGPU). Dynamic implies that throughout inference, grey padding is being reduce for quicker inference. It labored with the YOLO11 TensorRT model. Extra particulars about reducing grey padding above (Letterbox or easy resize part).
One disappointing result’s the latency on intel N100 CPU with iGPU ($150 miniPC):
mannequin | Latency (ms) |
------------------+-------------|
YOLO11m | 188 |
D-FINEm | 272 |
D-FINEs | 11 |

Right here, conventional convolutional neural networks are noticeably quicker, possibly due to optimizations in OpenVINO for GPUs.
Total, I performed over 30 experiments with totally different datasets (together with real-world datasets), fashions, and parameters and I can say that D-FINE will get higher metrics. And it is smart, as on COCO, it is usually larger than all YOLO fashions.

VisDrone experiments:


Instance of D-FINE mannequin predictions (inexperienced – GT, blue – pred):

Remaining outcomes
Understanding all the small print, let’s see a last comparability with one of the best settings for each fashions on i12400F and RTX 3060 with the VisDrone dataset:
mannequin | F1-score | Latency (ms) |
-----------------------------------+---------------+-------------------|
YOLO11m TRT dynamic | 0.600 | 13.3 |
YOLO11m OV | 0.600 | 122.4 |
D-FINEs TRT | 0.629 | 12.3 |
D-FINEs OV | 0.629 | 57.4 |
As proven above, I used to be in a position to make use of a smaller D-FINE mannequin and obtain each quicker inference time and accuracy than YOLO11. Beating Ultralytics, essentially the most extensively used real-time object detection mannequin, in each pace and accuracy, is sort of an accomplishment, isn’t it? The identical sample is noticed throughout a number of different real-world datasets.
I additionally tried out YOLOv12, which got here out whereas I used to be writing this text. It carried out equally to YOLO11 and even achieved barely decrease metrics (mAP 0.456 vs 0.452). It seems that YOLO fashions have been hitting the wall for the final couple of years. D-FINE was an awesome replace for object detection fashions.
Lastly, let’s see visually the distinction between YOLO11m and D-FINEs. YOLO11m, conf 0.25, nms iou 0.5, latency 13.3ms:

D-FINEs, conf 0.5, no nms, latency 12.3ms:

Each Precision and Recall are larger with the D-FINE mannequin. And it’s additionally quicker. Right here can also be “m” model of D-FINE:

Isn’t it loopy that even that one automobile on the left was detected?
Consideration to information preprocessing
This half can go just a little bit outdoors the scope of the article, however I wish to at the very least rapidly point out it, as some elements might be automated and used within the pipeline. What I undoubtedly see as a Computer Vision engineer is that when engineers don’t spend time working with the information – they don’t get good fashions. You’ll be able to have all SoTA fashions and the whole lot finished proper, however rubbish in – rubbish out. So, I at all times pay a ton of consideration to how you can method the duty and how you can collect, filter, validate, and annotate the information. Don’t assume that the annotation workforce will do the whole lot proper. Get your arms soiled and test manually some portion of the dataset to ensure that annotations are good and picked up photos are consultant.
A number of fast concepts to look into:
- Take away duplicates and close to duplicates from val/check units. The mannequin shouldn’t be validated on one pattern two occasions, and undoubtedly, you don’t wish to have a knowledge leak, by getting two identical photos, one in coaching and one in validation units.
- Verify how small your objects might be. Every part not seen to your eye shouldn’t be annotated. Additionally, do not forget that augmentations will make objects seem even smaller (for instance, mosaic or zoom out). Configure these augmentations accordingly so that you received’t find yourself with unusably small objects on the picture.
- When you have already got a mannequin for a sure activity and wish extra information – strive utilizing your mannequin to pre-annotate new photos. Verify circumstances the place the mannequin fails and collect extra comparable circumstances.
The place to start out
I labored lots on this pipeline, and I’m able to share it with everybody who desires to strive it out. It makes use of the SoTA D-FINE mannequin underneath the hood and provides some options that have been absent within the authentic repo (mosaic augmentations, batch accumulation, scheduler, extra metrics, visualization of preprocessed photos and eval predictions, exporting and inference code, higher logging, unified and simplified configuration file).
Right here is the hyperlink to my repo. Right here is the original D-FINE repo, the place I additionally contribute. When you want any assist, please contact me on LinkedIn. Thanks on your time!
Citations and acknowledgments
@article{zhu2021detection,
title={Detection and monitoring meet drones problem},
creator={Zhu, Pengfei and Wen, Longyin and Du, Dawei and Bian, Xiao and Fan, Heng and Hu, Qinghua and Ling, Haibin},
journal={IEEE Transactions on Sample Evaluation and Machine Intelligence},
quantity={44},
quantity={11},
pages={7380--7399},
12 months={2021},
writer={IEEE}
}
@misc{peng2024dfine,
title={D-FINE: Redefine Regression Activity in DETRs as Tremendous-grained Distribution Refinement},
creator={Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Solar and Feng Wu},
12 months={2024},
eprint={2410.13842},
archivePrefix={arXiv},
primaryClass={cs.CV}
}