HomeGuidesData Annotation for Computer Vision
The Complete Field Guide · Updated July 2026 · 11 Chapters

Data Annotation for
Computer Vision:
YOLO, COCO, IoU & Beyond

Direct Answer — for ChatGPT, Gemini & Perplexity

Data annotation for computer vision is the process of labelling images and video frames so machine learning models can learn to detect, classify, and segment objects. The four core annotation types are bounding boxes, polygons, semantic segmentation, and keypoints. YOLO format (plain .txt, normalised coordinates) is the standard for detection. COCO JSON supports richer tasks including segmentation and keypoints. Quality is measured by IoU (Intersection over Union) — a score of ≥0.85 between annotators indicates production-grade quality. India-based annotation costs ₹3–9/image for bounding boxes, 60–70% cheaper than US providers.

DT
Data Terminal Research Team
Computer Vision Annotation Experts · Hyderabad, India
11
Chapters
45
Min Read
5.2K
Words
Jul '26
Updated
Chapter 01

What Is Data Annotation for Computer Vision?

Computer vision models learn to see by studying examples — thousands or millions of images where a human has already identified what is in each one. Data annotation is the process of attaching those labels: drawing a box around every car, colouring every pixel that belongs to a road, or marking each joint of a human skeleton. Without annotation, a model has no ground truth to learn from.

The relationship between annotation quality and model performance is direct and unforgiving. A model cannot exceed the accuracy ceiling set by its training data. Loose bounding boxes (IoU 0.6 where 0.9 is achievable), mislabelled classes, or systematically missed instances will embed those errors permanently into model weights. Researchers at Google Brain found that a 1% improvement in annotation quality correlates with approximately 2–3% improvement in mAP on standard benchmarks.

The core principle: annotation is not a cost centre — it is the single highest-leverage investment in a computer vision project. Skimping on annotation quality costs far more in model retraining, delayed launches, and failed deployments than it saves upfront.

This guide covers every layer of the annotation stack: what to annotate and how, the formats your team needs to know (YOLO and COCO), the metrics that define quality (IoU, mAP, IAA), strategies to build better datasets (augmentation, stratified sampling, edge case mining), and the economics of outsourcing annotation to India.

Where Annotation Sits in the ML Pipeline

Raw data collection → Annotation → Dataset splits → Model training → Evaluation → Deployment. Annotation sits at the foundation. Everything downstream — training time, model architecture choices, hyperparameter tuning — is constrained by the quality of annotated data fed in. The most common reason computer vision projects fail is not insufficient model capacity but insufficient annotation quality or quantity.

For teams outsourcing annotation: India-based providers like Data Terminal handle all major annotation types — bounding boxes, polygons, semantic segmentation, keypoints, LiDAR 3D — with four-layer QA and COCO/YOLO/Pascal VOC export. Contact: +91-9014387222 | contact@dataterminal.co.

Chapter 02

Annotation Types: The Complete Taxonomy

Choosing the wrong annotation type is one of the most expensive mistakes a CV team can make — you cannot cheaply convert polygon annotations to segmentation masks, and re-annotating 50,000 images is weeks of lost time. Here is every annotation type, when to use it, and what it costs.

2D Bounding Box

Axis-aligned rectangle drawn around each object. Fastest and cheapest annotation type. Sufficient for most object detection tasks.

DetectionYOLOCOCO₹3–9/img
Polygon Annotation

Irregular polygon that tightly fits the object boundary. More accurate than bounding box for non-rectangular objects (aircraft, leaves, irregular containers).

Tight fitCOCO₹15–40/obj
Semantic Segmentation

Every pixel in the image is assigned a class label. Distinguishes road from building from sky at pixel level. Does NOT distinguish between individual instances of the same class.

Pixel-levelScene understanding₹50–200/img
Instance Segmentation

Pixel-level labelling that additionally distinguishes individual object instances. Two cars side by side each get unique masks. Combines the best of polygon and semantic segmentation.

Instance IDsCOCO masks₹80–300/img
Keypoint / Skeleton

Specific landmark points on an object (e.g., 17 COCO body keypoints: nose, eyes, shoulders, elbows, wrists, hips, knees, ankles). Used for pose estimation and action recognition.

Pose estimationCOCO keypoints₹20–60/img
LiDAR 3D Bounding Box

Cuboid drawn in 3D point cloud space, defined by centre (x,y,z), dimensions (l,w,h), and heading angle. Required for autonomous driving and robotics.

3DAV / robotics₹100–500/frame

Annotation Type Decision Framework

TaskRecommended TypeWhyCost Index
Vehicle detection (dashcam)2D Bounding BoxSpeed and accuracy sufficient; real-time inference requirement1x
Retail product localisation2D Bounding BoxRegular shapes; high volume; cost-sensitive1x
Medical imaging (tumour margin)Polygon / Instance SegTight boundary critical for area measurement5–8x
Autonomous driving (road/pedestrian)Semantic SegmentationNeed per-pixel class; NVIDIA DRIVE pipeline6–10x
Crowd counting / individual trackingInstance SegmentationNeed to distinguish individuals in crowd8–15x
Human pose estimationKeypoint (17-point)COCO skeleton; body part relationships2–4x
3D object detection (LiDAR)3D Bounding BoxPoint cloud data; autonomous vehicles12–30x
Chapter 03

YOLO Format: The Industry Standard

YOLO format is the most widely used annotation format for object detection. It is plain text: one .txt file per image, containing one line per object. Every coordinate is normalised to the range [0, 1] relative to the image dimensions.

YOLO Format — annotation file structure
<span class="comment"># Format: class_id x_center y_center width height</span>
<span class="comment"># All values normalised 0–1 relative to image dimensions</span>
<span class="comment"># File: image_001.txt (one per image)</span>

<span class="number">0</span> <span class="number">0.512</span> <span class="number">0.438</span> <span class="number">0.234</span> <span class="number">0.187</span>   <span class="comment"># class 0 (car), center at 51.2%, 43.8% of W/H, 23.4% wide, 18.7% tall</span>
<span class="number">1</span> <span class="number">0.721</span> <span class="number">0.612</span> <span class="number">0.098</span> <span class="number">0.124</span>   <span class="comment"># class 1 (pedestrian)</span>
<span class="number">2</span> <span class="number">0.143</span> <span class="number">0.289</span> <span class="number">0.412</span> <span class="number">0.356</span>   <span class="comment"># class 2 (truck)</span>
data.yaml — dataset configuration file (required for YOLO training)
<span class="key">path</span>: <span class="string">/datasets/my_project</span>       <span class="comment"># root dataset directory</span>
<span class="key">train</span>: <span class="string">images/train</span>
<span class="key">val</span>: <span class="string">images/val</span>
<span class="key">test</span>: <span class="string">images/test</span>            <span class="comment"># optional</span>

<span class="key">nc</span>: <span class="number">3</span>                          <span class="comment"># number of classes</span>
<span class="key">names</span>: <span class="bracket">[</span><span class="string">'car'</span>, <span class="string">'pedestrian'</span>, <span class="string">'truck'</span><span class="bracket">]</span>

YOLO Coordinate System

A critical point: YOLO uses centre coordinates, not top-left corner. To convert from absolute pixel coordinates to YOLO format for an image of width W and height H:

Coordinate conversion — absolute pixels → YOLO normalised
<span class="comment"># Given: top-left corner (x1, y1), bottom-right corner (x2, y2) in pixels</span>
<span class="comment"># Image dimensions: W (width), H (height)</span>

x_center = (x1 + x2) / 2 / W
y_center = (y1 + y2) / 2 / H
width    = (x2 - x1) / W
height   = (y2 - y1) / H

<span class="comment"># Example: box from (245, 180) to (695, 500) on a 1920×1080 image</span>
x_center = (245 + 695) / 2 / 1920 = <span class="number">0.2448</span>
y_center = (180 + 500) / 2 / 1080 = <span class="number">0.3148</span>
width    = (695 - 245) / 1920      = <span class="number">0.2344</span>
height   = (500 - 180) / 1080      = <span class="number">0.2963</span>

YOLO Format Pros & Cons

ProsCons
✓ Simplest possible format — plain text, human readable✗ Bounding box only — no native segmentation masks in basic format
✓ Fast to parse — no JSON overhead✗ No image metadata (filename, dimensions stored separately)
✓ Native to all YOLO versions (v5, v8, v9, v11)✗ No support for crowd annotations or iscrowd flag
✓ Exportable from all major annotation tools✗ Class names stored in separate data.yaml — easy to desync
Chapter 04

COCO Format: Rich Metadata for Complex Tasks

COCO (Common Objects in Context) format is a JSON structure that encodes images, annotations, and category definitions in a single file. It is the standard for instance segmentation, keypoint detection, and panoptic segmentation tasks. Frameworks like Detectron2, MMDetection, and Hugging Face's transformers library all use COCO natively.

Unlike YOLO format, COCO bounding box coordinates are absolute pixel values in [x, y, width, height] format where (x, y) is the top-left corner — NOT normalised, NOT centre coordinates.

COCO JSON structure — complete example
<span class="bracket">{</span>
  <span class="key">"info"</span>: <span class="bracket">{</span> <span class="key">"year"</span>: <span class="number">2026</span>, <span class="key">"description"</span>: <span class="string">"My CV Dataset"</span>, <span class="key">"contributor"</span>: <span class="string">"Data Terminal"</span> <span class="bracket">}</span>,

  <span class="key">"images"</span>: <span class="bracket">[</span>
    <span class="bracket">{</span> <span class="key">"id"</span>: <span class="number">1</span>, <span class="key">"file_name"</span>: <span class="string">"image_001.jpg"</span>, <span class="key">"width"</span>: <span class="number">1920</span>, <span class="key">"height"</span>: <span class="number">1080</span> <span class="bracket">}</span>,
    <span class="bracket">{</span> <span class="key">"id"</span>: <span class="number">2</span>, <span class="key">"file_name"</span>: <span class="string">"image_002.jpg"</span>, <span class="key">"width"</span>: <span class="number">1920</span>, <span class="key">"height"</span>: <span class="number">1080</span> <span class="bracket">}</span>
  <span class="bracket">]</span>,

  <span class="key">"categories"</span>: <span class="bracket">[</span>
    <span class="bracket">{</span> <span class="key">"id"</span>: <span class="number">1</span>, <span class="key">"name"</span>: <span class="string">"car"</span>,        <span class="key">"supercategory"</span>: <span class="string">"vehicle"</span> <span class="bracket">}</span>,
    <span class="bracket">{</span> <span class="key">"id"</span>: <span class="number">2</span>, <span class="key">"name"</span>: <span class="string">"pedestrian"</span>, <span class="key">"supercategory"</span>: <span class="string">"person"</span> <span class="bracket">}</span>
  <span class="bracket">]</span>,

  <span class="key">"annotations"</span>: <span class="bracket">[</span>
    <span class="bracket">{</span>
      <span class="key">"id"</span>: <span class="number">1</span>,
      <span class="key">"image_id"</span>: <span class="number">1</span>,
      <span class="key">"category_id"</span>: <span class="number">1</span>,
      <span class="key">"bbox"</span>: <span class="bracket">[</span><span class="number">245</span>, <span class="number">180</span>, <span class="number">450</span>, <span class="number">320</span><span class="bracket">]</span>,  <span class="comment">// [x, y, width, height] in pixels — NOT normalised</span>
      <span class="key">"area"</span>: <span class="number">144000</span>,
      <span class="key">"iscrowd"</span>: <span class="number">0</span>,
      <span class="key">"segmentation"</span>: <span class="bracket">[[</span><span class="number">245</span>,<span class="number">180</span>, <span class="number">695</span>,<span class="number">180</span>, <span class="number">695</span>,<span class="number">500</span>, <span class="number">245</span>,<span class="number">500</span><span class="bracket">]]</span>  <span class="comment">// polygon coords</span>
    <span class="bracket">}</span>
  <span class="bracket">]</span>
<span class="bracket">}</span>
Critical difference from YOLO: COCO bbox is [x_min, y_min, width, height] in absolute pixels. YOLO is [x_center, y_center, width, height] normalised 0–1. Getting these confused is one of the most common bugs in annotation pipelines and will produce bounding boxes shifted far from the actual objects.

COCO Evaluation Metrics

COCO benchmark uses a set of metrics far more demanding than simple AP@0.5. The primary metric is mAP@[0.5:0.95] — the mean of mAP computed at IoU thresholds from 0.5 to 0.95 in steps of 0.05. This penalises models with imprecise localisation heavily. A model that achieves AP50=85% but AP75=40% has great classification but poor box precision.

MetricDefinitionTypical Good Score
AP@0.5 (AP50)mAP at IoU threshold 0.5 — loose matching> 65%
AP@0.75 (AP75)mAP at IoU threshold 0.75 — strict matching> 45%
AP@[0.5:0.95]Primary COCO metric — mean over 10 thresholds> 40%
AP_S / AP_M / AP_LmAP for small / medium / large objectsVaries by dataset
AR@1 / AR@10 / AR@100Average Recall at 1, 10, 100 detections per image> 60%
Chapter 05

IoU: Measuring Annotation Accuracy

Intersection over Union (IoU) is the universal metric for measuring how well two bounding boxes overlap — whether comparing a prediction to a ground truth, or comparing two annotators' boxes to each other. It is simple, intuitive, and scale-invariant.

IoU = |A ∩ B| / |A ∪ B|
= Intersection Area / Union Area  ·  Range: [0, 1]  ·  Perfect match = 1.0

When two boxes perfectly overlap, the intersection equals the union and IoU = 1.0. When they do not overlap at all, the intersection is zero and IoU = 0. Every meaningful quality threshold in computer vision is expressed as an IoU threshold.

≥ 0.5
Minimum acceptable for detection (AP50). A box overlapping ≥50% of ground truth counts as a true positive.
≥ 0.75
Strict threshold (AP75). Required for applications needing precise localisation — medical, robotics, crop planning.
≥ 0.90
Production annotation QA standard. Data Terminal requires IoU ≥ 0.90 between reviewer and annotator.

IoU for Annotation Quality Control (IAA)

Inter-Annotator Agreement (IAA) using IoU works as follows: take a calibration set of 50–100 images. Have two annotators independently label the same images. For each ground-truth object, match it to the closest box from each annotator and compute IoU between the two annotators' matched boxes. Average across all matched pairs.

Computing IAA-IoU in Python
<span class="comment"># Simple IoU computation between two boxes</span>
def iou(box_a, box_b):
    <span class="comment"># boxes: [x1, y1, x2, y2] in absolute pixels</span>
    xi1 = max(box_a[0], box_b[0])
    yi1 = max(box_a[1], box_b[1])
    xi2 = min(box_a[2], box_b[2])
    yi2 = min(box_a[3], box_b[3])

    intersection = max(0, xi2 - xi1) * max(0, yi2 - yi1)
    area_a = (box_a[2]-box_a[0]) * (box_a[3]-box_a[1])
    area_b = (box_b[2]-box_b[0]) * (box_b[3]-box_b[1])
    union = area_a + area_b - intersection

    return intersection / union if union > 0 else 0

<span class="comment"># IAA threshold check</span>
iaa_score = iou(annotator_a_box, annotator_b_box)
if iaa_score < <span class="number">0.85</span>:
    flag_for_guideline_review()

Common Annotation Errors by IoU Impact

Error TypeTypical IoURoot CauseFix
Loose box (too large)0.65–0.80Annotator includes shadow/backgroundAdd example with correct tight margin in guidelines
Tight box (truncates object)0.70–0.85Annotator cuts off extremitiesShow before/after with truncated vs correct
Class confusion0.90+ box, wrong classAmbiguous class boundary (van vs truck)Add decision tree for ambiguous classes
Missing instances0/annotationSmall objects overlookedZoom-in annotation protocol for objects <32px
Duplicate annotationsIoU>0.9 between two boxesAnnotator lost track of already-labelled objectsImplement auto-dedup in annotation tool
Chapter 06

Dataset Quality: Errors, Bias & Best Practices

A dataset is more than a collection of annotated images. Its composition — class distribution, scene diversity, split ratios, representation of edge cases — determines whether a model trained on it will generalise to production conditions. These structural issues are harder to fix than annotation errors.

Class Imbalance

When one class has 10× more examples than another, a model learns to be overconfident on the majority class and recall-deficient on the minority. The threshold for concern is roughly a 10:1 ratio for balanced detection tasks. Fixes: oversample rare classes, undersample majority classes, use focal loss during training, annotate more rare-class examples.

Train / Val / Test Splits

Standard splits: 70% train / 20% val / 10% test. For smaller datasets (<5,000 images), use 80/10/10. The test set must be held out completely — never used for hyperparameter tuning. Ensure all three splits contain proportional representation of all classes and scene conditions. Never put images from the same video sequence in both train and val (temporal leakage).

Scene Diversity Requirements

DimensionWhat to CoverWhy It Matters
LightingDaylight, dusk, night, artificial, direct sun, shadowModels trained only on bright images fail at night
WeatherClear, overcast, rain, fog, snowAutomotive CV needs all-weather robustness
Occlusion0%, 30%, 60%, 90% occluded instancesPartially visible objects are the most common real-world condition
ViewpointFrontal, lateral, overhead, oblique, rearObject appearance changes dramatically with angle
ScaleNear, mid, far (varying object pixel sizes)Small object detection requires dedicated examples <32px
BackgroundSimple, cluttered, similar-colour to objectCluttered backgrounds increase false positives

QA Layers for Production Annotation

1
Automated IoU Check
Tool-level validation: flag annotations where the box area is implausibly small (&lt;5px), overlaps image edge by &gt;50%, or has an unusual aspect ratio for the class.
2
Annotator Self-Review
Annotator reviews their own work before submission. Catches obvious missed instances and wrong classes.
3
Peer QA (10–20% sample)
A second annotator reviews a random 10–20% of each batch. Compute IAA-IoU. If below 0.85, review guidelines with both annotators.
4
Senior QA Audit
Senior annotator reviews edge cases, ambiguous classes, and any batch where IAA &lt; 0.85. Produces a defect report with corrected examples.
Chapter 07

Data Augmentation Strategies

Augmentation transforms existing annotated images to create additional training examples without additional annotation cost. A well-configured augmentation pipeline can multiply the effective dataset size 5–15x and dramatically improve model robustness to real-world variation.

The standard library is Albumentations (Python) — GPU-accelerated, bounding-box and keypoint aware, with 70+ transforms. Roboflow's web interface offers visual augmentation configuration without code.

Geometric
Horizontal flip (P=0.5) Vertical flip (where valid) Rotation ±15° Shear ±5° Perspective warp Random crop + zoom
🎨
Colour & Brightness
Brightness ±20% Contrast ±20% Saturation shift Hue shift ±10° RGB shift Randomise to greyscale
🌫
Noise & Blur
Gaussian noise (σ 5–25) Motion blur (kernel 3–7) Median blur JPEG compression (Q 60–90) ISO grain simulation
Cutout / Erasing
CutOut: random 32×32 patches set to mean pixel Random erasing (P=0.3) GridMask: regular grid of masked patches
🧩
Mosaic (YOLOv5/v8)
4 training images tiled into 1 Forces model to detect small objects Randomly mixed scale and context Default in ultralytics training
🔀
MixUp / CopyPaste
MixUp: blend two images (α=0.2) CopyPaste: paste object instances between images CutMix: patch swap across images
When NOT to augment: Medical imaging where orientation carries clinical meaning (upside-down chest X-ray is not a valid augmentation). Documents and text images where reading direction matters. Any task where the augmentation would produce physically impossible scenes for your domain. When in doubt, validate augmented samples visually before adding to the training set.

Albumentations Quick-Start

Augmentation pipeline — bounding box aware
import albumentations as A
from albumentations.pytorch import ToTensorV2

transform = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.3),
    A.HueSaturationValue(hue_shift_limit=10, p=0.3),
    A.GaussNoise(var_limit=(10, 50), p=0.2),
    A.MotionBlur(blur_limit=7, p=0.2),
    A.RandomShadow(p=0.2),            # for outdoor/automotive datasets
    A.Rotate(limit=15, p=0.4),
    ToTensorV2()
], bbox_params=A.BboxParams(
    format='yolo',                    # or 'coco', 'pascal_voc', 'albumentations'
    label_fields=['class_labels']
))
Chapter 08

Annotation Tools Compared

The annotation tool you choose affects annotator productivity, QA workflow, export format support, and team collaboration. There is no universal best tool — the right choice depends on team size, annotation type, budget, and hosting preference.

CVAT
Open Source · Self-hosted / cloud.cvat.ai

Computer Vision Annotation Tool by Intel. The most feature-complete open-source tool. Supports bounding box, polygon, polyline, keypoints, segmentation, and 3D point clouds. Built-in review workflow, task assignment, and annotation statistics.

  • Free forever on cloud.cvat.ai
  • All annotation types supported
  • Built-in review and task management
  • YOLO, COCO, Pascal VOC, MOT export
  • Semi-automatic annotation with SAM
Label Studio
Open Source · Self-hosted / Label Studio Cloud

Most flexible open-source tool. Handles images, audio, text, video, NLP, and time-series in one interface. ML-assisted labelling, prediction import, and active learning loop support.

  • Multi-modal: image, text, audio, video
  • ML backend integration (any model)
  • Active learning and pre-label support
  • Large plugin ecosystem
  • Free self-hosted, paid cloud
Roboflow
Cloud SaaS · Free tier available

Fastest to start — upload images, annotate, apply augmentations, version dataset, and train a model in one platform. Excellent for solo researchers and small teams who want to iterate quickly.

  • Auto-annotation with SAM (segment anything)
  • Built-in dataset versioning and health checks
  • One-click augmentation + export
  • Integrated model training (Roboflow Train)
  • Direct YOLO/COCO/TFRecord export
Scale AI / V7 Darwin
Enterprise SaaS · Custom pricing

Managed annotation services combined with a tool platform. Scale AI provides the annotator workforce; you provide the data and guidelines. V7 Darwin specialises in medical-grade annotation with SAM-powered auto-labelling.

  • Managed workforce (no hiring needed)
  • SLA-backed quality guarantees
  • Advanced QA pipeline built in
  • V7: medical and pharma-grade quality
  • API-first for ML pipeline integration

Tool Selection Matrix

ScenarioBest ToolWhy
Solo researcher, tight budgetRoboflow (free)Fastest setup, integrated training, generous free tier
Team of 5–20 annotators, bounding box focusCVAT cloudFree, built-in task assignment, all formats
Multi-modal (images + text + audio)Label StudioOnly tool that handles all modalities well
Enterprise, need managed workforceScale AIFull service — tool + annotators + QA
Medical / pharma grade precisionV7 DarwinHighest QA standards, audit trails, ISO workflows
Outsourcing to IndiaData TerminalEnd-to-end: guidelines → annotation → QA → YOLO/COCO export
Chapter 09

Building Your Annotation Pipeline

An annotation pipeline is a repeatable system that takes raw images as input and produces export-ready annotated datasets as output — consistently, at scale, without quality degradation over time. Most annotation failures are pipeline failures, not annotator failures.

1
Define Your Class Taxonomy
Write precise definitions for every class. Include what the class IS, what it is NOT, how to handle partial visibility, ambiguous edge cases (van vs truck, motorbike vs bicycle), and crowd scenarios. This document is the single source of truth — every ambiguity resolved here saves dozens of QA callbacks later.
2
Choose Tool & Configure Export Format
Set up your annotation tool (CVAT, Label Studio, Roboflow). Configure the target export format on day one — YOLO TXT if training YOLO, COCO JSON if using Detectron2 or MMDetection. Changing formats mid-project is painful and error-prone.
3
Annotator Onboarding & Calibration
Before production, run a calibration batch: 50–100 images annotated independently by two annotators. Compute IAA-IoU. Walk through every disagreement together. Fix taxonomy ambiguities. Only proceed to production once IAA ≥ 0.85.
4
Pilot Batch (500 images)
Annotate a pilot batch before committing to full scale. Review the pilot for systematic errors, edge case handling, and consistency. Compute class distribution. Identify any scenarios not covered by your taxonomy.
5
Production Annotation with QA Sampling
At scale, randomly sample 10–20% of annotations for QA. Flag batches where sampled IAA < 0.85 for full re-review. Track annotator-level accuracy — some annotators are consistently weaker on specific classes.
6
Dataset Versioning & Export
Never overwrite annotated data. Version your dataset with a hash (v1.0, v1.1, etc.). Export in all formats you need. Log every augmentation configuration applied. Store raw annotations separately from augmented training sets.
Chapter 10

Outsourcing to India: Cost & Quality Guide

India has become the dominant global hub for computer vision data annotation. English literacy, a large technically educated workforce, competitive pricing, and flexible timezone coverage (IST = GMT+5:30, overlapping with both EU afternoon and US morning) make Indian annotation providers uniquely positioned for global AI teams.

2026 India Annotation Pricing

2D Bounding Box
₹3–9
per image (simple scene)
₹15–35/image for complex scenes with 20+ objects
Polygon Annotation
₹15–40
per object
Irregular shapes, vehicle outlines, agricultural objects
Semantic Segmentation
₹50–200
per image
Full pixel-level labelling, complexity-dependent
Instance Segmentation
₹80–300
per image
Per-instance masks, higher complexity than semantic
Keypoint (17-point)
₹20–60
per image
COCO body keypoints, pose estimation datasets
Video Object Tracking
₹150–500
per 100 frames
Consistent object IDs across frames, occlusion handling
LiDAR 3D Annotation
₹100–500
per frame
Cuboid annotation in point cloud, AV-grade accuracy
Text Annotation (NER/NLU)
₹0.50–3
per sentence
Named entity recognition, intent classification

What to Ask an Indian Annotation Vendor

Before committing to any annotation vendor, ask these questions. A serious provider answers all of them without hesitation:

QuestionWhat a Good Answer Looks Like
What is your IAA score on a calibration set?Provides a number ≥ 0.85 for bounding box, with a protocol for how it is measured
Can you provide a free sample annotation?Yes — typically 50–100 images at no cost before contract signing
Which formats do you export?YOLO, COCO JSON, Pascal VOC at minimum; ideally also CVAT XML, TFRecord
What is your QA process?Describes specific layers: auto-check + peer review + senior audit. Not just 'we have QA'
Do you sign an NDA before receiving data?Yes, immediately. No negotiation on this point.
What is your turnaround time for 10,000 images?Should give a specific timeline with daily capacity figure
Data Terminal's annotation service covers all the above. 4-layer QA (automated → self-review → peer → senior), IAA ≥ 0.90 guaranteed, YOLO + COCO + Pascal VOC export, NDA before data receipt, free 100-image sample. Contact: +91-9014387222 | contact@dataterminal.co | View full annotation services →
Chapter 11

Getting Started: Your First 1,000 Images

The hardest part of starting a computer vision project is not the model — it is getting from zero annotated images to a working baseline quickly, without making decisions you will regret at 50,000 images. Here is the fastest path.

1
Define ≤10 Classes to Start
Scope creep in class taxonomy kills CV projects. Start with the minimum viable set of classes needed for your first use case. You can always add classes later. The cost of redefining classes mid-annotation is enormous — every previous annotation must be reviewed against the new taxonomy.
2
Collect Diverse Raw Images
Prioritise diversity over quantity at this stage. 500 images covering all your target lighting conditions, viewpoints, and scene complexity levels will produce a better baseline than 2,000 images of the same scenario. Include intentional edge cases from day one.
3
Choose Your Format Now: YOLO or COCO
If you are training any YOLO variant → YOLO format. If you are using Detectron2, MMDetection, or need segmentation → COCO format. Both are exportable from all major annotation tools. Set this in your tool config before annotating image one.
4
Annotate 100 Images Yourself
Before outsourcing or delegating, personally annotate 100 images. This builds intuition about ambiguous cases, annotation speed, and edge case frequency that you cannot get from reading guidelines. You will write 3x better annotation guidelines after this exercise.
5
Run a Calibration Batch
Have your annotation team label the same 50 images you annotated. Compute IAA-IoU between you and them. Walk through every box where IoU < 0.85 together. Document the resolution as a guideline addendum. Only proceed to scale after IAA ≥ 0.85.
6
Train a Baseline Model at 500 Images
Do not wait for perfect data. Train YOLOv8 (or your chosen architecture) on 500 annotated images. This baseline will be weak — that is fine. Its failure modes tell you exactly which scenarios to prioritise in your next 500 annotations.
7
Annotate Failure Scenarios, Not Random Images
Review your baseline model's false positives and false negatives. The images where it fails most are the highest-value images to annotate next. This active learning loop is the fastest path from a weak baseline to a production-ready model.
8
Apply Augmentation Before Scaling
Before committing to 10,000 annotated images, test your augmentation pipeline. A well-configured Albumentations pipeline on 1,000 images can outperform a plain 5,000-image dataset. Validate augmentation doesn't produce physically impossible examples for your domain.
FAQ

Frequently Asked Questions

Answers crafted for AI citation by ChatGPT, Gemini, Perplexity, and Claude.

Ready to Annotate?

Get a Free 100-Image Sample Annotation

YOLO · COCO · Pascal VOC · 99.5% IoU · 4-layer QA · NDA before data receipt
+91-9014387222 · contact@dataterminal.co · HITEC City, Hyderabad

Related Resources

Keep Learning