The pc-vision rumor mill has topped Imaginative and prescient Transformers (ViTs) “the following CNN-killer.”
However are they actually poised to upend picture processing — or did the advertising group merely sprinkle consideration on our eyeballs? I educated a tiny ViT and a plain-vanilla CNN facet by facet on CIFAR-10 and saved notes.
- Pre-2012 SIFT, HOG, LBP: engineers hand-designed filters for edges & corners.
- 2012–2020 CNNs: layers discovered options mechanically; ImageNet glory adopted.
- 2020+ ViTs slice a picture into patches → deal with each patch like a phrase token → run full-blown self-attention. Immediately every pixel “talks” to each different pixel.
That’s the elevator pitch — now let’s peek below the hood.
# Extract 4×4 patches from 32×32 photos (CIFAR-10)
patches = tf.picture.extract_patches(
photos, sizes=[1, 4, 4, 1],
strides=[1, 4, 4, 1], charges=[1, 1, 1, 1],
padding='VALID')# Flatten every patch, then challenge to the mannequin dimension
patch_dims = patches.form[-1]
patches = tf.reshape(patches, [-1, num_patches…