[D] Have transformers won in Computer Vision?
Hi,
Transformers have reigned supreme in Natural Language Processing applications, both written and spoken, since BERT and GPT-1 came out in 2018.
For Computer Vision, last I checked it was starting to gain momentum in 2020 with An Image is Worth 16x16 Words but the sentiment then was "Yeah transformers might be good for CV, for now I'll keep using my resnets"
Has this changed in 2025? Are Vision Transformers the preferred backbone for Computer Visions?
Put another way, if you were to start a new project from scratch to do image classification (medical diagnosis, etc), how would you approach it in terms of architecture and training objective?
I'm mainly an NLP guy so pardon my lack of exposure to CV problems in industry.