Meta AI released a new generation of improved Convolution Networks, achieving state-of-the-art performance of 87.8% accuracy on Image-Net top-1 dataset and outperforming Swin Transformers on COCO dataset where object detection performance is evaluated.
The new Convolution Networks (ConvNeXt) used as a starting baseline the Resnet model (current convolution state of the art). The new design and training approach is inspired by the Swin Transformers model. It is important to mention that Swin Transformers was the previous state-of-the-art deep-learning technique for image classification using a new type of architecture called visual transformers.
Figure 1 - Source A ConvNet for the 2020s
The technique used for a better model training was based on data-efficient image transformers and Swin Transformers approach, improving Resnet-50 accuracy from 76.1% to 78.8%.
A second strategy to increase accuracy was changing ResNet 7*7 kernel with stride 2 to a kernel 4*4 with stride 4, leading to an accuracy increase from 79.4% to 79.5%.
Another important improvement was the usage of ResNeXt design, which brought network performance to 80.5%.
The "final" upgrade to Resnet-50 was a redesigned residual-block inspired by the Swin Transform block.
The great advantage of this new model is the scalability, which means the accuracy increases with more data provided. In addition, ConvNeXt proves that Convolution Networks still can be optimized and achieve better results in image classification.
Figure 2 - Source A ConvNet for the 2020s
The framework implementation in PyTorch was released on GitHub. It has pre-trained model weights trained on Image-Net-1K, Image-Net-22K, among others. For training on Image-Net-1K, use the following command:
python run_with_submitit.py --nodes 4 --ngpus 8 \
--model convnext_tiny --drop_path 0.1 \
--batch_size 128 --lr 4e-3 --update_freq 1 \
--model_ema true --model_ema_eval true \
--data_path /path/to/imagenet-1k \
--job_dir /path/to/save_results
In addition to code release, a web demo was developed using HuggingFace platform. This demo allows you to input any image generating a label for it.
The release of this new framework has received a lot of attention on social media, especially from other state-of-the-art authors like Lucas Beyer on Twitter:
On Twitter as well, the co-author of EfficientNet argues that similar performance can be achieved with EfficientNetV2: