您的位置:首页 > 汽车 > 新车 > 怎么创建网站免费的_网站项目策划书实例_深圳seo推广外包_聊城网站开发

怎么创建网站免费的_网站项目策划书实例_深圳seo推广外包_聊城网站开发

2025/5/1 0:24:58 来源:https://blog.csdn.net/2301_80132162/article/details/147606190  浏览:    关键词:怎么创建网站免费的_网站项目策划书实例_深圳seo推广外包_聊城网站开发
怎么创建网站免费的_网站项目策划书实例_深圳seo推广外包_聊城网站开发

VIT

  • embedding处理与标准的Transformer不同,其他基本一致

Embedding

  • Graph: ( H , W , C ) (H,W,C) (H,W,C)
  • Patch: ( N , P 2 C ) (N,P^2C) (N,P2C),其中 N = H ∗ W P 2 N=\frac{H*W}{P^2} N=P2HW, P P P是patch的大小

  • 注意的是,论文了保留与Bert的一致性,加上了<class>这一个token,

例子

例如: ( 224 , 224 ) (224,224) (224,224)的图片,patch的大小为 ( 16 , 16 ) (16,16) (16,16),被embedding为 ( 196 = 224 ∗ 224 16 ∗ 16 , 768 = ( 16 ∗ 16 ∗ 3 ) ) (196=\frac{224*224}{16*16},768=(16*16*3)) (196=1616224224,768=(16163))的图片,加上<class>变为 ( 197 , 768 ) (197,768) (197,768),然后经过模型输出结果。

其他结构与Transformer保持一致

  • 不再赘述

特性分析

缺乏归纳偏置

  • 归纳偏置:论文中指的是平移不变性,翻译相等性。这些是CNN特有的归纳性质,可以理解为一种先验知识,可以帮助模型更好的学习。

Inductive bias. We note that Vision Transformer has much less image-specific inductive bias than CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution (as described below). Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.
🔤我们注意到 Vision Transformer 的图像特异性电感偏差比 CNN 小得多。在 CNN 中,局部性、二维邻域结构和平移等方差被烘焙到整个模型的每一层中🔤

位置编码的特殊性

  • 图像是2D的,所以自然而然地考虑位置编码是否需要是2D的?
  • 论文用实验证明:1D的位置编码足够VIT可以学习到图像这种特殊的位置关系

数据集的偏好

  • VIT相比CNN在小数据集上表现不佳:论文解释将原因归咎于缺乏归纳偏置
  • 大数据上VIT比CNN更好.

在这里插入图片描述

参考文献

@misc{dosovitskiyImageWorth16x162021,title = {An {{Image}} Is {{Worth}} 16x16 {{Words}}: {{Transformers}} for {{Image Recognition}} at {{Scale}}},shorttitle = {An {{Image}} Is {{Worth}} 16x16 {{Words}}},author = {Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},year = {2021},number = {arXiv:2010.11929},eprint = {2010.11929},primaryclass = {cs},publisher = {arXiv},doi = {10.48550/arXiv.2010.11929},archiveprefix = {arXiv}
}

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com