본문 바로가기

논문스터디

(4)
ViT An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale https://arxiv.org/abs/2010.11929 An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleWhile the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction wit..
CLIP paper : Learning Transferable Visual Models From Natural Language Supervision : https://arxiv.org/abs/2103.00020  Learning Transferable Visual Models From Natural Language SupervisionState-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data..
Diffusion Model Denising Diffusion Probabilistic Models - https://ar5iv.org/html/2006.11239 Denoising Diffusion Probabilistic ModelsWe present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by tr…ar5iv.labs.arxiv.orgAbstract(요약) diffusion probabilistic models..
StableDiffusion High-Resolution Image Synthesis with Latent Diffusion Models - https://arxiv.org/abs/2112.10752  High-Resolution Image Synthesis with Latent Diffusion ModelsBy decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guid..