Computer Vision and Generative AI

At the end of the course, you will earn a Certificate of Completion from Spirit in Projects.

Do you want to understand how computers see and interpret images, while learning how to generate impressive visual content with AI? If you are interested in the fascinating world of computer vision and generative AI models and want to use these technologies practically, then look forward to this training. From object recognition to text-to-image generation, you will learn both the theoretical foundations and the practical application of state-of-the-art AI image processing and generation technologies. In practical exercises in Google Colab, you will train your own models and experiment with state-of-the-art tools (e.g., Stable Diffusion).

Advising

Objectives

  • Understand fundamentals of computer vision and image processing
  • Learn about Convolutional Neural Networks (CNNs) and Vision Transformers
  • Practically apply object recognition and image segmentation
  • Understand and deploy generative models (GANs, Diffusion Models)
  • Master text-to-image and image-to-image generation
  • Gain practical experience with current tools (Stable Diffusion, DALL-E, Midjourney)
  • Fine-tune and adapt computer vision models

Target groups:

AI Expert, Data Scientist, Software Developer, ML Engineer, System Architect, Software Architect, UX/UI Designer, Content Creator and anyone who wants to engage with Computer Vision and generative AI.

Syllabus

1. Fundamentals of Image Processing

  • Digital images: Pixels, colors, resolutions
  • Image preprocessing and augmentation
  • Feature extraction
  • Classical computer vision methods
  • From classical methods to Deep Learning

2. Convolutional Neural Networks (CNNs)

  • Architecture of CNNs
  • Convolutional layers, pooling, activation functions
  • Well-known CNN architectures: VGG, ResNet, EfficientNet
  • Transfer learning with pre-trained models
  • Practical exercise: Image classification with transfer learning (Google Colab)

3. Vision Transformers (ViT)

  • Transformer architecture for computer vision
  • Self-attention mechanism for images
  • ViT vs. CNNs: Advantages and disadvantages
  • Hybrid approaches
  • Practical exercise: ViT for image classification

4. Object Detection and Localization

  • Object detection: YOLO, R-CNN families
  • Single-stage vs. two-stage detectors
  • Bounding boxes and confidence scores
  • Real-time object detection
  • Practical exercise: Object detection in images and videos (Google Colab)

5. Image Segmentation

  • Semantic vs. instance segmentation
  • U-Net and Mask R-CNN
  • Use cases: Medicine, autonomous driving
  • Practical exercise: Image segmentation with pre-trained models

6. Additional Computer Vision Applications

  • Facial recognition and facial landmarks
  • Pose estimation
  • OCR (Optical Character Recognition)
  • Video analysis and action recognition
  • Practical exercise: Multi-task computer vision pipeline

7. Fundamentals of Generative Models

  • What are generative models?
  • Difference from discriminative models
  • Latent space and embeddings
  • Quality metrics for generated images (FID, IS)
  • Application areas and ethics

8. Generative Adversarial Networks (GANs)

  • Architecture: Generator and discriminator
  • Training dynamics and mode collapse
  • StyleGAN and progressive growing
  • Conditional GANs
  • Practical exercise: Own GAN experiments (Google Colab)

9. Diffusion Models

  • Fundamentals of diffusion models
  • Forward and reverse process
  • Stable Diffusion architecture
  • Latent diffusion models
  • Advantages over GANs

10. Text-to-Image Generation

  • OpenAI’s and Google text-to-image models
  • Open-source alternatives (Stable Diffusion, etc.)
  • Prompt engineering for image generation
  • Practical exercise: Text-to-image with Stable Diffusion (Google Colab)

11. Image-to-Image Translation and Editing

  • Style transfer
  • Image inpainting and outpainting
  • Super resolution
  • ControlNet for precise image control
  • Practical exercise: Image manipulation with AI tools

12. Video Generation

  • Text-to-video: Runway Gen-2, Google Veo
  • Video editing with AI
  • Frame interpolation
  • Challenges of video generation

13. Fine-tuning and Adaptation

  • Fine-tuning pre-trained models
  • LoRA (Low-Rank Adaptation)
  • DreamBooth for personalized models
  • Dataset preparation
  • Practical exercise: Custom model fine-tuning (Google Colab)

14. Tools and Platforms

  • Hugging Face Diffusers
  • Stability AI
  • ComfyUI and Automatic1111 for Stable Diffusion
  • Commercial APIs: OpenAI DALL-E
  • Cloud platforms for computer vision

15. Ethics and Legal Aspects

  • Deepfakes and misuse
  • Copyright for AI-generated images
  • Bias in computer vision models
  • Watermarking and provenance
  • Responsible AI usage
Spirit in Projects