icon-symbol-logout-darkest-grey

Solving Computer Vision Problems through Self-Supervision and Generative Image Synthesis

  • Date in the past
  • Tuesday, 7. May 2024, 15:00
  • Mathematikon, room B128
    • Siva Karthik Mustikovela
  • Address

    Mathematikon
    Room B128

  • Organizer

  • Event Type

Computer vision models require large amounts of labeled data for training which is error prone, time consuming and notoriously hard to acquire. It is specifically difficult to obtain labels for fine grained geometry based tasks like object viewpoint estimation and geometry estimation. Obtaining large scale object detection labels for changing operating domains is also time consuming. Synthetic data is an alternative but has a huge domain gap compared to real world images which leads to models to under perform on real images. On the other hand, it is relatively easy to mine large amounts of unlabeled images of an object category from the internet. We seek to answer the whether such unlabeled collections of in-the-wild images can be successfully utilized to train computer vision models purely via self-supervision. We propose methods to learn object viewpoint estimation, object detection, controllable image generation and decomposition purely through self-supervision using unlabeled images in an analysis-by-synthesis paradigm.

For object viewpoint estimation, we leverage a viewpoint aware image synthesis network as a form of self-supervision to train our viewpoint estimation network by coupling both the models through cycle-consistency. Our method performs competitively compared to fully supervised methods for objects like faces, cars, buses and trains. For self-supervised object detection, we leverage a generative model which provides control over 3D location and orientation of the synthesized object, using which we also obtain the bounding box of the object. The synthesized image and bounding box are used to train the object detector. The object detection accuracies indicate that we outperform existing baselines considerably and surpass other synthetic data based detection methods.

Finally, we propose a method to learn geometrically controlled image generation and decomposition using class specific unpaired real world images and 3D CAD models. We jointly model the forward process of image generation and the inverse process of image decomposition. We are able to generate highly realistic images with fine grained control over shape, appearance and reflections. Our results indicate that computer vision tasks can be learned through self-supervision and can achieve performance similar to either supervised methods or synthetic data based methods.