Visual Intelligence and Systems

Introduction:
The development of robotic agents capable of interacting with objects in unstructured environments is a fundamental goal in robotics research. However, existing approaches suffer from a lack of integration between instruction understanding, perception, and control, which hinders their generalization capabilities. This proposal presents a novel framework that tightly couples language-driven perception with advanced control strategies to enable versatile and efficient robotic manipulation. By integrating perception and control in a synergistic manner, we aim to overcome the limitations of current approaches and enhance the adaptability, efficiency, and generalization capabilities of robotic agents.

Method:
To achieve our objectives, we propose an integrated approach that tightly integrates language-driven perception with advanced control strategies for robotic manipulation in unstructured environments. Our method is structured around three main components: the closed perception-action loop, language-driven affordance learning, and model-based control with perception integration.

The closed perception-action loop forms the foundation of our method. It facilitates continuous information exchange between the perception and control modules, influencing each other's decisions. Natural language inputs are processed by the perception module, extracting object attributes, spatial relationships, and manipulation instructions. This information is then shared with the control module to guide the robot's actions, ensuring alignment between control actions and the perceived environment and manipulation goals.

The core concept of our method is language-driven affordance learning. Leveraging natural language processing techniques, such as GPT3.5 or GPT4, we process and comprehend instructions for object manipulation tasks. Parsing and extracting essential information, such as object names, spatial relations, and manipulation verbs, enable us to generate a comprehensive perceptual representation of the scene. Semantic reasoning techniques are then employed to interpret the extracted language information, capturing contextual information crucial for manipulation.

The integration of model-based control with perception further enhances our method. By utilizing the perceptual information obtained from the language-driven affordance module, the control module generates accurate and adaptive manipulation plans. The control strategies consider the affordance map and ontological perception data, enabling efficient and precise object manipulation in unstructured environments.

Through the tight integration of language-driven affordance learning with advanced control strategies, our method enables robust understanding and execution of object manipulation tasks based on natural language instructions. By overcoming the disjointed treatment of perception and control, we enhance the adaptability, efficiency, and generalization capabilities of robotic agents.

Goals:

Our proposed method aims to achieve the following goals:

  1. Integration of Language-Driven Perception and Control: By tightly coupling language-driven perception with advanced control strategies, we aim to overcome the limitations of existing approaches that treat perception and control as separate modules. This integration will enhance the adaptability, efficiency, and generalization capabilities of robotic agents.
  2. Robust Perception and Understanding of Instructions: We aim to robustly perceive and understand object attributes, spatial relationships, and task instructions from natural language inputs. By employing advanced language processing and semantic reasoning techniques, we extract relevant information from instructions and generate a comprehensive perceptual representation of the scene.
  3. Efficient and Adaptive Object Manipulation: Our method focuses on designing control strategies that leverage the rich perceptual information obtained from language-driven perception. These control strategies enable efficient and adaptive object manipulation in dynamic and unstructured environments, handling varying objects, backgrounds, and task requirements.
  4. Generalization in Unstructured Environments: We aim to improve the generalization capabilities of robotic agents in unstructured environments. By integrating perception and control in a synergistic manner, our method leverages perceptual information to align control actions with the perceived environment and manipulation goals, facilitating effective generalization.

Requirements:

  • Strong programming skills (Python and ROS);
  • Reinforcement learning basis.
  • Experience with deep learning libraries (Pytorch or Tensorflow);
  • [Optional] Experience with panoptic segmentation, object tracking, multiple-view geometry or transformers.

Professor: Fisher Yu, external pagehttp://yf.io

Supervisors:
Yao Mu, external pagehttps://yaomarkmu.github.io/

To apply: Send your self-introduction, project of interest, CV, and transcripts to . We will usually reply within a week if there is a project match.

References:

[1]. Shridhar, Mohit, Lucas Manuelli, and Dieter Fox. "Perceiver-actor: A multi-task transformer for robotic manipulation." Conference on Robot Learning. PMLR, 2023.

[2]. Yang, Jiange, et al. "Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots." arXiv preprint arXiv:2306.05716 (2023).

[3]. Mu, Yao, et al. "EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought." arXiv preprint arXiv:2305.15021 (2023).

[4]. Nair, Suraj, et al. "R3m: A universal visual representation for robot manipulation." arXiv preprint arXiv:2203.12601 (2022).

[5]. Bahl, Shikhar, et al. "Affordances from Human Videos as a Versatile Representation for Robotics." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[6]. Brohan, Anthony, et al. "Rt-1: Robotics transformer for real-world control at scale." arXiv preprint arXiv:2212.06817 (2022).

[7]. Ahn, Michael, et al. "Do as i can, not as i say: Grounding language in robotic affordances." arXiv preprint arXiv:2204.01691 (2022).

 

Introduction
Long-horizon planning and control tasks require accurate predictions of future state transitions. However, traditional world models encounter difficulties in capturing the complexity and dynamics of such tasks, leading to lengthy prediction time steps and accumulating errors. This research proposal introduces a Chain of Thought World Model that integrates short-term prediction and key frame forecasting, aiming to improve the efficiency and accuracy of goal-conditioned reinforcement learning in long-horizon scenarios.

Method
The research will be conducted in the following steps:

  1. Design and implement the Chain of Thought World Model, incorporating both short-term prediction and key frame forecasting.
  2. Utilize benchmark environments and datasets to evaluate and compare the proposed model with traditional world models.
  3. Develop a goal-conditioned reinforcement learning framework using the Chain of Thought World Model.
  4. Conduct experiments to assess the performance of the proposed model, comparing its efficiency and accuracy with conventional world models.
  5. Analyze the results to determine the benefits and limitations of integrating short-term prediction and key frame forecasting in reinforcement learning.

 

Goals:
This research proposal aims to achieve the following objectives:

  1. Develop a Chain of Thought World Model that combines short-term prediction and key frame forecasting for long-horizon planning and control tasks.
  2. Investigate the effectiveness of integrating short-term prediction and key frame forecasting in assessing the potential impact of current actions on future outcomes.
  3. Evaluate the performance of the proposed model in comparison to traditional world models in long-horizon planning and control tasks.
  4. Assess the efficiency and accuracy of goal-conditioned reinforcement learning using the proposed approach.


Requirements

  • Strong programming skills (Python);
  • Experience with deep learning libraries (Pytorch or Tensorflow);
  • [Optional] Experience with panoptic segmentation, object tracking, multiple-view geometry or transformers.

Professor: Fisher Yu, external pagehttp://yf.io

Supervisor
s:
Yao Mu, external pagehttps://yaomarkmu.github.io/

To apply:
Send your self-introduction, project of interest, CV, and transcripts to . We will usually reply within a week if there is a project match.

References:

[1]. Wei, Jason, et al. "Chain of thought prompting elicits reasoning in large language models." arXiv preprint arXiv:2201.11903 (2022).

[2]. Yang, Mengjiao Sherry, et al. "Chain of thought imitation with procedure cloning." Advances in Neural Information Processing Systems 35 (2022): 36366-36381.

[3]. Jia, Zhiwei, et al. "Chain-of-Thought Predictive Control." arXiv preprint arXiv:2304.00776 (2023).

[4]. Chen, Lili, et al. "Decision transformer: Reinforcement learning via sequence modeling." Advances in neural information processing systems 34 (2021): 15084-15097.

[5]. Hafner, Danijar, et al. "Dream to control: Learning behaviors by latent imagination." arXiv preprint arXiv:1912.01603 (2019).

[6]. Hafner, Danijar, et al. "Mastering atari with discrete world models." arXiv preprint arXiv:2010.02193 (2020).

[7]. Hafner, Danijar, et al. "Mastering Diverse Domains through World Models." arXiv preprint arXiv:2301.04104 (2023).




 

Introduction: The design of online domain adaptation mechanisms allows for handling unforeseeable domain changes that occur during deployment, like sudden weather events. To meet the timing constraints typical of real-world applications, it is crucial to identify effective adaptation policies reducing the computational overhead as most as possible – i.e., when a domain shift is actually occurring – as well as supervising the model at its best, for instance by exploiting additional information coming from other modalities, if available, or learning temporally consistent prompts.
In this project, we aim at developing a novel online adaptation paradigm relying on the guidance of a Multimodal or Large Language Model (LLM), to both orchestrate the adaptation process itself – e.g., to identify whenever a domain shift is occurring – and exploits its knowledge to obtain additional supervision.

Join work with Google, University of Bologna, ETH/TUM

Goals:

  • Familiarize with Online Domain Adaptation literature for Semantic Segmentation
  • Review the literature combining text and vision embeddings, e.g. CLIP, DenseCLIP, as well as their use in applications, e.g. SAM, SegGPT
  • Deploy a method exploiting text prompts for conditioning/improving the adaptation process
  • Design a benchmark to evaluate the developed method and others
  • Bonus: Submit your results to top-tier deep learning conferences


Requirements:

  • Strong programming skills (Python);
  • Solid experience with deep learning libraries (e.g. Pytorch, Tensorflow, Jax);
  • Experience with computer vision applications (and, optionally, LLMs and CLIP)
     

Professor: Fisher Yu

To Apply: Send your self-​introduction, CV, and college transcripts to . We will usually reply within a week if there is a project match.

References:

  1. Sun, Tao, Segu, Mattia, et al. “SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation”, CVPR 2022.
  2. Panagiotakopoulos, Theodoros, et al. "Online Domain Adaptation for Semantic Segmentation in Ever-Changing Conditions", ECCV 2022.
  3. Radford, Alec, et al. “Learning Transferable Visual Models From Natural Language Supervision”, ICML 2021
  4. Rao, Yongming, et al. “DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting”, CVPR 2022
  5. Kirillov, Alexander, et al. “Segment Anything”, external pagehttps://arxiv.org/abs/2304.02643
  6. Wang, Xinlong, et al. “SegGPT: Segmenting Everything In Context”, external pagehttps://arxiv.org/abs/2304.03284
  7. Ge, Chunjiang, et al. “Domain Adaptation via Prompt Learning”, external pagehttps://arxiv.org/abs/2202.06687
     

Introduction
Deep learning applications have been thriving over the last decade in many different domains, including computer vision and natural language understanding. The drivers for the vibrant development of deep learning have been the availability of abundant data, breakthroughs of algorithms and the advancements in hardware. The majority of models created today require a human to manually label data in a way that allows the model to learn how to make correct decisions. The collected datasets typically need to be on the one hand representative of all the different operating and environmental conditions and on the other hand of all the different classes. This is not feasible in practical applications. For rare events, such as faults, the dataset cannot be representative of all the different fault types. Also collecting data in all possible environments can be both costly and can take long to collect (particularly in case of rare events). For datasets experiencing a domain shift, e.g. due to different environmental conditions, Domain adaptation (DA) can perform a knowledge transfer from a label-rich source domain to an unlabeled target domain. However, there have been several limitations to DA approaches, particularly for cases with rare events and with respect to their ability for online learning in new environments. There are two paradigms in resolving the generalization issue for DA: robustness and adaptation. In this project, we explore how to build an adaptive deep learning model that can learn to detect rare events / anomalies from images acquired in a different domain.

Method
The main focus of this project is to investigate how to combine few shot learning and domain adaptation techniques to utilize the information from ‘normal’ images and adapt the deep learning model to detect cracks/anomalies in varying environments.

Goals

  1. Familiarize with Computer vision and domain adaptation literature
  2. Literature review of state-of-the-art methods for anomaly detection, few shot learning and Domain adaptation/generalization
  3. Propose an algorithm combining DA and few shot learning for anomaly detection.
  4. Design a set of experiments to evaluate different domain adaptation approaches
  5. Train and evaluate the approaches on various Industrial datasets
  6. Real word application with our Industrial partner.


Requirements
Python, Pytorch/Tensorflow, prior experience with Computer vision. Nice to have: Enthusiasm toward learning and experiencing.

Professor
Fisher Yu, external pagehttp://yf.io and Olga Fink https://ims.ibi.ethz.ch

To apply
Send your self-​introduction, the project of interest, CV, and college transcripts to . We will usually reply within a week if there is a project match.

References
[1] Li, R. a.-S. (2020). Model Adaptation: Unsupervised Domain Adaptation Without Source Data. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Wang, X. a. (2020). Frustratingly Simple Few-Shot Object Detection. International Conference on Machine Learning (ICML).

[3] Xiao, J. a. (2020). Multi-Domain Learning for Accurate and Few-Shot Color Constancy. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Yang, C. a.-N. (2020). One-Shot Domain Adaptation for Face Generation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
 

Introduction
Professional human pilots can achieve astounding levels of performance in First-Person-View (FPV) drone racing, where pilots control the vehicle based on live-streamed images from a camera mounted on the drone's nose.
The ability for human pilots to effectively control the drone solely based on images is awe-inspiring and acts as a primary source of inspiration for many researchers.
Vision-based systems that can achieve the same level of autonomy have wide-ranging applicability to inspection, delivery, search-and-rescue.

Method

We aim to develop vision-based end-to-end neural network policies to push the quadrotors to their physical limits in autonomous drone racing. The neural network policy receives first-person-view images and outputs control commands that execute on the quadrotor directly. The main focus of this project is to use reinforcement learning and visual observation to solve the task.

Goals

  1. Familiarize with reinforcement learning and drone racing literature [1, 2]
  2. Literature review of state-of-the-art methods for vision-based navigation and efficient RL
  3. Design a set of experiments to evaluate different neural network architectures (e.g., CNN, LSTM, TCN), different state representations (e.g., raw images, compressed features).
  4. Train and Evaluate the policy using our simulator [3]
  5. Real-world Flight (with the help of experienced engineers and PhD students


Requirements

C++/Python, Pytorch/Tensorflow, prior experience with Reinforcement Learning, Nice to have: Experience with computer vision and robotics background.

Professor
Fisher Yu, external pagehttp://yf.io and Davide Scaramuzza external pagehttp://rpg.ifi.uzh.ch

To apply
Send your self-​introduction, the project of interest, CV, and college transcripts to . We will usually reply within a week if there is a project match.

References
[1] Song, Y., Steinweg, M., Kaufmann, E. and Scaramuzza, D.. Autonomous Drone Racing with Deep Reinforcement Learning. IROS 2021.
[2] P. Foehn, A. Romero, D. Scaramuzza. Time-Optimal Planning for Quadrotor Waypoint Flight. Science Robotics, July 21, 2021.
[3] Song, Y., Naji, S., Kaufmann, E., Loquercio, A. and Scaramuzza, D. Flightmare: A flexible quadrotor simulator. CoRL 2020.
[4] Arulkumaran, K., Deisenroth, M.P., Brundage, M. and Bharath, A.A., 2017. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866.
 

Introduction: Robotic manipulation is a very challenging task, as it requires tight interactions between perception, planning and control. Due to their dependence errors in one of the modules often means a failure of the whole task. As an alternative end-to-end reinforcement learning (RL) has emerged, where the different levels are combined into one, which can be trained directly using RL. Even though these methods can be successful for a lot of tasks, they are often slow to train and have trouble with highly dynamic environments.

Method: In this thesis we plan to investigate how we can push RL for manipulation to highly dynamic environments. Therefore, in a first step different state of the art RL algorithms should be benchmarked for robotic manipulation tasks. The goal is to find a method suited for application to a real robotics manipulation task, focusing on sample efficiency and sim-to-real transfer. Given the knowledge about weaknesses and strengths of these algorithms in a next step we push them to dynamic tasks. This involves both improving the given algorithms to better deal with the new environment by using suited network architectures or changing the training scheme, but also developing suited benchmark tasks for dynamic object manipulation.

Goals:

  1. Understanding the basic issues with RL for manipulation tasks (HER,SAC)
  2. Try to push such methods to dynamic tasks
  3. Investigate network architecture and RL learning for dynamic tasks
  4. Design benchmark dynamic task


Requirements: Python, Pytorch/Tensorflow, ROS/Gazebo. Nice to have: Experience with RL

Professor: Fisher Yu, external pagehttp://yf.io

To apply: Send your self-introduction, project of interest, CV, and college transcripts to . We will usually reply within a week if there is a project match.

Introduction: Reconstructing an object in a Structure-from-Motion (SfM)/Depth fusion pipeline is typically done by registering new views/depths incrementally. However, these methods do not consider the object being roughly in the center when performing the fusion. The goal of the project is to consider objectness and central location of the object as a prior when performing KinectFusion [cite]. This can give a potential advantage for the method in order to obtain a cleaner and background-free point cloud from the scanning process.

Method: Two main challenges exist for depth-based scanning. The first is that of the variabilities of the object size and appearance that should be handled automatically by the method. This may be performed by a network which predicts the saliency or objectness in the depth map, fine-tuned online. The second challenge is that of using both the RGB images and the depth map for obtaining the best possible camera pose, in order to perform the fusion of depth maps.
The implementation of the project can be performed based on an open source KinectFusion library. A strong preference is to use a C++ implementation, while some components such as the neural network can be implemented in TensorFlow/PyTorch. There exist numerous KinectFusion opensource libraries implemented in Python and C++.

Goals:

  1. Fuse depth maps for high fidelity 3D reconstruction.
  2. Use intersection of field of views to reject depth maps that do not belong to the object.

Impacts: The obtained solution will be integrated in an open-source High Fidelity Robotic Scanning project, which would have a big impact in the field of robotic scanning system for single objects.

Professor: Fisher Yu, external pagehttp://yf.io

To apply: Send your self-introduction, project of interest, CV, and college transcripts to . We will usually reply within a week if there is a project match.
 

Introduction: High-fidelity 3D reconstruction is a requirement for virtual visual experience of the physical world. However, existing methods usually fail to capture the details of the real world due to its shape and material complexity. This is largely due to the inadequate amount of acquired information from the existing devices and manual operations. We plan to combine perception aware planning, model based reinforcement learning and 3D scene understanding to capture the data in indoor environments to remedy those problems.

Method: The goal of this thesis is to develop a prediction aware planning module that plans its motion trying to minimize the uncertainty about the 3D reconstruction. This requires two major components, the first is a suited planning module which can quickly generate suited paths considering complex uncertainty aware reward functions. The second module is a dynamic representation of the object that can be dynamically updated and can accurately capture the uncertainty in the 3D reconstruction. In this thesis we plan to investigate both classical object models based on point clouds and ray tracing, and voxel grids, but also learning based methods.

Goals:

  1. Develop a perception aware path planner
  2. Integrate uncertainty into the path planner
  3. Develop a object representation that can estimate the uncertainty of 3D reconstruction
  4. Develop a learning based approach that can estimate the reconstruction uncertainty given a point cloud.

Professor: Fisher Yu, external pagehttp://yf.io

To apply: Send your self-introduction, project of interest, CV, and college transcripts to . We will usually reply within a week if there is a project match.
 

Introduction: The ability to keep learning new concepts and to ask for new knowledge is the hallmark of intelligence and the key to understanding large-scale datasets. To achieve this goal, we need to study active learning, online learning, and continuous learning formulations and solutions. Numerous research works have investigated how the machine can learn if a human teacher can provide continuous supervision based on the machine feedback. However, the existing algorithms are conceptual and it is impossible to test them in a real system. This project aims to study the proper system architectures and algorithm designs for human-machine collaboration systems. Please note that there are two directions in this project. You will choose one as your focus. One is computer system design. The system challenge is the design of cloud systems that can support active and online learning. The other is algorithm design. The goal is to design active and online learning models that can help humans solve computer vision problems.

Method: On the algorithm track, we will study how to design interactive algorithms to help the system users generate ground truth labels with minimum efforts. The targeting tasks include object segmentation, video segmentation, point cloud object detection, etc. On the system track, we will survey the existing algorithms and study how to unify the ideas. Then we will come up with the system API and test them on different algorithms. The API will be implemented based on Scalabel (external pagehttps://github.com/scalabel/scalabel). We will look into the cloud system communication and resource allocation issues to provide a low-latency framework for human machine collaboration.

Goals:

System track

  1. Understand the machine learning foundation of active and online learning.
  2. Gain experience in using those algorithms in solving real problems.
  3. Design the system and user interfaces for the algorithms.
  4. Draw conclusions on the low-latency system design by investigating the system tradeoffs.
  5. Extend Scalabel to demonstrate the interfaces and internal designs.
  6. Obtain published results for top-tier computer system conferences.

Algorithm track

  1. Provide baseline implementations for some popular algorithms.
  2. Prove the new algorithms can outperform the existing methods on the targeted problem.
  3. Obtain published results for top-tier computer vision conferences.

Requirements:
System track: Strong programming skills and OOP design experience. Decent background in machine learning and computer to understand the state-of-the-art methods.
Algorithm track: Prior experience with deep learning models for computer vision problems such as semantic segmentation and object detection. Good grasp of machine learning and computer vision knowledge.

Professor: Fisher Yu, external pagehttp://yf.io

To apply: Send your self-introduction, project of interest, CV, and college transcripts to . We will usually reply within a week if there is a project match.

References:

[1] external pagehttps://github.com/scalabel/scalabel/
[2] Continuous Adaptation for Interactive Object Segmentation by Learning from Corrections, ECCV 2020
[3] Ray: A Distributed Framework for Emerging AI Applications, OSDI 2018
[4] GraphX: Graph Processing in a Distributed Dataflow Framework, OSDI 2014
[5] Object Instance Annotation with Deep Extreme Level Set Evolution, CVPR 2019
 

Introduction: High-fidelity 3D reconstruction is a requirement for virtual visual experience of the physical world. However, existing methods usually fail to capture the details of the real world due to its shape and material complexity. This is largely due to the inadequate amount of acquired information from the existing devices and manual operations. We plan to combine reinforcement learning and 3D scene understanding to learn how to capture the data in indoor environments to remedy those problems.

Method: We will study the setup where the RGBD cameras are mounted on a mobile robotic arm. The system will support 3D scene reconstruction by integrating SLAM, RGBD reconstruction, and implicit deep shape understanding. We will then investigate how to design the policy learning algorithms to direct a mobile robot to efficiently acquire data for further scene reconstruction.

Goals:

  1. Build a ROS2 package for SLAM
  2. Obtain large-scale 3D scene reconstruction results
  3. Design deep learning algorithm to extract 3D shape information
  4. Design reinforcement learning algorithms for indoor path planning

Requirements: Deep understanding of 3D geometry and reinforcement learning; Strong programming skills. Optional: experience with the real robots and mechanical engineering background.

Professor: Fisher Yu, external pagehttp://yf.io

To apply: Send your self-introduction, project of interest, CV, and college transcripts to . We will usually reply within a week if there is a project match.

References:

[1] Local Deep Implicit Functions for 3D Shape, CVPR 2020
[2] Autonomous 3-D Reconstruction, Mapping, and Exploration of Indoor Environments With a Robotic Arm, RA Letters 2019
[3] ORB-SLAM: a Versatile and Accurate Monocular SLAM System, ToR 2015
[4] Screened Poisson Surface Reconstruction, ToG 2018
 

Introduction: Generalization of deep learning models across datasets is always an obstacle and mystery when we deploy state-of-the-art models in the real-world applications. There are two paradigms in resolving the generalization issue: robustness and adaptation. In this project, we explore how to build adaptive deep learning models that can learn from the videos in different environments.

Method: We will investigate how to use self-supervised learning and few-shot learning to utilize the information embedded in each video for adapting the deep learning models in an online fashion.

Goals:

  1. Investigate different few-shot learning methods for object detection.
  2. Design algorithms to use test-time information for domain adaptation.
  3. Combine feedback learning and meta learning for the task.

Requirements: Solid knowledge foundation in machine learning and computer vision. Understanding of different optimization methods, meta learning, and few-shot learning.

Professor: Fisher Yu, external pagehttp://yf.io

To apply: Send your self-introduction, project of interest, CV, and college transcripts to . We will usually reply within a week if there is a project match.

References:

[1] Frustratingly Simple Few-Shot Object Detection, ICML 2020
[2] Few-shot Object Detection via Feature Reweighting, CVPR 2019
[3] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
[4] Learning feed-forward one-shot learners, NIPS 2016
 

Introduction: Object tracking receives long standing interest in computer vision due to its critical applications in video editing, movie special effects, autonomous driving, and robotics. Even though numerous algorithms have been proposed for this task, there are still a lot of open problems. One of the key problems is how we can balance time efficiency and tracking accuracy to build real-time object tracking algorithms. In this project, we aim to leverage the temporal and spatial information for joint object detection and tracking to improve the overall efficiency and efficacy of the resulting tracking systems.

Method: We will start with the best performing online tracking algorithm on large-scale datasets and investigate the redundancy of the tracking and detection components. Also, we will look into the existing object detection algorithms for efficient object localization. Then we will investigate how to design new tracking algorithms by using the temporal information to improve detection accuracy.

Goals:

  1. Design a new algorithm to improve the running time of the existing tracking algorithm without sacrificing the accuracy.
  2. Design reusable API to make the algorithm available on robotic systems.


Requirements: Solid knowledge foundation in machine learning and computer vision. Familiarity with object detection and tracking methods.

Professor: Fisher Yu, external pagehttp://yf.io

To apply: Send your self-introduction, project of interest, CV, and college transcripts to . We will usually reply within a week if there is a project match.

References:

[1] Quasi-Dense Similarity Learning for Multiple Object Tracking, CVPR 2021
[2] YOLO9000: Better, Faster, Stronger, CVPR 2017
[3] Tracking without bells and whistles, ICCV 2019
 

Introduction: Object detection is one of the most fundamental problems in computer vision. It is the pillar for most of the computer vision problems that require instance-level recognition. At the same time, numerous new breakthroughs in machine learning can directly influence how we think about and design object detection algorithms. In this project, we aim to examine recent ideas in machine learning and understand their usage in object detection.

Method: We will review the recent trends in Transformer and design new model architectures to incorporate Transformer into object detection. We will focus on how to use Transformer to build simpler object detection networks. We will also study how we can use self-supervised learning to improve the performance and generalizability of object detection models.

Goals:

  1. Design a new object detection model based on Transformer
  2. Use self-supervised learning to learn good image representation for detection

Requirements: Solid knowledge foundation in machine learning and computer vision. Familiarity with object detection and representation learning methods.

Professor: Fisher Yu, external pagehttp://yf.io

To apply: Send your self-introduction, project of interest, CV, and college transcripts to . We will usually reply within a week if there is a project match.

References:

[1] End-to-End Object Detection with Transformers, ECCV 2020
[2] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS 2015
[3] Attention Is All You Need, NIPS 2017
[4] Rethinking Pre-training and Self-training, NeurIPS 2020
 

Introduction: Humans have the amazing capability to estimate the depth and shape of the observed objects with a single eye. Such ability in intelligence systems is critical in automation such as self-driving cars. Although we have made a lot of progress in imitating this capability, we are still a long way from estimating the accurate depth and shape of moving objects. What’s worse, it is very expensive to get dense labels for the depth of moving objects. To solve this problem, we aim to leverage the temporal and geometric information embedded in the monocular videos and use 3D self-supervised learning to enhance the learning process.

Method: We will study the existing supervised and unsupervised monocular depth estimation methods. On the supervised learning side, we will use the LiDAR data collected on the autonomous driving platform as a training signal for depth estimation. We will dive into the internal mechanism of the existing models and investigate how we can estimate the object depth and shape reliably. At the same time, we will use self-supervised learning methods and 3D geometry information to remedy the sparsity of supervised learning signals.

Goals:

  1. Build a new state of the art single-image depth estimation method for moving objects.
  2. Propose new 3D object detection and tracking algorithms to utilize the new depth information.


Requirements
: Deep understanding of 3D geometry and representation learning; Strong programming skills.

Professor: Fisher Yu, external pagehttp://yf.io

To apply: Send your self-introduction, project of interest, CV, and college transcripts to . We will usually reply within a week if there is a project match.

References:

[1] Joint Monocular 3D Vehicle Detection and Tracking, ICCV 2019
[2] Unsupervised Learning of Depth and Ego-Motion from Video, CVPR 2017
[3] AdaBins: Depth Estimation using Adaptive Bins
[4] Categorical Depth Distribution Network for Monocular 3D Object Detection
[5] Hierarchical Discrete Distribution Decomposition for Match Density Estimation, CVPR 2019
 

Introduction: You can check our major research directions and publications at https://cv.ethz.ch/research.html and https://cv.ethz.ch/publications-and-awards.html. You can express your interest in those works and we can discuss how to investigate your own ideas.

Professor: Fisher Yu, external pagehttp://yf.io

To apply: Send your self-introduction, project of interest, CV, and college transcripts to . We will usually reply within a week if there is a project match.
 

JavaScript has been disabled in your browser