Low-Level vs. High-Level Computer Vision
Recent advances in deep neural networks has transformed the field of computer vision by eliminating the need of hand-crafted feature extraction such as SIFT or HOG. There has been a lot of interest in how these innovations can actually transform the current industry. This article introduces some representative and possible applications of deep learning-based computer vision technologies that either have affected or may affect our daily lives.
Depending on how much fundamental the technology is, computer vision tasks can roughly be divided into two categories: (1) low-level computer vision, and (2) high-level computer vision. In deep learning context, researchers tweak CNN architectures for low-level vision such as recognition, localization, detection, segmentation, etc. They sometimes develop some special layers that are more computational-efficient or even change the convolution operation itself (ex. Dilated conv, deformable conv, kervolution, etc. ). On the other hand, in high-level computer vision tasks, the core features are usually first extracted by pre-trained low-level feature extractors, and fixed during training. The main interest is rather in effectively interacting various features such as image, text, acoustics, etc. Image captioning, visual question answering (VQA), scene graph generation, etc. fall under this category.
Among these two, I personally think that low-level vision technologies have much more possibility and potential for practical applications, because they’re simpler. Simplicity means generality or, at least, generalizability. Supervised deep learning for low-level vision in academia have shown near-human level performance in many tasks. In my Master’s degree period, I’ve seen many collaborative projects between school and companies that are primarily about object detection, (although I specialized on video question answering. ). There are many kinds including: bird nest detection for safe transmission line, outdoor fire detection, object detection in the wild, miniaturization of detection models, eye movement detection on video, airplane/ship detection using radar imagery, etc. Compared to that, I saw only two high-level vision projects: multi-modal video question answering, and scene graph generation (general purpose).
So if you are an entrepreneur who wants to harvest low-hanging fruits of deep learning technologies, I think the most probable way is related to low-level computer vision technologies. Even though object detection performance has been increased remarkably (ex. COCO competition), the application of those to specific domains posit different challenges. For specific domains such as autonomous vehicle, bird nest detection, eye disease detection, etc., there are not many labeled data. The lack of data is one of the most common hurdle when applying deep learning technology to the real world applications. Another may be the unique characteristics of the image domain. COCO challenge deals with general images that are taken from various places containing various objects. However, consider X-ray images. All x-ray images look similar, but the diagnosis of disease requires to focus only on some few pixels. Would CNN also work well for this kind of scenario?
Therefore, I think for the next ~10 yrs, there will be a rush on applying low-level vision technologies to the industry. Well.. maybe as an academic researcher, high-level vision may have more potential for new discovery and invention. Anyway, that’s it. Way to go.