Hi! I'm Divya Ramesh.

I am a Computer Vision and Deep Learning Engineer at a Los Angeles based startup called CloudSight Inc.. I work on solving visual understanding problems using a hybrid intelligence approach. I'm particularly interested in finding answers to the following questions:

  1. How can we use artificial intelligence in effectively combining humans and machines in real-time to provide visual understanding so as to collectively exceed capabilities of either?
  2. How can we transfer knowledge from the crowds to computer vision and natural language understanding algorithms?

My subject interest spans computer vision, machine learning, deep learning, natural language understanding, and more recently crowdsourcing and decision theory.

Please feel free to reach out to me if you are interested in any of the above.


R & D Software Engineer - Deep Learning & Computer Vision

CloudSight Inc

Lead the research and development efforts in hybrid intelligence for visual understanding. Also work closely with data collection team to help generate training data for deep learning algorithms.

Aug 2015 - Present

Computer Vision Intern

CloudSight Inc

Investigated feature-based algorithms with machine learning for mobile visual search. Developed a solution to reduce image recognition time for known images to 0.7 - 0.8s using OpenCV for iOS, Xcode, and Objective-C and SQLite libraries.

May 2014 - June 2015

Undergraduate Research Assistant

M S Ramaiah Institute of Technology, Bangalore, India

Collaborated with a research team under Prof. K. Manikantan. Worked on the applications of swarm intelligence for image segmentation, facial recognition and biometrics. Work culminated in an oral conference publication.

Jan 2012 - June 2013

Education and Skills

Master of Science in Electrical Engineering

University of Southern California, Los Angeles, California
August 2013 - May 2015

Bachelor of Engineering in Electronics and Communication

M S Ramaiah Institute of Technology, Bangalore, India
August 2009 - June 2013

Programming Languages

Python, C/C++, Ruby

Libraries & Packages

TensorFlow, Caffe, OpenCV, NLTK, Gensim, MATLAB

Industry Projects

System and Methods of Confirming Image Descriptions for use in Hybrid Intelligent Systems
(US Patent Filed July 2017)

Real-time captioning of visual media has a potential for great impact in applications such as helping the visually impaired identify objects and the relationships with their surroundings. Fully automated visual captioning is still in its nascent stages, and hence cannot be used as a reliable solution in such applications. However, automated captioning when combined with human-in-the-loop systems can provide real-time yet reliable solutions. Efficient systems employ algorithmic selection to decide when to include human intelligence. In this study, we propose a novel method to determine the validity of a machine generated caption for a previously unseen image. Our approach utilizes Latent Dirichlet Allocation based topic models to establish the semantic relatedness between a natural language caption and the objects & attributes present in an image. We show that this method can also be used to decide subsequent actions when employed in human-in-the-loop image caption annotation systems.

Quality Control and Reliability of Crowd Workers in Real-time Image Captioning Tasks

Monitoring quality and reliability of crowdworkers in a non-binary task like image captioning is a challenge as there can be multiple right answers. In this project, we implemented a scheme to compute performance metrics for each crowd worker by considering their quality, accuracy and response times on the task of image captioning. This metric also helped to simultaneously measure the inter-rater reliability between workers.

Real-Time Landmark Recognition using Hybrid Intelligence

Led two junior engineers to help develop a deep learning based visual recognition solution for 150 landmarks of the world. Engineered the system to use minimal human assistance while maintaining an overall recall rate >90% and error ≤5%. This solution is currently live in a leading telecommunication provider's flagship phone.

Content Based Image Retrieval and Duplicate Image Detection

Implemented a hierarchical bag-of-words based image retrieval solution for 4000 catalog images with < 2% error. This solution was developed for Canadian Tire in a turnaround time of 48 hrs.

Academic Projects

Modeling Affective States of Students to Aid Intelligent Tutoring Systems

Intelligent Tutoring Systems (ITS) could be a potential solution to many issues related to education. However, the biggest challenge for an ITS is to be able react to students in the same way as a human tutor does. This involves spontaneously adapting and reacting to students’ understanding levels. In our study, we have explored the possibility of using audio-visual cues of students to detect their understanding levels with respect to the response of a human tutor. The analysis has been done on a self-curated and annotated real-classroom interaction dataset instead of acted datasets that are generally used in such studies. We have been able to achieve a baseline accuracy of 62% in classifying the students’ levels of understanding, thus confirming the utility of such multimodal cues in modeling responses of an ITS in a student-ITS interaction.

Depth Estimation and 3D Reconstruction

Extracted disparity and depth maps from speckle pattern projected by KINECT sensor. Performed calibration of RGB and Depth sensors to combine color and depth data generated by PrimeSense, and created a point cloud image of the scene (3D reconstruction).

Analysis of Unsupervised Feature Learning Methods for Image Classification

This project investigates the utility of the unsupervised feature learning algorithm for the task of image classification. The state of the art methods for image classification use a pre-learning stage with deep networks to achieve superior results. In this project, one such pre-learning phase has been investigated and found to give significant improvement over the baseline performance obtained using raw pixel data as inputs to a classifier. The main aim of this project was to be able to find the relationship between the different hyper parameters of the feature learning model and the classification performance. This aim has been significantly achieved as is shown by the extensive study carried out in the project.

Gender Recognition and Multiple Face Tracking on TI DSP DM6437

Gender recognition from faces is a popular problem in computer vision as it is useful in many applications such as biometric authentication, high-tech surveillance and security systems, criminology, inspection, augmented reality, etc. Our system first detects the faces appearing in front of the camera using a Viola - Jones object detection framework, and then performs gender classification using a Support Vector Machine classifier that is trained on Local Binary Pattern features. Our system has been implemented on the DaVinci DSP DM6437 from Texas Instruments. We found that we could detect upto 5 faces at a time with an accuracy of 100%. Restricting the facial region to a pre-determined bounding box produces the right gender result with an accuracy of 80%, and an initialization time of 10 frames.

Interests and Other Activities


I am an intermediate classical pianist, and to me there is nothing better than ending my day with a good hour of piano practice. I love playing pieces from the Romantic Era.

I also enjoy reading a variety of books, with topics of Philosophy and Science being my favorite. Apart from this, I also enjoy hiking outdoors from time to time.


I taught Scratch Programming and Hummingbird Robotics to children of Grades 3-7 as part of the Neighborhood Youth Association (NYA), Venice, California during the academic year 2016-2017.