Welcome to the home of Computer Vision powered by PlayFusion. Computer Vision is one of the many subdomains of Artificial Intelligence, more commonly referred to as a ‘AI’. This site focuses specifically on Monocular Visual Odometry on mobile phones and tablet devices which is another way of saying ‘teaching low powered devices with one eye to see’ as opposed to the high end computing power available on personal computers or specialist stereo or depth sensing cameras.
This is a very exciting area of computer vision research and an area PlayFusion has been pioneering for the past three years. We call it the Enhanced Reality Engine(™)
The essentials of computer vision are:
Essentially the desire is to reproduce what our eyes and brains do and extend this capability to an automated system, while adding enhanced functionality and extracting or mixing data unavailable by direct vision.
The scope of application for computer vision is vast and varied. In the article below we look at these and consider some of the methods used and the algorithms applied, with specific attention to the use of computer vision solutions in Augmented Reality (AR) and Mixed Reality (MR) applications, where PlayFusion is leading the way with it’s advanced Enhanced Reality Engine. In these applications there is a lot of attention paid to optimization of algorithms, to run on low end hardware, with a single camera and emphasis on fast accurate recognition, mapping and tracking.
The process by which computer generated content is overlaid onto images or video streams
For many reasons. To provide an enhanced experience for users and to provide more information to the user than is available from simple observation.
Let’s first look at the other big players already in this field.
Open Source Computer Vision is a library of programming functions mainly aimed at real-time computer vision. The library is cross-platform and free for use under the open-source BSD license.
It is programmed in C++ and this is still it’s primary interface. This makes it difficult to use for non-experienced C++ programmers and, although it has a wealth of features and algorithms, it requires a great deal of understanding of these in order to implement them effectively.
Many people regard computer games as purely a distraction, not a serious industry or berate it for the fact that “kids nowadays don’t mix and climb trees anymore”. However, the games industry is responsible for major developments in computer technology that have led to advances in all kinds of other areas of our lives.
Tremendous competition in the games industry means that developers are constantly competing to make more realistic graphics, faster processing or data, better multiplayer experiences etc. etc. Because the game industry is so large it drives hardware development to cope with the demands of producing ever more impressive content.
Consequently we now see graphics cards being used to provide low cost, superfast processing of mathematical and statistical data for all kinds of applications, processors that are orders of magnitude faster than a few years ago, video and graphical image manipulation that is years ahead of where it would otherwise be, computer vision solutions to all kinds of problems, faster disk drives etc. all leading to developments in medical technology, aerospace, manufacturing, retail, and a host of other application areas.
And above all, people in the games industry are passionate, imaginative and motivated by a desire to create something new and exciting - we imagine something that we would like to do - and then we make it happen. We believe anything is possible.
At PlayFusion we have developed advanced computer vision capabilities that can be applied to many areas of industry, here are a few of them:
In fact almost any market sector you can think of can employ computer vision to enhance and extend its capabilities.
At the core of the technology is our proprietary Enhanced Reality Engine. This has been developed from our award winning computer games and is set to form a platform which can be licensed by users wanting to add augmented reality experiences or other advanced computer vision technology to their products or services.
PlayFusion's Enhanced Reality Engine employs a technique known as Monocular Visual Odometry and has some very advanced features, many of which you won’t find elsewhere. The most significant of these are shown below.
So let’s look at some of these in more detail and see how they can be applied…..
But first..some explanation of computer vision terminology
This is the process in computer vision where the pose (position and orientation) of a camera is estimated using visual data over time. This can be independent of a map of the surroundings, so it can, for example, estimate pose relative to starting position etc.
Simply put, this technique involved tracking a set of interest points - eg. edge or corner pixels in a video feed - and using advanced geometric algorithms to estimate poses. A flow chart of the method would look something like this:
Detect matching points in multiple images > compute matrices relating pairs of points > perform post processing to refine accuracy eg. Extended Kalman Filter > output of pose data
This looks simple enough, however it involves a number of other computer vision techniques in order to work. For example use of key frames - to fix reference points in time and space, an image analysis algorithm to detect the points must be employed, some image enhancement may be necessary to facilitate this, some prior knowledge about the environment may be required or need to be collected before the implementation of the odometry measurements eg. detection of planes and camera height etc.
Finally the data can be used in a number of ways, for example animation of an object overlaid on a live video feed to give an Augmented Reality experience.
Many visual odometry systems employ dual camera systems so that 3D data can be collected. This makes pose estimation a lot easier. However, as we have mentioned already, PlayFusion’s Enhanced Reality Engine is designed to be a cross platform tool that works on a wide range of low cost mobile devices. These generally have only one camera so the problem of pose estimation is much more complicated.
Essentially it is doing what other systems require two cameras for on one camera.
This uses additional data from sensors that are commonly present in mobile technology to enhance precision. In particular inertial sensors that detect movement.
A method where a map of an environment is built up while simultaneously plotting the pose of a camera travelling in that environment.
SLAM can essentially be broken down into a flow that looks something like this:
Landmark extraction > data association > state estimation > state update > landmark update.
As can be seen, these two computer vision techniques work together, the flow chart for visual odometry is very similar to SLAM and essentially forms the Localization part of the procedure whereby the rest of the algorithm calculates the real world mapping. So we can now produce a more comprehensive flow chart.
|1||Scan area to establish key frames and detect landmarks||Move camera around|
|3||Detect matching point pairs and relative movement||Move camera or object|
|4||Compute relative matrices|
|5||Post process data to improve accuracy|
|6||State estimation (pose)|
|7||State update (pose change)|
|9||Output data||Animate graphics etc.|
|10||Repeat from 3|
Although this is a simplified workflow, it shows the basic concept of visual inertial odometry that forms the basis of most computer vision AR methods. The algorithm’s and physics are no secrets and there are many papers describing these in detail - however the skill is in implementing them efficiently. PlayFusion’s Enhanced Reality Engine uses a similar, but highly modified, workflow. As can be imagined, the amount of computer processing required to make this solution effective in real time is considerable. Many users have managed this with high end devices. What PlayFusion has managed is to bring this functionality to low cost readily available mobile technology.
This is where the optical data recorded by a device is processed and identified by software. It is an incredibly complex area of computer vision. Image recognition is a very advanced computer vision technique that has been used and developed over many years. It forms the basis of many computer vision technologies.
Essentially it tries to replicate what our eyes and brains do in a fraction of a second. Complex algorithms are used to help a computer decide what it is actually seeing. An example is in facial recognition. This is used in security systems and can also be applied to augmented reality. A simple application is facebook’s messaging system where people can send pictures or video with various overlays on them. Although essentially a fun application, the algorithms here have taken years to develop and are now widely used in AR.
This is the process by which a moving object is tracked by the computer vision system. The target may be an object or a set of points on an object. From an animated character in a game to a moving tank on a battlefield, this is an area that, combined with image recognition can provide the basis of very powerful augmented reality experience where enhanced data can be presented in real time to users.
This technology enables the sharing of data across multiple platforms to create a map of the environment that is persistent. Data can be stored off device enabling much faster computer vision operations as the amount of relative data being managed is much less, so live AR is possible even on realitvely low powered devices.
At PlayFusion we have pioneered multi user technology in our Enhanced Reality Engine. This enables exciting experiences for people. We have unique capabilities in this area and users of our SDK will be able to create complex multi-user experiences.
Imagine looking at a desk through your phone or tablet and seeing race track superimposed on it. One of the cars on the track is yours, the others are other players. The race is on. You all see the same track, from different perspectives, but everyone controls their own vehicle and everyone shares the same precise map data.
Now imagine the same desk with an AR representation of medical scan of a patient from a CT machine. Two surgeons are able to practice an operation to remove a tumor. Each seeing the others moves and effects so they can work in tandem. The image generated for both users is accurate and although they can see it from different perpectives the precision is absolute.
The possibilities are endless…if you can imagine it we can probably do it.
So what are some examples of important computer vision techniques and algorithms? In computer vision applications, such augmented reality as featured in PlayFusion’s award winning Lightseekers card, game three things are of paramount importance to users:
At PlayFusion we are proud of our achievements to date, especially in terms of optimizing the code to run in real time on low end hardware other solutions can’t access, across platforms. We are continually developing the software, refining and improving it for the best possible computer vision experiences for our own use. Consequently the platform we make available to users outside our own industry is also being constantly supported and updated, driven by the intense demands and competition in the games industry to provide ever more impressive experiences.
An important part of any computer vision system is the ability to recognize artifacts or features in the image. To this end a number of methods have developed and the choice of technique depends very much on the application. There are four basic features that these computer vision algorithms look for:
A brief comparison of a few of the commonly used computer vision algorithms is shown below.
|Shi and Tomasi||X|
|Laplacian of Gaussians||X||X|
|Difference of Gaussians||X||X|
More details and rigorous explanations of the mathematics behind these algorithms can be found by following the links below:
Canny - named after its inventor, John F. Canny in 1986.
Sobel - named after its inventor, Irwin Sobel. sometimes called Sobel-Feldman operator.
SUSAN- Smallest Univalue Segment Assimilating Nucleus.
Shi and Tomasi - modification of the Harris algorithm
FAST - Features from Accelerated Segment Test
MSER - Maximally Stable External Regions
PCBR - Principal Curvature Based Region detector
Once features have been detected, a computer vision process known as “Feature Extraction” can be performed. This essentially looks at the area local to a feature, or set of features, to describe it more precisely eg. gradient of contrast, edge orientation etc. A number of methods can be used in this step. The extracted features can be used to identify artifacts in the image and comparison of sets of extracted features can be used to determine movement etc.
Since an extracted feature may be used to identify sets of detected features, it effectively reduces the amount of data requiring manipulation. This speeds up subsequent processing and facilitates machine learning. Optimization of feature extraction is key to fast, efficient computer vision solutions.
Feature extraction is regarded as a method of dimensionality reduction. This is a method that takes data which requires a number of variables to describe and reduces it to fewer variables by defining new “principal variables” in this case related sets of detected features.
Some of the more commonly used approaches are listed below.
This is a computer vision technique that detects and describes features in images and groups related points or areas of interest together. It is widely used in areas such as object recognition, remote mapping, gesture recognition and video tracking.
SIFT effectively extracts a number of image points or features and uses these as a set, to provide a feature description of an object. Once this feature description is saved, it can be used to identify the object quickly in new images or video streams. Importantly, the identification has to be reliable under a wide range of conditions such as noise, different light levels and also has to be scalable.
In computer vision applications requiring robust object recognition and subsequent tracking - eg. animating an augmented reality figure - algorithms like SIFT are extremely important.
For a fuller explanation of the technique click this link: SIFT
SIFT is often used in conjunction with techniques like….
A SIFT like computer vision image descriptor that considers more spatial regions for the histograms representing accurate distributions of data.
A statistical procedure for converting a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
This is another computer vision technique that is helps with object recognition and defining shapes, esepcially where the data might be incomplete or contain erroneous artifacts.
This is a feature detector that is essentially an evolution of SIFT. It is faster than SIFT, however it uses integer approximations of the SIFT algorithm so it may not be suitable for all applications. As always the precise choice of methods will depend on the application requirements. At PlayFusion or engineers are experts in applying this technology. Consequently we are able to optimize our code to run on low end hardware, while still delivering exceptional performance.
This is a computer vision feature extraction algorithm, specifically developed to address the problem that many other image recognition algorithms suffer from, namely that they are slow and require a lot of processing power to work fast enough in real time. KLT is able to extract data more quickly and use fewer points than many traditional methods.
As more research and development is done in computer vision, and the requirements to run complicated methods on low end hardware such as mobile phones, algorithms such as KLT have become extremely important.
KLT Tracker - Kanade-Lucas-Tomasi feature tracker
RANSAC is a method of identifying “outliers” or points that are outside a statistical level of error. It is used in computer vision mapping to ensure that erroneous points do not affect the reliability of the data. For example in mapping one image onto another - or frames in a video stream - it is used to discard any random points that may appear due to noise or other artifacts, that would otherwise affect the accuracy of the mapping.
Occlusion in computer vision is the interaction of real and virtual parts of the image. For example if an augmented character runs behind a wall, the wall hides the character from the display. Likewise if a character runs in front of a real object, the object should be hidden by the augmented entity.
To occlude real items with the augmented characters is fairly straightforward. However, to occlude augmented content with real content is much harder. It either requires a pre-mapped area to be held on the device or cloud or uses complex algorithms to map occluding objects as they appear.
One academic discussion of approaches using this technology can be found here: Real Time Occlusion Handling.
Multi user AR has long been a much sought after capability. It has been approached in a number of ways. For example a pre rendered map can be used, or a cloud based map derived from a number of users individual inputs and stitched together.
PlayFusion's Enhanced Reality Engine is pioneering the use of multi user AR on mobile devices.
Many Computer Vision systems rely on detecting a set of data points and defining planes from these points. This is a convenient way to map an environment.
PlayFusion's Enhanced Reality Engine rapidly detects a large number of data pioints and identifies multiple planes in both horizontal and vertical orentations. AR components can be spawned on any of these planes as required.
At PlayFusion we have worked hard to optimize code to work on devices from low end mobiles through to high specification computers. We are proud of this capability as we can perform complex enhanced reality tasks that others require either high end computing power and complex optics to achieve.
PlayFusion is working with companies involved in developing the latest hardware to run enhanced reality solutions on. The progression from optical to photonics devices is beginning to happen, meaning that head up displays and headsets will become smaller, lighter, cheaper and more functional.
Photonics technology using components such as diffractive optics, is bringing rapid advances to enhanced reality experiences and is set to revolutionize computer vision. Making applications using techniques such as augmented reality more accessible and immersive.