Due to the recent development in both software and hardware, augmented reality is becoming increasingly popular and applied in applications within various fields such as gaming, healthcare, and education. When designing these AR and VR applications, one problem that is encountered is the designing of virtual scenes. In particular, it is time consuming to design scenes by hand, especially when these scenes must be varied to create realistic environments. Recent literature in the computer vision community has tried to address this problem through scene synthesis, a method to automatically generate environments. Popular approaches utilize scene structures from scenes in a dataset, learn patterns that exists within those scene structures, and propose new scenes based on this learning. Earlier papers such as GRAINS uses scene trees, but more recent papers such as SceneGraphNet use scene graphs. Additionally, there are other approaches such as DeepSynth and PlanIT that use scene images. The structures I’ve mentioned previously will be talked about in detail later.
However, lawsuits surrounding SUNCG, a popular scene database used widely in scene synthesis research, has caused public repositories to take down their associations with the dataset. As a result, future scene synthesis research can no longer reproduce results, unless data has been pre-processed, and most importantly, there is no longer access to SUNCG overall. Therefore, researchers in the scene synthesis field need to rely on smaller databases that contain scans of real scenes, which cause bounding boxes to be erroneous. There exists a need for more robust systems that don’t rely (1) on perfect labeling and (2) large amounts of data.
In this blog, I’ll go over the DeepGen, a generative contextual augmentation framework with a deep network approach. Our scene synthesis system attempts to address the first aforementioned issue by relaxing the constraints on scene relationships. This allows us to use scene graphs and message passing techniques to suggest placements of objects. In regard to the second issue, we augment data to increase training set size by treating objects in a room as individual datapoints rather than the rooms themselves. Our system is trained on the Matterport 3D dataset, which contains 223 bedrooms which is small compared to the thousands of rooms that SUNCG provides. Our choice of dataset gives us the ideal environment for our system to address the two outlined issues.
Besides relaxing scene relationship constraints, we also add a new scene relationship based on intersections, and we leverage dynamic edge convolution layers to help distinguish important co-occurence relationships. Scene relationships will be gone over in detail later.
Consider the following designer problem:
A designer is concerned with placing an object of furniture F of label L into a room R. Within the room R, there may exist objects within R.
Our algoirthm provides a solution to this problem via the following algorithm:
- Sample points from the room. Generally, the sampling scheme is that you sample some resolutions of points uniformly in space from the room. For instance, if you have a resolution of r=50, then you sample 50×50=2500 points.
- For each point that you sample P, our system pretends that object F is centered at P.
- Given that F is centered at P, we generate a summary feature vector and a set of scene graphs associated with F’s placement. Both the summary feature vector and scene graphs are generated by observing the spatial relationships of F to existing objects within the scene, including the walls and floor.
- Using the summary feature vector and the set of scene graphs as input to our graph neural network, we output a probability describing the likelihood that object F is centered at point P.
- Once we have gotten all the probabilities, the sample point with the maximim probability of placement is where we decide to actually place object F in room R.
The image below describes the algorithm described above.
In this section, we will go over some implementation details. Before I begin with describing the summary features and scene relationships that are encoded in scene graphs, I’ll take some time to describe popular choices for encoding scene relationships.
Scene Relationships and their Encodings
Scene relationships are not at all hard to understand, and they just require a general looking around of your surroundings. If luck would have it, you might be reading this blog on a laptop on a table. If this is the case, then you might say that the laptop and table have a scene relationship of support, specifically the table is supporting the laptop or the laptop is supported by the table.
The previous example illustrates the simplicity of scene relationships: they are merely just observations of the room environment. Scene synthesis literature introduces not only support relationships, but they also describe surrounding (multiple objects surrounding one object), next-to (objects next to each other with the same orientation), facing (objects next to each other with opposite orientation), and more. The declaration of scene relationships between objects allows us to mathematically encode information about a scene in some sort of structure. I’ll go over three popular choices: scene trees, scene graphs, and scene images.
In scene trees, relationship types are specified in inner nodes, and objects that are participating in those relationships are children of the inner node. The ordering of the children can even have meaning. For instance, for an inner node that represents a surrounding relationship, the first (left-most) child is the object that is surrounded by other objects (the other children of the inner node). Finally, the root represents an abtract room node that connects all subsequent nodes, and leaf nodes represent actual objects within the scene, including walls and floor. Each node within the tree can have an associated feature vector that expresses more information than what is encoded in the structure, but usually, these feature vectors are in the leaf nodes to capture data such as bounding box size and an object’s center in the room reference frame. Below is a picture of a scene tree from GRAINS.
One of the glaring issues about scene graphs is the distance from one node to another node in a scene. In order for you to reach a node in one branch to another node in another branch, you potentially have to go through multiple inner nodes and even the root node in some cases. This can ultimately distill or muddy up relationships if there exists any between nodes. Furthermore, a hierarchy of relationships for a specific object is established in scene trees due to the membership of one inner node and sub-memberships to other inner nodes by tree traversal. There might not exist a hierarchy to begin with.
On the other hand, scene graphs do not have the problems described previously, and they are much simpler to conceptualize. In a scene graph, nodes represent the actual objects, while the relationships between nodes are established via edge connections. Furthermore, edges have types which dictate what kind of relationship exists between nodes. The figure below from the SceneGraphNet paper illustrates the simplicity of a scene graph.
Another popular scene relationship encoding scheme are scene images. In actuality, these images are not images as we know, i.e. have one or three channels for each color. Instead, each channel of a scene image encodes information at a pixel level. For instance, suppose N channels of the scene image are dedicated to occupancy maps for each of the N types of furniture that are present within a dataset. This is another pretty simple encoding to conceptualize because we already think about scenes from a floorplan point of view. Scene synthesis literature such as DeepSynth introduces other channels to capture pixelwise height, wall occupancy, general object occupany, and pixel focus. Below is a figure illustrating a scene image.
Data Input Overview
Now that we’ve gone over scene relationships and their encodings, hopefully this demystifies the rest of the blog in regard to scene relationships and graphs. In DeepGen, we encode scene relationships in two ways: (1) a summary feature vector and (2) a set of scene graphs.
Consider part of the DeepGen pipeline where we have an object F that we want to place in the room R. F has a bounding box radius vector [Rx, Ry, Rz], label L, and proposed placement P. Other objects that exist within the scene also have this information. Using this information, we can calculate the following values:
- Average distance from the F to other furniture types
- Counts per furniture type detailing to how many members per furniture type are surrounding F
- Counts per furniture type detailing to how many members per furniture type are intersecting F
- Counts per furniture type detailing how many members per furniture type are supporting F
- Three closest furniture groups numerically stored in a 3-dimensional vector [A, B, C]
- One-hot encoded vector representing the relative position of Fwithin R. Specifically, [1, 0] represents F being nearest one wall, [0, 1] represents F being equally nearest two walls, and [0, 0] represents that an object is farther from all walls than a certain distance (thus being in the middle of the room)
With the exception of the intersection observation, these features directly come from suggestions made by SceneGen.
In addition to the summary feature vector, we also use a set of scene graphs, each representing a scene relationship. The following list details the criteria for an edge to exist between two nodes in each graph.
- Intersection: If another object in a scene intersects F, then a directed edge is connected from the other node to the target node associated with F.
- Surrounding: If another object is within the proximity of F, then a directed edge is connected from the other node to the target node associated with F.
- Support By: If another object’s top plane of its bounding box is within a thresholded distance (0.05 meters) of the target object’s bottom plane, than a directed edge is connected from the other node to the target node associated with F.
- Supporting: If another object’s bottom plane of its bounding box is within a thresholded distance (0.05 meters) of the target object’s top plane, than a directed edge is connected from the other node to the target node associated with F.
- Relative Position: The nodes in this graph are associated with the walls, floor, and target object F. If a wall is within the proximity of an object, then an edge is drawn from the node associated with the wall to the target object. If no walls meet this criteria, an edge is drawn from the floor node to the target object node.
- Co-Occuring: Nodes associated with objects that already exist within the scene have a directed edge from themselves to the target node associated with F.
Each node in a scene graph has an associated feature vector. This feature vector includes a one-hot encoding of furniture type as well as the ordering of distance between itself and the target object. For instance, if the ordering equals 5, then the object is the 5th closest to the target object. Finally, a default node exists within each scene graph so that our graph neural network system within DeepGen can recognize the absence of nodes. Its feature vector includes a zero vector for its one-hot encoding and an ordering number of -1. These scene graphs and their relationships come from suggestions made by SceneGraphNet.
Now that we have gone over the inputs to our system, let’s dive into the system architecture. The first part of our system is the preprocessing phase which turns the scene and proposed placement of object F into a summary feature vector and scene graphs. The next part is the inference phase. We have a dedicated model per furniture type which specialize in plausible placement for that furniture type. Therefore, the inputs are passsed to the appropriate model depending on the target object F’s label L. That dedicated model will output the likelihood that object F is placed at some proposed point P.
Each dedicated model has the same architecture. The architecture per model incorporates an initialization layer, specialized graph neural network layers, a concatentation step, and feed forward network with ReLU activations and a Sigmoid activation at the end for the probability output.
The initialization layer is nothing more than a feed forward fully connected 2-layer network with ReLU activations and a Linear activation at the end. It is applied to each scene graph’s feature matrix, which is simply the feature vectors of the nodes stacked into a matrix form. Then each feature matrix is fed to their respective graph neural network layer. Our architecture uses Graph Attention Layers for all scene graphs except for the co-occurence scene graph. The co-occurence scene graph is paired with a dynamic edge convolution layer, which learns to ignore insignificant co-occurence edges. Graph neural network layers are useful for receiving and summarizing information from neighbors, which we call a message. In actuality, this message is merely just a vector representing an aggregation of vectors generated from passing neighbor feature vectors through some fully connected layers.
Once messages to the target nodes associated with F in each scene graph, the messages are concatenated with the summary feature vector. This larger concatenated vector is passed through the final feed forward fully connected network to output the probability of plausible placement. The network architecture is depicted in the image below:
This project is still ongoing, but here are some interesting comparisons. Our experiments involve taking out an object from a scene in the test set and seeing where a scene synthesis system will place the object. We then measure the quality of a placement by how far the proposed placement is from the true placement in the original scene. This experiement is basically what we did to train our system as well.
In the figure below, we show probability heatmaps from SceneGraphNet, SceneGen, and DeepGen. SceneGraphNet does not seem to work well at all on the Matterport test set, but SceneGen and DeepGen seem to work better. SceneGen seems to be more decisive in its probability heatmap, but its high probability area is far from the true placement of an object F. DeepGen is less decisive than SceneGen, but it incorporates the true placement.
Our system is still a work-in-progress. Despite the quantitive results shown above, SceneGen still places objects closer to their true placements, on average, compared to DeepGen. We are investigating new developments in computer vision including vision transformers to see whether these architectures yield more decisive and plausible results. We are also looking for improvements in performance since utlimately, we would like our system to be used in a production level context. Fast heatmap generation and placement as well as low memory footprint are of high priority when it comes to a more public-facing application.
If you would like more information on DeepGen, please feel free to contact me via this site or my email.