Enhanced Data Transfer Cooperating with Artificial Triplets for Scene Graph Generation

1. Introduction

Scene graphs have emerged as a pivotal representation for detailing semantic information within a visual scene, by specifying relationships between object pairs [1], [2]. This representation enables reasoning about visual content through the encoded spatial and logical details of object instances and their relations. In modern applications, scene graphs have become foundational for high-level visual tasks like activity parsing [3], image retrieval [1], visual understanding [4], and image captioning [5]. This paper delves into the scene graph generation (SGG) task, aiming to predict objects and their relations from visual input.

SGG models encounter two primary challenges when trained on common dataset [6]: first, the distinct long-tailed distribution of relations [7], [8], and second, the ambiguity caused by semantically similar relation classes (e.g., on/on back of/mounted on) [9]-[11]. The latter exacerbates the issue, as instances within a category may be annotated under multiple confusing classes. Such complexities often bias relation predictions in general SGG models, leading to low recall rates for rare predicate classes. While some unbiased SGG methods [7], [12]-[14] have addressed this, they often sacrifices performance on frequent classes. Hence, it is essential to consider these trade-offs to ensure the model performance on the majority of data is not compromised.

Recently, the training data modification approach has shown promising results for training an unbiased SGG model [9], [10]. Two major concepts for the modification are the addition of new predicate labels and reassignment of existing ones, which can efficiently improve rare class performance. We revisit these concepts through IETrans [10], a baseline data modification method. IETrans encompasses two modules: external transfer for label addition and internal transfer for label reassignment. Notably, the external transfer, while leveraging background triplets for augmentation, doesn’t fully exploit the available data. Given the compositional nature of relational triplets, inter-triplet augmentation appears worthwhile. Additionally, predicate reassignments in the internal transfer are not uniformly reliable. A human evaluation study [10] reveals that only 76% of transferred triplets are deemed reliable. The inconsistency in the degree of semantic confusion, even among identical predicates, suggests that an “entire” transfer strategy might not be optimal. Guided by these findings, our system seeks to address these shortcomings by extending the modification concepts in two key ways: improving upon the data addition process and enhancing the reassignment efficiency.

Our method introduces two novel modules: Feature Space Triplet Augmentation (FSTA) and Soft Transfer. FSTA dynamically creates artificial triplets during training. We can construct new data by enumerating triplet combinations subject-predicate-object' and subject'-predicate-object from a sampled mini-batch. Here, $x'$ denotes data not from the original triplets. These artificial triplets serve to regularize the relation classification module in the SGG model. We undersample the frequent classes in artificial triplets to shape their predicate distribution. Further, a biased prediction-based sampler selects the class label for $x'$. This design aims to often sample combinations that are hard to be predicted correctly for a biased model. A pre-trained generator synthesizes the corresponding features based on class labels. Besides, Soft Transfer refines label reassignment by implementing an instance-wise ranking and mapping mechanism. We first compute a reliability score for each reassigned sample from biased model predictions, then select low-scoring triplets for Soft Transfer. Subsequently, a non-binary predicate label is calculated by mapping the reliability score, allowing for finer control over semantic confusion by using this label probability instead of an entire reassignment.

FSTA notably boosts performance on rare classes with increased sample quantity and diversity. Conversely, Soft Transfer alleviates performance loss in frequent classes, a typical compromise when elevating rare class performances. In essence, while FSTA contributes to mean recall (mR) gain, Soft Transfer leads to the recall (R) gain. Collectively, these modules bring reduced performance trade-off that shown in the improved overall metrics, F1@K and Avg@K. Our model-agnostic method was evaluated on the VisualGenome dataset [6], using two types of general SGG models: MOTIF [15] and RelDN [16] with IETrans. In the predcls task, our system outperforms the baseline IETrans by a 3.1% and 7.0% relative gain on the F1@100 metric for MOTIF and RelDN, respectively. Figure 1 illustrates the balanced performance of our method.

Fig. 1 Accuracy comparison between FSTA, Soft Transfer, Full, and the baseline IETrans on Motif ($1^{st}$ row) and RelDN ($2^{nd}$ row). In the scatter plots (left), a larger dot size and a darker color represent higher F1@100 and AVG@100 scores, respectively. As shown in the bar plots (right), increased scores in the overall metrics (F1@100 and AVG@100) indicate the alleviated performance trade-off in our full method, consisting of two complementary modules.

To sum up, we make the following contributions:

We propose a novel, model-agnostic method for training a R/mR balanced SGG model. It integrates two complementary modules: FSTA and Soft Transfer, which enhance the baseline IETrans.
We conduct extensive experiments and discussions on VisualGenome and demonstrate the effectiveness of our system.

Open AccessEnhanced Data Transfer Cooperating with Artificial Triplets for Scene Graph Generation

Summary :

1. Introduction

2. Related Work

2.1 Biased and Unbiased Scene Graph Generation

2.2 Compositional Learning

3. Methodology

3.1 Preliminary Introduction: IETrans

3.2 Feature Space Triplet Augmentation

3.3 Soft Transfer

3.4 Implementation Details

4. Experiments

4.1 Dataset and Evaluation Protocol

4.2 Comparing to Other Methods

5. Discussions

5.1 Ablation Study

5.2 Sensitivity Analysis

5.3 Comparison with Real Data Resampling

5.4 Parameter Choices for FSTA

5.5 A Study on MP-Sampler

5.6 Feature Visualization

6. Conclusion

Acknowledgments

References

Appendix A: Full Sgcls and Sgdet Tasks Results

Appendix B: Qualitative Results

Appendix C: Results for Ablation Study (RelDN)

Appendix D: Results for Sensitivity Analysis (RelDN)

Appendix E: Results for FSTA Parameter Choices (RelDN)

Appendix F: Randomness of FSTA

Appendix G: Object Generator

Appendix H: Hyperparameter Details

Footnotes

Authors

Keyword

Latest Issue

Contents

Copyrights notice of machine-translated contents

Cite this

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles

Open Access
Enhanced Data Transfer Cooperating with Artificial Triplets for Scene Graph Generation