1. Introduction
Document classification occupies an important role in information retrieval, which is widely applied in the fields like document management, news categorization and sentiment analysis. However, data sparsity is always an issue frequently happening in document classification. To meet the challenge of data deficiency, semi-supervised learning [1] and few-shot learning [2], [3] have been studied, in which model training is supervised by a small number of labeled samples. An even more extreme scenario is using only document contents and category names for classification, without any annotated data [4], [5].
To handle this tough task, several works leverage word-level features for classification [4], [6]. However, concrete words can hardly express accurate semantic information, increasing the risk of misclassification. Another approach is categorizing through comparing cosine similarities between document and category embeddings, which is also known as nearest neighbor classification [5], [7]-[9]. However, document embeddings do not necessarily locate around the embedding of their correct category in the embedding space, for which aligning document and category embeddings to bridge over semantic gaps is usually necessary, requiring extra manual/computing costs.
In this paper, we intend to predict category embeddings for unlabeled documents and compare these predicted category embeddings with predefined category embeddings for classification, such that there is no need to bridge over the semantic gap between documents and predefined category names. To this end, we propose Con2Class, in which a pretrained language model (PLM) is leveraged to produce two types of embeddings, one from predefined category names and the other from unlabeled documents. These category embeddings form a unified space, on which the category of one document can be easily classified by searching the nearest predefined category, without the need of alignment. Specifically, we enter category name-contained documents to the PLM to produce contextualized embeddings of predefined categories from the last hidden layer of the predefined category name. Meanwhile, we utilize manually designed prompt-included templates to guide the PLM to predict category embeddings for all the unlabeled documents from the last hidden layer of a [MASK] token in the template. In this way, category embeddings of both category names and unlabeled documents become contextualized token embeddings that can be mapped onto a unified category embedding space, on which classification can be easily conducted by similarity comparison. Here, resolving semantic gaps between the two types of embeddings through alignment, like X-Class [5] and SimPTC [9], is not necessary.
Nevertheless, the category embeddings predicted by the original PLM from the [MASK] token are not clearly separated in the embedding space, thus further refinement of the embeddings is required. To deal with this issue, we adopt contrastive learning, which is known to be effective for capturing similarity relationship [10]-[12], to separate predicted category embeddings into clusters. However, training the PLM with only traditional contrastive learning loss inevitably hurts its encoding ability that learned in the pretraining stage, leading to the collapse of the predefined category embeddings. Therefore, we propose a new contrastive learning loss function, named MLM-maintained contrastive loss, which incorporates a masked language modeling (MLM) loss term, aiming at maintaining the encoding ability of the PLM, so that the quality of the predefined category embeddings is preserved.
There is another challenge when no annotated data is accessible, such that we have no validation set to check the classification ability of the model during the training. To resolve this problem, we newly propose confidence factor which is used as an indicator to estimate the intermediate classification performance. We record the confidence factor after every certain training steps during the training process, and retain the model having the highest confidence factor to predict category embeddings from unlabeled documents and category names. We name this process confidence-driven contrastive learning.
After the confidence-driven contrastive learning, each document can be assigned a pseudo label by searching the most semantically similar predefined category name. The pseudo labels are used to finetune the final classifier, but only confident pseudo labels are used and progressively added, to alleviate the influence of noisy pseudo labels involved in earlier stages.
The main contributions of this paper are summarized as:
-
We introduce a method to train a PLM to predict category embeddings from category names and unlabeled documents, based on which classification can be easily conducted by similarity comparison without the need for embedding alignment.
-
Confidence-driven contrastive learning is proposed to refine the category embeddings, in which MLM-maintained contrastive loss is introduced to improve category separation while preventing collapse of category embeddings, and confidence factor is introduced to estimate the classification ability of the model.
-
A progressive self-training strategy is introduced to alleviate the problem of amplifying noise during self-training.
-
The evaluation results on five representative datasets show our proposed Con2Class exceeds those of known models by 5.8% and 5.6% in average on macro and micro F1-scores, respectively.
The rest of this paper is organized as follows: Section 2 covers related work. Section 3 formulates the task and introduces our proposed method Con2Class. Section 4 presents the experiments and results. Section 5 shows analysis and discussion on Con2Class. Section 6 gives a conclusion and discusses future directions.
2. Related Work
Pretrained language models (PLMs) are derived from transfer learning, which is originally proposed with the aim to solve a new problem using limited labeled samples [13]. Transformer-based PLMs, BERT [14] as an example, are pretrained by objectives such as next sentence prediction (NSP), masked language modeling (MLM), etc., to obtain linguistic knowledge from huge general corpora, such that PLMs can achieve remarkable results on specific tasks by finetuning on limited task-specific data [13], [15], [16].
NSP is designed to distinguish whether two sentences are arranged in the correct order, through the output of a [CLS] token. The [CLS] token contains the information of all the words in the sentence, which is usually used for sentence-level representation and classification.
Given a document \(\boldsymbol{d}\) in which several words are masked by [MASK] tokens, MLM aims to predict the likelihood of each word \(w\) appears at the [MASK] position over all the vocabulary:
\[\begin{aligned} p(w|\boldsymbol{d})=\mathit{softmax}(W_{2}\sigma (W_{1}\boldsymbol{d} + \boldsymbol{b})), \end{aligned}\] |
where \(W_{1}\), \(W_{2}\) and \(\boldsymbol{b}\) are trainable parameters and \(\sigma (\cdot)\) is the sigmoid function. By virtue of MLM training objective, prompt learning, which leverages prompting texts-included templates to probe into the knowledge of PLMs, is prevailing for its easy deployment and lighter requirement for annotated samples [17]. Such a paradigm is in the limelight among wide fields like text classification [18], [19], summarization [20] and question answering [21]. Researches revolving around prompt improvement are also widely conducted [3], [22]-[25].
Through the Transformer architecture [26], PLMs can capture high-order, long-range dependencies over texts and generate context-aware word representations, known as contextualized word embeddings, which are more informative than traditional word embeddings such as Word2Vec [27] or Glove [28]. Additionally, by leveraging techniques such as Siamese network [29], prompt learning [11] and contrastive learning [10], [12], PLMs can also generate informative sentence embeddings, which can be used for classification, natural language inference and semantic similarity tasks [10].
Contrastive learning. To better capture the semantic relationship between sentences, contrastive learning is widely adopted under the scenario of unsupervised learning. ConSERT [12] proposes four augmentation strategies including adversarial attack, token shuffling, cutoff, and dropout, to construct positive pairs and captures semantic relationship through the average pooling over BERT’s last hidden layer. SimCSE [10] employs the [CLS] token to learn semantic relationship between documents, in which standard dropout [24] is applied twice on one document to generate two different embeddings used as positive pairs. Embeddings produced by different documents are regarded as negative pairs in their contrastive learning. RTS [31] adopts similar strategy with SimCSE in contrastive learning, but positive pairs are constructed by two segmentations randomly segmented from one document. To avoid interference from the length of the document, ESimCSE [32] improves SimCSE by executing word repetition to differentiate the length of documents and retain the original meaning. Additionally, momentum contrast is proposed to expand negative pairs for more refined contrastive learning. PromptBERT [11] utilizes text templates to learn semantic relationship through a [MASK] token. Templates are manually designed to fit for task objectives, and multiple embeddings are extracted from the same sentence for positive pairs.
Classification using only category names. The approach of nearest neighbor search is widely adopted in category-name-only classification. One of the most representative works is Prototypical Networks [33] in the field of computer vision, in which images are classified by comparing similarities with class prototypes. In text classification, analogously, documents are classified by comparing semantic similarities with category names, for which document and category representations are essential to the success of the classification.
Table 1 summaries the differences between existing category-name-only classification methods and our method Con2Class proposed in this paper, on the points such as embedding generation methods, alignment and contrastive learning, etc. Traditional encoding methods such as Bag-of-Words (BoW) [27], Explicit Semantic Analysis (ESA) [34], Glove [28] and Latent Semantic Analysis (LSA) [35] are adopted in early works such as Dataless [7] and the method introduced in [8], to encode documents and category names. However, these encoding methods can hardly capture the contextualities of semantic representations. Recently, PLMs becomes popular in text representation generation, for their strong ability on capturing contextual relationship. RTS [31] employs the [CLS] token of MPNet [36] to obtain the embeddings for documents and categories, but it works only when the description of categories is available since only one word or phrase can hardly provide enough semantic information for encoding. X-Class [5] constructs embeddings by the weighted average of contextualized token embeddings, but in this way, document embeddings are not necessarily placed around their category embeddings, for which an alignment process by Gaussian Mixture Model (GMM) [37] is necessary to be executed. SimPTC [9] fills category names into hand-crafted templates to construct sentences such that the embeddings of both documents and categories can be represented by the [CLS] token. GMM is also conducted for alignment in SimPTC.
Apart from the nearest neighbor search, there are also methods categorizing documents using word-level information by counting term frequency which is demonstrated by ConWea [6] and matching category-related words which is proposed in LOTClass [4]. However, concrete words are less informative than dense embedding vectors, resulting in higher level of misclassification.
3. Task Definition of Category-Name-Only Classification and Methodology
Suppose a collection of documents \(X = \{x_{1}, x_{2}, \ldots, x_{n}\}\) is assigned with a collection of category names \(C = \{c_{1}, c_{2}, \ldots, c_{m}\}\), by a mapping \(M:X\mapsto C\). The goal of category-name-only classification is to train a prediction model that predicts \(M\), but no training data that associate documents \(X\) with category names \(C\) are given.
In this paper, we propose a nearest neighbor search-based model Con2Class to solve this tough task. Figure 1 depicts the conceptual architecture of Con2Class. We first finetune a pretrained language model (PLM) BERT-em through confidence-driven contrastive learning, to produce embeddings for all the predefined category names and predict category embeddings for all the unlabeled documents in a unified category embedding space. Then, pseudo labels are assigned by searching the closest predefined category embedding in this unified category embedding space, without the need of alignment. Finally, a progressive self-training strategy is introduced to train another PLM classifier BERT-cl for final prediction. Note that Con2Class is PLM-antagonistic and in this paper, we employ the representative PLM BERT [14], for fair comparison with other methods.
3.1 Category Embedding Production
As introduced in Sect. 2, nearest neighbor search-based methods usually compare similarities between documents and categories, which requires alignment to narrow the semantic gap between document embeddings and category embeddings. In this paper, instead, we produce predefined category embeddings and predicted category embeddings in a unified category embedding space, such that similarity comparison can be conducted without the need of alignment. Figure 2 illustrates how predefined category embeddings and predicted category embeddings are produced when the template “[X] is about [MASK]” is used, in which [X] refers to the raw document.
Predefined category embedding. RTS [31] regards category descriptions as documents and generates predefined category embeddings through the [CLS] token. However, it is impractical to obtain informative category embeddings from the [CLS] token under our setting that using only category names. SimPTC [9] fills category names into hand-crafted templates to construct anchor sentences with the aim of obtaining predefined category embeddings from the [CLS] token. But the semantics of these anchor sentences deviate to some extent from the original meanings of the category names, which is not desirable for nearest neighbor search-based classification. Therefore, we choose to employ a BERT model named BERT-em to produce contextualized category embeddings leveraging the unlabeled documents which contain at least one category name. Suppose the category name \(c_{j}\) appears a total of \(q_{j}\) times in the corpus. The embedding of \(c_{j}\) is calculated by averaging the hidden states of the last hidden layer on all the occurrences of category name \(c_{j}\) in the corpus:
\[\begin{align} \boldsymbol{e}_{c_{j}} = \frac{1}{q_{j}} \sum\nolimits_{p=1}^{q_{j}} \boldsymbol{h}_{c_{j}}^{p}, \tag{1} \end{align}\] |
where \(\boldsymbol{h}_{c_{j}}^{p}\) is the state of the last hidden layer of \(c_{j}\) at the \(p\)-th appearance in corpus. It can be extended to the case that the category name is a phrase by averaging the hidden states on all the words in the phrase.
Category embedding prediction. To facilitate similarity comparison with predefined category embeddings, we consider predicting category embeddings of unlabeled documents into a unified category embedding space. Inspired by PromptBERT [11], we manually design a prompt-contained template, in which a [MASK] token is included to predict the category embedding \(\boldsymbol{e}_{i}\) of document \(x_{i}\) from the last hidden layer of the [MASK] token \(\boldsymbol{h}_{\mathrm{[MASK]}}\):
\[\begin{align} \boldsymbol{e}_{i}=\boldsymbol{h}_{\mathrm{[MASK]}}. \tag{2} \end{align}\] |
In this way, both predefined category embeddings and predicted category embeddings are contextualized token embeddings generated by MLM prediction, whose proximities can be measured without the need of alignment. Note that in PromptBERT, using an identical model like BERT-em for generating category embeddings for both documents and category names is not considered.
3.2 Confidence-Driven Contrastive Learning
Refining predicted category embeddings. The template “[X] is about [MASK]” in Fig. 2 guides the BERT-em model to predict category embeddings of unlabeled documents in the category embedding space. Nevertheless, the category embeddings predicted by the raw BERT-em model are not clearly separated by categories, which makes it difficult to distinguish category membership by nearest neighbor search. Therefore, we finetune the BERT-em model by contrastive learning, which has been proved to effectively capture the similarity relationship between documents [10]-[12], to separate predicted category embeddings in the category embedding space.
Figure 3 shows the schema of our contrastive learning. As introduced in Sect. 2, there are many ways to construct positive and negative document pairs for contrastive learning. In our work, we follow SimCSE [10], an easy and effective method, to construct document pairs leveraging standard dropout [30]. Each document is fed into BERT-em twice and the two outputs generated with different dropout masks constitute a positive pair. The outputs of different documents within a mini-batch, that are fed into BERT-em at the same time, constitute negative pairs. The training target is to push apart the predicted category embeddings of different documents (negative pairs) and pull close the predicted category embeddings of the same document with different dropout masks (positive pairs).
Suppose \(x_{i}\) and \(x_{j}\) are documents in a mini-batch of size \(B\), \(\boldsymbol{h}_{i}^{z}\) and \(\boldsymbol{h}_{j}^{z'}\) are the predicted category embeddings of \(x_{i}\) and \(x_{j}\) with different dropout masks \(z\) and \(z'\), then the loss function of contrastive learning for one mini-batch of size \(B\) is:
\[\begin{align} \mathcal{L}_{CL} = \frac{1}{B}\sum\nolimits_{i=1}^B -log \frac{e^{sim\left(\boldsymbol{h}_{i}^{z},\boldsymbol{h}_{i}^{z'}\right)/\tau}} {\sum\nolimits_{j=1}^B e^{sim\left(\boldsymbol{h}_{i}^{z},\boldsymbol{h}_{j}^{z'}\right)/\tau}}, \tag{3} \end{align}\] |
where \(\tau\) is a temperature hyperparameter.
However, roughly applying contrastive learning for separating predicted category embeddings inevitably hurts the encoding ability learned from MLM pretraining objective, which can collapse predefined category embeddings and deteriorate nearest neighbor search.
For this reason, we improve the loss function by adding the original MLM loss of BERT to the contrastive learning loss, trying to maintain the encoding ability of the BERT-em during contrastive learning, thus, preventing the collapse of predefined category embeddings:
\[\begin{align} \mathcal{L}=(1-\alpha)\mathcal{L}_{CL}+\alpha \mathcal{L}_{MLM}. \tag{4} \end{align}\] |
We name the loss function of Eq. (4) MLM-maintained contrastive loss, where \(\alpha\) is a hyperparameter weighting two losses.
Prediction confidence. We train the BERT-em model to produce category embeddings from both predefined category names and unlabeled documents, on which a classifier based on nearest neighbor search is trained. However, there is no annotated data that can be used as a validation set to check the classification accuracy during training, for which we need to find a substitution for estimating the classification ability of the model. Intuitively, a good classifier should be able to clearly distinguish the category that a document belongs to. Here, we consider deriving a classification quality indicator from distributions of confidence scores, for which we newly propose confidence factor (\(CF\)) to estimate the classifier performance. For calculation of \(CF\), we firstly evaluate semantic similarities between predicted category embeddings and predefined category embeddings. Cosine similarity is adopted to evaluate the similarity between the predicted category embedding of document \(x_{i}\) and the predefined category embedding of category \(c_{j}\):
\[\begin{align} s_{i}^{j}=\cos(\boldsymbol{e}_{i},\boldsymbol{e}_{c_{j}}). \tag{5} \end{align}\] |
The predicted probability distribution \(\boldsymbol{p}_{i}\in\mathbb{R}^{m}\) of \(x_{i}\) is then calculated by z-score normalization and softmax function, applied on the similarity vector \(\boldsymbol{s}_{i}=[s_{i}^{1}, s_{i}^{2}, \ldots, s_{i}^{m}]\):
\[\begin{align} & z(\boldsymbol{s}_{i}) = \frac{\boldsymbol{s}_{i}-\mu }{\sigma}, \tag{6} \\ & \boldsymbol{p}_{i} = softmax(\mathit{z}(\boldsymbol{s}_{i})), \tag{7} \end{align}\] |
where \(\mu\) and \(\sigma\) are the mean value and the standard deviation of \(\boldsymbol{s}_{i}\). Z-score normalization here aims to sharpen the probability distribution to magnify the belonging category more clearly.
We define document-wise confidence score (\(CS\)) to evaluate the confidence of a single prediction on document \(x_{i}\):
\[\begin{align} CS_{i}=\max \boldsymbol{p}_{i}. \tag{8} \end{align}\] |
The confidence factor is then calculated by averaging the confidence scores of all the documents in \(X\):
\[\begin{align} CF=\frac{1}{n}\sum\nolimits_{i=1}^n {CS}_{i}. \tag{9} \end{align}\] |
Note that \(CF\) becomes constant in the case of binary classification (\(m=2\)), losing its role, for which we skip z-score normalization and directly apply softmax function on the similarity vector \(\boldsymbol{s}_{i}\) as Eq. (10) presents to calculate the probability distribution for binary classification:
\[\begin{align} \boldsymbol{p}_{i}=\mathit{softmax}(\boldsymbol{s}_{i}). \tag{10} \end{align}\] |
We train the BERT-em model for a total of \(\Phi\) training steps using unlabeled documents iteratively, and evaluate \(CF\) every \(\varphi\) training steps, such that there are a total of \(\lfloor \Phi/\varphi \rfloor\) checks throughout the training, where \(\lfloor \cdot \rfloor\) is the floor function. Finally, the model having the highest \(CF\) is used to produce predefined category embeddings and predicted category embeddings for label prediction. We name the above training process confidence-driven contrastive learning.
3.3 Progressive Self-Training by Pseudo Labels
Progressive self-training executes a cycle of classifier training and pseudo label generation, to progressively improve classification performance. To train a classifier, we employ a separate BERT model BERT-cl, on which a classifier head is added. We finetune the BERT-cl classifier following the instruction by [14]. At the initial stage of self-training, the BERT-cl classifier is finetuned under the supervision of pseudo labels generated by confidence-driven contrastive learning. From the second self-training loop, the pseudo labels and the confidence scores are produced by the probability distribution predicted by the BERT-cl classifier itself. Since noisy pseudo labels that indicate incorrect categories may exist, we add a confidence regularizer \(\mathcal{R}(\theta)\) weighted by \(\lambda\) to the classical Kullback-Leibler divergence to finetune the BERT-cl classifier with the aim of avoiding overfitting on those noisy pseudo labels, as [38] did. Thus, the loss function for the BERT-cl classifier finetuning is:
\[\begin{align} & \mathcal{L}_{FT}(\theta;\tilde{\boldsymbol{y}}) = \mathcal{D}_{KL}(\tilde{\boldsymbol{y}}|| f(x;\theta)) + \lambda \mathcal{R}(\theta),\nonumber\\ & \mathcal{R}(\theta) = \mathcal{D}_{KL}(\boldsymbol{u}|| f(x;\theta)), \tag{11} \end{align}\] |
where \(\tilde{\boldsymbol{y}}\) refers to the pseudo label, \(\boldsymbol{u}\in \mathbb{R}^{m}\) is uniform distribution in which \(u_{i}=1/m\), for \(i=1,2,\ldots,m\).
Since the overall accuracy of pseudo labels is relatively low at the initial stage, we introduce a progressive training strategy such that top (\(P + 10 \times t\)) percent of the predictions having the highest confidence scores are used for the BERT-cl classifier finetuning in the \(t\)-th self-training loop, where \(P\) is a pre-defined hyperparameter. In this way, only a small part of predictions having high-confidence pseudo labels are leveraged at the beginning of the self-training, and the scale of pseudo labeled documents expands as the self-training proceeds. To avoid noise amplification from wrongly marked documents in previous loops, the BERT-cl classifier finetuning always starts again from the original pretrained BERT model at each self-training loop. The self-training will be terminated when it reaches the maximum count of iteration \(T\).
4. Experiments
4.1 Datasets
We conduct experiments on five representative datasets: AGNews [39], 20News [40], Yelp [39], IMDB [41] and DBpedia [39], covering the fields of news, review and ontology. The features and statistics are listed in Table 2.
4.2 Baseline Methods
We compare our proposed Con2Class with several advanced methods targeting at the same task of categorizing using only category names as supervision.
WESTCLASS [42] assumes that words and documents share a joint semantic space. Pseudo-labeled documents are generated from the semantic space and then, leveraged for classifier training under the self-training manner. ConWea [6] uses contextualized representations and seed word information to distinguish polysemes of a word and create a contextualized corpus, which is used to train the classifier and expand seed words iteratively. LOTClass [4] leverages pretrained language models to generate category vocabulary, and then, search category-indicative words through matching replaceable words with the words in category vocabulary as training supervision. X-Class [5] generates class-oriented representation for documents and align documents to classes through clustering. SimPTC [9] embeds category names in templates to construct anchor sentences, and classification is implemented by clustering texts in embedding space. Following the setting of the original paper, we run SimPTC using RoBERTa-large model on our dataset. Additionally, we present the result of supervised learning, in which a BERT classifier is finetuned by a 2-fold cross-validation setting.
4.3 Experimental Settings
Since the domains of the benchmark corpora are diverse, we manually design different templates to adapt to different types of corpora. As shown in Table 3, the classification task on five datasets is roughly divided into two types: topic classification (AGNews, 20News and DBpedia) and sentiment analysis (Yelp and IMDB). With PET [18] and PromptBERT [11] as references, two templates are designed for each subtask in our experiments.
All the experiments are conducted using one NVIDIA GeForce RTX 3090 GPU. For both BERT-em and BERT-cl, the BERT model of ‘bert-base-uncased’ is adopted. The hyperparameter \(\alpha\) in the contrastive learning loss is set to 0.1 in all the experiments, because BERT-em has already been trained by the MLM objective in the pretraining stage and we just need to maintain the ability of MLM prediction in confidence-driven contrastive learning. The temperature hyperparameter \(\tau\) is set to 1 in all the experiments. On account of the limitation of GPU memory, we set mini-batch size \(B=10\) in contrastive learning and update BERT-em parameters every 10 mini-batches, making up one training step. Following this setting, the batch size of confidence-driven contrastive learning is practically 100. Since the DBpedia dataset has much more categories than the other datasets, we conduct confidence-driven contrastive learning a total of \(\Phi = 100{,}000\) training steps for the DBpedia dataset and \(\Phi = 10{,}000\) training steps for the other datasets. The confidence factor is evaluated every \(\varphi = 10{,}000\) training steps for the DBpedia dataset and \(\varphi = 1{,}000\) training steps for the other datasets, such that the confidence factor is evaluated a total of 10 times throughout the confidence-driven contrastive learning for all the datasets. The batch size of the BERT-cl classifier finetuning is set to 24 and BERT-cl parameters are updated every 5 batches. The learning rates of the contrastive learning and the BERT-cl classifier finetuning are 3e-5 and 1e-5, respectively. Since larger datasets may incur more noises in pseudo labeled data, we set \(P=20\) for DBpedia for its large dataset size, while \(P=40\) for the other datasets. The weighting hyperparameter \(\lambda\) is set to 0.1 and the BERT-cl classifier is finetuned 5 epochs for DBpedia, while 10 epochs for the other datasets, in consideration of the dataset sizes as well. Progressive self-training is conducted for \(T=3\) iterations in our experiments.
Since further pretraining helps improve the classification performance [15], we further pretrain the BERT-cl model 10 epochs on the DBpedia dataset and 100 epochs on the other datasets before the BERT-cl classifier finetuning. Since there is a huge gap in styles between human reviews and the original pretraining data of BookCorpus [43] and Wikipedia, we also further pretrain the BERT-em model before confidence-driven contrastive learning on the Yelp and IMDB datasets. For all the datasets, we execute five runs and compare on the average scores.
4.4 Experimental Results
We compare our proposed Con2Class with recent advanced methods by macro-F1 and micro-F1 scores, among which LOTClass [4], X-Class [5], and SimPTC [9] share the SOTA records over all the datasets. Apart from the full version of Con2Class, we additionally evaluate the performance of confidence-driven contrastive learning alone, in which the progressive self-training is removed from Con2Class, to demonstrate the classification ability by conducting only confidence-driven contrastive learning.
According to the results in Table 4, Con2Class surpasses all the baselines over the five datasets. The results on the Yelp and IMDB datasets are even approaching that of supervised BERT finetuning, demonstrating the strong ability of Con2Class on the task of classification using category names only.
Notably, the results of confidence-driven contrastive learning only, on the Yelp dataset, have already exceeded that of baselines, which is further improved by the succeeding progressive self-training. This could be attributed to the clear and distinct meaning of their category names “Good” and “Bad”.
In the ablation study, we remove the MLM loss function \(\mathcal{L}_{MLM}\) from Eq. (4) for both confidence-driven contrastive learning and Con2Class. According to the results, the classification ability of confidence-driven contrastive learning is degraded after \(\mathcal{L}_{MLM}\) is removed, which further affects the progressive self-training, reflected in the performance drop on Con2Class. Interestingly, the result of the confidence-driven contrastive learning of the 20News dataset is almost unchanged after \(\mathcal{L}_{MLM}\) is removed, but the performance of Con2Class drops remarkably. It reflects the fact that the effect of progressive self-training is not entirely governed by the accuracy on the whole samples, but it is directly related to the accuracy of those top-ranked high-confidence pseudo-labeled samples.
The time cost of confidence-driven contrastive learning is affected by pre-set hyperparameters such as the training step \(\Phi\) and the interval step \(\varphi\) between two confidence score checks. For the cases when \(\Phi = 10{,}000\) and \(\varphi = 1{,}000\) on the AGNews, 20News, IMDB and Yelp datasets, confidence-driven contrastive learning takes about 3 hours. For the case when \(\Phi = 100{,}000\) and \(\varphi = 10{,}000\) on the DBpedia dataset, the time consumption increases with the expansion of training steps, reaching 30 hours. The time cost of progressive self-training varies with the scale of the corpus and it ranges from 1 hour on the smallest dataset 20News to 12 hours on the largest dataset DBpedia.
5. Discussion and Analysis
5.1 Insight of Confidence-Driven Contrastive Learning
To explore how the classification ability changes during the confidence-driven contrastive learning, we evaluate macro-F1, micro-F1 and the confidence factor after every training step on the 20News dataset.
The evaluation results on the first 100 training steps are shown in Fig. 4. The F1 scores are at a low level when no training is done. Then they increase rapidly through the first few training steps and fluctuate around a relatively high level after 20 training steps. The confidence factor slightly drops at first and then, increases along with F1 scores. After 20 training steps, it fluctuates and exhibits a similar variation trend with F1 scores, validating the effectiveness of using our proposed confidence factor to estimate the classification ability.
There is a noticeable drop in the confidence factor at the beginning of the training. It can be attributed to the fact that the initial predicted category embeddings are not discernable enough, so that a great number of predicted category embeddings are unexpectedly close to a wrong predefined category in the embedding space, which causes low F1 scores. With the training proceeds, predicted category embeddings move away from wrong predefined categories and approach towards correct ones. Such a process causes a slight decrease on the confidence factor at the early stage as predicted category embeddings leave from wrong categories, after which, a subsequent boost on the confidence factor occurs as predicted category embeddings move towards correct categories.
5.2 Sensitivity Analysis
Hyperparameter \(\boldsymbol{\alpha}\). The loss function of the confidence-driven contrastive learning involves the hyperparameter \(\alpha\) weighting the contrastive learning loss \(\mathcal{L}_{CL}\) and MLM loss \(\mathcal{L}_{MLM}\). We change \(\alpha\) ranging on [0.1, 0.3, 0.5, 0.7, 0.9] to explore how it affects the classification performance. Tests are conducted on all the five datasets and the results are shown in Fig. 5.
According to the results, the classification performance does not show remarkable differences with varying values of \(\alpha\) on the Yelp, IMDB and 20News datasets. On the AGNews dataset, however, the performance is not much changed until \(\alpha\) reaches 0.9, an obvious performance drop happens. Regarding the DBpedia dataset, F1 scores constantly decrease as \(\alpha\) rises.
Considering the result of the ablation study that removing \(\mathcal{L}_{MLM}\) leads to performance drop, a small \(\alpha\) is preferable for our proposed Con2Class.
Manually designed templates. As introduced in Sect. 4.3, we design two templates for the two subtasks, respectively, to predict category embeddings. In this part, we investigate how templates affect the classification results.
Table 5 lists all the templates we used for investigation, which are designed with reference to PET [18] and PromptBERT [11]. T1 refers to the template adopted in our experiment, which is designed with consideration of the style of the corresponding subtask. T2-T4 are the other three templates designed based on the style of the corresponding subtask. Additionally, to explore the case that the template is not much compatible with the corpus, we exchange the T1 templates for the two subtasks, marked as T5.
Figure 6 depicts the results of using different templates in our model. T1-T4, which we carefully designed with respect to the type of the corpus, are not showing remarkable differences on the classification performance. On the other hand, an obvious performance drop is observed on the first four datasets when the template is swapped with the other subtask, represented by T5. The result is as we envisioned because MLM prediction greatly relies on context and with an irrelevant template, BERT-em can hardly produce embeddings correctly representing categories of documents.
The DBpedia dataset is not much influenced by the template T5. It can be attributed to the fact that each document in the DBpedia dataset is a paragraph describing an ontology, such that the fifth template “[X] All in all, it was [MASK].” can also work on this type of content.
5.3 Embedding Space Visualization
The performance of nearest neighbor classification is greatly influenced by the document/category distribution over the embedding space. Intuitively, it is desirable that documents cluster by category compactly and distinguish from other categories clearly.
In this part, we use t-SNE [44] to visualize the embedding space over five different methods: average pooling, X-Class, SimPTC, Con2Class and Con2Class without MLM loss. Figure 7 shows the results on the 20News and DBpedia datasets. For each dataset, we randomly sample 10,000 documents for visualization and different categories are distinguished by different colors. Each dot point refers to one document and each diamond point refers to one category. For the averaging pooling method, documents are represented by the average of the BERT output layer and each category is represented by the contextualized embedding encoded by BERT. X-Class produces document and category embeddings by weighted average of contextualized embeddings and SimPTC obtains document and category embeddings from the BERT’s last hidden layer of the [CLS] token. For the methods of Con2Class and Con2Class without MLM loss, the document is represented by the predicted category embedding and the category is represented by the contextualized category embedding.
According to the visualization results, the boundary of document representations produced by the averaging pooling is fuzzy and documents of different categories are not clearly separated. Moreover, a great part of category representations appear at the boundary of two categories, which is unfavorable for category assignment. The following three methods show much more clear boundaries between different categories and most of the documents gather around the correct categories. Benefiting from contrastive learning, Con2Class is able to produce even more compact clusters, which signifies the category separation of documents, such that wrongly assigned pseudo-labels on wavering documents are less involved in the succeeding BERT-cl finetuning. Without the MLM loss, however, the boundaries between different categories of documents become blurred and a larger proportion of categories appear at boundaries or wrongly gather with irrelevant documents, which proves that MLM loss is helping separate categories of documents and prevent the collapse of category embeddings.
6. Conclusion
To tackle the task of document classification using category names only, a new method named Con2Class is proposed in this paper. For nearest neighbor search, a pretrained language model is finetuned through confidence-driven contrastive learning to produce predefined category embeddings and predict category embeddings of documents in a unified embedding space with the help of prompt-contained templates. MLM-maintained contrastive loss is introduced in confidence-driven contrastive learning to refine predicted category embeddings and prevent the collapse of the predefined category embeddings during contrastive learning. The confidence factor is newly proposed to estimate the classification ability of the model, such that the model producing the most certain prediction, which we regard having the best classification ability, is selected to produce category embeddings for similarity evaluation and comparison. Pseudo labels obtained by searching the semantically closest predefined category name are used for classifier finetuning, based on which a progressive self-training is executed where high-confidence predictions are progressively added for classifier finetuning, while reducing the interference from noisy pseudo labels.
The performance evaluations show that Con2Class surpasses the SOTA results on all the five representative datasets. Intermediate results reveal the feasibility of using the newly proposed confidence factor to estimate the classification ability of the model. The sensitivity analysis reveals the impacts of the hyperparameter \(\alpha\) and templates on the result, which demonstrates that a small \(\alpha\) is preferable for Con2Class and an appropriate template is necessary to the success of Con2Class. Visualization on embedding space reveals that our proposed Con2Class is improving category separation for the unlabeled documents over the existing methods. Adding MLM loss to contrastive learning is also proved to effectively prevent the collapse of predefined category embeddings.
Nevertheless, these is still wide space to improve the results. For example, techniques of expanding the category names and designing multiple templates for ensemble, which have been proved to be effective in other works [4], [7]-[9], are supposed to boost the classification performance. Checking the confidence factor more frequently is also supposed to have a greater likelihood of finding superior models, yielding better results. Besides, replacing the concrete templates with more flexible continuous templates may contribute to a significant improvement on the task of classification using category names only, which is the direction we plan to move forward in our future work.
Acknowledgements
This work is in part supported by JSPS KAKENHI Number JP22K12044.
References
[1] D. Lee, “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” ICML 2013 Workshop: Challenges in Representation Learning, Atlanta, USA, June 2013.
[2] T.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D.M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” 34th Conference on Neural Information Processing Systems (NeurIPS), Vancouver, Canada, Dec. 2020.
[3] T. Gao, A. Fisch, and D. Chen, “Making pre-trained language models better few-shot learners,” Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), online, vol.1, pp.3816-3830, Aug. 2021. DOI: 10.18653/v1/2021.acl-long.295
CrossRef
[4] Y. Meng, Y. Zhang, J. Huang, C. Xiong, H. Ji, C. Zhang, and J. Han, “Text classification using label names only: A language model self-training approach,” Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), online, pp.9006-9017, Nov. 2020. DOI: 10.18653/v1/2020.emnlp-main.724
CrossRef
[5] Z. Wang, D. Mekala, and J. Shang, “X-class: Text classification with extremely weak supervision,” Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), online, pp.3043-3053, June 2021. DOI: 10.18653/v1/2021.naacl-main.242
CrossRef
[6] D. Mekala and J. Shang, “Contextualized weak supervision for text classification,” Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), online, pp.323-333, July 2020. DOI: 10.18653/v1/2020.acl-main.30
CrossRef
[7] M. Chang, L. Ratinov, D. Roth, and V. Srikumar, “Importance of semantic representation: dataless classification,” Proc. Twenty-Third AAAI Conference on Artificial Intelligence (AAAI), Chicago, USA, pp.830-835, July 2008.
[8] Z. Haj-Yahia, A. Sieg, and L.A. Deleris, “Towards unsupervised text classification leveraging experts and word embeddings,” Proc. 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, pp.371-379, July 2019. DOI: 10.18653/v1/P19-1036
CrossRef
[9] Y. Fei, Z. Meng, P. Nie, R. Wattenhofer, and M. Sachan, “Beyond prompting: Making pre-trained language models better zero-shot learners by clustering representations,” Proc. 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), Abu Dhabi, United Arab Emirates, pp.8560-8579, Dec. 2022. DOI: 10.18653/v1/2022.emnlp-main.587
CrossRef
[10] T. Gao, X. Yao, and D. Chen, “SimCSE: Simple contrastive learning of sentence embeddings,” Proc. 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online and Punta Cana, Dominican Republic, pp.6894-6910, Nov. 2021. DOI: 10.18653/v1/2021.emnlp-main.552
CrossRef
[11] T. Jiang, J. Jiao, S. Huang, Z. Zhang, D. Wang, F. Zhuang, F. Wei, H. Huang, D. Deng, and Q. Zhang, “PromptBERT: Improving bert sentence embeddings with prompts,” Proc. 2022 conference on empirical methods in natural language processing (EMNLP), Abu Dhabi, United Arab Emirates, pp.8826-8837, Dec. 2022. DOI: 10.18653/v1/2022.emnlp-main.603
CrossRef
[12] Y. Yan, R. Li, S. Wang, F. Zhang, W. Wu, and W. Xu, “ConSERT: A contrastive framework for self-supervised sentence representation transfer,” Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), online, vol.1, pp.5065-5075, Aug. 2021. DOI: 10.18653/v1/2021.acl-long.393
CrossRef
[13] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, Y. Yao, A. Zhang, L. Zhang, W. Han, M. Huang, Q. Jin, Y. Lan, Y. Liu, Z. Liu, Z. Lu, X. Qiu, R. Song, J. Tang, J.-R. Wen, J. Yuan, W.X. Zhao, and J. Zhu, “Pre-trained models: Past, present and future,” AI Open, vol.2, pp.225-250, 2021. DOI: 10.1016/j.aiopen.2021.08.002
CrossRef
[14] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), Minneapolis, USA, pp.4171-4186, June 2019. DOI: 10.18653/v1/N19-1423
CrossRef
[15] C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to fine-tune BERT for text classification?” Proc. 18th China National Conference on Chinese Computational Linguistics (CCL), Kunming, China, pp.194-206, Oct. 2019. DOI: 10.1007/978-3-030-32381-3_16
CrossRef
[16] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T.L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A.M. Rush, “Huggingface’s transformers: state-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, July 2020.
[17] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys, vol.55, no.9, pp.1-35, 2023. DOI: 10.1145/3560815
CrossRef
[18] T. Schick and H. Schütze, “Exploiting cloze-questions for few-shot text classification and natural language inference,” Proc. 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL), online, pp.255-269, April 2021. DOI: 10.18653/v1/2021.eacl-main.20
CrossRef
[19] R. Zhong, K. Lee, Z. Zhang, and D. Klein, “Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections,” Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, pp.2856-2878, Nov. 2021. DOI: 10.18653/v1/2021.findings-emnlp.244
CrossRef
[20] X.L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), online, pp.4582-4597, Aug. 2021. DOI: 10.18653/v1/2021.acl-long.353
CrossRef
[21] P. Lewis, L. Denoyer, and S. Riedel, “Unsupervised question answering by cloze translation,” Proc. 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, pp.4896-4910, July 2019. DOI: 10.18653/v1/P19-1484
CrossRef
[22] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang, “GPT understands, too,” AI Open, in press. DOI: 10.1016/j.aiopen.2023.08.012
CrossRef
[23] X. Liu, K. Ji, Y. Fu, W. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,” Proc. 60th Annual Meeting of the Association for Computational Linguistics (ACL), Dublin, Ireland, vol.2, pp.61-68, May 2022. DOI: 10.18653/v1/2022.acl-short.8
CrossRef
[24] T. Shin, Y. Razeghi, R.L. Logan IV, E. Wallace, and S. Singh, “AutoPrompt: Eliciting knowledge from language models with automatically generated prompts,” Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), online, pp.4222-4235, Nov. 2020. DOI: 10.18653/v1/2020.emnlp-main.346
CrossRef
[25] Z. Zhong, D. Friedman, and D. Chen, “Factual probing is [MASK]: Learning vs. learning to recall,” Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), online, pp.5017-5033, June 2021. DOI: 10.18653/v1/2021.naacl-main.398
CrossRef
[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, USA, Dec. 2017.
[27] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” International Conference on Learning Representations 2013 (ICLR Workshop Poster), Atlanta, USA, June 2013.
[28] J. Pennington, R. Socher, and C.D. Manning, “Glove: Global vectors for word representation,” Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp.1532-1543, Oct. 2014.
CrossRef
[29] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp.3982-3992, Nov. 2019. DOI: 10.18653/v1/D19-1410
CrossRef
[30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research (JMLR), vol.15, no.1, pp.1929-1958, 2014.
[31] T. Zhang, Z. Xu, T. Medini, and A. Shrivastava, “Structural contrastive representation learning for zero-shot multi-label text classification,” Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, pp.4937-4947, Dec. 2022. DOI: 10.18653/v1/2022.findings-emnlp.362
CrossRef
[32] X. Wu, C. Gao, L. Zang, J. Han, Z. Wang, and S. Hu, “ESimCSE: Enhanced sample building method for contrastive learning of unsupervised sentence embedding,” Proc. 29th International Conference on Computational Linguistics (COLING), Gyeongju, Republic of Korea, pp.3898-3907, Oct. 2022.
[33] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, USA, Dec. 2017.
[34] E. Gabrilovich and S. Markovitch, “Computing semantic relatedness using Wikipedia-based explicit semantic analysis,” Proc. Twentieth International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, Jan. 2007.
[35] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society for Information Science, vol.41, no.6, pp.391-407, 1990.
CrossRef
[36] K. Song, X. Tan, T. Qin, J. Lu, and T. Liu, “MPNet: Masked and permuted pre-training for language understanding,” 34th Conference on Neural Information Processing Systems (NeurIPS), Vancouver, Canada, Dec. 2020.
[37] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis, Wiley, 1973.
CrossRef
[38] Y. Yu, S. Zuo, H. Jiang, W. Ren, T. Zhao, and C. Zhang, “Fine-tuning pre-trained language model with weak supervision: A contrastive-regularized self-training approach,” Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), online, pp.1063-1077, June 2021. DOI: 10.18653/v1/2021.naacl-main.84
CrossRef
[39] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” 29th Conference on Neural Information Processing Systems (NIPS), Montreal, Canada, Dec. 2015.
[40] K. Lang, “Newsweeder: Learning to filter Netnews,” Proc. Twelfth International Conference on Machine Learning (ICML), Tahoe City, USA, July 1995. DOI: 10.1016/B978-1-55860-377-6.50048-7
CrossRef
[41] A.L. Maas, R.E. Daly, P.T. Phan, D. Huang, A.Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” Proc. 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), Portland, USA, pp.142-150, June 2011.
[42] Y. Meng, J. Shen, C. Zhang, and J. Han, “Weakly-supervised neural text classification,” Proc. 27th ACM International Conference on Information and Knowledge Management (CIKM), Torino, Italy, pp.983-992, Oct. 2018. DOI: 10.1145/3269206.3271737
CrossRef
[43] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” Proc. IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, pp.19-27, Dec. 2015.
[44] L. Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol.9, no.11, pp.2579-2605, 2008.