Bert and XLNet are 2 of the most powerful pretrianed DNNs for Natural Language Processing (NLP). Bert is one of the most widely used models to finetune for business problems like Sentiment Analysis, Question Answering, and Named Entity Recognition (NER), and Document Ranking. However, XLNet actually outperforms Bert on at least 20 different NLP tasks, as shown in the original XLNet paper. And this has been confirmed in a followup study by the XLNet team (in 2019).
Bert and XLNet have similar architectures but they are (pre)trained very differently. BERT is a called an Autoencoding (AE) based model, whereas XLNet is referred to as Auto-Regressive (AR). More specifically, while BERT uses masked words, XLNet uses a more sophisticated approach from Permutation Language Modelling (sometimes called Deshuffling).
Here, compare 2 equivalent (albiet smaller) BERT and XLNet (base-case) models from Huggingface. We see that the BERT model has many layer alphas greater than 6, which usually indicates that these layers are not well trained. In contast, the XLNet layer alphas all lie in a very clean range between say 2.5 and 4.5, which indicates that all of the XLNet layers are very well trained.
Using weightwatcher, we see immediately that XLNet should outperform BERT, even without having access to the training or test data. And this is exactly what the academic literature indicates.