DOLA: decoding strategy to make LLMs less prone to hallucinate.

Viraj Kadam
3 min readDec 8, 2024

--

LLM hallucinations are not that well studied and undestood. The DOLA paper from MIT and Microsoft titled “DECODING BY CONTRASTING LAYERS IMPROVES FACTUALITY IN LARGE LANGUAGE MODELS” tries to address this issue with contrastive decoding.

1. Background and Introduction

1.1) What is LLM hallucination?

Hallucinations in LLMs refer to generated content not based on training data or facts, caused by various factors like imperfect learning and decoding. Exact reasons for LMs hallucinations are not fully understood

1.2) Some insights into the inner working of LLMs

  • LLMs are trained to predict the next token given a preceding sequence, to maximise the probability of observed data in the training. This is called as maximum likelyhood estimation (MLE).
  • From a model interpretability perspective, transformer models have been loosely shown to encode lower- level information (e.g., part-of-speech tags etc) in the earlier layers, and more “semantic” information in the later layers.
  • Dai et al (2022) postulated that knowledge neurons are distributed in the top layers of the transformer.
  • Factual knowledge evolves across layers. While predicting on factual questions, what was observed was that the model would change the prediction more in the later layers. While predicting easy tokens such as is, was, the etc, the output distribution does not change much from initial layers to the final layers. This shows that for more factual and complex generations, the model
  • By emphasizing the knowledge of higher layers and downplaying that of lower layers, we can potentially make LMs more factual and thus reduce hallucinations.

1.3 ) Why LLMs Hallucinate?

  • LLMs have a objective of maximizing likelyhood estimation. This gives the model a heavy penalty if it misses to predict the distribution that it has seen in the training data.
  • On the other hand, the training objective also does not penalize the model for predicting tokens which it has not seen in the training data. Hence it can assign non zero probabilities for output sequences which it has not seen in the training.
  • Hallucinations occur when the model generates outputs which are not based on the training data, but are semantically and syntactically acceptable. The objective of next word prediction emphasizes on plausible output sequences, rather than the factual-ness of the output.

2. How DOLA works

Dola is applied as a contrastive decoding strategy,where the next word probability is obtained by difference in logits between higher vs a lower layer.

2.1) Selecting a lower (premature) layer :

To magnify the effectiveness of contrastive decoding, we need to select a layer with maximum logits difference with the final (mature) layer. The authors selected Jensen Shannon Divergence (JSD) as a measure of difference between the distributions.
The motivation to select the maximum divergence layer is to ensure that the model would significantly change its output after that selected layer, and thus have a higher chance to include more factual knowledge that does not exist in the early layers before.

2.2) Contrasting the predictions :

The log probabilities of the premature layer outputs is subtracted from those of the mature layer. We then use this resulting distribution as the next-word prediction.

If the predicted probability of a token is too small in the mature layer, it is not likely to be a reasonable prediction (as the knowledge has not evolved across the layers). So the token probability is set to zero for such cases.

3. Using DOLA in huggingface’s transformer library.

To use DOLA decoding in tranformers, we can simply use the following arguments in the generate function.

dola_layers : stands for the candidate layers in premature layer selection. The selected layer will be contrasted with the final layer.

low: dynamically selects some layer from the lower part of the model for contrastive decoding.

dola_high_output = model.generate(**inputs, do_sample=False, max_new_tokens=50, dola_layers='low')
tokenizer.batch_decode(dola_high_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True))

high: dynamically selects some layer from the higher part of the model for contrastive decoding.

dola_high_output = model.generate(**inputs, do_sample=False, max_new_tokens=50, dola_layers='high')
tokenizer.batch_decode(dola_high_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)

list of integers : for layer indices to contrast manually specified layers. For example, setting dola_layers=[28,30] will contrast the final layer with the 28-th and 30-th layers.

dola_custom_output = model.generate(**inputs, do_sample=False, max_new_tokens=50, dola_layers=[28,30], repetition_penalty=1.2)
tokenizer.batch_decode(dola_custom_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)

The paper suggested that contrasting 'high' layers to improve short-answer tasks like TruthfulQA, and contrasting 'low' layers to improve all the other long-answer reasoning tasks, such as GSM8K, StrategyQA, FACTOR, and VicunaQA.

repetition_penalty : repetition_penalty = 1.2 is suggested to reduce repetition in DoLa decoding.

References

--

--

No responses yet