Ethical treatment of language models against harmful inference-time interventions

Articolo

Data di Pubblicazione:

2026

Abstract:

Open-weights large language models and low-cost steering methods are strongly democratising the crafting of custom artificial intelligence-based assistants. This benefit comes with the side effect of expanding the potential risks associated with the harmful, toxic, or other undesired uses of neural language models. Language model immunisation is a quite novel research area that seeks to mitigate these risks. Immunised models are pre-trained models whose weights are hard to fine-tune toward harmful or dual tasks. While existing works on immunisation focus on resistance against full-parameter or parameter-efficient fine-tuning, this paper proposes a candidate strategy to neutralise models against low-cost attacks based on Inference-Time interventions (ITI). The proposed approach is called Ethical Treatment (E.T.),1 and consists of training layer-wise low-rank adaptors to locally neutralise attacks at the decoder-block level of Transformer-based models. Pilot experiments on Llama-3-8B-Instruct demonstrate E.T.'s effectiveness in reducing ITI-attack success rates while preserving utility on general-purpose tasks. Evaluation across the TinyBenchmarks suite shows that E.T. maintains strong performance on commonsense reasoning, and world knowledge, with primary degradation limited to mathematical reasoning. While not solving the broader immunisation challenge, these results position E.T. as a promising step toward structurally robust open-weight models.

Tipologia CRIS:

Articolo su Rivista

Keywords:

Large language models; Language model immunisation against; inference-time interventions; Ethical-Artificial Intelligence

Elenco autori: