Skip to Main Content (Press Enter)

Logo UNINSUBRIA
  • ×
  • Home
  • Corsi
  • Insegnamenti
  • Professioni
  • Persone
  • Pubblicazioni
  • Strutture
  • Terza Missione
  • Attività
  • Competenze

UNI-FIND
Logo UNINSUBRIA

|

UNI-FIND

uninsubria.it
  • ×
  • Home
  • Corsi
  • Insegnamenti
  • Professioni
  • Persone
  • Pubblicazioni
  • Strutture
  • Terza Missione
  • Attività
  • Competenze
  1. Pubblicazioni

Ethical treatment of language models against harmful inference-time interventions

Articolo
Data di Pubblicazione:
2026
Abstract:
Open-weights large language models and low-cost steering methods are strongly democratising the crafting of custom artificial intelligence-based assistants. This benefit comes with the side effect of expanding the potential risks associated with the harmful, toxic, or other undesired uses of neural language models. Language model immunisation is a quite novel research area that seeks to mitigate these risks. Immunised models are pre-trained models whose weights are hard to fine-tune toward harmful or dual tasks. While existing works on immunisation focus on resistance against full-parameter or parameter-efficient fine-tuning, this paper proposes a candidate strategy to neutralise models against low-cost attacks based on Inference-Time interventions (ITI). The proposed approach is called Ethical Treatment (E.T.),1 and consists of training layer-wise low-rank adaptors to locally neutralise attacks at the decoder-block level of Transformer-based models. Pilot experiments on Llama-3-8B-Instruct demonstrate E.T.'s effectiveness in reducing ITI-attack success rates while preserving utility on general-purpose tasks. Evaluation across the TinyBenchmarks suite shows that E.T. maintains strong performance on commonsense reasoning, and world knowledge, with primary degradation limited to mathematical reasoning. While not solving the broader immunisation challenge, these results position E.T. as a promising step toward structurally robust open-weight models.
Tipologia CRIS:
Articolo su Rivista
Keywords:
Large language models; Language model immunisation against; inference-time interventions; Ethical-Artificial Intelligence
Elenco autori:
Cevallos-Moreno, J. F.; Rizzardi, A.; Sicari, S.; Coen Porisini, Alberto
Autori di Ateneo:
CEVALLOS MORENO JESUS FERNANDO
COEN PORISINI ALBERTO
RIZZARDI ALESSANDRA
SICARI SABRINA SOPHY
Link alla scheda completa:
https://irinsubria.uninsubria.it/handle/11383/2209152
Link al Full Text:
https://irinsubria.uninsubria.it//retrieve/handle/11383/2209152/478853/1-s2.0-S0952197626005981-main.pdf
Pubblicato in:
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE
Journal
Progetto:
SERENA-IIoT: SEcure and REliable Networked Architecture for Industrial Internet of Things digital transformation
  • Dati Generali

Dati Generali

URL

https://www.sciencedirect.com/science/article/pii/S0952197626005981
  • Accessibilità
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 26.4.5.0