Knowledge Acquisition through Continued Pretraining is Difficult: A Case Study on r/AskHistorians

Jan Hoffbauer, Sylwester Sawicki, Marc Ulrich, Tolga Buz, Konstantin Dobler, Moritz Schneider, Gerard de Melo

August, 2024

Abstract

Powerful LLMs such as ChatGPT master a wide array of tasks, but have notable limitations in domain-specific areas, especially when prompted to recite facts. This is of particular importance for knowledge workers who are increasingly adopting LLM-based tools. While there are various techniques that can help ingest knowledge into LLMs, such as instruction tuning and alignment, most have disadvantages. We examine the impact of prominent training techniques on LLMs’ knowledge accuracy using a knowledge-dense dataset that we curate from r/AskHistorians, a rich source of historical knowledge. We evaluate the impact of different model sizes from 1.3B to 7B parameters and other factors such as LoRA adapters, quantization, overfitting, and the inclusion of Reddit data in pretraining. In addition, we measure linguistic metrics and human and LLM-based preferences. Our results suggest that pretraining and model size have a much stronger effect on knowledge accuracy than continued pretraining – except in cases of overfitting to the tested knowledge. Fine-tuning on our Reddit dataset introduces less complex, but slightly more toxic language. Our study explores the challenges of injecting domainspecific datasets into LLMs and has implications for practitioners, e.g., when LLMs are to be fine-tuned with company-specific datasets.

Type

Preprint

Publication

KnowLLM@ACL 2024

Knowledge Acquisition through Continued Pretraining is Difficult: A Case Study on r/AskHistorians

Abstract

Konstantin Dobler

ELLIS PhD Student in ML & NLP