Knowledge Acquisition through Continued Pretraining is Difficult: A Case Study on r/AskHistorians

Abstract

Powerful LLMs such as ChatGPT master a wide array of tasks, but have notable limitations in domain-specific areas, especially when prompted to recite facts. This is of particular importance for knowledge workers who are increasingly adopting LLM-based tools. While there are various techniques that can help ingest knowledge into LLMs, such as instruction tuning and alignment, most have disadvantages. We examine the impact of prominent training techniques on LLMs’ knowledge accuracy using a knowledge-dense dataset that we curate from r/AskHistorians, a rich source of historical knowledge. We evaluate the impact of different model sizes from 1.3B to 7B parameters and other factors such as LoRA adapters, quantization, overfitting, and the inclusion of Reddit data in pretraining. In addition, we measure linguistic metrics and human and LLM-based preferences. Our results suggest that pretraining and model size have a much stronger effect on knowledge accuracy than continued pretraining – except in cases of overfitting to the tested knowledge. Fine-tuning on our Reddit dataset introduces less complex, but slightly more toxic language. Our study explores the challenges of injecting domainspecific datasets into LLMs and has implications for practitioners, e.g., when LLMs are to be fine-tuned with company-specific datasets.

Publication
KnowLLM@ACL 2024
Konstantin Dobler
Konstantin Dobler
PhD Student & Research Associate

I’m a PhD student at Hasso Plattner Institute researching transfer learning of large language models.