5 How Generative AI Works

Draft — Not Yet Reviewed

The content in this chapter is being reviewed since Claude Code was used to convert the text from powerpoint slides to this webpage. Content may be incomplete, inaccurate, or require significant editing before use.

To use generative AI tools critically and responsibly, researchers need a working understanding of how these systems are built and where they fundamentally differ from human reasoning. This chapter provides a conceptual (non-mathematical) overview of large language models, their training process, and the technical reasons behind their limitations and failures.

Learning Outcomes

By the end of this chapter you will be able to:

Describe at a conceptual level how large language models are trained and how they generate text
Explain what tokenisation is and why it matters for how AI processes research text
Identify the main categories of limitation and failure modes in current generative AI systems
Explain why LLMs produce plausible-sounding but sometimes incorrect outputs
Connect technical limitations to practical implications for research use

5.1 What Is a Large Language Model?

From rule-based systems to neural networks to transformers
Training on large corpora: what data, and whose data?
The prediction task: next-token prediction as the core mechanism
Parameters, scale, and emergent capabilities

5.2 Tokenisation

What tokens are and how text is split
Why tokenisation affects how AI handles numbers, names, non-English text, and specialised vocabulary
Implications for prompting in technical and scientific domains

5.3 How Text Generation Works

Probability distributions over tokens
Temperature, sampling, and why outputs vary
Context windows: what the model “sees” at any moment
The absence of memory between sessions (in most systems)

5.4 Limitations and Failure Modes

Hallucination: generating confident but false information
Knowledge cutoffs and static training data
Sensitivity to prompt wording — and what this reveals
Reasoning limitations: pattern matching vs. logical inference
Bias amplification from training data
Inability to verify or update its own outputs

5.5 Why AI Fails in Research Contexts

Fabricated citations and plausible-but-wrong claims
Discipline-specific knowledge gaps
Inconsistency across conversations
Over-confidence in uncertain domains

Discussion Activity

After learning how LLMs generate text, has your intuition about AI “understanding” changed? What does “understanding” even mean in this context?
Which of the failure modes described in this chapter concerns you most for your own research area?
If an AI produces a confident, well-written, but factually wrong paragraph, what does this reveal about the limits of using fluency as a quality signal?
How might the knowledge cutoff of an AI system create specific risks in fast-moving research fields?

5.6 Practical Exercises

5.6.1 Exercise 1 — Exposing tokenisation limits

Tool: arena.ai (free, battle mode)

In battle mode, submit: “Count the number of the letter ‘r’ in the word ‘strawberry’.” Observe whether either model makes an error. This is a known failure case caused by how text is tokenised — the model never “sees” individual letters. Vote for the more accurate response, then reveal the models. Discuss: if an AI cannot reliably count letters in a single word, what does this imply for tasks like character-level text analysis in research?

5.6.2 Exercise 2 — Finding the knowledge cutoff

Tool: duck.ai (free, private)

First, ask the model: “What is your knowledge cutoff date?” Then ask it about a real event you know occurred after that date (e.g., a recent publication, policy change, or conference in your field). Observe how the model responds: does it acknowledge uncertainty, confabulate, or refuse? Document the response. What does this mean for researchers who use AI for literature review or horizon scanning?

5.6.3 Exercise 3 — Non-determinism and consistency

Tool: arena.ai (free, battle mode)

Submit a simple but open-ended research question from your field twice in two separate sessions. Compare the two responses. Do they give different conclusions, different sources, or different structures? Note any inconsistencies. Discuss: if the same prompt can produce different outputs, what are the implications for reproducibility in research that relies on AI?

5.7 References

Alan Turing Institute. Introduction to Transparent Machine Learning. Free Jupyter Notebook course. MIT Licence. GitHub. github.com/alan-turing-institute
The Turing Way Community. The Turing Way: A Handbook for Reproducible, Ethical and Collaborative Research. CC BY 4.0. book.the-turing-way.org
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of FAccT ’21 (pp. 610–623). ACM. doi.org/10.1145/3442188.3445922
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 30. arxiv.org/abs/1706.03762