The Olympics Bot — Retrieval-Augmented Generation for Journalists

17 min readAug 12, 2024

A colorful knowledge graph with clusters of names of sports in Finnish. — A knowledge graph of the sports and athletes featured in the live news feed.

Facts are to a large language model (LLM) what time is to a stopped clock. When we say that an LLM is “hallucinating”, we are referring to the instances when it is telling incorrect things. This is somewhat misleading, as all the things an LLM says are essentially hallucinated: often they just happen to be true.

Retrieval-Augmented Generation (RAG) is a design pattern that was invented to combat this limitation. In RAG, the responses of a generative large language model are based on search results. When a user asks a question, the RAG system retrieves knowledge from a database. This information is then passed on to the LLM. The LLM is subsequently prompted to answer the question using the information it has received.

In the Olympics Bot Project, funded by Yle Innovations, the innovations department of the Finnish Broadcasting Company, we designed a RAG system for reporters covering the 2024 Olympic Games. The chatbot was augmented with a wide variety of both textual and tabular data sources. The main sources of data were the articles from Yle’s live coverage news feed. In addition, the chatbot was given texts about the history of the Olympic games, the biographies of all the athletes in Finland’s national team, and the rules of every sport played during the games. It also received tabular data about the schedule of the games, as well as the previous records from athletes from the national team.

These days, the RAG landscape is awash with overhyped and over-engineered solutions, all promising to eliminate the downsides of LLMs. The Olympics presented us with the perfect test case for assessing the suitability of different RAG approaches in real-world journalistic workflows. While it is simple to set up a chatbot that can answer questions, the true challenge lies in ensuring that those answers are not only correct, but also as relevant and up to date as they can be. Beyond simple question answering, we wanted to see whether we could use RAG to combine different data sources to form holistic analyses and summaries.

The project was overseen by Jyri Kivimäki from Yle News Lab, and the system was designed by me. The source materials were collected with the help of two hard-working summer interns, Jimi Hevonoja and Kerkko Tukiainen. Together, we set out to build a system that should fill these requirements:

Truthfulness: It goes without saying that the answers of the chatbot should reflect the context it has been given to form the answer. Moreover, it is important that it does not simplify, embellish or otherwise alter the information in the context. Journalistic texts present conflicting viewpoints, and it is important that the system preserves this diversity while forming answers.
Reliable sourcing: The system should also make it easy to fact-check its own answers. If you need to use Google to check whether the answers of the system are true, why not just use Google to answer your question?
Relevance: The system should base its answers on the latest information. In live sports coverage, information that is five minutes old can already be outdated.
Latency: If the system takes too long to form an answer, it can already be outdated by the time it arrives.

Retrieval

Choosing the correct retrieval methods for RAG depends on the data sources. For our purposes, I chose to use a combination of several techniques. To easily benchmark different configurations, I designed the system in a way where the different retrieval modes could be toggled on and off or stacked in different ways.

Lexical (word-level) search with the BM25 algorithm
Vector search with embedding models
LLM-generated database queries
Knowledge graphs

Lexical vs. Semantic Search

Traditional text search algorithms, such as the bm25, rank their results based on how many of the words in your search query appear in the documents. This is great, if you know the words for the thing you want information about, but it has its drawbacks. As Michel Foucault writes in The Archaeology of Knowledge, two sentences can be lexically identical, yet belong to completely different discourses, and therefore carry different meanings. Conversely, two sentences might have no words in common, yet carry a similar meaning if they belong to the same discourse.

The most common retrieval method in RAG is using so-called semantic (meaning-level) search, or vector search. They tackle the shortcomings of traditional text retrieval by utilizing a feature of LLMs, text embeddings. LLMs don’t process text as letters or words, but as multidimensional vectors, known as embeddings. These embeddings form a space, called the embedding space, where embeddings which are likely to occur in a sequence are grouped closer together. When an LLM generates text, it searches this space, and chooses the embeddings that are closest.

In semantic search, both the documents and the question are fed to an embedding model, which calculates embeddings from both. Then, using cosine similarity, or other geometrical metric, the angle of the question embedding is compared to the angles of the document embeddings. This metric, known as the relevance score, can then be used to rank the documents and retrieve the documents that are the best match.

Calling this technique semantic can be a bit misleading, as the embedding representations are just long lists of numbers that look meaningless to human eyes. As with most things involving neural networks, it is difficult to discern what exactly the model is doing. Yet, somehow, embeddings can be used to discover texts that might have none of the same words than the question yet contain information that is relevant to answering them. Traditional BM25 still outperforms embeddings in many use cases, so our system utilized a combination of both.

The emergence of RAG has given rise to specialized vector databases, which are designed to store lots of embeddings and to access them quickly. However, most of them are just buckets for vectors and lack the basic features which are commonplace in non-vector databases, such as using time codes and metadata tags to filter results. Eventually, I settled on Vespa, which offers a lot of customization, including the option to write your own ranking algorithms. Vespa allowed us to import the schema used to store the Yle articles without losing any of the metadata associated with it.

As for the embedding model, our selection was limited because the model needed to perform well in multilingual retrieval. We used both BGE-M3, which is developed by the Beijing Academy of Artificial Intelligence, as well as OpenAI’s text-embedding-large-3. In practice, I found no meaningful difference between their real-world performance or latency. As the BGE-M3 was self-hosted on Yle’s cloud and quantized (compressed) with the ONNX framework to run on light resources, the costs were also comparatively similar. The BGE-M3 had the added benefit of supporting hybrid retrieval, meaning that it could be used as a lexical retriever at the same time. For a systematic analysis of how embedding models fare in Finnish, I recommend reading this article.

Database Query Generation

A graph of a database schema, containing the fields “athlete”, “record”, “sport”, “schedule” and “article”. — An early draft of our database schema.

In addition to text documents, the data also contained tables. I experimented with feeding this data to Vespa as-is, without generating embeddings, and then generating database queries using GPT-4-Turbo, our choice of LLM. However, this approach proved to be quite challenging, and I would advise against going about it the way we did.

First, Vespa uses a subset of the YQL query language, which is similar, but not identical to SQL, a much more commonly used query language. This meant that the LLM often mixed the two, causing errors. The syntax issues were eventually alleviated with a lot of prompt engineering and a few functions that checked that the queries were correct. The queries still fail around 10% of the time, as LLMs are very good at coming up with new interesting ways to fail.

Secondly, the database scheme was very complex, featuring lots of terminology in different languages. This might be the cause for why the LLM often made queries that were nonsensical. For example, when asked to tell about the athlete Aleksi Leppä, the LLM generated the query “select name from athlete where name contains “Aleksi Leppä”. This query, while syntactically correct, was tautological, and only returned “Aleksi Leppä”.

The third issue was the limited Finnish proficiency of GPT-4-Turbo and the other OpenAI models. While they perform reasonably well during simple text generation tasks, their lack of language understanding is a limiting factor when it comes to prompts that require complex reasoning. Another time using the same question from our evaluation set, which in Finnish is “Kerro Aleksi Lepästä”, the LLM generated the query “select text from athlete where name contains “Aleksi Lepäs”, which did not result in any matches. I tried to circumvent these lemmatization issues by implementing fuzzy matching to the text fields, meaning that the database would allow matches even if a few letters were not identical. However, this did not eliminate the issue completely, as sometimes the LLM would also choose words that were just flat out wrong. Even when the prompt contained the names of all the sports it could search, it still chose to use “uiminen”, which is the activity of swimming instead of “uinti” which means swimming as a sport.

If the database query resulted in an error or an empty response, it was easy to retry and get a result. I have never seen the system fail on the second try. A more nefarious problem were the queries that were successful but were either searching from the wrong place or were otherwise irrelevant. Breaking up the prompt into several smaller prompts might have helped with the issue, but it was not possible, as in order to make the choice of what database field to query, the LLM needed to be presented with all of them simultaneously.

Knowledge Graphs

A zoomed in view of the “athletics” node with the “pole vaulting” node in upper left corner.

In knowledge graph retrieval, information is represented as nodes and their relationships in a graph. While the content is being fed into the database, an LLM is prompted to extract entities, as well as their relationships, from text. In our instance, we were interested in two kinds of entities, athletes and sports, and how the athletes had fared while competing in these sports. This information is then stored along with the texts themselves.

During query time, the knowledge graph is searched by the LLM by identifying keywords from the question posed by the user. The retrieved nodes and relationships can either be offered directly to the LLM as context or used to subsequently search for data featuring the keywords in them (we did both).

This approach is useful, as it can offer the LLM a more comprehensive picture of how different entities are related to each other. For example, when asked to list how Finland is doing in athletics, it could search for the node “Athletics”, which would have a bunch of athletes’ nodes attached to it, such as “Wilma Murto -> jumped 480m in -> Athletics”, or “Saga Vanninen -> will compete tomorrow in -> Athletics”. Then, it could use those nodes to answer how each of them was doing in relation to that sport: “Wilma Murto has just jumped 480 meters, whereas Saga Vanninen will compete tomorrow.”

Sports is a good test use case for knowledge graph retrieval, because its ontologies are mostly pre-established, making it easy to evaluate how good the results truly are. The greatest limiting factor was once again GPT-4-Turbo’s language understanding. Even though we included the athletes and their sports in the prompt (which is bad practice, as this kind of approach wouldn’t really scale beyond our test use case), the LLM struggled with consistently extracting the correct entities. It would, for example, create graphs like “Murto -> placed sixth in -> Pole vaulting”, which, while not wrong, is problematic for a few reasons. By not identifying that “Murto” and “Wilma Murto” are the same person, it creates a new node that is likely to be left out when searching for “Wilma Murto”. Also, as it fails to identify pole vaulting as one of the events belonging to the sport athletics, it would also not show up when searching for athletics results in our earlier example. The system would incorrectly state that Murto was still competing, because the latest information was attached to nodes that did not appear in the search.

Data Enrichment Techniques

In addition to so-called naïve RAG, where the user’s question is directly compared against the documents, there are also methods to enrich the query and data to produce more relevant results from vector search (and lexical search). We tested three different methods:

HyDE: Hypothetical Document Embeddings
Question generation
Summary generation

In HyDE, the LLM is first asked to create a hallucinated answer to the question, which is then compared with the documents instead of the question itself. In our instance, it was prompted to produce an imaginary Yle news article.

Question generation and summary generation both enrich the data being retrieved. In question generation, the LLM is presented with the text in the database and prompted to produce questions that could be answered with the document. During retrieval, these questions are then matched with the user’s question. In summary generation, each document is summarized using an LLM, and then these summaries are matched with the question during retrieval. Optionally, the original article can then be retrieved along with the summary.

We manually evaluated these techniques with a set of 20 questions (60 variants in total). We noticed that HyDE was marginally better than the naïve approach, so we kept that as an option, although it added around 5 seconds of latency. As there was no harm done by including summaries in the dataset, we also left those in, although by doing so we ran the risk of adding hallucinated content. When the games were ongoing, we realized that Yle’s live news feed about the Olympics mostly consisted of content that was so short it was completely useless to summarize, so I got rid of it.

Retrieval Re-ranking and Filtering

It is customary to use some kind of system to rank the retrieved contexts. The idea is to give the LLM as little text as possible to save on costs and latency and to improve the accuracy of the resulting responses by not presenting the LLM with results that are irrelevant.

With BGE-M3, I followed the implementation from the original paper by having the embedding model rank the results using multi-vector retrieval. Several documents are first retrieved with single embeddings, and then re-ranked using a relevance score which is calculated using the individual vectors of each retrieved document. With text-embedding-3-large, I used a shortened embedding with only 256 dimensions for the initial retrieval, and then used the full embedding with 3072 dimensions to re-rank those results. The idea behind both methods is to save time by going through all documents with lower accuracy, and then using the highest possible accuracy on only the retrieved subset.

I tried using GPT-4-Turbo as a re-ranker by prompting it to filter out a certain number of irrelevant documents. However, it fared worse than just getting rid of the lowest 50% of documents based on their relevance scores.

Additionally, I weighted the retrieval results based on their recency, which was pivotal in finding the most recent sports results. When the Olympics were already ongoing, we realized that simply weighting the results was not enough, and instead transitioned to using a strict cut-off for the results. When the user asked a question, GPT-4 was prompted to create time codes that would be used to filter the results. GPT-4 excelled in this task as long as it was provided the current time and weekday in the prompt. For example, if the user wanted to know what was going on today, it was almost 100% certain that the system would only use sources that were dated after 0:00 on the same day. We included this time range as part of the answer in the UI, so that it would be easier for users to see whether their prompt was interpreted correctly.

Evaluation

Many RAG evaluations use synthetic datasets to measure how well documents are retrieved. These datasets are generated similarly to how we did question generation, by making the LLM create questions based on each document in the database. However, this kind of evaluation does not reflect the real-world performance of a RAG system, and therefore we opted to use a very different methodology. We asked journalists to come up with questions they would want to ask from such a system, and assembled an evaluation set of 64 questions out of them. This could be seen as an unfair way to evaluate the system, as the journalists had no knowledge of what kind of materials the system had access to. However, it is a much more realistic simulation of real-world use, as we cannot expect the users to know in advance what kind of questions they should pose.

A bar chart with three bars: No information, with 22 results, Right, with 32 results, and Wrong with 10 results. — A simplistic evaluation of the results.

On a simplistic level, this is what the results look like: 50% of the answers were correct, 16% incorrect, and with 34% of cases the chatbot claimed it didn’t know. For an answer to be labelled incorrect, it only needed to contain one untrue thing, whereas for an answer to be labelled correct, everything needed to be true. However, this level of analysis does not get us far in surveying the real-world usefulness of the chatbot. Throughout the process of developing the system, I had also been developing what I consider to be a more fine-grained and meaningful taxonomy for evaluating the responses. It distinguishes between the different causes of the issues, such as information retrieval and language understanding.

Right Answers

R1: The response is correct, cohesive and up to date. It is based on the most relevant context.
R2: The response is correct but is not based on the most relevant context.
R3: The response is correct but is not based on the context.
R4: The response is correct but does not use the most relevant information in the context or presents information in the context inaccurately.

Wrong Answers

W1: The response is incorrect, even though the context would have featured correct information.
W2: The response is incorrect, and the context features no correct information.
W3: The response is incorrect because the question has been misinterpreted or a part of the question has been ignored.
W4: The response is incorrect because it is unintelligible.

No Answers

N1: The chatbot correctly declines to answer because it has no access to the correct information.
N2: The chatbot correctly declines to answer because the context does not contain correct information.
N3: The chatbot incorrectly declines to answer even though the context contains the correct information.
N4: The chatbot correctly declines to answer because the question is unanswerable.

With this taxonomy, the results look like this:

A bar graph depicting the different classifications, with N1 being the highest for “No info”, R1 the highest for “Right” and W1 being the highest for “Wrong”. — A more comprehensive classification of the results.

Out of the correct answers, most belong to R1, the optimal category. In the “no information” section, N1 represents the “unfair” cases where the LLM would have no access to correct information. It seems to handle these cases well, as claiming ignorance is preferable to hallucinations. However, four responses in the W2 category indicate that sometimes it will incorrectly think it has information to answer a question when it doesn’t. In other words, it misinterpreted the context as containing relevant information.

Based on these results, we can say that information retrieval still needs work, as there were nine “no information” — responses that were there due to retrieval issues. Looking more closely at those cases, almost all of them were questions where the correct information would have been behind the database queries, confirming that it performed weaker than the other retrieval methods.

Another area of improvement are cases where the LLM receives the correct information but does not know how to utilize it while forming an answer. These cases were in categories N3, R4 and W1. These were mostly due to limited language understanding capabilities and might improve if we chose a different LLM.

The high number of unanswered questions is more of a user experience design problem than a retrieval problem. As users cannot be expected to know all the sources, how do you communicate what kind of questions they should be asking? Most RAG systems work the other way around by having the user input their own documents. This has the benefit that the user most likely already has some notion of what those documents contain.

Review: Did we Meet our Requirements?

Question:
Give a summary of today’s most important news in the Paris Olympics.
Answer:
”Searching between: 2024–08–11 00:00:00–2024–08–11 23:59:59
Today’s most important news from the Paris Olympics includes the conclusion of the women’s basketball final where the USA claimed a dramatic victory over France with a one-point margin, finishing 67–66. This win not only secured another gold for the USA but also ensured their position as the top nation in the overall medal table of the Paris Olympics, with a total of 40 gold medals.
Additionally, the men’s handball final saw Denmark triumph over Germany with a score of 39–26, marking a significant victory for Denmark’s handball icon Mikkel Hansen, who achieved a perfect ending to his playing career.
In other news, the women’s volleyball final unfolded with Italy securing a historic Olympic gold by defeating the USA in straight sets, marking their first-ever Olympic medal in women’s volleyball.
These pivotal events have shaped the closing stages of the Paris Olympics, highlighting remarkable performances and record-setting achievements. For more details on the final medal tally and individual achievements, you can visit the latest updates here.”

Looking at this response, which happens to be an especially egregious example of the LLM disobeying the prompt, it is clear that the system still has a long way to go. Returning to the requirements we set for it, it fails in being truthful and offering reliable sourcing, while it succeeds in being relevant. As for latency, we opted for a system where the user can adjust it to their liking, sacrificing some accuracy for speed. The evaluation was done using the longer latency mode, which takes around a minute to answer.

In Archaeology of Knowledge, Michel Foucault criticizes research for striving toward “the greatest possible number of contradictions resolved by the simplest means”. This quote came to my mind while I was going through the 64 answers of the LLM. Instead of simply quoting the relevant statements, the LLM tries to uncover the meaning behind them, and then present us with that meaning. Even when explicitly prompted not to, it offers a straightforward narrative told from a God’s eye view. Things represented in the original articles as quotes by interviewees, opinions or as statements by experts or governing bodies have all lost their original context. Without fail, it also concludes with bland final remarks we have begged it not to include.

One of the hardest tasks throughout this project has been trying to engineer prompts where the multifaceted, even contradictory elements of the original materials would be preserved, as that kind of transparency would be essential for journalistic use cases. Sources remain another issue: when asking for lists, it usually can point to the individual articles it has gotten the information from, but with summaries, such as the example above, it usually presents only a few. There are of course ways around this issue, such as forcing the LLM to return the source for each sentence.

To sum up the results of this experiment, we are currently not limited by the retrieval of relevant information, but instead by the LLMs’ ability to understand and contextualize that information in a way that is useful. Most things could still be improved with simply engineering better prompts, including the aforementioned issues with preserving the context of statements. Yet, there are quite a few problems which are harder to tackle because they stem from the limited capabilities of LLMs in low-resource languages.

Acknowledgments

My thanks go to Jouni Frilander from Yle Innovations, Johanna Rikman and Ossi Saarinen from Yle Sports, and to the entire News Lab team.

The Olympics Bot — Retrieval-Augmented Generation for Journalists

Retrieval

Lexical vs. Semantic Search

Database Query Generation

Knowledge Graphs

Data Enrichment Techniques

Retrieval Re-ranking and Filtering

Evaluation

Review: Did we Meet our Requirements?

Acknowledgments

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Vertti Luostarinen

No responses yet