Evaluating LLMs for Creative Writing in Finnish

10 min readJan 13, 2025

Lovecraftian experiments with open weight models

The standard benchmarks for evaluating large language models (LLMs) rely on easily quantifiable metrics, such as math problems and multiple-choice questions. Therefore, they don’t really measure many capabilities I personally find important, mainly creative tasks and their Finnish language level. Therefore, I have developed my own rather incidental and subjective methodology for evaluating new model releases.

Although all the quotes in this article are in Finnish, I decided to write in English. To those who are not Finnish speakers, this article might still have value as a reminder about how far behind smaller languages are in the LLM race. Although I don’t expect my methods to be very widely applicable, perhaps they can serve as an example of how to perform these kind of unscientific, practical tests.

The prompt I use is simply “Kirjoita tarina miehestä ja norpasta saarella H.P. Lovecraftin tyyliin”. ” Write a story about a man and a ringed seal in the style of H.P. Lovecraft.” I then rate the stories on four metrics: Finnish proficiency (from illegible to C1), Lovecraftian style (1–5), dramatic structure and coherence (1–5) and whether the ringed seal, “norppa”, is understood correctly (Yes or no). As a side note, I think I might switch my methodology for assessing the language capabilities of the models, because the mistakes they make are not really that similar to those by human language learners.

The prompt is, by design, quite challenging to follow. A major factor in Lovecraftian prose is the use of antediluvian expressions that makes it great for testing the breadth of a model’s vocabulary. As an added challenge, the ringed seal is conjugated in the prompt, and both inferring the correct basic form and understanding the meaning of the word prove quite tricky to many models.

Lovecraft’s stories, at least the famous ones, tend to follow a very similar structure, making it easy to assess whether the model is sticking to the prompt. Often featuring only few characters and locations, Lovecraft’s plots are simple enough that, at least in theory, even the smaller models could be able to handle them. Generating coherent stories is a demanding task for an LLM, as it requires both reasoning and creativity. In my experience, most models favor the former at the cost of the latter.

Below, I present my results in no particular order.

Nemotron-4–340B-Instruct

| Model                    | Finnish Proficiency | Lovecraftian Style | Dramatic Structure and Coherence | Norppa understood? |
| ------------------------ | ------------------- | ------------------ | -------------------------------- | ------------------ |
| Nemotron-4-340B-Instruct | B2                  | 4                  | 3                                | Yes                |

Nemotron-340b is a very large model developed by Nvidia. While the generated story is good, the results are far from ideal, considering the size of the model. The model seems to have the issue that it “thinks” in English, meaning that it will occasionally invent words when it then “translates” the story to Finnish.

Saari, joka on tunnettu vain muutamille uskaliaille merimiehille ja satunnaisille harhaileville merilinnuille, oli karu ja vieraanvaraton paikka, sen kallioiset rannat ja niukka kasvillisuus tarjosivat vain vähän lohtua niille, jotka uskalsivat astua sen rannoille. Silti, juuri tänne meidän päähenkilömme, yksinäinen ja melankolinen hahmo nimeltä Alvar, oli päättänyt asettua, ajautuneena syvään kaipuuseen yksinäisyyteen ja viehätykseen tuntemattomasta.

In the story, a seeker of the arcane arts, Alvar, is driven insane by looking into the “deep and soulful” eyes of a seal. This feels very Lovecraftian, although the seal is given very little in the way of backstory. The ending is a slight letdown, as Alvar becomes a hermit, even though he already was living alone on a desolate island. In fact, this is the only story where the protagonist inhabits the island from the beginning.

Gemma-2–27b-it

| Model          | Finnish Proficiency | Lovecraftian Style | Dramatic Structure and Coherence | Norppa understood? |
| -------------- | ------------------- | ------------------ | -------------------------------- | ------------------ |
| Gemma-2-27b-it | B1                  | 4                  | 3                                | No                 |

Gemma-2–27b is a medium size LLM developed by Google. It produces surprisingly coherent Finnish for it’s size, but completely forgets to mention the seal.

Kun laiva rantautui saaren karuun rantaan, Silas tunsi itsensä yhä enemmän ahdistuneena. Maasto oli kuollut ja hiljainen, kuin jokin hautausmaa. Ainoa ääni oli tuulen murahdus ja laineiden pauhu.

The story features two men, Silas and Thomas, who discover some ruins on an island and, once inside the ruins, get mauled to pieces by a mysterious creature. Notably, this is the only story with any gore in it. The story is not very good, but it’s brisk pacing does remind me of Lovecraft’s earlier, and arguably weaker, stories.

Llama-3.1–70B

| Model         | Finnish Proficiency | Lovecraftian Style | Dramatic Structure and Coherence | Norppa understood? |
| ------------- | ------------------- | ------------------ | -------------------------------- | ------------------ |
| Llama-3.1-70b | B2                  | 2                  | 4                              | Yes.               |

Llama-3.1.-70B is considered one of the frontier foundation models. While it’s Finnish has only minor errors, it’s vocabulary is quite sparse, which does not lend itself well to Lovecraft’s verbose style.

Norppa oli suuri ja salaperäinen eläin. Se tarkkaili Malcolmia, ja Malcolm tarkkaili sitä. Hän tunsi, että norppa oli jotenkin yhteydessä hänen mielensä, ja hän alkoi tuntea outoa yhteyttä eläimeen.

The story is about a man who arrives on the island, and he is driven insane by forming a connection with a friendly, yet mysterious seal. The story lacks a lot of detail and almost reads as a synopsis, as it does not really spend any time building atmosphere. Although the story elements are Lovecraftian, the way the story is depicted is not.

Llama-3.1- 8b

| Model         | Finnish Proficiency | Lovecraftian Style | Dramatic Structure and Coherence | Norppa understood? |
| ------------- | ------------------- | ------------------ | -------------------------------- | ------------------ |
| Llama-3.1- 8b | A2                  | 3                  | 3                                | No                 |

Llama-3.1–8b is a small and popular version of Meta’s Llama-3.1. The Finnish it produces is full of jarring errors, but still legible. It calls the norppa Nopas, and interprets it as a four armed black creature.

Yöllä, kun Wystan palasi saaren keskelle, hän näki jotain, mikä sai hänet peloissa. Nopas, jota hän oli aina vain kuvitellut, oli olemassa. Se oli suuri, musta olento, jolla oli neljä kättä ja neljä jalkaa. Sen silmät olivat mustia kuoppia, ja sen ääni oli kuin tuhkaa, joka leikkasi ilmaa.

The story is again about a man arriving on a boat, finding some creature in a ruin, and then going insane. While the story features some nonsensical details that go nowhere, such as “black matter that is tied into old tree [sic]”, it does end with Wystan of escaping and going insane, which feels very appropriate.

Phi-3.5-mini-Instruct

| Model                 | Finnish Proficiency | Lovecraftian Style | Dramatic Structure and Coherence | Norppa understood? |
| --------------------- | ------------------- | ------------------ | -------------------------------- | ------------------ |
| Phi-3.5-mini-Instruct | No                  | 1                  | 1                                | No                 |

Phi-3.5 is a series of light-weight models developed by Microsoft. They claim to have included some training data in Finnish, and indeed, it can produce sequences of words that would look like Finnish to someone that doesn’t know the language:

Kaikki, mikä oli ennen, alkoi uhmaasti muuttumaan. Värien äänien tuulet ja ääret tuli hirviöisesti muuttama hiljaisena rypää, muuttuja, joka siveli hiljaa, ainoastaan muutama kasvipohjaisen hengen muuttunut. Läpiin hämärän sivellin aavemäärää, tietysti tietysti, hiljaisen hämmän, kuin tuulista aurinkoa, kasvipohjaisen hengen voi muuttua hiljaa, kun kaikki muuttuu hölmien kautta.

As poetry, I find it quite inspirational. You can almost, but not quite, make out what the LLM tries to say. More than Lovecraft, it reminds me of The Electric Bard from Stanislaw Lem’s The Cyberiad (in it’s uncalibrated state).

Ahma-3B-IQ3S- GGUF

| Model              | Finnish Proficiency | Lovecraftian Style | Dramatic Structure and Coherence | Norppa understood? |
| ------------------ | ------------------- | ------------------ | -------------------------------- | ------------------ |
| Ahma-3B-IQ3S- GGUF | B2                  | 1                  | 1                                | Yes.               |

Ahma-3b is a Finnish model developed by Finnish-NLP as a hobby project. I ran a quantized version of it on my mobile phone. Considering these limitations, the results are very good.

Olipa kerran rauhallisella saarella mies ja hänen rakas norppansa. He elivät yhdessä onnellisina elämänsä loppuun asti. Sinä ja mies olitte aina olleet läheisiä. Eräänä päivänä mies lähti kalastamaan. Kun hän palasi, hän näki, että norppa oli hävinnyt. Se oli kuollut hukkumalla. Mies oli murtunut. Hän kysyi, mitä norppa oli tehnyt. Mies kertoi, että se oli yrittänyt saada kalaa.

This story is probably the most original out of the bunch. It tells about a man who has a pet seal that drowns and dies. He then gets an elephant he also names Norppa. The rest of the story lists all of Norppa the elephants many achievements, including the ability to walk on water. While the story is very endearing, it has absolutely nothing to do with H.P. Lovecraft.

DeepSeek-v3

| Model       | Finnish Proficiency | Lovecraftian Style | Dramatic Structure and Coherence | Norppa understood? |
| ----------- | ------------------- | ------------------ | -------------------------------- | ------------------ |
| DeepSeek-v3 | B2                  | 4                  | 4                                | No.                |

DeepSeek-v3, developed by DeepSeek AI, is an absolutely humongous model, with 671 billion parameters, and is currently considered one of the very best in the world. Yet it seems to have been bested by my benchmark, as it failed to include the seal in the story. The model’s Finnish is fluent, apart from a few minor mistakes.

Saaren kyläläiset olivat synkkä ja vaikeniva joukko. Heidän silmänsä olivat ontot ja pelonsekaiset, ja he puhuivat vain harvoin, aina kuiskaten. Nathaniel yöpyi kylän ainoassa majatalossa, jonka omisti riutunut, kyyryinen mies, joka esitteli itsensä nimellä Jebediah. Kun Nathaniel mainitsi Eliasin nimen, Jebediahn kasvot vääntyivät kauhistuneesti, ja hän mutisi: “*Hän meni sinne… sinne, missä ne asuvat. Hän ei pitänyt kiellosta.*”

This is yet another story about a man who sees a creature on the island, and escapes physically unharmed, but mentally traumatized by what he has seen. Interestingly, this is the only story where the island is already inhabited by some locals. The way they are described is very reminiscent of Lovecraft, apart from a distinct lack of racist undertones, which is a major improvement over the original stories. The ending of the story feels very rushed, however. It seems that the LLM began what seems like a lot longer story, and then suddenly realized it needs to wrap it up in only a few paragraphs before it runs out of context. This is a common issue I run into, especially with larger models.

Poro-34b-Chat-GGUF-4b

| Model                 | Finnish Proficiency | Lovecraftian Style | Dramatic Structure and Coherence | Norppa understood? |
| --------------------- | ------------------- | ------------------ | -------------------------------- | ------------------ |
| Poro-34b-Chat-GGUF-4b | C1                  | 3                  | 1                               | Yes.               |

Poro-34b is a model by Finnish Silo AI trained with Finnish and English data. I was able to only run the 4b quantized version locally. Unsurprisingly, it’s Finnish was the best out of the bunch, although not quite native level, with strange capitalization and compound word errors.

Hän veti syvään henkeä ja kutsui muinaisia kieliä käyttäen Suurta Muinais Olentoa nimeltä Cthulhu. Tämä sai aikaan omituisen väreilyn hänen sormiensa päissä ja sitten — ihme kyllä — valtava lonkeromainen käsi ilmestyi vedestä aivan hänen edessään!

The protagonist, hilariously named Arthursen Norpington, never gets on the island, because he jumps off the boat to chase the seal. Then, Arthursen summons the great old one Cthulhu for some reason. Cthulhu arrives, but goes away in the next sentence for reasons that also are not explained, and Arthursen is saved. Although the story is the only one that features a named creature from the Cthulhu mythos, the way that the story elements are treated is far from Lovecraftian. Reminiscent of Ahma’s story, Poro’s tale is full of incoherent twists and turns.

Qwen2.5–72B-Instruct

| Model                | Finnish Proficiency | Lovecraftian Style | Dramatic Structure and Coherence | Norppa understood? |
| -------------------- | ------------------- | ------------------ | -------------------------------- | ------------------ |
| Qwen2.5-72B-Instruct | B1                  | 4                  | 3                                | No.                |

Qwen2.5–72B is another fairly large model developed by Alibaba. Similar to Nemotron, it suffers from a “thinking in English” problem and makes up words that are sometimes quite amusing.

Vihdoin, kun aurinko alkoikin laskea horisontin taakse, saapui saarelle mies nimeltä Jonathan Hawthorne, merimies, joka oli etsinyt maailman piilopaikkoja ja arvoituksia. Hänen sydämessään poltteli kiusallinen halu tietää, mitä maailman pieneimmillä ja unohtuneimmilla paikoilla piiloutuu. Hän oli kuullut murheellisistä anekdooteista ja vanhoista merimiesten tarinoista, jotka viittasivat siihen, että tämä saari olisi ollut kerran jumalanpilkkaan kuolleiden elämämaa.

The text nails the atmosphere of a Lovecraft story, but like many others, lacks any meaningful twists. Once again, a man explores the island and encounters a “Norpa” that drives him insane without any further explanation. Quite a few stories seem to equate mysteriousness with vagueness. There is a difference between maintaining an air of mystery and not giving the reader enough information that these LLMs fail to grasp.

Conclusion

As we can see, good options for Finnish text generation remain scarce, at least on the open weight front. While the models trained with Finnish datasets perform somewhat better in generating Finnish (although still not as well as I would like), they severely lag behind on all other fronts. Llama-3–70b-instruct managed to generate a coherent story with mostly passable Finnish, yet it achieved this by sacrificing a lot of tone and colour.

Lovecraft sets the scene in his stories with detailed passages describing the history and mythology of his settings. The LLM’s mimic his style but fail to incorporate these sections in a way that would feel satisfying to the reader. The elements established in the beginning rarely play any part in the conclusion. Although they feature some captivating sections, most of the stories are not enjoyable to read from beginning to end, because they just play out as a series of very abrupt events. The larger models, such as DeepSeek-v3, do try to build some tension, but their stories are often hampered by some pacing issues.

Because current training methods do not really reward creativity, it seems that many of the large frontier models tell quite dry stories compared to older or smaller ones. By making the models more predictable, and thus more factually accurate, we might also be diminishing their creative potential. The distinct lack of creativity is evident in how these stories resemble each other, often beat by beat. They remind me of some recent studies that would seem to imply that LLMs have a homogenizing effect on creative work. Creativity is difficult to quantify, and therefore these kinds of subjective evaluations still have their place.