Ethical Diffusion: Generating Images Without Theft

Vertti Luostarinen
14 min readSep 5, 2023

Preface

My approach towards the supposedly very “complex” issues associated with image generation models such as Stable Diffusion is very straightforward: I think they are theft. I define theft as taking something that you don’t own without permission, and I think image generators fall into that category. Let me illustrate this with a simple analogue:

Imagine that you steal 6 billion cars. Then you put them all into a huge furnace and melt them. In your factory, you use that mixture of metals to make new cars. It is no longer possible to tell which cars were originally used to create any of the new ones. But does that mean that you have not stolen six billion cars in the first place? Of course not. Could you have made your new cars without stealing those cars first? No, you could not have, because you needed the raw materials from them.

What if you stole the cars for a research project in a university? Is it still stealing? Yes, it is. What if you created a novel way to steal the cars and stole them before legislation caught up and forbade it? You would not have committed anything illegal, but that does not change the fact that you would still have stolen them. Would the situation change if those cars and the furnace you put them in did not physically exist but were bits instead? Well, money also does not exist in the physical world, and the European Central Bank can make more of it any time it pleases, but we still tend to agree that money can be stolen. Just because something can be copied, it does not mean that it lacks value. If a thing can be copied infinitely, its value often diminishes.

Note that my argument here is more ontological than ethical. I do not make any claims about whether it is good or not to steal in general, or in this instance. I just wanted to elaborate on why I disagree with the idea that AI is somehow beyond traditional notions of copyright and ownership.

Image generators, such as Stable Diffusion, have been trained with billions of images scraped from the web. Without all the stolen data, they would not have the properties they now have and would not be as popular. But how far exactly could you get without theft? In this project, I wanted to see if I can generate my own images in a way that did not involve the ethical issues associated with this technology. This meant that I needed to

1) Source the data in a way that was legal, ethical, and respectful.

2) Build a diffusion model from scratch.

3) Train the model on my own hardware.

4) Publish the results in a way that is transparent about the process.

The results look nothing like the images from Stable Diffusion, but personally I consider that a positive. By presenting an alternative I also hope to showcase how much of the value these systems create is based on scraped data, and how little would remain without them.

How do Diffusion Models Work?

64x64 pixel image of greyscale Gaussian Noise.

Diffusion is the process where matter moves from a state of higher concentration towards a state of lower concentration. In the natural sciences, diffusion models are needed to calculate, for example, the dissipation of water vapours in the atmosphere. In 2020, scientists at UC Berkeley published a paper called Denoising Diffusion Probabilistic Models, (DDPM), where they applied the principles of thermodynamics in image generation. I have no idea how they came up with the idea that diffusion could also be applied to pixels instead of molecules.

With the right algorithms, the diffusion process can be reversed, at least up to a certain extent. By looking at how water vapour molecules have diffused into a kitchen, we can estimate their origin and eventually guess where the tea kettle is. In the DDPM in paper, the same process is done to pictures. Gaussian noise is iteratively added to a dataset of images. These images, that are gradually getting more and more noisy, are used to train a neural network. Then, in turn, this process can be reversed; now we present the neural network a picture of pure Gaussian noise, and then ask it to denoise it step by step. And lo and behold, we have a system that can be used to generate unique images by just feeding it with images of noise.

I could not build on top of Stable Diffusion or any other publicly available diffusion model because they are all pretrained. Instead, I needed to train mine from scratch to make sure I was not using anyone else’s data without their permission. Eventually, I ended up using the popular Pytorch library in Python for my own implementation.

Failed Experiments with Noise

The Gaussian curve.

As I delved deep into the math behind diffusion models, I had one question: why Gaussian noise instead of any random noise? Gaussian noise is a type of noise that follows the normal distribution, which is often represented as the Gaussian curve.

My first encounter with the Gaussian curve was in high school matriculation exams. In the Finnish high school system, it is used to determine the grades, so that only the top 5% get the highest grade and the bottom 5% get the lowest, while most fall somewhere in between. This is done so that if an exam is particularly hard or easy, the results are still comparable from year to year. The Gaussian curve salvaged my math exam; with the points I got, I would not have passed any other year, but my grade got bumped up a notch because I was not the only one struggling with that exam.

Like I mentioned before, diffusion models add noise in the training phase, then remove it in the inference phase, the phase when you generate images. Because the normal distribution follows a certain pattern, it can be reversed with a statistical algorithm: values are more likely to land in the middle of the curve than on the edges. If I had taken math more seriously in high school, I might have understood this basic concept out of the gate. Instead, I started messing with the noising and de-noising algorithms, trying to make them work with different types of noise.

I even went as far as making a quantum computing integration so I could feed my images with quantum noise, probably because I thought that saying that I have built a “quantum diffusion model” would make me sound smart. For this purpose, I used Qiskit, which is a programming language made for quantum computers that can also be used to simulate quantum operations on normal hardware.

At some point I realized that instead of simulating a quantum computer I could just use the torch.poisson argument inside Pytorch, because quantum noise follows a Poisson distribution. This resulted in a bunch of errors that went beyond my math abilities. The most likely explanation for them was that even though I was applying a different type of noise, I was still denoising them the Gaussian way.

This experience taught me at least why the diffusion models work with a Gaussian noise model instead of any other one and in the process, I learnt a lot more about Pytorch (and some Qiskit). I have since discovered that there in fact are quantum denoising models, but they work very differently than my attempt.

The Data

One example of Oki Räisänen’s works.

To train a neural network to generate images with this system, I needed to obtain at least a couple of thousand images that were similar. Furthermore, I needed to be sure that each one of them was free to use. Thankfully, I was kindly given the opportunity to use the body of work from Oki Räisänen by the Päivälehti Archive. He worked as the cartoonist for the Helsingin Sanomat newspaper from 1932 to 1950. Räisänen passed away over 70 years ago, so his drawings are now free to use for non-commercial purposes.

I think there would still be ethical considerations for copying the style of a deceased artist, even though his images are legally free to use. But as we are about to see, my images were not going to be very representative of Räisänen’s work because of technical limitations and my own artistic ambitions. Like I have stated elsewhere, my interest in AI is not to make something humans can already do but things we cannot.

The Implementation

The same image after processing. You see my issue here?

Stable Diffusion and others use OpenAI’c CLIP to connect the image generation system with text. That is why they can be prompted with text to generate different images. I investigated it, but it was not feasible for me to create such a system myself.

My original idea to remedy this problem was to make the model conditional. This meant that I would have different text categories I could choose to generate from the dataset. In my mind, a system where you can type one word out of ten options is still technically a text-to-image-generator, just not a very versatile one. But when I got hold of the data, I realized that metadata information was not stored along with the images in the archive, and it was also incomplete, so I would have had to annotate the images myself, and that would have taken several days. I had to give up on this feature and instead lumped all the images together and made an unconditional model.

The neural network implementation I ended up using is a very simple U-Net architecture. U-nets are named after their shape and are often used in image processing tasks. In the U-Net, the images are first downsampled and then upscaled back again during the training process, so the complexity of the images follows a u-shape. The downsampling is done to keep the amount of data needed for the calculations in check.

Even with this downsampling, the highest resolution of images I could generate was small, 64x64 pixels. When I began to downscale the drawings, I realized that my choice of data was not ideal. Räisänen used very thin lines and a lot of white space, so when detail was lost, the images turned into indistinguishable grey mush. At this stage, my idea was to upscale them after feeding them to the diffusion model, so I was not that concerned.

What really helped me, though, was that my data were grayscale and not rgb. Instead of calculating the red, green, and blue channels for each pixel, I needed to calculate only one value, and this really sped up the training process. For a 64x64-image, this meant that instead of calculating 12 288 dimensions, the system only needed to calculate 4096.

The training only took around nine hours with my RTX 4080 GPU with a batch size of five. This meant that for each iteration out of a hundred, the neural network would generate five images before moving on to the next. The code used 13gb of VRAM out of the sixteen that were available on my GPU. I probably could have pushed it further by increasing the batch size, but I did not want to risk crashing the system during the night when I did the training.

I could have used a service that lets you run your code in the cloud, and that would have made it possible to increase the resolution at least up to 512x512 pixels. Running machine learning algorithms takes a lot of computing power that in turn consumes a lot of electricity. If my code would have been running in a data center somewhere, there would have been no way to monitor energy consumption. By running the code locally, I could rest assured that my images were generated with 100% wind power.

Upscaling the Images

Stable Diffusion and others struggle with the same problem I had: diffusion models take a lot of computing power even for small images. That is why they use it in conjunction with a VAE, a variational autoencoder. In the encoding phase, the VAE radically reduces the dimensionality of the data, so that it can be fed to the diffusion model. Then in the decoding phase, the VAE reintroduces these dimensions. The VAE is also conditioned with text information, so it can direct the generation process towards the text prompts.

The problem I had was that the VAE’s are trained with images as well. For them to be able to reduce the dimensions of the image, they must have information about which features to prioritize. That is why I would have had to also train my own VAE. While it was certainly feasible, I would have needed to feed 512x512 images to the VAE to produce them, and this was not possible with my hardware for the same reasons it was not for the diffusion model. The other issue with this approach would have been that, like I mentioned before, my idea for the final works was not to mimic Räisänen’s style. I needed to use something more creative.

I also explored the option of upscaling the images with ESRGAN. I am familiar with it because I used it for the animation frames of my short film, The Crowing. However, the publicly available pretrained ESRGAN models clearly state that they are only meant for research purposes, so I would have needed to train my own once again, and that seemed like a big undertaking.

Circling Back to Diffusion

At one point I did some point cloud experiments with the images in OpenFrameWorks. I ultimately didn’t pursue this direction, because it felt like I was just trying to come up with something that looks cool without putting any artistic intent behind it.

I felt a bit stuck trying to figure out what to do with these small, grayscale, smudgy images. The pictures were not recognizably Oki’s anymore, all they had left from the originals was the rough ratio of lines and white space, and a similar style of composition. A common critique of data art is that often the artwork has no visible connection to the source of the data, and I feel that this was true for my project as well. However, my starting point was never to explore Oki’s art, but rather the concept of diffusion.

Entropy is the lack of order or predictability in a system. The second law of thermodynamics dictates that as time goes on, entropy increases. If you put a tea kettle on a hot stove, the water molecules inside it begin to move more and more erratically: they have now more entropy. This, in turn, causes the temperature of the water to increase. (As a side note, this is why in machine learning we determine the variation in the generation process with the “temperature” parameter). As the water evaporates, and the steam dissipates from the tea kettle, it distributes evenly across the space. What was once neatly inside the kettle, is now all over your kitchen. The water has adhered to the laws of physics and gone from a state of lower entropy to a state of higher entropy.

The inventor of the statistical formulation of the second law of thermodynamics, Ludwig Boltzmann, was deeply troubled by this particular law of physics, because unlike others, it is time-asymmetrical. As time goes on, entropy only increases. In the real world, there is no putting back the molecules into the tea kettle. And even if there was, there is no way to transfer the heat back from the molecules to the stove, which in turn is incapable of sending its energy back to the electricity company. Entropy only goes one way, and the only way to achieve order in one place is by deducting it from somewhere else.

Boltzmann eventually arrived at a statistical explanation to this unsettling asymmetry: entropy increasing was merely a matter of probability, as it is infinitely more likely for randomness to occur than the opposite. To put it more plainly, there is usually only one way for things to go right, and a million ways they can go wrong. His answer, however, has not satisfied everyone, and the debate about “the arrow of time” and whether it even exists, continues to this day. The philosopher Reza Negarestani has claimed that time-asymmetry is a cognitive bias that humans possess, but artificial intelligence can get rid of. Indeed, one could argue that diffusion models are time-symmetrical.

Diffusion models are an imagined, statistical attempt at reversing entropy. But as they are statistical models of thermodynamical phenomena, they can never produce the exact starting point, only a statistical interpretation of one possible past that could have led to the current situation. Viewed from this perspective, the images created by diffusion models have traveled to us from another past: my pictures are from a non-existent history, from an alternative Oki Räisänen.

At the time I was reading about these phenomena, I was attending the Generative Media Coding Workshop as part of my MA studies at Aalto University, where I was learning about the OpenFrameWorks project for C++. I assembled the images in OpenFrameWorks into larger grids to see what kind of patterns they would form. Then, I started treating these individual images like water molecules, animating them, so that their gradual movement would follow the Gaussian Curve. I added the element of time by making them change iterations: as time goes on, the images go backwards in the generative process, and become less like the ones Oki made. (It is this way because the diffusion model starts from pure noise, reversing the diffusion process. Now, we are once again reversing it. It is confusing, I know.)

A frame from the beginning of the finished work.

In the film Blade Runner, artificial humans, called replicants, can only live for a few years, but possess fictitious memories of an entire lifetime. Russell’s paradox states that this is not unlike how we perceive the past, as we never enter the past itself, merely access the memories we have of that past in the present moment. The troubling conclusion the philosopher Bertrand Russell arrived at is that the existence of a past cannot be proven. While viewing my images from an alternative Oki, I experienced something similar. To quote Russell:

Hence the occurrences which are CALLED knowledge of the past are logically independent of the past; they are wholly analysable into present contents, which might, theoretically, be just what they are even if no past had existed.

Travelling to non-existent pasts might sound cool, but the fact of the matter is that those pasts mean very little to us because we have not lived through them. Even if my diffusion model was ethical, in the end it could only demonstrate how the process can only create results that have a severed relationship with the data they are based on and lack any meaningful connection to the real world. This is why I decided to title my work “Rikotaan menneisyys”, “Let’s break the past”.

Conclusion

Another frame from the work, this time from the end, where we approach the initial state of almost complete noise.

All that being said, I do not want to end on such a grim note about diffusion models. The technology itself holds much promise. The issue is that one of the reasons it is such a good value proposition to users is because the true creators of that value, the image makers, have not been compensated. This would be true of any industry: if we stopped paying, for example, cab drivers, or doctors for their services, the consumers would get a better deal than before. But for how long would we then have cab drivers or doctors?

If Microsoft or any other big company would have wanted to, they could have just asked for permission before using the images trained on their models. This would have meant that they would have been late to the market, like Adobe arguably was with its image generator that was trained on only licensed and public domain data. But it would have been also in their own interests to think what will happen if no one has the incentive to create new images. In the future, visual records of history might all be from pasts we have personally not lived through.

Special thanks to Janne Ridanpää and the Päivälehti Archive for their cooperation with the data. I give my thanks also to Matti Niinimäki from Aalto University, who supervised this study project. The diffusion model implementation was done following tutorials by Dominic Rampas and Thomas Capelle, which you can find here and here.

--

--