InfoQ Homepage Presentations Image Processing for Automated Tests

Image Processing for Automated Tests

View Presentation

Speed:

51:17

Summary

Stefan Dirnstorfer discusses the shift from DOM-based testing to visual UI agents. He explains why LLMs often fail at precision tasks - like spotting one-pixel shifts or broken road networks - and shares how advanced image registration and "Chain-of-Thought" vision processing are essential for reliable QA. Learn why combining generative AI with classical algorithms is the future of automation.

Bio

Stefan Dirnstorfer is CTO at testup.io, a UI testing service that uses images and screenshots for automated testing. With 42 years of software development experience he has both followed and helped shape numerous technological trends. His expertise spans software development, operations, data analysis, and interface design, with his current focus on automating visual tasks in test automation.

About the conference

InfoQ Dev Summit Munich software development conference focuses on the critical software challenges senior dev teams face today. Gain valuable real-world technical insights from 20+ senior software developers, connect with speakers and peers, and enjoy social events.

Transcript

Stefan Dirnstorfer: My name is Stefan Dirnstorfer. I'll be talking about image processing in the context of testing and automation. My talk will be mainly based on these articles, in case you want to read more. Professionally, I serve as a CTO and cofounder of a company that brings testup.io and image-based test automation service.

Application Testing

Imagine you want to test an application on a phone. You want to initiate some actions automatically and check if the resulting states are correct. Image-based algorithms haven't been powerful for all that long, so traditionally, you would rely on some internal states, like a component tree or a DOM. Developers like this kind of representation because that matches the internal model. That's how they organize the modules, how they place folders, and how they design object hierarchies. However, there might be multiple reasons why you don't have access to these internals. The development could be outsourced, you could be heavily integrating with third-party components, or to say it more bluntly, you might be vibe coding.

If you're vibe coding, then you're not testing, and you could say, I'm vibe coding, then I might as well do vibe testing. When your vibe test fails, you will see some gibberish like this, which you cannot interpret. Of course, you can then do vibe fixing, but you neither understand the test nor the code, and it will not be very good. Wouldn't it be better if you got a graphical representation of what is the intended accepted state and the state of the current version? The problem is that even for humans, it takes a while until you spot the difference. Once you discovered it, it pops out to you as a major flaw. The ideal world would be if an AI helps you flag that change and attribute it to an intended, an acceptable, or a faulty state. That is the content of my talk. We will see where AI stands today, how fast it is moving, where it will be tomorrow, and what it can and cannot reach.

After all, we need to consider that with regards to vision tasks, even we humans cannot solve every problem. When we look into the world, we are permanently susceptible to various optical illusions. We have these illusions not because we are on LSD somehow, we have them because evolution had to make compromises in order to give us some visual capabilities at the expense of others. We are not here to solve all types of visual tasks. We are here to solve the problem of some quirky UIs on frontends that either are on devices that we do not fully understand, or we have systems that are intentionally obfuscated, or are just by nature so obscure that you just don't want to know anything about their internals.

Visual UI Agents (Test Script)

Let's get more practical with this example. A test script is, here are three steps. I want to open an app identified by a picture, then I want to find an item identified by the name Munich, and then I want to check if the represented graphics matches my expectation. To run this test, I chose Claude Sonnet 4.5, the computer use tool. That's not what you get when you open the desktop or the chatbot, that's a specialized API. The latest version just came out, so I updated my presentation to reflect the latest behavior. The first beta came out with 3.5 well over a year ago. This is very new. I will point out to you when some changes happened in the past year. As always, we start with our initial prompt. We send over a prompt together with the picture to the server. Sure enough, it waits for some time and comes back with a response.

As usual, it has the explanation of what it is thinking, but additionally has a technical component where it says how it wants the computer to be used. In this case, it wants to get the screenshot, a request that is easy to fulfill. I send it over, and again, the AI takes some time thinking and comes up with a new response. This time, it confirms that it found the requested icons identified by the street logo 280. In addition, it has the computer use instruction showing us where to click. In case you are wondering why this is a click and not a tap, this is because the AI is not using an iPhone. It is using a simulated phone, which is accessed through a web interface that I manipulated to create some interesting scenarios.

Now that we have understood how the interaction between our software and the AI works, we can have a look at how this behavior works out over time. We click as instructed and wait for the screen. I send the same screen over again to the AI, and the AI says, "Nothing has changed. Let's wait". In the technical body, it even says it has to wait for two seconds. I find this even quite remarkable that the AI has an understanding that nothing happened and for how long it makes sense to wait. This has happened to you, very often that you clicked on the phone and nothing happens. Humans solve this problem quite naturally sometimes without even noticing. For traditional automation services, this is a major hurdle and often breaks down the entire process. After some time, I send over the same screen again, and the AI realized that nothing happened even after the wait time.

The old version would repeat in a loop, but that would be boring for you, and the new version doesn't do this anymore. The new version finds out a new way to modify it, maybe click a slightly different place on the screen. It does the change click, and the Maps application is opened, and Claude Sonnet confirms this, and we have solved our first step. Well done. Let's move on to the next item. We want to find Munich. Here we have to solve a number of intermediate screen problems that we need to click through. Traditional automation would be completely confused if something like this happens, but the AI sees, ok, just continue. The AI figures out that we don't need to share our location. Very clever. Very privacy-oriented.

Finally, AI has to disable notifications. Even AI knows that they are annoying. After that, a search interaction item occurs, the AI clicks on it, types the search term, and then correctly identifies the search result, not being confused by all the advertising items displayed next to it. I find this quite clever, and certainly the second task also has been done by the AI.

Now the third step. We need to find if the correct map is shown on the screen. I ask Claude Sonnet, is this the screen that we want? Yes, it is. Perfect, isn't it? This is what we wanted. We wanted a simple instruction, run the test, and do the checks. That's fine. That's what we wanted to do. Or, we have a second closer look. Let's first look at the icon. The icon that we requested had a 280 road symbol in it, and the icon that it found didn't. You might say, that's pretty clever, it realized that this is the same app, essentially. If you remember, in its thinking, it confirmed to us explicitly that it found the 280 road sign. Was it hallucinating, or what's going on here? Maybe let's give it the benefit of the doubt. Let's review a second thing. Let's review the map.

The map that I expected to show had all the roads connected. The road that the manipulated phone did show had all the roads cut. This is a major flaw. If you rely on that road for navigation, you might be concluding that Extinction Rebellion had taken over the city center. How can this super-intelligent Clanker miss this apocalyptic scenario? There is something wrong. For testing purposes, we have a problem, because it's not very reliable. Let's first review what we have done so far. We have looked at visual UI agents, or visual computer use agents. They are very strong, because they can operate all types of devices. They are very resilient against modified behaviors. They are easy to instruct in human language. Problem is sensitivity. That's something that we need to explore now. Of course, as always, the AI models are a bit more resource-hungry than traditional methods.

Computer Use Models (DetACT Model)

This is a slide that I took from the computer use model survey. This is version 8, released in May this year. It doesn't have the latest Sonnet model yet, but we will see that there's an abundance of models lining up, starting at around 2023. That's when they became usable. On the top right, you see OpenAI and Gemini's operator model. They are similar, but more focused on web use, which has an advantage that you additionally have access to the web source code and the DOM. There is one model I want to discuss with you in more detail, because it's an open-source model, and we can learn from it how do these things work internally. The DetACT model is the model that I want to show you, but the group doing DetACT model also releases a training set, the OmniACT set. This will give you some insight on how the systems are trained. Here, look at this example. Here, AI needs to understand what a stock price change is. It needs to figure out where it has to swipe over in order to get the result.

If you're reading about success rates of 40%, we are really dealing with requests that are on this level, which take a very deep insight, or here where you need to hover and scroll until you get to the right value. In terms of interaction planning, they are already very good, and for us testers, we can describe the steps in more detail, and they will be perfect. How do they work? The DetACT model team shared these slides, and it works like this.

The screen comes in and will then be analyzed by different specialized models, one, for example, doing object character recognition, extracting all the texts. One will be trained to extract all kinds of known logos, and another model will extract certain unnameable things identified by colors or basic shapes. All these findings, together with their coordinates, are stuck into a traditional language model, which could be ChatGPT. Here we can already see that as long as the visual display can be described in human language easily, the model will be able to infer the right results. If something gets a bit more confusing and murky, the textual representation is not enough to solve the task.

Let's look at a few examples of where this UI model breaks. This is back to Claude Sonnet. I gave it two screens and asked it, where is the difference? In this case, it correctly identifies these are two different versions of the same Maps application. Even if you ask it again, knows at what year the style change happened. Apple did a good job, infiltrating the publicly available training data with its marketing speak. Let's look at this example. Here, the AI does not find a difference. That is because this movement wasn't described in text. Of course, the numbers must have been different, otherwise, it wouldn't have been able to click on it. Since bounding boxes are not really unique, it must have some tolerance before it rejects this as different. Here, it did not verbalize at least. It did not explain that this is a different screen. What about this one? As soon as two icons create a new relationship here in overlap, this overlap is verbalized again and can be found in the text.

Going back to the map, you can now easily understand why it's not feasible to describe in human language what the road connections are. We would have to take a huge amount of context and huge amount of text, spoiling all the reasonable and realistic scenarios that you want to do with it. Obviously, the generative AI does not do it here. This doesn't look like a very complicated task. It's now time to do some real image processing coding. Now is the time where we can toss aside the generative AI and prove that all the years that we have spent honing our coding skills were not in vain. We use GPT-5. GPT-5, almighty intelligence overlord, please compare these two images and see how it does.

For the first half a minute, it creates all kinds of computer code analyzing these images. It implemented three different algorithms and flagged all the differences. There's one thing I didn't tell you yet. This map, apart from having the road cut, also was moved by one pixel. If you are in frontend development, you will know that layout changes on that level happen all the time with the new browser version coming out, the new operating system, new screen resolutions. These kinds of errors are happening from version to version inadvertently. GPT-4 would have stopped at this point and said, these pictures are equal. Why would it conclude that the pictures are equal? Obviously, in its training data, it had many similar confusing situations where the equality label was set to true. Of course, you could prompt it to see the cut.

Then the problem is, it's just not sure. Whatever you want it to say, it says. It doesn't know. We are now with GPT-5, and GPT-5 continues. How lovely. It does a Fast Fourier Transformation and computes a phase correlation. It figures out that the images were moved by one pixel. It only needs to undo this movement and then can compare the pixels. See how it does? Sure enough, it makes the algorithm comes out with the exact map of where the pixels are missed. Now it needs only to derive what this is at this point, and no street or routing changes. It was not able to name and correctly identify it. Progress is remarkable, and I see likely that GPT-6 will be able to name that. Keep in mind that we just had a uniform translation of one pixel. With a phase correlation, you wouldn't even be able to detect the slightest amount of scaling or a non-uniform transformation where roads and text may be moved differently. Now we can really toss aside the generative AI and see how we're working. We are halfway in. Don't worry. If you have this problem in practice, you will know by the end how to solve this.

Image Comparison - Spot the Difference

Let's first do some research, what is out there? If you are in frontend testing, you're probably using Playwright or Cypress. These two platforms are the market leaders, and they have the capability to compare images. Internally, they both rely on the very same library. It's called Pixelmatch. Pixelmatch relies on an algorithm published in 2010, which doesn't seem like that old. No, it is old. It is very old. It does what it promises, it compares pixels. Let's review the problem with the pixel movement. As long as the images are perfectly aligned, you can compare the image pixel by pixel and see where it fails. As soon as the images are not perfectly aligned, differences pop up everywhere. These alignment changes, the position changes, happen all the time. You hardly have control over it, because it happens from version to version. Let's look at this more extreme example. Here I have two versions of a map, and it changed in two ways.

First, it was rotated. Second, roads were cut. If you're not able to separate those two effects, that means that maybe one of those two changes is intentional, and the other one is just acceptable, then you will be blinded. You will not see the one effect, because the other destroys your entire view. The challenge that we have for the test automation is to separate these two effects from each other. I tested a number of libraries, and of course, this was not solved. There is one library that stood out as a bit more capable than others, and that is the Applitools. Applitools was able to detect displacements of two pixels, roughly. Yes, one to three pixels, depending on context. The first question that I have is, how does Applitools do it? How can we scale that method to more complicated scenarios? Here you can see how it flags the changes. I was able to replicate this performance with a simple convolutional neural network. This is the setup.

Basically, I use a small sliding window of size 9 by 9. The training data, you see four examples of training data. Some of them are labeled as equal, and some of them are labeled as false. This window can slide over the image and pick out changes with tolerance to some pixels of difference. I was using about 30,000 weights and 8 layers. This is a very simple model. By the standards of 2015, when they started, by their standards, this was intelligence. This was artificial intelligence back then. Now we would call it just machine learning or statistics. This one works. That's what they are doing. This is also what I described in the InfoQ article, in case you want to know the details about this procedure.

Let's focus on the actual challenge at hand. We have one image that is the correct version, and one might be the incorrect version. We want to spot the differences. This is actually a game that we played as children all the time. We were quite efficient at doing this, even if one of those pictures were somehow distorted. That's because we are used to seeing things from an angle anyways. We are not disturbed by a slight distortion like this. This doesn't hamper our performance.

For AI, this task, that's exactly the testing task, is very hard. It's not only hard because there are so many details. It's also hard because some of those things are really hard to describe. Look at these clocks, for example. The clock AI benchmark measures human performance at reading analog clocks at 90%, whereas the best AI models didn't even reach 40% yet. Reading a clock is already very difficult. That's just not to mention all the other details that you find here. Let's give it to ChatGPT anyways. Dear ChatGPT, please compare these two images. They are from Wikipedia's Spot the Difference page. GPT definitely had this picture in its training set. Sure enough, it found the clocks, but incorrectly labeled their times at 3:00 and 4:00. It found other things like strawberries. It found five and four strawberries. Maybe I missed them. Let's ask GPT, please show me the strawberries. Here we go. It made a nice picture. Let's be fair, this is better than what I would have drawn off the top of my head. Let's look at the clock. The clock is not equal. They're not circled. They are definitely not showing 3:00 and 4:00. Why does the kid suddenly have two heads? Or is it two kids now? This is weird. What do we do? Let's first review what is the difference between humans and computers to understand what's going on.

For this, let's have a look at the vision pipeline. First, we start with the raw illumination signal. Then we extract some features. Finally, we have some idea, some conceptualization of what we are seeing. For computers, these layers are, on the starting level, the bitmap, the pixel data. On the next layer, we have the convolutional neural networks with more and more features as the layers get higher and higher. Finally, we have the embedding, which we get with the latest attention models. That's how the computer processes our images.

How does a human do it? On first sight, the human works very similar. It has raw data captured by the retina. Then it has some feature detection mechanisms in the thalamus, or to be more precise, the lateral geniculate nucleus. That's where the features are detected, and then get forwarded to the visual cortex and our frontal cortex. Where is the difference between the computer and the human? The interesting or surprising finding is that there are 10 times more neurons leading from the frontal cortex, into the visual system than the other way around. That's as if you connected a camera to the computer and there was 10 times more data flowing from your computer into the camera than it's coming out of the camera. That is because the vision system does not inform us what it sees. It's more the other way around. Our visual cortex makes hypothesis of what could be there, and the vision system's task is more to confirm whether that is there or not.

One interesting study that I found very telling of this effect is that they gave hallucinogenic drugs to mice. You would expect they have hallucinations and the visual system goes all wild and makes up things. The finding was that actually the neurons turned dark. That is because the hypothesis was just randomly confirmed as, yes, that monster is there, without even checking. The visual system actually checks hypothesis and does not inform us of what is there. You can observe yourself if you play such a spot the difference game. You can see that your eyes move left and right. You consciously find corresponding spots. You formulate hypothesis even with your verbal system. You think, is that mountain really shifted from left to right? Then you confirm this hypothesis. This is a multi-step, Chain-of-Thought reasoning that we humans naturally do. That's not what AI is doing with a simple forward-oriented method.

How do we go on and find the differences on these images with the computer? What we need to do first, we need to squeeze the images such that they match up, that they line up with each other. This process is called a registration. Registration is a term that was first coined by cartographers when they had maps from different regions, they had to line up the landmarks and find how these maps go together. Nowadays, image registration is used in multiple disciplines. For example, the self-driving cars, they need to know how things flow from one image to the other, which is also known as optical flow tasks. Stereo vision has this problem, and medicine has this problem. If you, for example, want to observe the development of a potential tumor, then you have a sequence of images in which the organ moves over time in the body.

At the same time, the metastasis develops. You need to undo a larger transformation to find the changes in the embedding content. How can we solve image registration? Let's go backwards in time, in 2003. That's the algorithm that you find in OpenCV library, a widespread computer vision library. It uses this Farneback algorithm, and you see that it's doing the thing, but it's not there yet. Fast forward to 2008, now the flow looks a bit cleaner, but still has a lot of mistakes. Then go back to now, 2020 more or less. Now this problem is perfectly solved. The question is no longer if we can solve this registration, but more, what exactly do we define as the boundary between a registration and an actual change.

On Hugging Face, for example, you find a huge number of models specialized on astronomy, on navigation, on medicine, so you have to choose your algorithm. Once you have chosen one, you can align these two images and then use the algorithm that we have discussed before to flag all the changes. What is missing, of course, is to name them. We can make clusters and give it to the AI one by one and ask it what is there. That's the part where we still have a few years to go before we can have reasonable textual extractions of what the change is.

Summary

Let me summarize what we have seen so far. We have discussed that for testing applications on a graphical device, we have a traditional way of using it. That is explaining it in terms of its internal data structure, component trees and DOMs, which probably most of you who are testing frontends are aware of. We have seen that in the recent years, there is more and more powerful algorithmic methods available to solve this task graphically. Especially with the UI agents starting to become usable from last year, more or less, we can now operate with these devices without relying on any structure. Then we have seen that for testing purposes, for quality control, these generative AI models are not very powerful and are easily confused, especially when the shown items have a rich graphical content, but are not easily described in human language. Here the system has major weaknesses and we must rely on maybe more explicit algorithms.

One of the algorithms that you will always come across when you do this is the image registration. I don't know why none of the software packages, Cypress or Playwright, have this as part of their tooling. The image registration is the first thing that you need to do to compare two images. Once you have the images registered, you can then use these traditional algorithms to compare them. This is also the topic of the publications that I made prior. Don't forget to check out our image-based test automation service that you can find online.

Questions and Answers

Participant 1: Is zooming something that is solved, or is it a part of image registration? If there's one image which is zoomed in and the other one is not, are we able to compare them at this moment?

Stefan Dirnstorfer: Yes, zooming is solved. If the zoom is within the range of maybe 50% bigger or so, then zooming is actually nothing else than just moving. All the things move out at different distances. If you have very large zoom levels, where you go from 2, 3, 4, factor 5/4, then you have other algorithms where you do a SIFT identifier. There's a scale invariant feature detection. This SIFT you can use to identify screen positions by a marker or an embedding that describes what this feature is, and the identifier is independent of zooms. This you can then find in the new image again. That's important for navigation, especially because objects tend to come closer and further. That's a very important part of registration.

Participant 2: I have a question about Segment Anything Model from Facebook, for example. Does it help with quality control? Because it can segment blocks, for example, on a page. You can technically probably calculate if they are misaligned, for example, or things like this.

Stefan Dirnstorfer: Segment anything was part of the slides that we saw on the DetACT model. They rely on segment anything information to detect where certain objects are. Segment anything is also often part of this registration task. Because when there are large areas where there is no feature which you can use to correspond to the other image, then it will be considered to be part of the segment. Segment anywhere is an algorithm that is a part of this all the time. The problem is that the segments are not unique. Sometimes you can find them one pixel to the right, one pixel to the left. They're not extremely precise and reproducible. If you have a segment model and you can name the object that the segment is representing, then you are in the GenAI area. You make a segment, say, it's roughly here, and you can name it, then it's fine. If you want to know if an image or when you rely on an image to be perfectly aligned, the segment information is not precise enough.

Participant 3: When you traditionally test software, then they run in unit tests or integration tests as part of a build pipeline. What you showed us today were like two different things. The first one was that the LLM could better handle unexpected things, like when you open up the map and then there's this prompt and this prompt and this prompt. A traditional program would have broken, but the LLM went through it because it has that world knowledge. Can you take at least that part of saying LLM go through that and package it up as a unit test, so that maybe the image comparison is still as good or as bad as traditionally, but at least the LLM lets you go through unexpected part in the test process? Can you package this?

Stefan Dirnstorfer: That is a very good use case, and you can definitely do this. The thing is, when you compare the images, you have a very high chance that the images actually are equal pixel by pixel. You will probably get a 90% success rate with that. Then it happens sooner or later that a minor style change somewhere breaks all your images, and then you have to reset the baseline. You can do this as a human and check image by image, but it really takes a while before you have checked that really none of the details have changed along the side. It's possible, but what do you do with the images? It's very painful to go through these images. Everyone I've spoken to who used one of those image-based comparisons, as they are part of Playwright, for example, they have all given up and said that's not useful. It's just something that they might use for one or two images, but really not more. If that's what you want to do, just have a long chain, and then check one image that, in the worst case, a human could reproduce, then that's perfectly feasible.

Participant 4: We know the challenges, what is intended change and what is not intended change. As a user of an AI model, how much control can I have in specifying what differences are feasible for me or are intended, versus flag these ones specifically?

Stefan Dirnstorfer: As soon as you can verbalize it and make a text out of it, the LLM will be extremely clever to match it against the specification or against the change request. As soon as you have derived a text information, then the LLM can go on. If you do not have a textual information, it's just a pattern, for example, then you have to send it to a human. The human would definitely benefit if they know whether that is just a shift or a rescaling, or if apart from the rescaling, something else changed along the way. Your review process is much accelerated if you are properly informed about what type of change it is. Even if you cannot verbalize it yet, this type of separating, saying to the human actually it's just a shift, or actually there's just a broken or just a changed color, or it's both, that information helps the human to review the changes more quickly.

Participant 5: The presentation talked a lot about the quality. What I wanted to understand is your point of view on the scaling part. We have a bunch of test suites where we want to compare our candidate with the expected or the differences. The existing tools that you talked about might be running faster, so you'd be able to go through a larger test suite compared to an LLM, which will go through an inference step, which is itself a costly operation. I wanted to know your point of view on, do you foresee that as a challenge, something that needs to be resolved, or that's not a big deal right now?

Stefan Dirnstorfer: The question is whether the LLM needs to be sped up?

Participant 5: For example, we want to run the LLM-based comparison on a large suite of test images, essentially. At that time, the cost of the LLM is also a thing, where it will require resources to actually process the images and give you the outputs. With existing tools like Applitools or some of the other things, I would expect that the comparison will be much faster, because they would have well-defined algorithms which run quickly, in order to determine the differences. An LLM inference is actually costly, it takes a lot of time and resources.

Stefan Dirnstorfer: Yes, the LLM inference for image comparison has a lot of benefits, because it is, for example, able to do object-character recognition, so it knows when a text is maybe the same text or did break at a different point. A typical example where image algorithms or image comparison algorithms break is, we have text and all of a sudden a character becomes one pixel wider and needs to be broken into the next line. Then the visual experience is totally different, but the text is the same. This cannot be done by Applitools. You have the LLM, which can figure that out. Another thing is when other changes are actually equivalent, but different visually. For example, a clock gets a new style and an LLM could find out that this is the same type of thing. It could inform you about this. Obviously, image comparison is faster. With and without registration process, you have a faster way to get to these images. Applitools would actually, I think, benefit if they used a more advanced image registration. I don't think they updated the algorithm since they started in 2015.

Participant 6: How good are these LLM-based automation tests on dynamic data, where we want to compare the structure of things but not the content exactly? Like maybe for a stock price, graph might be different at different time of time. We want to test in Canary if any changes are affecting prod queries or not.

Stefan Dirnstorfer: You have to think of it in that way. When the LLM sees the screen with a stock price on it, it first puts all the information that it sees on the screen into some text model. I think you get a good intuition for how good that can work. This chart is a fractal and you have infinite amount of information to get the entire chart transferred to the text model. When you have this limited context, then you can think for yourself what type of answers it could answer. What it cannot do, yes, if for example you have a question, did the stock price have a tiny dip between, blah, blah, blah, then it actually would realize that it's not part of the context because it's too detailed and would have to go back to the image and zoom into that relevant region and check again. This going back and forth, that's what the humans are doing with their visual pipeline, always forming hypothesis, going back, checking again, thinking, forming hypothesis, this multiple step, that is not implemented. Maybe it will come, or I guess it will come, but it's not so fast. Judging by the progress that we have seen so far, I guess that I will be giving similar talks still in 2028.

Participant 6: Because it can become really complex, the prompts also to the LLMs, like, compare this part of the things and not compare this part of the things. Some things can be dynamic. Even if you are saying the stock prices are different for a different time, but it might be a scenario where there is a little bit of gap between the lines and it might say that it's due to the content and not the actual structure of the page. It can become really complex, the LLM.

Stefan Dirnstorfer: If it's textualized, if the intermediate step has the textual information about whether there is or not, a gap, then you can instruct the text model. That's really powerful. I think that supersedes what humans are capable of. Once you have it in a text model, just asking it, is there a gap? Then it only needs to go through the text and find the word gap. If it says there is a gap between A and B, and then you ask it, is there a gap between A and B? Then it answers correctly. If there are many gaps and many things on the screen that probably doesn't have the gap information as part of the transferred context, and then you will ask it and it will have forgotten about it and it will start hallucinating and tell you whatever you want.

Participant 7: How far are we from automating QA?

Stefan Dirnstorfer: You mean when is the apocalypse when we humans are totally replaced by AI? I don't think this will happen entirely. QA, a lot of things are very much tied to us humans. For example, aesthetics or usability. Does it match our intuition? Is this a good place to put it there? Is that where we expect it to be? How could an AI infer whether it works for us humans? It could also imagine that in the future it's not us using that software, it's computer use robots using the software for us and the software doesn't need to be usable for us, it needs to be usable for other robots. If that circle is closed and robots use their own software, and then probably I think then the robots can do their own QA on their own. That's a difficult question, if the AI will replace us. I have no idea, nobody knows. These visual tasks, we have quite a while, many years to go where the computer cannot match us with visual recognition of pattern and alignments and detailed information.

According to Wikipedia there are ten differences. Let's count them. Who has one? Banana, three bananas, two bananas, we have one. Clock, two, what else? Sock on the child's leg, yes, the foot. The candy on the cat's hand. The pin on the grandmother's hair. Dolls, yes. Picture on the wall, the mountains are shifted. The knobs, the handles for the cupboard. The cookie, is it this cat? The tongue of the child. That's ten things. Goes to the ear of the cat, there's a smiley face. Then it's eleven. Yes, Wikipedia can't count.

See more presentations with transcripts

Recorded at:

Mar 16, 2026

Stefan Dirnstorfer

InfoQ Software Architects' Newsletter

Image Processing for Automated Tests

Summary

Bio

About the conference

Transcript

Application Testing

Visual UI Agents (Test Script)

Computer Use Models (DetACT Model)

Image Comparison - Spot the Difference

Summary

Questions and Answers

Related Sponsors

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Popular across InfoQ