As the head of AI at Comtrade System Integration, I've been closely observing the rapid advancements in generative AI models. These models have become impressively smart, and I use them extensively in both my work and personal life. However, I've noticed a significant limitation: many applications based on large language models or multimodal models seem to be stuck in a traditional chatbot paradigm.
Limited Capabilities
Interacting with these chatbots often feels like working with a very smart but peculiar colleague. They can provide advice on how to do things, but they never actually help out by doing part of the task. They never want to jump on a call with us to really talk, but instead want to exchange endless messages. They also can’t be bothered to check what’s on our screen, and they don’t seem to know us or have any awareness what’s important to us. Just imagine for a moment how frustrating an experience would it be to work day-to-day with a colleague who would behave this way. And yet, for most users today, this is currently the reality of our interactions with AI systems.
Let’s explore how our interaction with AI might soon change and in what directions these changes might occur.
The Problem with Current Chatbots
Currently, chatbots can tell us how to do something, but we’re usually the ones who have to do it. They only work while we’re interacting with them. Imagine if this were true for humans – if you were a CEO of a large company, but all your employees only worked when you were directly looking at them or interacting with them. Not much work would get done, and many resources would be wasted.
The Future of AI Agents
The future direction is clear: we’re moving towards a point where we’ll be able to delegate tasks to AI agents who will execute them on their own and report back when the task is finished. This is exactly how effective delegation works among humans.
An interesting corollary to this is that AI agents don’t need our attention to work. This means there can be many of them – ten, a hundred, or maybe even a thousand. This shift could fundamentally change the impact AI might have in an organization.
Defining AI Agents
We can define an AI agent as an autonomous system that is able to perceive its environment and take actions to achieve specific goals. But let’s illustrate what it is on a specific example.
Current vs. Future Approach: A Coding Example
In a conventional coding assistant system, you might ask how something would be implemented. The assistant could provide advice and maybe generate some code. But then you’re the one who actually has to take this code, enter it into an IDE, find a way to run the code, possibly get back an error message, then ask the assistant again what to do. It requires constant attention and involvement.
An agentic framework, on the other hand, might involve describing to the agent what result we’re looking for – what kind of function or program we want to develop. The agent would then, on its own, install the necessary libraries, write the code, execute it, read the error messages, respond to them appropriately, and so on. It would only get back to us when it believes its task is done according to some criteria, or if it really got stuck.
Tools for Working with Agents
There are now multiple frameworks supporting work with agents. For those more used to working in the Microsoft ecosystem, two options are particularly interesting:
Why Now is the Right Time for Agents
The concept of AI agents has been discussed for a while in relation to generative AI models. However, recent developments have made this a really great time to actually start working with them. A significant change occurred on September 12th when OpenAI released their next generation of models, referred to as the o1 models.
These new models have a particular property that sets them apart: they’re notably better at reasoning than previous generations. The underlying principle of how these models operate has shifted. Previously, models spent considerable amount of compute during their training, but relatively little of it in inference mode when actually being used. With the o1 models, this has changed. They now spend more time processing during the inference stage, considering various pathways to answer questions or complete tasks before providing a final answer.
One of the main limitations of current chatbot systems is the limited context they work with. This often leads to disappointing outputs, even when users expect comprehensive answers.
The Problem with Manual Context Input
Manually providing context to chatbots is tiresome. As a result, people often provide very little or no context, yet still expect the AI to give good outputs. This mismatch between input and expected output is a significant issue.
Limitations of Current RAG Applications
While we have Retrieval Augmented Generation (RAG) applications trying to provide a shortcut in context provision, they too have limitations. In these applications, the context is usually limited to a certain number of pre-selected files. Even though the files might be updated occasionally, the context remains largely static.
The Need for Automatic Context Gathering
To address these issues, we need AI systems that can gather context automatically and directly. This context should come from two main sources:
This approach mirrors how we, especially those in knowledge work, naturally gather the context we need to work effectively.
Emerging Solutions for Context Gathering
Some promising developments are emerging in this area:
Microsoft’s Recall service, set to launch on Copilot Plus PCs in October, takes a straightforward approach to context gathering. It automatically captures screenshots of your screen every five seconds, creating a visual record of your work. What sets Recall apart is its ability to enable natural language search through this captured content. This means you can easily retrieve relevant information from your past work sessions using simple queries, effectively expanding the context available to AI systems.
ScreenPipe offers a more comprehensive approach to context gathering. Beyond capturing screenshots from multiple screens, it also records all input and output audio passing through your computer. This dual approach of visual and audio capture provides a richer context for AI systems to work with. ScreenPipe stands out for its focus on privacy and customization. It supports local processing of data, meaning you don’t need to share your information with external APIs or cloud services. While this requires a GPU or other means of running language models locally, it offers greater control over your data. As an open-source tool, ScreenPipe also allows for customization and community-driven improvements, making it a flexible solution for those with more specific needs or privacy concerns.
Future Possibilities
While current solutions focus on screen capture and audio recording, there are discussions about more advanced context-gathering methods. Some experiments with wearable devices for context gathering are happening, particularly in certain tech hubs.
However, it’s worth questioning whether new wearable technology is necessary for this purpose. Most people already carry smartphones, which are capable of recording contextual information. The key will be developing software that can effectively utilize the sensors and capabilities of existing devices to gather rich, relevant context.
By improving how AI systems gather and understand context, we can significantly enhance their ability to provide relevant and helpful responses, moving beyond the limitations of current chatbot interactions.
Current Approach: Ad Hoc Searches
Currently, when a user queries a chatbot, it performs an ad hoc search. This search might happen in different ways – it might look into various files that it has access to, or it might perform a rapid external search with the help of a search engine. Based on what it’s able to quickly gather, it tries to answer the query.
Limitations of the Current Approach
This ad hoc approach has quite a lot of limitations. There’s always a question of what kind of data we’re actually getting back. Is it still valid? To what extent is it really relevant to the query? While overall it might produce acceptable results, often it does not.
The Concept of Living Knowledge Bases
A possible direction for improvement is to create living knowledge bases that are continuously updated and checked for inconsistencies based on incoming data. Imagine this as a kind of Wikipedia page for a specific topic you’re interested in. Any new information the system gains access to would be checked for relevance to this particular topic. If relevant, it would see whether it needs to create an update to that Wikipedia page.
RAG vs. GAR
This introduces an interesting duality to what we’ve come to know as RAG (Retrieval Augmented Generation). We could introduce a concept that I’m calling GAR – Generation Augmented Refinement. Here, updates to the knowledge base are actually generated by AI based on incoming information. The principle is rather different, but in practice, you could combine RAG and GAR to create a system that gives better answers to your questions.
The Current Passive State of Chatbots
Currently, chatbots aren’t proactive at all. To illustrate how reactive they are, there was a recent example where, apparently due to a glitch, ChatGPT was messaging some users first, without getting an actual query from them. This almost went viral, with people sharing screenshots, very surprised and even shocked at what happened. This reaction itself tells you how unusual it is in our current expectation for a chatbot to reach out to us directly.
The Limitation of “Known Unknowns”
The problem with this passive approach is that chatbots can only answer what we could call “known unknowns”. If you know you don’t know something, you’re the one who has to initiate the interaction with the chatbot and ask a question based on what you know you don’t know.
The Future: Proactive AI Systems
In the future, AI systems will become much more proactive, reaching out to us as needed. The major shift here is that such systems will be able to address our “unknown unknowns” – things we don’t even know we don’t know, but that are helpful for us.
The Concept of an Extended Nervous System
What these systems might actually lead to is what I sometimes consider as a kind of extended nervous system. So let’s put some of the things that I’ve mentioned now together. Imagine that we have AI agents who are constantly reading the context that we are getting in order to determine what is relevant for us and what might not actually be relevant for us. And this is something that happens automatically in the background.
Based on all of this relevant information that is coming in, the AI agents might maintain a highly relevant living knowledge base for us. They would continuously, whenever some information would enter our context, update this living knowledge base. And of course, they could have the discretion to actually reach out to us as needed. So actually, something might come on the radar of the agent without us even being aware of that, and the agent would know that this is something that is significant for us, and it would enter this information, or I should say it would update the knowledge base that is relevant for us, and then would also contact us to let us know that this has happened.
Managing Information Overload
This proactive approach introduces a new challenge: information overload from numerous AI agents. An intriguing solution is a system of virtual credits that agents must spend to gain our attention. Imagine an agent patiently accumulating credits for truly important information, treating our attention as a valuable, limited resource. This approach ensures that only the most significant updates break through, preventing overwhelming barrages of information.
Current Limitations
Current chatbots are useful for many things, including helping us prepare when we need to make a strategic decision – when we have enough time to gather all the information, do the research, sleep on it overnight, and so on. This is the realm where chatbots are very useful today.
However, when you move to a more tactical setting – for example, negotiations, sales, interviews, or standing in front of a large audience explaining something – in this context, they are rather useless currently.
The Need for Speed
The future direction is clear: AI systems will become fast enough to provide us with real-time decision assistance. This will hold a lot of value for many people. The key question here is speed, because this is the main limitation preventing us from seeing more of these systems already today.
Advancements in AI Hardware
NVIDIA still holds the market-leading position when it comes to systems capable of doing AI inference. However, now there is competition. Three challengers providing hardware primarily for inference are Groq, Cerebras, and SambaNova.
In a recent benchmark using the LLAMA 3.1 Instruct 70 billion model, Cerebras achieved over 500 tokens per second, followed by SambaNova with slightly over 400, and Groq with 250. The next provider, likely using GPUs, was Together AI with 77 tokens per second. This large speed advantage will help make real-time intelligence possible.
Delivering Real-Time Intelligence
The question is also how this real-time intelligence should reach us. One option is to use screens – you might get this information on your computer screen or your phone. However, this might only cover a small part of all these tactical interactions in which humans partake. There’s also the option to have headphones where the model might whisper some advice to you, but that probably isn’t optimal because it’s somewhat distracting.
Personally, I believe that this real-time intelligence creates a great use case for smart glasses. We’re seeing some interesting products here created by Meta. I currently prefer a product called Frame by a startup company, Brilliant Labs. They try to build more tinkering-friendly software and hardware. It has microphones, a camera, and, most interestingly, a simple AR layer over the glasses that can, for example, display text being generated by a language model.
Future Possibilities
Looking further into the future, devices such as Neuralink might start playing a role in real-time intelligence. Even though today it might sound a bit far-fetched, I believe it is actually much closer than most people think.
Current Limitations
Current chatbots might be able to use certain tools like a search engine or access a file or a database. However, it’s almost never that these chatbots actually connect us to other humans.
The Future of AI-Human Connections
Future AI systems will have the ability to contact humans or other AI systems as needed. This could be either to obtain information that is currently not available to the AI system by other means, or in some cases to delegate a task. This possibly also changes the dynamic where it’s always humans delegating tasks to an AI system – it might actually happen the other way around.
AI-to-Human Communication
For AI-to-human communication to be as smooth as possible, a key channel is voice. And in this domain a new milestone has also just been reached – OpenAI made the new Advanced Voice Available in ChatGPT application and the corresponding model made accessible through their new Realtime API.
Emotional Intelligence in AI
Another aspect that is also well supported now with advanced voice AI is the focus on emotional intelligence. If this interaction between AI and humans is to become more common, AI systems will need to respond to certain emotional states that humans are in and also to communicate with a voice that contains some semblance of emotions. Otherwise, humans will find it robotic.
Current State: AI in the Digital World
Currently, chatbots, as with most other AI systems, really reside in the digital world. Of course, there are exceptions, but they are mostly there.
The Future: Embodied AI Presence
I believe that AI systems will progressively increase their embodied presence in the real world, let’s call it the non-digital world. And this will happen in many different ways.
The Role of Humanoid Robotics
One of the most noticeable ways this will happen is through humanoid robotics. Whenever I mention humanoid robotics, most people still consider it something that is very far away, that it’s something we really shouldn’t be bothered with right now. And I think that actually it’s one of the most currently misunderstood technologies and also a highly undervalued one.
Currently, if you pay a bit more attention to this space, you notice that there are a lot of companies now competing in this area, such as 1X, Figure, Tesla, Sanctuary AI, Agility Robotics, Unitree Robotics and many more. A lot of them have now reached a level of maturity where at least limited commercial releases of their products are realistic in the year 2025.
As we move beyond the chatbot paradigm, we’re entering an era where AI becomes more autonomous, context-aware, proactive, and deeply integrated into both our digital and physical worlds. These advancements promise to reshape how we interact with AI, potentially leading to more natural, efficient, and powerful collaborations between humans and artificial intelligence.