An Honest Account of How We Navigated AI Over the Last Three Years

Disclaimer: This post contains strong opinions about a few AI products that might be dear to you. I promise they’re based on real experiments and documented outcomes. If you’re emotionally invested in any particular tool, consider this your friendly “proceed at your own risk.”

When ChatGPT launched in November 2022, I spent a significant amount of time trying to understand if it could genuinely help us do better work at our company, where efficiencies actually existed and whether it represented a real benefit.

Our First Approach and Early Experiments

I put together a team and called it the "AI Innovation Team", made up not of decision makers but rather the enthusiasts and experimenters. We explored where AI might create real value, discussed upcoming changes, reviewed the new big demos, etc. I built prototypes to see how promises I read online compared to reality and see how our products could evolve.

To my never-ending amazement, my own productivity using AI kept growing exponentially each day. I became a strong believer of its value but it was going to be difficult to justify the cost of everyone having a license at first. The CEO was afraid it could be yet another tool that we provide that people would barely touch and he had a point. So I went on to identify which roles and profiles would get the biggest lift from licenses.

The “Safe” Enterprise LLM Idea

Justifiably, we were very uncomfortable with putting our precious IP or client data into 3rd party LLMs, so I also investigated the very obvious alternative of deploying an internal, agency-wide LLM tool. Secure. Hosted by us. Available to all without a per person license. Authenticated with company SSO. Connected via APIs. No data leakage. No black boxes. On paper, it was the right enterprise approach.

Soon, I realized that this path came with two structural problems.

First, the models were evolving too fast for us to keep up and maintain an edge. Keeping models updated, re-testing behavior after each release, adjusting system prompts, validating regressions and continuously integrating new capabilities wasn’t a side project. It was a major commitment that was going to require dedicating at least one full-time role.

Second (and more importantly) the intelligence gap was undeniable. The same prompts run through raw API access produced materially worse results than those run inside ChatGPT itself. The difference wasn’t subtle. I thought there had to be two possible reasons for the difference between the two. 1- The ChatGPT system prompt (models through the API don't have this) 2- They used a smarter model that wasn't available through the API. Given that at that point the API was a bigger money maker, I thought the most likely reason had to be the system prompt. Intelligence, it turned out, is not just the model, it’s also the instruction layer that shapes how the model responds. So the API is like a genius with no mentor or guidance.

At that point, it became clear that continuing down the “build and host everything ourselves” path would be expensive and ultimately quite inferior for the people doing the work. Looking back, I’m so glad we walked away when we did. Many organizations rejected that decision for perceived "strength and safety". I think they paid for it in lost momentum: if people’s first exposure to AI is a watered-down experience, it reinforces the conclusion that AI is a time-waster and they stop investing the effort to discover where it’s actually useful.

Choosing One Core LLM Platform

By the end of 2023, I was convinced. The question wasn’t whether an LLM belonged in daily work, but which platform we should choose. I also concluded that our biggest gains would come from getting one core LLM deeply adopted instead of getting good at using the AI features in apps or using specialized AI tools.

Which one? ChatGPT, Gemini, Claude or Microsoft Copilot

At the time, Gemini didn’t match the reasoning quality we were getting from ChatGPT. It was often well-sourced, but on the same prompts it produced less intelligent outcomes repeatedly. More importantly, it wasn’t set up for collaboration: it didn't give the ability to share chats or Gems (or Custom GPTs) across the company, which would have eliminated one of the biggest organizational multipliers of using AI to benefit from each other's skill set.

I considered Anthropic's Claude, but it didn't have a clear path to long-term platform dominance, and we didn't want to tie our business to the less future-proof option.

Microsoft Copilot, on paper, looked like the clear answer. We were already a Microsoft 365 organization. Teams, Outlook, SharePoint, everything lived there. Copilot promised deep integration, enterprise security, part of every tool we use daily and it was powered by the same OpenAI models under the hood that I loved. But I would never commit the sin of deciding on claims only, so I designed a test.

The Surprising Failure of the Most Promising Path

I selected ten power users across different roles, people already fluent in ChatGPT, curious, skeptical, with pain points that MS Copilot promised to solve and committed to finding the truth rather than defending a preference. For three months, they were going to run a variety of tasks side by side. Same work. Same scope. Same evaluation criteria. Everything documented.

By week 6 we retired the test, the conclusion was unanimous. Copilot failed.

Not marginally. Dramatically! The “bonuses” it promised—system integration, document awareness, enterprise context—did not even work most of the time. And for anything ChatGPT could do, Copilot did worse. Often much worse.

Scaling ChatGPT Adoption

I presented this outcome to our CEO, backed with that unarguably clear data, and we walked away from the tool that looked safest and standardized for the one that actually worked. We rolled out ChatGPT licenses across the company. I recorded a core training session, accompanied with an assessment that was required to get the license, so everyone had a foundational understanding on how to use it well. We monitored usage and provided support for those who struggled to find the value.

I also started weekly 15 min training sessions and shared use cases. I designed this program more like a "promotional campaign around amazing things you can do with ChatGPT rather than an educational program". Productivity gains soon followed.

The Shift in 2025: Google Catches Up

Then 2025 happened.

Quietly, steadily, Google started to catch up and then surpass expectations. Gemini didn’t sprint. It walked. Carefully. Deliberately. We had already seen that Gemini Deep Research was clearly outperforming ChatGPT Deep Research. Google's NotebookLM also showed a level of source fidelity that ChatGPT couldn't ever match.

But then, with the release of Gemini 3 and Nano Banana Pro, and the added ability to share chats and Gems, the remaining missing elements were fixed. Now, they were ahead of OpenAI and seemed to have integrated Gemini into the workspace the right way (what MS Copilot tried to achieve but couldn't).

Leadership, Trust and Long-Term Direction

Also, the contrast in leadership styles has been hard to ignore.

Demis Hassabis of Google DeepMind has moved with a kind of restraint that inspires confidence. No theatrics. No stunts. More substance. Tangible patience. More respect for the downstream consequences of powerful systems and an intention to do good.

At the same time, decisions coming out of OpenAI this year have made me increasingly uncomfortable tying a business’s long-term foundation to that direction. Although GPT-5’s uneven rollout didn’t bother me as much as others, Sam Altman's handling of the events did.

The Sora 2 roll out was a big one for me. Launching it with huge copyright violations, followed by a casual “oops, ok, forgive the millions of AI sloppy videos already produced, we'll look into fixing the copyright issues going forward” after it was trained on them already. Releasing this, focused on yet another endless scroll of no-value AI content, felt like an immature business direction and a bad use of AI.

For a consumer product, this might be forgivable. For a system businesses are expected to trust deeply, it raises flags.

Then, there was a customer support experience we had with OpenAI that reinforced a broader pattern of dismissiveness, lack of accountability and exposed a model not built for business reliance. One security incident on their side led to our entire company’s ChatGPT access being shut off for nearly a month by mistake. We couldn’t reach a human and ended up having to repurchase ~$3k in licenses just to get back online with no resolution a year later still.

Where We Are Now

So, where to from here? Next step is taking my personal eval of Gemini against ChatGPT and expanding the scope to a larger group of users to test my hypothesis.

What I take from all of this is less philosophical and more operational: AI rewards improvisation and adaptability. And it punishes rigid assumptions.

The Core Lesson for Leaders

We all walked into this with beliefs that sounded reasonable. “Microsoft should have a reliable enterprise answer.” “The safe self-hosted tool has to beat the commercial model.” “If it’s the safe option, it’ll also be the smart option.” In our case, those beliefs didn’t hold up.

And the only reason we caught it is because we treated them as hypotheses to test, not truths to defend.

The takeaway I’d offer other leaders is this: AI is very new, very fast, very uncharted. It is a frontier, and what lies ahead is not always clear. Nobody has the right answer; not vendors, not analysts, not the loudest person in the room. That means the mindset has to change. Treat your assumptions as temporary, expect today’s “obvious choice” to be wrong sometimes and stay willing to adapt. Pick a direction, but hold it lightly: keep an open mind, let data driven tests, real outcomes drive your decisions. And be ready to change course when the evidence changes.

PS: A Note regarding Microsoft Copilot

I have a lot of respect for the more responsible stance Mustafa Suleyman has taken around AI and I genuinely want Microsoft to succeed here.

In 2024, our development team ran into the same kind of gap with GitHub Copilot AI coding assistant, that the whole company experienced with Microsoft Copilot. GitHub Copilot was widely positioned as the new industry standard for developers (and it was also built on OpenAI models). Many tech teams rolled it out as the “official” choice. We did our own side-by-side evaluation anyway and the outcome was similar: for the work our developers needed to do, they consistently preferred ChatGPT, even though it was very inconvenient and not integrated.

We kept watching closely. In 2025, when GPT-5 and Claude Sonnet 4 became available inside GitHub Copilot, the tool improved dramatically and started working the way we had hoped it would in 2024, so we started using it.

That shift reinforced a broader point: these tools can change overnight when the underlying models change. I assume Microsoft Copilot may have improved as well, but once a tool loses trust, it takes more than incremental updates to win back the time and momentum people lost while trying to make it work.

— Yas Dalkilic
Head of AI, RAB2B

TL;DR

We all walked into AI with beliefs that sounded reasonable. “Microsoft should have a reliable enterprise answer.” “The safe self-hosted tool has to beat the commercial model.” In our case, those beliefs didn’t hold up.

And the only reason we caught it is because we treated them as hypotheses to test, not truths to defend.

The takeaway I’d offer other leaders is this: AI is very new, very fast, very uncharted. Nobody has the right answer; not vendors, not analysts, not the loudest person in the room. That means the mindset has to change. Treat your assumptions as temporary, expect today’s “obvious choice” to be wrong sometimes and stay willing to adapt. Pick a direction, but hold it lightly: keep an open mind, let data driven tests, real outcomes drive your decisions and be ready to change course when the evidence changes.

An Honest Account of How We Navigated AI Over the Last Three Years

Our First Approach and Early Experiments

The “Safe” Enterprise LLM Idea

Choosing One Core LLM Platform

Which one? ChatGPT, Gemini, Claude or Microsoft Copilot

The Surprising Failure of the Most Promising Path

Scaling ChatGPT Adoption

The Shift in 2025: Google Catches Up

Leadership, Trust and Long-Term Direction

Where We Are Now

The Core Lesson for Leaders

PS: A Note regarding Microsoft Copilot

Related posts

The Untapped Mindset Required to Make the Most of AI

Your Next AI Upgrade Isn’t a Model, It’s Your Workflow

The Art of Averaging: How AI Perfects Everything Until Nothing Feels Special