Feedback loops, and Google’s home-field advantage with LLMs
When I was at Pinterest leading product for the Discovery team, we started to leverage deep learning for our recommendations. It was always the case that while deep learning enabled a step-function change in the results (as measured by user engagement), the results got dramatically better once we took that user engagement data and fed it back into the model to further tune the results. We’d go from a place of discontinuous change with each step-function upgrade of the model, to a world with a compounding feedback loop.
This feedback loop that enables continuous improvement has been Google’s brilliance and core advantage. Type any query into Google, and the nature of a search interface is that Google provides a possibility of answers. The benefit of owning the search interface (and often browser) is that Google can then analyze the user engagement clickstream data (conversion rate, bounce rates, time away or onsite, etc.) to assess and therefore tune the quality of the results.
ChatGPT doesn’t have this same benefit. Type a query into ChatGPT and you get a single, definitive result. You can give it a thumbs up or down, you can ask it to “regenerate” the response, and you can give it explicit feedback in a text box, but this type of feedback all require explicit feedback from a user. Some might argue it’s high signal, but to me it only scratches the surface — there isn’t a native, implicit data feedback loop that helps ChatGPT results get better over time leveraging user engagement.
If ChatGPT results were perfect, this wouldn’t be an issue. But we know that while they feel like magic, ChatGPT and LLMs generally still suffer from hallucinations, and inaccuracies. For many types of queries (queries that result in listicles being my personal favorite), a ChatGPT’s result could be thought of as no different than any other link in the search results — human in its fallibility. That said, eventually, you’d have to bet that an AI should get to a place where its understanding and summary of the web’s corpus is better than any individual human’s answer. The question to me is how much LLMs still needs humans to help it get there.
Learning from Human Feedback (RLHF) is a well reported critical part of OpenAI’s training process for ChatGPT. But that process tuned for whether an answer seems sensible vs the precision or accuracy of the answer. I’d imagine the diversity of queries ChatGPT’s interface invites and the nuance for truthfulness and precision is daunting if you have to rely entirely on human labelers. What if there is a native feedback loop?
This is where I get curious about Bing’s integration of OpenAI’s API into the search results. I can only speculate on how specifically Bing (or OpenAI, depending on the data sharing agreement the two have) will leverage all the clickstream data that they’ll get from having the “chat mode” results in search and the quality of that data. It’s not as native an integration as the ranking order of ten blue links, so how exactly it gets used is up for debate. Still, relative to what OpenAI is getting in ChatGPT, it will be a torrent. One big question to me is how much OpenAI benefits from that data, or whether Microsoft effectively ends up having a parallel model that separates from OpenAI’s.
Which leads me back to Google. Damn Google is way better positioned here. At a minimum, they have way more queries and the technology is all under the same corporate umbrella, facilitating the data sharing necessary to create the best feedback loop. It’s like having the home-field advantage, while OpenAI is playing an away game in a foreign land. It will be exciting to see what happens when Google does show it can dance. Their home field advantage means they get to choose the music.
Taking a step back, it’s remarkable what OpenAI has achieved thus far without this user engagement data. Imagine what could happen once it does.