It's 2 am here in India. OpenAI has just announced its latest and greatest model - GPT 4. Greg Brockman, CTO of OpenAI, went live to showcase its capabilities and limitations and honestly, it has left me awestruck.
It's been almost 3 years since GPT 3 was launched, 3 months since chat-gpt (aka GPT 3.5) was launched and now GPT 4 is out. The rate of innovation is hard to keep up and the hype is real! GPT4 and LLMs in general are going to change how we build solutions in the next decade. As someone who has been building and productionizing ML models for the last 5 years, I wanted to cut through the hype and understand what it is and what is possible with it. Together, we are going to look at its capabilities and the problem spaces this technology would unlock at the current rate of progress.
One thing OpenAI got major backlash for was not actually sharing anything specific in the paper. In their own words.
Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.
so much for “open” AI.
The things we know for sure are that GPT-4 is much more creative, and reliable and is able to handle much more nuanced instructions than GPT-3.5. It performs better and increases the SOTA of a number of ML metrics. This time, interestingly, GTP-4 sat for a lot of tests we humans take and gave some pretty impressive results. This sheds some light on the practical capabilities of the model.
As you can see it shows human-level performance on the majority of these exams - heck it is even better than me right now! These improvements appear to stem primarily from the pre-training process and was not significantly affected by the fine-tuning process it goes through (RLHF). GPT-4 is also multilingual, showing better performances in the same ML benchmarks in other languages. It performs well even in low-resource languages like Latvian(Latvia) and Welsh(Wales)
Another major capability is that GPT-4 now accepts visual inputs, hence giving it the capability to handle any visual and language tasks. It showed some pretty interesting results from the demos.
It also supports a wider context window, which means more working memory for the model to work with to accomplish tasks. The maximum context window is 36k tokens (~25k words).
⏰ Note: explodinggradients.com is going to release another blog going deep into the research innovations that facilitated these new capabilities, so follow the blog to get notified when that is out.
New Use cases
So the model is impressive and an improvement over the previous models. The new capabilities allow for some exciting use cases and products. OpenAI worked with a few organisations to give us a few ideas. So let's look at a few examples to get a sense of what is practically possible and where it adds the most value.
Duolingo build Duolingo Max, which is an assistant, that offers 2 new features
Explain my Answer - if your finding a lesson rather hard and getting the answers wrong you can use Explain-my-Answer to help you. You can chat with the assistant to figure out how and why you are wrong and it will nudge you to the answer.
Roleplay - this is a killer! it allows you to put what you learned into practice in similar situations that you might find when travelling. Gpt4 would be prompted with the situation and specific outcomes like talking to a barista in a cafe in Paris or asking for directions on a backcountry road. It also has mechanisms to evaluate the answers and give you feedback. Closer to real-world practice for you in real-world scenarios.
This platform gives students wider access to tools and tutorials to learn their coursework and for teachers to plan coursework. They created Khanamigo, an assistant to act as a guide for students and teachers
Right now students can use it to get help with math, writing stories, learning programming etc. It will nudge them to get to the answer themselves, without answering the questions directly. GPT 4 can act as historical figures like Cleopatra and Jane Austen. This gives students an opportunity to interact and apply what they learned about these figures to the test. It can also aid in debate topics helping students form arguments on both sides.
For teachers it offers an extra hand with time-consuming administrative work, creating lesson plans and hooks. In the future, it can also parse through the various interactions Kanamigo has with the students and identify who needs extra help so that teachers give better help to them.
Be My Eyes
Be My Eyes is a free mobile app with one main goal: to make the world more accessible for blind and low-vision people. The app connects blind and low-vision individuals with sighted volunteers and companies from all over the world through a live video call.
With GPT-4 they launched Virtual Volunteer, utilizing the new visual capabilities of the model. Users can send images via the app and ask questions about them. For example, given a picture of the inside of the refrigerator, it can identify what is in it and suggest recipes based on what is available. If the virtual assistant is not able to answer the question then it relays the request to an actual volunteer to help with the task.
The payment company stripe used GPT-4 to solve problems that fall under 3 main buckets. They are also working on a very promising long-term use case.
Providing support for questions about documentation - GPT-powered Strip Docs, directly ask the stipe docs questions and it will parse through the documentation to generate an answer, summaries or extract specific information from relevant docs directly. This lets developers spend less time reading and more time building. Fill out this form to get access
better understanding of the businesses of their customers - in order to address customer issues, agents need to quickly figure out what the customer's business is and how they use stripe. This can be a manual process of clicking through the customer's websites but GPT-4 makes this faster. It performs better than humans at cranking out these summaries and gives the agents faster turnarounds times.
Fraud detection on community platforms - helps moderators of Stripe's online forums like Discord analyze the syntax of posts in Discord and with GPT-4 flag accounts and then take a closer look. GPT-4 can help scan inbound communications, identifying coordinated activity from malicious actors.
Business Coach - this is an idea they are working on. Given that stripe has a detailed analysis of your company's revenue it can use that information to give you tailored advice.
Their wealth management team, which helps people, institutions and governments raise, manage and distribute their capital, uses GPT-4 to help retrieve and derive insights from the hundreds of thousands of pages of knowledge and insights spanning investment strategies, market research and commentary, and analyst insights. This gives Morgan Stanley’s wealth advisers the superpower to scan through this large corpus and extract specific insights to help their clients. They have 2 unique advantages though. 1st is their enormous corpus data and 2nd their huge teams of financial advisors who can leverage this tool and provide faster feedback.
GPT 4 and Large Language Models in general are a major shift in how we build products and solutions. There are 3 key factors at play
Easily Programmable - since it has a high-level idea of concepts, anyone can easily program it to carry out complex tasks. Take for example the Roleplay use case of Duolingo. You can build that today for free in chatgpt with the correct prompting. With advances in prompt engineering, the things you can do becomes wider and the truth is we have only started to figure out the extent of what these models can do. The Be My Eye’s Virtual AI is also a very good example. Historically building such a virtual assistant would be super hard, I know because I tried building something 3 years ago called Aeye. It would need multiple components and then string them together but now all of that is available with just an API call. It is also more general which means it would work on a wider range of tasks.
Intelligent Information Retrieval - like with stripe’s GPT-powered documentation or Morgan Stanley’s assistant, the ability to intelligently retrieve information is crucial. A system that can understand your request, intelligent enough to comprehend the vast amount of knowledge you have as unstructured data and give us the answers with sources while at the same time being scaleable is precious.
Scalable - this is something that is inherent to computers but now you have an easily programmable machine that is say 50% as good as humans (but likely more, GPT-4 reached the top 90% in some of the tests it took). Since it runs on computers, it can be scaled in orders of magnitude that humans simply can’t handle. Stripe’s customer understanding pipeline can parse through entire websites and provide intelligent summaries in a fraction of a second, which in turn frees up agents' workload. The same with Morgan Stanley, the breadth of institutional knowledge financial advisors can leverage now is greater than any single person can study and keep in their own head.
Before rushing out to build our next idea on top of GPT-4 let's consider some of its limitations. The model has similar limitations as its previous version ie its tendency to make up facts of its own, different biases that are part of the data on which it is trained, overconfidence in its wrong predictions and a few more. OpenAI has managed to bring significant improvements in these aspects but the issues still have to be taken into account when building on top of it.
For someone planning to build a product on top of it, there are some other things we have to keep in mind. First and foremost is the lack of moat. Everybody has access to OpenAIs APIs including your competition so there is no differentiation. The difference in performance would boil down to how effective our prompt engineering which can easily be copied. This means the core value of our product cannot be on GPT-4. So however impressive GPT-4 is, it really doesn’t give us an advantage in the market. The solution is to build/have moats in other areas. All the use cases above can easily be replicated. But it would be hard to compete with the companies because they have other moats
Something else to consider is the lack of customization. OpenAI gives GPT-4 as an API endpoint with a few parameters we can tweak with. Nothing more. Internally they are running a very general system that is aimed at catering to a wider range of use cases and customers. This means if we have a special use case, which we would like to fine-tune, changing the safety and moderation strategy to be tuned to our specific customer etc is at the moment impossible. The best solution is to prompt the model with all our customizations. Given the larger context size we can give more detailed instructions but in the end, that is going to be the bottleneck.
Also, these systems are not deterministic. Let's take the example from LangChain, a library that can be used to combine LLMs with other sources of computation or knowledge. This helps us build complex apps on top of LLMs. Using this tool we can chain together actions so that the LLMs can perform complex tasks. Let us consider an example of a prompt Who is Olivia Wilde's boyfriend? What is his current age raised to the 0.23 power? The LLM does the following to figure that out.
Q: Who is Olivia Wilde's boyfriend? What is his current age raised to the 0.23 power? ------Models Thought Pattern------ Thought: I need to find out who Olivia Wilde's boyfriend is and then calculate his age raised to the 0.23 power. Action: Google Search Action Input: "Olivia Wilde boyfriend" Observation: Jason Sudeikis Thought: I need to find out Jason Sudeikis' age Action: Google Search Action Input: "Jason Sudeikis age" Observation: 47 years Thought: I need to calculate 47 raised to the 0.23 power Action: Calculator Action Input: 47^0.23 Observation: Answer: 2.4242784855673896 Thought: I now know the final answer ------Models Thought Pattern------ A: Jason Sudeikis, Olivia Wilde's boyfriend, is 47 years old and his age raised to the 0.23 power is 2.4242784855673896.
here the model is trying to figure out the exact steps it needs to solve the question and perform those actions. Since the path is not hand-coded, if the model makes a mistake or goes down a different path in any of the steps, it could lead to very different answers.
In the coming months, it is going to be interesting to see the field progress, see the use cases and products that come out and startups that get built. Personally, I’m going to try and build a few apps leveraging it to get a feel of its capabilities. Prompt engineering is also a space that is going to grow rapidly, it is after all the way to program these models to do our tasks.
⭐ Note: Prompt Engineering is what I’m going to cover in my next blog so drop a comment if that is something you're interested in 😃