Gen AI 101- Part 1 - Aditya Kamath

Ever since ChatGPT first made its debut, I’ve been reading, listening and learning about Generative AI a.k.a Gen AI. It’s only in the couple of months or so that I have made some strides in understanding the mechanisms behind these models and their applications.

In this series of blog posts, you’ll find me trying to answer a number of questions about Gen AI (primarily Large Language Models a.k.a LLMs) and I hope it helps you get a basic understanding of this technology and maybe use it as a jumping off point to do further research. Let’s get right into it.

1. What is Generative AI?

Generative Artificial Intelligence is a set of deep learning-based technologies that are capable of generating outputs like text, audio, video, speech, images and/or a combination of all of them.

The base models that power these technologies are also known as foundation models and Users often interact with them using “prompts” which are text-based instructions passed as inputs to the models. ChatGPT by Open AI is easily the most famous example of such a technology.

While you can arguably say that prompting is one of the most popular ways of interacting with Gen AI models, specifically LLMs, there are newer models emerging that do not even require prompts from Users. Many such models have multi-modal input and output capabilities and are capable of interacting with each other to process information in complex ways almost autonomously called “agentic workflows”. The race is on to frantically integrate Gen AI into as many products and experiences as possible and with this, “prompting” is very likely to be only one among several different seamless ways of accessing Gen AI capabilities.

Further Reading: Generative artificial intelligence [Wikipedia]

2. Where is Gen AI being used today?

Here’s a list of some cool use-cases and products that I have come across so far:

Virtual Assistants: They have come a long way since the early times of Siri, Alexa and Cortana. They’re much more effective at taking Users’ queries/ inputs in a variety of modalities like text, audio, and image/video and responding with detailed seemingly well-reasoned human-like responses.

Product/ Tool	Notes
ChatGPT	• Arguably the first general purpose virtual assistant. • Multimodal inputs- text, images, video, audio, etc. and text and image output.
Perplexity	• A young startup that has quickly gained popularity as an alternative to Google’s Search product, Perplexity is adding a lot of new features to their product like shopping, content curation, etc.
Gemini	• Gemini is Google’s response to ChatGPT and their flagship virtual assistant that is capable of multimodal input and output. • One of the main benefits being that it can integrate with all of Google’s ecosystem to give the User a whole new way to interact with their Google accounts.
Meta AI	• Similar to Google’s Gemini, Meta AI is a multimodal virtual assistant that is integrated throughout Meta’s family of apps and is powered by their Llama family of models.

Coding Assistants: While many of the virtual assistants that I mentioned above can also generate code, these assistants are powered by models that have been specifically trained on swaths of code-bases and fine-tuned to produce production-grade code with promises to greatly improve developer velocity and also help inexperienced coders write functional code.

Product/ Tool	Notes
GitHub Copilot	• Prompt-generated coding, code-completion/suggestions, codebase search, workflow setup and much more.
Amazon Q Developer	• Very similar to GitHub Copilot in the sense that you get features like code-completion/suggestions, codebase search but also has some differentiated features like helping with migrations and legacy code upgrades.
v0.dev	• Use prompts to create fully functioning and usable UI components for webapps.

Consumer Electronics It was only a matter of time before Gen AI made its way into consumer electronics devices. But integrating Gen AI into physical devices hasn’t come without its challenges with many of these devices facing a lot of criticism online for the functionality that doesn’t fully align with what has been advertised. So here are 3 different AI virtual assistants housed within consumer hardware electronics devices each with a different form-factor and different ways of “wearing” them.

Product/ Tool	Notes
Ray-Ban Meta Smart Glasses	• A pair of glasses that have inbuilt cameras, speakers and mic that are capable of live streaming, recording audio/video, audio playback, phone calls and more powered by the Meta AI virtual assistant through an audio-command/prompt interface.
Humane AI Pin	• One of the first hardware accessories designed to be an AI-first assistant and that created a lot of hype/ intrigue before its launch- it is a device that you “wear” on your clothes that has a camera, mic, speaker and a mini projector that sees and hears the world around you and that you can summon with touch/ voice commands.
Rabbit R1	• Similar in functionality to the above two devices but more geared arguably replacing the smartphone and the only one of the three devices with a screen- this is a handheld device designed to see, hear and respond about the world around you.

Design With the multi-modal models becoming more powerful and pervasive, we are very likely going to see a fundamental shift in the way arts, design and creative content will be created both as an industry as well in terms of artistic expression.

Product/ Tool	Notes
Figma AI	• Figma at its core is a user interface design tool and Figma AI is a set of features that enhances the User’s experience inside of Figma like better search, prompt-based text and visual artifact generation, etc.
UIzard	• A UI design tool that can create full UI mockups from text prompts, screenshots of other UIs and even photos of sketches.
Adobe Firefly	• A tool that can generate images based on prompts, complete images, create custom creative fonts, videos and more using prompts.

2. What are foundation models?

Foundation models are machine learning models based on some type of neural network architectures that have been trained on extremely vast amounts of unlabelled data using a set of techniques called self-supervised learning. Because of their ability to learn patterns in such vast amounts of data, these foundation models can be used for a variety of general-purpose use cases like the ones mentioned above. Some examples of popular foundation models are:

BERT: a model for natural language processing (NLP) use cases that was primarily trained on English language data. It was introduced by researchers at Google in 2018.
GPT Models: a family of models released by OpenAI starting with GPT-1 in 2017 with various modalities like text input, text output (GPT-1, 2, 3 and 3.5) as well as text and image input, text output (GPT-4o, GPT-4).
Stable Diffusion: a text-to-image generation model capable of generating images based on input prompts that was released by Stability AI in 2022.
LLaMA: a family of language models i.e., for text-to-text use cases released by Meta AI starting in 2023.

Because of the vast amounts of data that foundation models have been trained on, they can be used to create/derive specialized models for specific use-cases downstream that help further increase their performance and accuracy.

Further Reading: What are foundation models? NVIDIA, AWS.

That wraps it up for this post. In the next post I’ll continue exploring about foundation models and how they are trained and fine-tuned.