From Hackathons to AI Mastery: Understanding Vectors and Embeddings [Sidecar Sync Episode 35]

Written by Mallory Mejias | Jun 20, 2024 4:59:43 PM

Timestamps:

0:00 Exploring Vectors in AI Technology
3:54 Understanding Vectors in AI Technology
18:05 Assigning Numeric Values to Semantics
26:48 Discovering Bias and Vector Databases
38:33 Implementing Vector Databases in AI

Summary:

In this episode of the Sidecar Sync, hosts Amith and Mallory delve into the fascinating world of vectors and embeddings in AI technology. They break down complex concepts into understandable terms, exploring how vectors represent data in multi-dimensional space and the transformative potential of AI, comparing it to the early days of the internet. Amith shares insights from a recent AI Hackathon, and they discuss practical applications for associations, such as professional networking and content personalization. Tune in to learn how these technologies can revolutionize data handling and provide personalized experiences for members.

Let us know what you think about the podcast! Drop your questions or comments in the Sidecar community.

This episode is brought to you by Sidecar's AI Learning Hub. The AI Learning Hub blends self-paced learning with live expert interaction. It's designed for the busy association or nonprofit professional.

Follow Sidecar on LinkedIn

Other Resources from Sidecar:

Tools mentioned:

Other Resources Mentioned:

More about Your Host:

Mallory Mejias is the Manager at Sidecar, and she's passionate about creating opportunities for association professionals to learn, grow, and better serve their members using artificial intelligence. She enjoys blending creativity and innovation to produce fresh, meaningful content for the association space. Follow Mallory on Linkedin.

Read the Transcript

Disclaimer: This transcript was generated by artificial intelligence using Descript. It may contain errors or inaccuracies.

Amith Nagarajan: Alright, you ready?

A warm welcome back to everybody to the Sidecar Sync. We have another exciting episode here for everyone today we are going to dive into a very technical sounding subject. But I will promise you this, if you stick with us for this episode, you will definitely learn something new. And it is actually not really a technical episode.

We'll talk about why that is in a few moments. Um, my name is Amith Nagarajan and I'm one of your hosts.

Mallory Mejias: And my name is Mallory Mejias. I'm one of your co hosts and I also run Sidecar.

Amith Nagarajan: And before we jump into the wild world of vectors, which is our topic for today, we're going to take a moment to hear from our sponsor.

Mallory Mejias: Amith, how are you doing this week?

Amith Nagarajan: I'm doing great. How are you today?

Mallory Mejias: I'm doing pretty fine myself. I know you had a hackathon at your house last week. Do you want to talk a little bit about what you all did and how fun and exciting that was?

Amith Nagarajan: Yeah, for those who tuned in the last week's [00:01:00] episode, uh, you might have noticed a little bit of audio quality problems on my end.

Uh, I do have a pretty good internet connection up at my house in the mountains. Um, But, uh, and I, and I hosted seven developers, uh, for about a week up in the mountains of Utah. And we had an awesome time. We were working on AI software development. So, um, for those of you that don't know, I've been a software developer myself for over probably 35 years at this point.

I can't even remember when I started, but it's been my whole life basically. Still love it very, very much. And, uh, every once in a while I'll get together with a group of developers from across our family of companies and we'll build stuff. And that's what we did last week up in Utah. And of course, when you get seven developers in a house cranking on code, they're all very much fueled by caffeine, but also by internet bandwidth.

And so when Valerie and I were recording the episode last week, I had a couple of minor issues along the way, uh, with audio quality and connections, but, uh, we had a great time. We worked on multi agent systems frameworks, which is a lot of words, basically means way smarter AI, getting all of our systems and technology [00:02:00] ready for GPT 5 and GPT 6, which of course they're not.

With us yet, but we always like to be a model or two ahead in terms of what we're planning for and essentially plumbing our technology systems so that when those models, not those models specifically, by the way, but those classes of models become available, new features will automatically emerge in our software.

So with Betty and with Skip and with Rex, those are the three main AIs we were spending time on last week. We have some really, really cool stuff that got built and planned out. So. Plus, we're in the mountains of Utah, so it's hard not to have a good time up there.

Mallory Mejias: Sounds like it was a blast. Do you think that it's because of the hackathon, in part, that you were inspired to talk about vectors on today's episode?

Amith Nagarajan: You know, maybe so. I spend quite a bit of time talking to people about vectors because it's, uh, It's a topic that seems really like out there and mappy and technical and all this, but it's such a critical concept to understand, um, you know, with the CEO mastermind group that I run with our good friend [00:03:00] and colleague, Mary Byers, um, we meet once a month with about 30 leaders from across the association market, um, to talk about, you know, strategy plus AI.

So that's our CEO mastermind group. And in that group a couple months ago, I led a presentation all about vectors, and I did that then because I felt like even though that group is decidedly non technical for the most part, um, they really need to understand this technology. I think everyone needs to at least have a basic grasp of this technology and how they can take advantage of it.

Um, it's kind of like this, you know, if you were walking around in the 90s or early 2000s and you didn't understand what the internet was, And it had this transformative potential to really change the way the world worked. Uh, AI is like that, right? AI is even bigger than the internet. And vectors and understanding vectors is a very important part of understanding AI.

So, um, I was excited to present that a couple months to the CEO group. And, uh, I think our listeners are going to get a lot out of this episode if they just stick with us. So there's going to be a little [00:04:00] bit of, of math that we're going to get started with. But, uh, I promise it won't be too painful and it's going to be very, very interesting.

Mallory Mejias: Yep, I will say the topic, don't click away just yet. The topic sounds a little bit technical, a little bit intermediate. I think even if you do have an intermediate understanding of vectors, you'll walk away from this episode learning something new. But if you're a total beginner, stay with us. We definitely made this episode easy to understand.

So, first and foremost, and we've got to talk about vectors at its core, what are they? They are numerical arrays that represent data in a multi dimensional space. To visualize this, imagine mapping the characteristics of an object, like size, shape, and color, onto a graph. Each characteristic corresponds to a dimension in the vector.

For example, a two dimensional vector can represent an object's length and width, while a multi dimensional vector can capture a more complex set of attributes. In data science and AI, vectors transform diverse data types like text, images, and audio and video [00:05:00] into numerical formats that machines can process and analyze.

So it sounds easy enough, I guess, but Amith, in your words,

Amith Nagarajan: I think you did a great job explaining it. I mean, you know, from my viewpoint, it's, it is just a bunch of numbers. Each vector is a bunch of numbers. But the concept is essentially is how do we create a common format? You know, a common way of representing things.

A thing being a piece of content. It might be an article from your website. That thing might be a profile of a member. Or it might be a product you offer, or it might be a session at an upcoming conference. These items, or objects, or things, are all, uh, complex. They all have many characteristics. And essentially, these characteristics, um, are these attributes, or what we call dimensions in vector space.

So it's a way of creating a common, uh, Mathematical representation of literally anything that can [00:06:00] then be compared against the other vectors, the other representations that are out there. So the reason it's so important is AI models don't actually understand the words we speak and the images we see.

They have to translate it into some kind of representation the computer can understand. And vectors are the basic concept that all these AI systems work on top of.

Mallory Mejias: Mm-Hmm. . And so vectors are a set of numbers that can represent concepts, text, like a singular word, a singular image, or is it pieces of an image?

How does that work?

Amith Nagarajan: Well, a vector essentially. So the, the concept is that yes, vectors, they have these attributes. So you said like shape, color, size, taste, texture, smell. These are attributes we can put words to. So if I wanted to categorize something like an animal, and say, okay, the animal, is it furry or not?

Is it big or small? Is it aggressive or not? Is it a carnivore or not? You know, those are attributes [00:07:00] that I can probably assign words in human language, in whatever human language I use, to words. However, there's many other qualities or characteristics that could apply to anything. You know, probably very soon in the near future during this podcast, my dog's going to start barking.

So, you know, that's an attribute of his personality that I can probably put in words if I wanted to, but it's going to be hard to, you know, to figure out what that is. Like, is the attribute noisy, not noisy, barker, non barker, you know, those things. So, um, what ends up happening is, um, There's a lot more information in our brains that we cannot represent in words, in human language.

But vectors are capable of actually encapsulating all this other information. It's almost like saying, hey, that intuition we have, that, that soft stuff that we can't put into words, that's what vectors are able to encapsulate beyond, like, known attributes.

Mallory Mejias: That makes sense. So it's kind of on one hand, we're talking about attributes that we can put words to because that's how we as humans communicate with one another.

That's how we interpret the world [00:08:00] is through language. But right before we started recording Amith, you said, in fact, language is kind of limiting. It's the only thing we have to express ourselves. But in fact, there's a lot more on earth in the universe that we can't necessarily put words to, but. That still exists and still has relationships to other objects.

Amith Nagarajan: Right. Um, and it's kind of like if you take the, the classical example is, is if you take the video version of this podcast and then translate it to audio only, and then translate the audio only down to just the words in the transcript, you lose information with each of those translations, right? Video to audio, you lose something you can't see us, and there might be expressions in our faces or something in our backgrounds.

That would be information, right? And then you go to audio, it's less information. You go to text only, it's less information. Uh, but there's also other, uh, shifts in modality that are interesting to think about, too. So, think about, like, um, the same text. You could speak it. You could sing it. You could turn it into more of a poetic [00:09:00] interpretation of literally the exact same pieces of text, right, where you can capture that if you go upscale it essentially into, uh, audio or certainly video, right?

So, um, that information loss, like we can express ideas in more than just words because we can speak, we can see, we can act out. Those are all part of the human experience. Um, and then our memories take into account, of course, elements of what we recall, uh, in terms of words and images and audio, but really we ultimately mash it up somehow into our memories in a way that then ultimately forms like what we probably refer to as intuition.

Uh, over time. So really what I'm referring to is, you know, with AI and with vector space, we don't exactly know how all of it works, which is part of what's interesting, but vectors are able to have literally thousands and thousands of attributes, which the AI systems use. and us are incapable of labeling.

Like we go, we can't go to these vectors and say, you know, at a position [00:10:00] 500, that's what this particular dimension means, but we know it has some meaning.

Mallory Mejias: Is there potential for an infinite number of attributes in a vector?

Amith Nagarajan: Yes.

Mallory Mejias: Wow.

Amith Nagarajan: The models currently produce many thousands of attributes, but there could be, and the question is, is there value to that right beyond thousands?

If you go to hundreds of thousands or millions of attributes, Does that produce any scale advantages? I suspect that there will be, uh, because there's more and more and more subtleties to everything in the world. Um, so I suspect that as we get more computational power, and these embeddings models, which we'll talk about in a little bit, become better and better and better, and can produce even bigger vectors, there will be upside to it.

The question is, is there a point of diminishing marginal returns if you go from 5, 000 numbers to 500, 000 numbers to 50 million numbers in a vector? You know, theoretically, it can be whatever you want. If you have enough compute, enough storage, and so on, uh, will there be value from that? So, I think there probably will be.

[00:11:00] Um, you know, in, in model sizes, we see these so called emergent properties when you scale from GPT 3 class models to GPT 4 class models. New capabilities of the model, quote unquote, emerge, where the designers of the model didn't necessarily know these things would be capable of these new, these new capabilities, and that's a little bit different thing, but it's there's kind of a parallel in terms of model size.

With parameter counts that we talk about a lot, um, and vector sizes, it's a, it's again a different concept that the scaling of vectors over time I think potentially could bear more fruit.

Mallory Mejias: Okay, this, this last question just about vectors is more for my beginners. You might think it's a silly question, Meath, but I don't think so.

Where do vectors live? And in that sense, are they in the cloud? Where, where are these numerical representations living?

Amith Nagarajan: I think it's an excellent question. So, um, when you create a vector, which we'll talk about in a little bit, you run this specialized AI model called an embeddings model. And then embeddings model basically, uh, creates a vector from, from a given [00:12:00] piece of content, text, audio, et cetera.

Once you have that vector, you can, you can store anywhere. It's just a set of numbers. So you can put it in a file, you could put it in a specialized vector database. So there's a lot of ways of doing this. Uh, the vector databases that are out there, you can use a tool like Pineco, which is the one that we like a lot.

It's a very popular. Uh, very, uh, cost effective database. It's not like a traditional database, though. It's, it's specifically built to store these vectors. So, uh, in a traditional database where you have tables and columns and things like that, you could stuff a vector into one of those, into one of those tables in a field or in multiple fields.

But, um, it's not going to be very efficient because a relational database or even other types of databases are not designed to For the kind of math that you want to do at scale. So if I have a vector, let's say I have a vector for every piece of content that my association has ever created. I have another vector, a vector for each member representing everything I know about the member.

And I want to be able to compare [00:13:00] all of my member vectors against all of my content vectors And there's a lot of reasons I might want to do that So let's say I have you know, let's just use a big number Let's say I have a million members and I have a million pieces of content So you're doing this comparison of pairwise comparison of every piece of content relative to every member Um, if you have it in an optimized environment KA vector database, um, that is a nearly instant comparison.

It's measured in microseconds and microseconds are millions of seconds, um, compared to, you know, traditional computing you measure in milliseconds, which is thousands of a seconds. Um, and so these, these, these algorithms are so highly optimized for vectors in these specialized vector databases that you definitely want to use a vector database store.

And that's what we do across all of our products that I was talking about earlier. Um, you definitely need to take advantage of these vector databases. Part of what's happened is vector databases, the concept used to be kind of esoteric, harder to deal with, expensive, hard to deploy. [00:14:00] Now it's become really easy to do.

Mallory Mejias: Mm. So in your example with the one million members and one million pieces of content, if we looked at one member, that member would have a set of attributes. Which would be represented by a vector essentially by a list of numbers and then you have a piece of content with the same various attributes represented by numbers and then there would be a Comparison between those two like essentially how close in proximity those two are to each other and that is how proximity works That's how personalization would work.

Amith Nagarajan: That's how personalization would work. That's how duplicate detection might work. That's how you might be able to do professional networking recommendations by comparing people vectors against other people vectors. And the list goes on and on, especially as you have new types of entities coming into this space.

So I have two categories of vectors in this example. One is a vector for every piece of content. Another one's a vector for every member. Well, what about a vector for every session at my upcoming annual conference? What about vectors for every product in my e [00:15:00] commerce system? Uh, what about vectors for every vendor that wants to sell products to our members so I can do better matching of, uh, Vendors at a trade show to the people that are attending the trade show and on and on and on, right?

So if you vectorize everything then you can compare everything at scale in a really interesting way.

Mallory Mejias: Okay, quite exciting I want to move along to embedding so vectors can encapsulate various types of data ranging from text and images to audio and video AI models create embeddings which are long vectors that encapsulate the meaning of the content Embeddings are powerful because they capture the nuanced relationships and contexts within the data, allowing for sophisticated comparisons and analyses.

So can you, I feel like we just kind of touched on this, but just set the tone for how vectors and embeddings are related to one another.

Amith Nagarajan: Yeah, I mean, for purposes of this episode, they're really the same thing. So, a vector is a general purpose mathematical [00:16:00] construct, where you can use vectors to represent anything.

An embeddings model is a special type of AI model that's not gotten a lot of press, but it's actually the workhorse underneath. A lot of the language models, people get excited about. So language models utilize embeddings to do comparisons and scale. And so the embeddings model is kind of like this separate model that, uh, basically converts a piece of content, whatever that is, into a vector.

So when people talk about, hey, I got the embeddings for this object, or I have the embeddings for this document. Uh, it's basically the exact same thing as a vector. Uh, vector is just a more generalized term, whereas the embedding is, is this process of running it through this specialized AI model.

Mallory Mejias: Mm hmm.

Okay, so today's topic was mostly informed by a new chapter from Ascend 2nd edition. Um, and it was a great chapter. I've read it a few times now. There's an example you give within that chapter, or several examples, where you talk [00:17:00] about, you know, one word plus another word, minus this word will give you this.

I'm hoping you can share one or maybe a few of those examples to help contextualize what it means to assign a numerical meaning or number to semantics. If that makes sense. Sure.

Amith Nagarajan: Yeah. And I think that the semantic meaning of a piece of content, you know, you break it down and so you try to break it down into attributes.

It only can go so far again using language, but we'll, we'll do our best and we'll also try to keep it really simple. So if we, you know, Work in the world of, of animals for a minute and say, Hey, we want to classify different kinds of animals. And we say something like, um, we take, um, a base concept like a kitten, and then we add an attribute called adultness.

And we say, this kitten has a high adultness. And maybe let's just say it's either on or off. So we took kitten and you add adults. What do you get? You know, what's the, what's the outcome of that equation? Kitten plus adult equals cat. Exactly. And so cat actually theoretically could, I [00:18:00] guess, refer to kittens and adults, but generally speaking in language, people use cat to refer to an adult cat.

Similarly, puppy plus adult equals dog. Um, so that's a very simple example where you've kind of captured some semantics where you combine words and that's, you know, those are attributes essentially. Uh, and then you can keep going. You can say, okay, well, those are two dimensions. It's like, you know, um, what is the species and what is the maturity level or the adultness, right?

And of course, these variables don't necessarily have, um, single values. They're not like 0 or 1. They can have any value between 0 and 1. So, you know, you might say, oh, okay, well, um, I've got a, uh, a cat, and that cat is 9 months old. Is that a kitten still, or is it an adult? I think one year is when people say cats are adults.

I think that that's definitely true for dogs. Um, and so how adult is my cat at, at nine months? Right. So there's a little bit difference there between that and like a newborn kitten. So there are, there are subtleties to this. It's not just on or off [00:19:00] values. Uh, but then you might add another, a third attribute, like let's say, okay, but what about, you know, how domesticated?

Is this particular animal, so we take kitten, um, add adult, but what if you added wild to that? So you get kitten plus adult plus wild equals?

Mallory Mejias: Lion.

Amith Nagarajan: Could be lion, could be bobcat, could be tiger, could be something scary, right? Yeah, it's like I'm taking a

None: quiz.

Amith Nagarajan: Yeah. Scariness, right? Scariness attribute. Um, and you know, it's funny because, I forget who I was talking to about this, but the other day I was just talking to somebody about cats versus dogs, and um, I was like, you know, My cat is six pounds, eight pounds, whatever it is.

And it's my daughter's cat. I'm a dog person, but my daughter really, really wanted a cat. So we succumbed to that pressure a few years ago and the cat's great for the most part, but just watching the cat move is kind of interesting. It's. It's so efficient and [00:20:00] almost elegant in its ferocity Even though it's like over the course of millions of years evolved into this relatively harmless, you know house cat, but Um, you know that cat being even two or three times the size if it was 20 or 30 pounds You know It's a lot scarier even at seven pounds or whatever than most of my the dogs have ever interacted with because it's just built to kill Um, it's just an unbelievably efficient killing machine.

It's like nature's You know, one of nature's finest examples of evolution as a carnivore. And I look at that thing and I'm like, yeah, uh, I really don't trust that cat not to eat me if I was passed out. So Good

Mallory Mejias: thing it's low on the wild, the wild scale at least.

Amith Nagarajan: Yeah, exactly. And, and at least on the size scale it's fairly small.

But if you added even like a couple of decimal points there, I'd be, I'd be worried. But that's the basic idea. You try to break down concepts into their attributes, and you do that at scale. Like, we run out of words. After 10, 20, 30 different words, we're probably kind of out. We're like, I don't know. How do you compare my cat to the next door neighbor's cat, even if [00:21:00] they're both the same breed, the same age, the same size?

They do similar things, but there's definitely differences. What are the words beyond color, beyond shape, beyond, you know, eye color, all these other attributes we can put, we run out of words after a while. And that's the power of these vectors is that in high dimensional vector space, meaning thousands of numbers, we're representing concepts we cannot put words to.

And that's really where I don't know if this is truly what a neuroscientist or a philosopher would say, but I think this is where we're kind of starting to border into what we normally classify as human intuition, where it's the things that we just know, we know that we know something, right? We have a feeling.

That this makes sense. We have a feeling that these two people should connect. We have a feeling that this upcoming session of this event is going to be a great fit for Mallory, but I don't really know why. I mean, I kind of do, there's certain things I might say, yeah, I really should tell Mallory about this upcoming session at the conference because I think she'll really enjoy it and get a lot out of it.

And I can tell you two, three, [00:22:00] four, five reasons why, but there's something else, right? There's just that feeling you have. Well, where is that feeling coming from? Well, the AI's equivalent of that is high dimensional vector space. And that's where some of these unbelievably amazing recommendations are coming from.

And recommendation systems, whether it's for commerce, for content or professional networking on a platform like a LinkedIn. Um, that's what these types of systems are using.

Mallory Mejias: So for the sake of the example with the kitten plus adult equals cat example, you mentioned we're talking about these attributes as zeros and ones, on and off, yes and no's.

But in an actual vector, I'm assuming each of these attributes is a spectrum because they're represented by numbers. So we don't have to say a kitten Is a kitten based on this nine month example or twelve months or whatever, but that each attribute has a spectrum, so it's not so black and white is what I'm saying?

Amith Nagarajan: Yeah, it's not zeros and ones. It's, those are like binaries of zeros and ones. These vectors are far richer than that. They're, they're very large numbers, [00:23:00] basically, and they're represented between zeros and ones, but they're, they're very, they're like the decimal number of digits. Pass the decimal is is incredibly high.

So these are all very large numbers that can each represent quite a bit of meaning. So how adult is my cat at nine months? You know, is it 0. 75 on that scale versus like a five week old cat is, you know, 0. 01 or something like that.

Mallory Mejias: And how does an AI assign meaning might be the wrong word because that's how we're understanding it, but how does it assign a number to an attribute, like from the beginning?

Amith Nagarajan: Yeah. So, I mean, the way these, uh, embeddings models are trained is based on a large corpus of content, just like all other AI training is. So it's like, based on this idea of saying, Hey, if we feed. A lot of information to a neural network, the neural network will kind of tease out these types of attributes and knowledge and learn from it.

And that type of model is specifically a model that's designed to come up with vector embeddings that match content elements. Um, it's based on model [00:24:00] training, which is very similar to how an LLM might be trained or a specialized machine learning model. It all works in some of the same types of principles of how you train a neural network.

And these embeddings models are designed to be very fast, very efficient. They're much, much smaller than large language models, but they form a critical component of the concepts of what makes it work.

Mallory Mejias: And then one of my last questions for this section, is this an area where Uh, bias becomes apparent and harmful because I'm thinking assigning a number, a numerical value to cat ness or adult ness is simple enough, but I could see that being quite harmful when looking at things like gender and race and orientation.

So is this the part of AI where we're seeing bias appear?

Amith Nagarajan: There's absolutely potential. It's based on the quality of the training data. And that's an area of deep conversation that is happening in the AI field and should be. So if I have a corpus of content that I use to train a model and that corpus of content, um, says that, [00:25:00] for example, uh, certain types of people have a higher level of credibility, right?

Professional ness might be an attribute. And let's say there's biases in the training data based on race, gender, age, whatever that may be, um, the model is going to have those perspectives when it produces these, these embeddings. Now, um, ultimately, those embeddings don't, like, tell you something about, like, the object in question, But if they the embeddings in turn are used by other models, so it can compound potential biases.

So it's definitely a factor. Biases are issues in all machine learning models of all kinds. Um, data bias is one of the fundamental things that data scientists have to look for. Um, but here's the thing is that all data is biased. All of us are biased. The question is, how much transparency can we add to our systems and processes to look for bias?

And one of the greatest opportunities is that since there's such a rich ecosystem of AI products and models to choose from, We could potentially [00:26:00] have models check each other's work, in a sense. So you could say, hey, I'm going to take an embeddings model, and not just the embeddings model, but, like, the full stack of these models from an open AI, let's say, and I'm going to also use Clod from Anthropic, or Gemini from Google, Uh, or Lama three from meta.

Right. And I'm going to have these models kind of form an adversarial relationship in a technical sense where they're, they're checking each other and testing each other and looking for biases of scale. So on the one hand, AI brings up the worst of our species where we have these deep biases. All of us do.

Um, but it also provides us a tool to discover. These biases better than ever before. So, uh, but the embeddings model, you've hit on something really important, Mallory, is that if the embeddings models training set has biases, the embeddings will reflect those biases by themselves. That doesn't really mean anything, but as other models utilize these embeddings or vectors, uh, to make decisions or make recommendations.

Um, that can definitely be a factor. So [00:27:00] those biases might, you know, affect how I recommend content or how I personalize professional networking suggestions at my conference. So definitely something to be thoughtful about.

Mallory Mejias: Okay. So we touched a little bit on vector databases, which house, of course, vectors.

Can you help contextualize for listeners how vector databases. can allow associations specifically to combine and manage all their types of data, text, image, audio, and video.

Amith Nagarajan: So the vector database serves a hyper specialized role. So it doesn't replace your relational database, which is probably what you have underneath your AMS.

Or membership system. It doesn't replace a SaaS application like a Salesforce or a HubSpot. Um, it sits side by side with these kinds of systems. So it's an additional data store. Um, so it doesn't replace an existing database. Vector databases do not do a good job of saying, Hey, I want to search for a member by phone number.

Or I want to look up a product by name. They, they're terrible at that. Theoretically, you could [00:28:00] stuff that type of information into a vector database. Because for every vector you put in, You can tag it with attributes called metadata, uh, and then you can search on the metadata, but it's not designed for that.

It's designed for vector map, and then once you get the vectors back, you can then look at these tags to find out other information, and that's how you connect the vector database back to your other databases. But vector databases do not replace Traditional SQL or relational databases. They don't replace content stores like websites or SharePoint or things like that.

They're designed to accompany those systems.

Mallory Mejias: And they are better at housing unstructured data in comparison to other databases that an association might have.

Amith Nagarajan: Well, vector databases actually can be kind of the intersection of all types of content in a sense. So let me give you an example. Let's say I have an AMS.

And my AMS I've got member data, I've got committee data, I've got information about transactions, financial transactions. That's highly [00:29:00] structured data, right? Um, and then I have, let's just say, a content management system. And I have 100, 000 historical articles from my association there. And that's very much unstructured data, right?

It's just text, images, maybe audio, but it's essentially unstructured data, uh, on my website in my CMS. So my CMS has all my unstructured content in this example, and my AMS has my structured content. Now, vector databases don't really care. They're just storing numbers. So, if I can convert, My unstructured content into vectors, which I can write.

I can run each of those 100, 000 pieces of content. Systematically through an embeddings model, which will produce a set of numbers, the vector, and then I can stuff that vector into the vector database, right? So I can take my unstructured data and convert it to vectors and put it in a vector database. Um, I still need my unstructured content because the vector database doesn't actually have the content in it, it just has the vector [00:30:00] representation of that content.

Then, on the structured content side, the structured data side, my AMS, um, I can convert that structured data into a format where I can then stuff it into the embeddings model, too. So I can say, hey, here's Mallory. Here's everything I know about Mallory. Here's her professional background. Here is her email address.

Here's where she works. Uh, here are all the people that she knows. Whatever I've got, right? Here are all the articles that Mallory has published. Clicked thumbs up on in our online community or whatever posts that Mallory's had, I can take the aggregate of what I know about Mallory. And I can feed that to an embeddings model and I can get a vector back that represents Mallory.

And that's taking structured data, right? From the AMS or the online community and converting it also into a vector. So then I have a vector for Mallory and I have a vector for a hundred thousand pieces of content. I can compare them in the vector database, but to come back to your question. Is the vector unstructured or structured [00:31:00] content?

Vectors can represent everything. That's what's so powerful, is they're the intersection of the world of unstructured content and the world of structured content. They're kind of like an intermediary between your structured and unstructured content. And they can also help you connect different sources of information.

Of both structured and unstructured content. So let's say, for example, I've got a website. Maybe I have five websites. Maybe I also have not only an AMS, but I have a marketing automation system, and I have an event system. I can get vectors out of all of those systems, or I should, I should probably say, not get vectors out of them, but I can get the data out of those systems, convert them to vectors, put them all into a single vector database, And then now I have this connective tissue, if you will, that links these things back.

Does that kind of make sense to the high level?

Mallory Mejias: It does. I think this is finally starting to click even for me. So you can take your unstructured data and your structured data, run those through an embeddings model, and then get vector [00:32:00] representations for basically everything you have. Every piece of data that you have from there.

I think where we really need to spend the end of this podcast is talking about the why. So like, let's say an association did exactly that. They have all these vector representations of every piece of data they own. Then what?

Amith Nagarajan: Well, then, then it comes to, so this is the enabling technology. So vectors as part of AI, it's a general purpose tech.

It's an enabling technology. It's not an application. So to your point, like it's less. So what? Who cares? Why? Why should I pay attention? So if you've been with us this far and you kind of understand the concept, the next thing is exactly what Mallory said. Like, how can you apply it? So, you know, the idea of a recommendations engine or personalization engine, this concept.

Has been really exciting yet elusive for associations for a long time Uh, you know going back to when netflix first went from the mail order business where they had those red [00:33:00] envelopes showing up with dvds At your house to online streaming You know, they actually had a form of personalization, a very rudimentary machine learning approach, even back in the days of the mail order business, where they'd recommend on their website, which DVDs you should put in your queue.

Uh, and that was based on crowdsource data. And then the online version was their first version of streaming. So, but back then only companies of the capital and tech caliber of a Netflix or an Amazon could do this kind of stuff. And. When someone like Netflix or Amazon did it and they invested tens or even hundreds of millions of dollars in years of time with world class data scientists to just do basic stuff, they ended up with a very narrow model, meaning they could just recommend DVDs to people.

That's the comparison they could make, right? Highly specialized, ridiculously expensive, and basically inaccessible technology. So the concept was even starting to formulate back then, and well before that too, but the practicality of it was very limited, right, back then, ten years plus ago. [00:34:00] The thing that's happened is we've had these six month doublings of AI continually going, along with Moore's Law, compounding and compounding the power.

And lowering the cost and improving the capabilities. So now we have general purpose capabilities in these embeddings models and vectors. So we can now say, look, we can do recommendations from anything to anything. We can compare and contrast. Any entity or any object, whether it's content or database, database information against each other.

So that's where the applications come into play. Remembering that this is, it might be technical sounding, but there are ways of doing this, even for a decidedly non technical association. You might need a little bit of help, but it's not millions of dollars. It's probably not even hundreds of thousands of dollars to do this now.

It's accessible to a lot of associations. Today and it'll just be getting easier and cheaper. So coming back to your question Um some applications that I think are really exciting. My favorite one probably is professional networking Um, so how can [00:35:00] we do a better job of creating? fostering nurturing Meaningful connections within our community, right?

How can we enrich the lives of our members by connecting them with each other? In a a deeply meaningful way and you know We've talked about this in the past and this idea of like look in your life Have you ever been connected to someone that's made a big difference professionally or personally or both?

And most people have stories to tell where they say yeah, you know a mentor of mine connected me with So and so or I was connected personally with a friend or a future spouse or whatever that that situation may have been Um, can we do that at scale as an association associations have always been in the business of professional networking but it's always been a combination of just kind of brute force and Luck, you know where you just put a lot of people in the room that have similar interests And you hopefully get a good mix out of it.

And, you know, the extroverted types oftentimes do better in those settings. The introverted types less so. But, uh, how can we replicate that, that [00:36:00] intuition, uh, at scale? I think professional networking is, you know, one of the highest priorities I see of leveraging this technology. So to me, that's application number one.

Obviously, content personalization is another one where you can feed better content to people with lower friction. And there's many others.

Mallory Mejias: I'm wondering, is there any sort of reinforcement feedback loop within your vector database? Because I'm assuming the vector database is only as effective as the embeddings model.

So what happens in the event that two vectors are close in proximity to one another. Maybe it's a piece of content and a member. And so you recommend that piece of content to that member, but maybe they don't love it. Maybe it actually wasn't a good fit. Is there any way to feed that back in?

Amith Nagarajan: So on top of the AI's, uh, kind of intuition of similarity comparison, you have to layer in some additional intelligence into your systems.

What you're referring to is like this feedback loop of reinforcement learning or other techniques where you can say, Hey. Let's take a thumbs up, thumbs down type of feedback, whether it's, you know, actually a thumbs [00:37:00] up, thumbs down an app or something else, and let's store that data. And then that way we train the next iteration of our models on top of the embeddings, we can do a better job of feeding content because the decision to recommend or not recommend is based on the embeddings and the proximity and vector space we've been talking about this whole episode.

But there's more to it than that. If you just simply say, Hey, I'm going to take the closest vector to Mallory to recommend to her. Um, but I don't recognize that it's already someone she works with and talks to all the time. You know, you and I do a podcast together every week. We talk several additional times per week typically.

So why would, why would that recommendation make any sense? We already know each other obviously. So it would be kind of a useless recommendation. So you have to layer in other what we'd call deterministic, right? Um, concepts on top of the A. I. So the A. I. Might not know that you and I work together, but our database system should know that or might know that.

Or maybe if the database systems didn't know that you gave me a thumbs down, um, then we know that and we can [00:38:00] infer from that something else. Oh, actually, they're too close in vector space. They're too similar in some ways, right? So, no. There's definitely more to do than just simply taking the closest possible vectors and just throwing them out there and seeing what happens Um, there's there's more work involved in that and that's obviously beyond the scope what we can talk about here But uh implementing these ideas requires quite a bit of planning Um, the the reason I think this is such an exciting thing to think about is it's not because it's a press a button And done kind of application There's more implementation work.

There's definitely planning. There's some dollars involved in getting it right as well. But, um, it is very approachable compared to what it was even five years ago. Five years ago, this would not have been a conversation really for any association. This was still in the realm of like the Fortune 100. And now it's available, and there's still work involved.

Mallory Mejias: hmm. So we talked about some use cases, the really exciting ones, like professional networking and content personalization. I do know as well, in the chapter that you wrote in Ascend, we talked about [00:39:00] duplicate detection and eliminating inefficiencies in member records, for example. Can you talk a little bit about that?

Maybe the less glamorous use, but might be exciting for some listeners.

Amith Nagarajan: I think it's a fantastic use. So if I'm able to create vectors for, let's say, all of my member records, I should be able to use those to find similar member records. And then from there, I might be able to identify on a nearly automated basis the duplicates.

So, um, that has been a pain in the side of every association since the beginning of time. And not just associations, but every company on Earth has duplicate data problems. So data quality problems in general, but duplicate data. Being kind of one of the core issues people struggle with mightily. And, um, the problem is, is, um, how do you identify these things without a person looking at these two records side by side or these five records side by side?

So, you know, if I have, um, two people that are kind of similar, um, but they're a little bit different, the traditional database programs we've had don't pick up on that. So they might look for absolute matches on first and last [00:40:00] name and title and employer and phone number and email. Those kinds of things we say these are absolute duplicates and they might have some so called fuzzy Logic to say, okay, well, let's look for the first five characters, the first name matching and an absolute match on the last name.

People come up with all these ideas for like how to do potential duplicate logic. That's been around since the 60s, you know, in terms of duplicate data management. Um, but it's still really weak, you know, it gets thrown off super easily. And so what ends up happening is associations and other orcs kind of give up.

Um, they, they, if they notice a duplicate, they might, if their system allows, they might try to merge it. And by the way, a lot of database systems are not good at handling that scenario where they're able to merge duplicate data together. That's another problem. But just identifying the dupes is something that backers can be really good at because, again, since they encapsulate potentially thousands of attributes of each of the records, they're able to look beyond just the data that, you know, a traditional database might look at.

Um, so one of the, an example would be like, if I would go to your typical member services, [00:41:00] Uh, associate in an association and say, Hey, take a look at these two records for Mallory. Is this the same Mallory? Are they different Mallories? Um, a typical, you know, kind of even entry level membership associate type person would be able to figure that out with a reasonably high degree of accuracy because, you know, They probably look at it and say, well, um, maybe they're the same person, but this one lives in Atlanta, this other one is in New Orleans, so maybe not, but then if they just look a little bit further and say, oh wait, hold on, both of them, um, say that they're the manager of Sidecar, maybe they are the same person, and Sidecar Isn't, you know, it's not Coca Cola.

It's not General Motors. It's a smaller company. So therefore, probably it's the same person. Uh, so those kinds, it's like partly a little bit of knowledge encapsulated there, but partly also it's, it's intuition and it's knowing to look at the employment history or looking at the educational background.

So it's unlikely that the two Mallory's have the same degree from the same university at their different people, right? And we have a lot of this data in association databases, or we can [00:42:00] buy it from third parties, right? We can enrich our data. Which makes it much more likely we'll find these dupes. Uh, but it's a fantastic use case.

Again, it's going to require some thought and some work to integrate this technology into your system. You can utilize something like a common data platform. Um, you know, like the member junction, open source CDP, we've talked about a lot, um, or you can try to build this on top of an AMS if you want. I wouldn't probably recommend the second one because.

AMSs move kind of slow in terms of their evolution, and they're a little bit harder to work with usually. But doing this with your data once you have it in a CDP becomes much easier.

Mallory Mejias: Well, you pretty much, as you were talking, I started thinking of the CDP, and I was like, well, wait, we can deduplicate records out of the Common Data Platform as well.

And then was realizing I don't quite have the grasp on, do these sit side by side, the CDP, in a vector database? Do you take the data in the CDP and convert it into vectors? Is that more of the process? Yes.

Amith Nagarajan: Well, that's certainly, that's what MemberJunction does specifically. Not all CDPs do the same [00:43:00] things.

But, um, in the case of MemberJunction, it's built as an AI native CDP from day one. So, um, MemberJunction specifically has a facility within it. Where you literally just click a few buttons in the admin console and you can turn on or off vectorization for any part of the database. So you can say, Hey, I have all my member data flowing in from my AMS directly into the member junction CDP, and I can set up that particular area of the CDP to be what we call auto vectorized and with auto vectorization, the software is already there to automatically take the data from the structured data.

Convert it to what we call a synthetic document, because it needs to be converted into a format that an embeddings model can understand. Feed it to an embeddings model, get the vector back, and then put that vector into a vector database. And that's all done automatically through the open source, freely available, member junction software.

We're pretty excited about that part because you can kind of, we like to say when we talk about MJ, it's, you [00:44:00] can auto vectorize the world, right? Um, and you can easily vectorize all sorts of other content and then you can bring them together through what we just talked about, the vector database essentially being the connective tissue between the unstructured and the structured world.

Mallory Mejias: That makes perfect sense. So, Amith, for people who are still with us, who are not scared anymore by the idea of a vector, vector database embeddings, what advice do you have for them in implementing, um, something like this?

Amith Nagarajan: Zero people have left this podcast midstream, Mallory, I'm sure. In fact, they've been so excited by what they've been hearing us talk about that they've brought other people, uh, that they're nearby to listen in with them as the podcast has gone along.

Um, I think what people need to do as a next step is just dig slightly, uh, slightly deeper than this. If you're interested in it, uh, you can only get so much from one source and from one format. So maybe do a YouTube video search on vector databases and we'll post a couple links to some really good tutorials.

[00:45:00] Uh, maybe, you know, read the upcoming edition of Ascend. We plan to have it out, uh, before August 1st. It has a whole chapter on it. We'll also be posting more articles to the Sidecar website, but learn a little bit more. Um, reach out on the Sidecar community, which is just community. sidecarglobal. com. Have conversations.

Uh, we have a lot of our AI experts from across the Blue Cypress family, as well as many community members who are deep in this stuff, involved, and can answer questions there. But do something with it. Just play with it. Um, learn more, and then think about one very small use case where you can run an experiment, right?

That's what I always try to reduce these things down to, is it's the concept is powerful, But it's only powerful if you do something with it to really learn it. Um, you know, I understood kind of the math and the theory behind this for quite a while, but until really probably in the last 18 months, until I actually dug into it and where it's working with examples that I really understand the potential for vectors in this space and why I've been out there really advocating for everyone to learn a little bit about it.

So you have to get in there and [00:46:00] experiment a bit. Um, so, uh, that's what I'd recommend is, is get started with a little bit more education and then pick one simple, easy use case, uh, and go play with it.

Mallory Mejias: I think that's great advice. I hope you all have enjoyed this episode. I certainly did. I had the opportunity to attend the CEO mastermind, uh, session that Amith mentioned earlier in this episode.

And I've heard about this stuff a few times, but I will say this episode was a really big click moment for me. For me. So thank you, Amith. I hope it was the same for all you listeners, and we will see you next week.

View full post