As you are no doubt tired of hearing, I am very busy lately. Not the kind of busy where I can’t zone out on the parts of a conference call that don’t involve me to read and comment on a substack article, but the kind of busy where after I get done doing all the various things people need me to do at 8:30pm at night I feel like a bit of butter scraped over too much bread. I work in banking and man is it wild out there right now. I’m in probably one of the best spots you could be in and it’s still wild. I had to let someone go a few weeks ago and although I think I’ve found another role for her and I’m helping her prep for an interview —if you ever work for me, I will basically consider you to be my child forever, regardless of our relatives ages— it’s just depressing to think about so I’ve been doing everything I can to make sure everyone else on my team is protected. My primary focus has been on ensuring I don’t have to have that conversation with anyone else and that my management has everything it needs to defend my people.
Yet another part of me hopes that I will be let go because it is very easy to get seduced into a comfortable six figure —relax, it’s not that far into six figures and my wife doesn’t work. I make only slightly more than the median household income— job where you know what you’re doing and then you don’t do the thing you know needs doing. I’ve been with the company long enough that I’d get about eight months of severance pay. Enough time maybe to do at least build out something good enough to secure funding to build the rest.
I’m convinced now I just need to build the Forum and the Index myself. I’ve had a few zoom conversations with some of you. I know there is appetite out there and it’ll be a lot easier for people to “get” once it has been developed. I’m hoping after this month my schedule will relax a bit as I’ll have implemented a few major projects and our burn rate for the rest of the year will be slower. Enough maybe for a weekly call. I have a ton of vacation time but I haven’t dared use any of it until I know everyone is as safe as can be, with a dedicated backlogs all sized and ready to stand up to the scrutiny of the man with the scissors who is looking to cut waste. More to come on that, later.
So in between doing all the things I’m supposed to do, like build a spreadsheet in five minutes to do some basic reporting for myself that has somehow become me spending five hours a week to build a version of the spreadsheet for the entire department, and then to have calls every week to discuss the spreadsheet, I have been thinking more about AI risk. Specifically things to actually do about it instead of tell spooky doomsday scenario stories with a flashlight under your chin in the dark.
I need to write a few pieces about why I don’t think AI X Risk can be as bad as claimed in some of the Rationalist Groups I follow, and yet why I still think it could totally end all life on the planet. And why that distinction is important since it means there are things that we could do in practice to preserve ourselves. It may also just be easier to write this all as a science fiction story. Let us go, however, into the most boring part about AI X Risk. The mathematical part.
When I went to college and finally got to hang out with other math nerds, I could tell my linear algebra professor was brilliant because he had terrible hygiene and sometimes he would hold a whiteboard marker behind his back and sort of just firmly press it up into his rectum while still writing on the board with a different marker held in his other hand. Occasionally he would take the rectum marker, put it up under his nose, and sniff it deeply after writing a very compelling proof. Per usual, I was the only one who noticed this and everyone was slightly offended that I had brought it to their attention. When I wasn’t wondering why he was low-key shoving a marker up his ass, as well as why I was the only person who seemed to be able to see this and why it was considered rude for me to let others know that I noticed it, I learned the wonders of eigen values and eigen vectors. The point of this is that I have always noticed things other people don’t, have always commented on them, and always been told I am insane. Which, kinda. Fair. But also everyone agreed he did that shit after I pointed it out.
This is kinda/sorta but not really the underlying structure of modern AI. Except not really. But it’s close enough for the purposes of this discussion. The YouTube channel 3Blue1Brown has some fantastic explainers if you want more.
Here’s basically how AI works:
INPUT
000
000
000
OUTPUT
Then there’s some feedback where you tell it “hey, your output sucked” and it goes “fair enough, let me try again” and you get:
INPUT
0 .1 .24
.31 .2 .5
.7 .812 .46
OUTPUT
And then you say “closer but no cigar” and just keep doing this over and over again.
This is again, not really totally correct.
It turns out that if instead of having a 3x3 grid like this but instead you have billions times billions of parameters in your feedback loop, you can produce something like GPT4. That’s all it is. It’s a series of weights in a model that has been tweaked to deliver feedback that humans have approved.
The problem is, as pointed out by a lot of the AI Safety folks, that while we can make these models and they give us certain answers, we have absolutely no idea why they are doing that. You can’t really explain “umm… yes… well it turns out if you just multiply these numbers billions and billions and billions of times you can tell that this is a picture of a traffic light. It’s called… uh… Jumblesome’s Law.”
Here’s the next part to explain how I came up with my idea.
You know how when a person goes in to have brain surgery, the surgeon first applies an electrode to their brain to map where their language centers and other functions are? If you haven’t seen this, it’s pretty wild. The surgeon will poke around in there close to wherever they need to operate and the patient will suddenly be unable to say the word “duck” or “waffles” or they’ll start smelling grass or even have entire fake memories. Then the surgeon makes a map out of all of this and says “Well, let’s definitely leave that part alone since it seems to be related to language.”
So, my idea is to wonder how true would this be for a matrix the size of GPT4? Then translate that into math.
In the 3x3 matrix above, if you ruin one number there’s no way it’s going to be able to produce a valid result. You wiped out over ten percent of the values in the model, no way the rest is just gonna work. However, if I have billions of weights in my grid is that still true? The term neural net is kind of a misnomer because there’s nothing quite like back propagation in your head —although I have suspicions about sleep which are complex enough and perhaps powerful enough that I won’t share them here, but it’s a bit startling that only one animal has ever kinda/sorta figured out how to get by without. Sharks for those not read in on how they can sleep with only one half of their brain at a time— but is the analogy close enough that if you carry the brain test above into the mathematical space you can still learn something?
Let’s say I give GPT4 a series of very simple prompts.
“What is the red fruit that grows on trees that is commonly given to children?”
“What red fruit that grows on trees is the chief export of Washington state?”
“What red fruit that grows on trees also has Biblical significance relating to the Garden of Eden?”
GPT4 in its normal functional state would return “Apple” in its answer.
You’d want to ask a whole series of questions like that so you have each of these concepts mapped in the other questions as sort of stand alone concepts. Sort of like a system of equations where you need to be able to eliminate all variables. Now, in those billions upon billions of weights like we talked about above, you just start setting values to zero.
There are lots of numbers here so you want to be smart about how you do this. Basically, by just trying this out, figure out the largest number of weights you can set to zero and and not just break your model. Then repeat the prompts above and keep shifting which weights you have set to zero until something magical happens and the model is suddenly incapable of returning the word apple. Or can’t understand Biblical references. Or suddenly one of your questions fails. You get the idea.
Now you focus in on that space and try one half of it and then you try the other half of it. Did you get it to work one time and not the other? Well, okay then, focus in again. Half and half. Did you get it to work one time and not the other? Keep going.
Eventually I suspect it would get a bit more complicated than this and you’d need to be a bit clever about how to search what series of weights corresponds to apple —plus probably other things you didn’t think to ask— and you might even see weird things like weights which are not close at all to each other that all have to be flipped on to understand apple.
However, this is all stuff you could automate. You could just set up a series of questions to be auto-generated quite honestly for any number of things and just run this series of searches again and again and again until you had a very large set of “these are all the weights that are related to this concept, change these around and suddenly that concept is erased from the machine.” You could, after a lot of computer run time and probably a lot of money, even construct an enormous dataset to build yet another AI model that reads the “mind” so to speak of the other AI model and can predict where it has things stored to help you with your searches.
This seems like a really valuable experiment and a place where you could actually spend a lot of money in the research field and get an actual tangible benefit. When I run this through my head, I think it would work but I’m not really sure. Unlike your brain all the nodes in an AI model are at least in theory connected to each other. So it’s possible any slight tampering has major impacts but I suspect that’s not the case. No more than a hunch on my part but when something is that big it must have internal structure for concepts to be efficient. I’d expect it could be “brain damaged” and have localized effects the same way that we do, except of course we can just repair the damage immediately afterward.
I have not been able to determine if someone else has already done this —my first thought is always to assume someone else thought of my thought— experiment or not, or if they saw intractable problems like the search being too difficult or expensive to execute but this is a way to make AI intelligible to us and to have some degree of confidence we can understand what it is thinking.
Then again, I listened to an Elizier Yudkowsky podcast while cleaning my garage and while my heart breaks for the guy I just ended up getting really pissed off at what I read as defeatism and nihilism so I ended up thinking about this while breaking down a bunch of cardboard boxes. It was a very spicy few days. I think at some point I muttered “I’ll scrutinize your goddamn matrix.”
Then I picked up the baby and we went to the state park and we walked around and he tried to eat a pine one, which was lovely. One of my favorite days. I promised the nobody who was listening that I wasn’t just going to be a doomed and fall into spiritual ennui while the world falls apart around me. My kid has other non edible things to try to eat.
I watched that brain probe Farscape episode too.
Check out collaborative filtering - Joe Konstan’s still researching and publishing. It’s over twenty years old. Several algorithms explored. One was basically optimizing a giant sparse matrix (i.e. movie ratings) to create recommendations. My memory is swiss cheese.
I can play rubber duck/BS sounding board. Extremely unreliable due to pain.
You’re describing an AI dream sandbox. That sounds interesting if want to talk. Quack.