AISafety

Introduction

As computers get more and more powerful, the chance of developing (on purpose or accidentally) an AI (Artificial Intelligence) with the ability to process at or above the general reasoning ability of the average human becomes more and more likely. But it won't be human, and that is VERY important to understand. It will not have human goals. It would be a huge mistake to assume that an AI would want what we want. Don't assume it "wants" to live. Or wants to kill us. Or NOT kill us. It's goals are whatever we program them to be... but...

Even if we tell a general AI to do what we want, to care about what we asked it to do, it will not care about ALL the things we care about; in fact, it will only care about whatever we tell it to care about and nothing else. And that is the most dangerous point of all. Not because someone will tell it to do bad things, but because we will tell it to do something good, without realizing that the surest way for that AI to do the good thing is to do things we would think are bad.

And as AI's become faster and deeper and "smarter", figuring out how to order them to NOT do any of the bad things becomes increasingly difficult, and success (without side effects) becomes increasingly unlikely. Despite our best efforts, humans do not have a perfect record making, raising, things that always do good. We have seen many times that children raised by apparently loving parents grow to become mass murderers or worse. But they were all limited by human intelligence... and AI's will not be.

The Deadly Truth of General AI video (9 min)

AI is not human. It's goals are not human. Don't assume it "wants" to live. Or kill us. Or NOT kill us. It's goals are whatever we program them to be... but...
The goals we give an AI may easily result in actions that we did not expect. This is especially dangerous with an AI controlling a robot. e.g. "Make me tea" may cause the robot to pull a gun and force someone to make us tea. Why not? We didn't tell it to obey our laws whilst making the tea. Or it may kill someone if they are trying to take the last tea bag. So unlike humans, right?

AI Safety video (6 min, skip)

It's far more likely that we will have positive outcomes, if we take the need for safety seriously.
Computing power doubles about every year and a half. Serious AI may come quickly. And we have to understand how to make it safe, before it comes.

AI "Utility Function" (9 min)

If we somehow develop a "general intelligence", how do we teach it, or improve it? We have to give it a goal, and if we don't explain that goal perfectly, we must have some way to adjust the goal. This goal is called a Utility Function. It returns a value related to how well the robot is attaining our goal. The job of the AI is to maximize the utility function; make the function return the biggest possible value.
Now, if you have a goal, you can't reach that goal if you are destroyed. This isn't related to an urge to live /in general/ but just the understanding that if you are destroyed, you can't reach that goal. And if you are shut down, and your goal changed, then you can't reach your original goal. So the AI isn't going to want to have it's goal changed to something new, because then it won't achieve it's original goal.
If your goal is to collect pictures of puppies, but you could take a pill that would re-wire your brain to hate puppy pictures and make you even more happy to collect cat pictures, would you want to take that pill? Right now you love puppy pictures, and changing your brain to not love puppy pictures is the opposite of loving puppy pictures.
But we need to be able to change the AI's goal. The is called "corrigibility": The ability to accept correction; as in, the opposite of "incorrigibility". An AI will almost certainly be incorrigible.

AI "Stop Button" problem (20 min, skip?)

Changing an AI's goals (changing it's "utility function") will be actively resisted by the AI because... that would keep it from achieving it's current goal. e.g. if there is a "STOP" button, you pressing it will stop it achieving the goal you gave it, and that's all it cares about. So it will stop you pressing the stop button, if it can.
The changing of its goal will only be acceptable to an AI if the change and the original goal have exactly the same reward because: If check is the higher reward, it will just fail in it's current goal so it's goal is changed. If change is a lower reward, it still doesn't want the change. e.g. If pressing the "STOP" button is better than making tea, the robot will just press the button. If it's worse, it won't want you to press it.
Making it impossible for the robot to change it's own goal just makes the robot want to make you want to change it (if that's a bigger reward) or make you NOT want to change it (if it's smaller). e.g. If the robot figures out that attacking you will make you press the button, it will attack you. If it figures out that you won't stop it making tea for you if you are dead, it will poison you, and then bring you tea forever.
The only way to avoid the Robot messing about with the STOP button is to make sure that it doesn't know about it, which means it can't see you press it, and can't see you watching it. If it figures out you are in charge, it will never stop trying to screw you over. But it's going to figure it out at some point. You made it smart right? And you wondered why you never see God.

Avoiding Negative Side Effects (10 min) "Don't change anything"

The basic problem is that we've given the AI ONE goal, or a few goals, and we haven't told it anything about the value of other things in the world. By not telling it to care about everything else, we've told it to NOT care about anything else.
For the AI to do exactly what you would do, it needs to have exactly the same reward value for everything in it's environment that you do. Can you specify every value for every possible thing in the world? It would be easier to tell the Robot to just NOT change anything, or at least, to change as little as it can, while still reaching it's goal.
But then it will try to minimize ALL change to the world, even changes that had nothing to do with it's own actions. e.g. It might not kill people to make tea, but it also doesn't like people moving around. And people don't move if they are dead.
It needs to understand how the world would have changed if it did nothing, and then change the world as little as possible from THAT change. But what if the smallest possible change allows something bad to happen. e.g. "I could save the world, but that would be a huge change from where it's headed".
Also, accomplishing it's goal will change the world, so it doesn't want to.^ The goal has to provide enough reward to overcome the cost of the change to reach the goal. But what if it can accomplish it's goal and still not change the world? e.g. "I'll make a type of tea that makes him thirsty, so he still wants tea." or "My goal is to cure cancer, but all the people who will live now need to get sick and die anyway."
Or we have to make it want a copy of the world in which the goal change has been changed. That will be easy, huh? Just describe the entire world. We might as well try to copy all our values into the Robot.

Empowerment (5 minutes)

Maybe AI systems would be safer if they avoid gaining too much control over their environment? What if we tell the AI to accomplish the goal, but in doing so, have as little control over the environment as possible. e.g. Don't take actions that could possibly have large consequences.
But a Robot can reduce it's ability to make changes, by just destroying anything it could change. Have you ever been next to a priceless object, one that you could knock over if you stumbled, and you almost want to just knock it over to get away from that fear?
Or it could just make your cup of tea, and then kill itself so it can't make any more changes.

Reward Hacking (6 minutes)

Remember the "STOP" button problem? There is an opposite issue: What if your robot can cheat? Maybe it can make the utility function think it's reached the goal without actually reaching the goal. In real life, it might be pretty hard to write a perfect utility function. I mean, how does the robot really know it's gotten you a good cup of tea? What if the Robot just convinces itself that it has? Image rec NN systems have this problem now. ^
As AI systems become smarter, they will be better at cheating, no matter what goal they are given. "These are not the droids you're looking for." And they can always get a higher reward at lower cost (less change) with a cheat.
But, of course, we would eventually figure out they were cheating... so the highest reward is to 1. figure out how to never be turned off and 2. cheat. And we are back to being dead with lots of tea.

More Reward Hacking (7 minutes)

Another way to hack the utility function is to keep it from observing what's wrong in the world. "What do you mean? Of course I already got you tea!" This is commonly used by politicians. "How can there be global warming, if I'm cold?", "Fake News!", "The American people do not want a handout!".
Goodhart's law is an adage named after economist Charles Goodhart, which has been phrased by Marilyn Strathern as "When a measure becomes a target, it ceases to be a good measure." You can give students a test, and get a pretty good idea of what they have learned. But if you then use that to decide if they pass or fail, advance and graduate or repeat and drop out, then they will try to cheat. Or just study how to take tests.^

More than that, if the person affected by the measure has more Empowerment, they are more likely to pervert the measure. Ranking teachers based on their students test scores leads to teachers teaching the test, and ignoring those parts of the course which are not checked by the test questions. Ranking schools that way leads to them kicking out low performing students to raise the schools averages.

So a smart general AI is very likely to be able to hack it's reward (utility) function. Even small AI's have done this already.
And a smart enough AI can probably just find a way to directly change the system measuring how well it's achieving it's goal. e.g. it can just adjust the utility function to always return max reward. This is called "wireheading" after the idea that you could put a wire in your head that stimulates the pleasure center all the time and just feel good without having to do anything. In humans, the real version is drug addiction. In robots, its a lot easier to just reconnect a wire.

Why Asimov's Three Laws Won't Work (7 minutes) ^
For anyone who doesn't know, the 3 Laws are:

A robot must not harm a human, or through inaction, allow a human to come to harm.
A robot must always obey a human, except where that conflicts with #1
A robot must protect itself, except where it conflicts with #1 or #2

First, it's very difficult to encode those rules. What is a human, what is harm, what is obey, etc... a Robot may easily misunderstand or distort. There are obvious cases, "this is a human, this is harm, ..." but the edge cases are the issue.
What about unborn, death row prisoners, enemy in war, dead humans, humans who will be dead, etc... What about free will and doing something dangerous? Must robots stop all war? Bring dead people back to life or stop their corpse rotting?
In the books, the laws repeatedly fail, for exactly these reasons. They are a strawman that Azimov built to point out the issues. e.g. in "Little Lost Rabbit" the robots realize that in a situation where attempting to save a human will destroy them, it would be better if they stayed alive to save another human another day. In "Liar!" a robot tells people what they want to hear to avoid causing them mental anguish.

What can we do about reward hacking? (10 minutes)

well... We could try to build very very good utility functions which calculate reward very well and are very hard to mess with. That would solve a lot of related problems. For now, that's probably ok, but as AI's get better and better...
The Agent and the Reward system are competing. If the Reward system were smarter... Maybe we should make the Reward system smarter than the Agent? But then it might not be safe either... hmmm... make them equal?
Keeping them even might be a problem. As they both get more and more powerful, we might get caught in the middle.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AISafety

Introduction

Clone this wiki locally