[TOC]
- Search algorithms
- Planning models, languages, and computational approaches.
- Generating heuristics, delete relaxation
- Classical planning width
- MDPs and value/policy iteration
- Monte-carlo tree search
- Reinforcement learning: Q-learning, SARSA, n-step learning, reward shaping, function approximation
- Game theory: normal form games and extended form games
- SAT
- CSP - Constraint Satisfaction Problems
- Classification
- Search
Only use the basic ingredients for general search algorithms.
- BFS
- DFS
- Uniform Cost
Additionally use heuristic functions which estimate the distance (or remaining cost) to the goal.
- A*
Consider a large number of search nodes simultaneously.
Work with one (or a few) candidate solutions (search nodes) at a time.
Is the strategy guaranteed to find a solution when there is one?
Are the returned solutions guaranteed to be optimal?
Completeness ? | Optimality ? | Time complexity | Space Complexity | |
---|---|---|---|---|
BFS | Yes | Yes(if costs are uniform) |
- - d: goal depth |
same as time complexity |
DFS | No(may loop) | No(just select one) | - worest: - best: |
|
Iterative Deepening Search | Yes | Yes |
Completeness ? | Optimality ? | Time complexity | Space Complexity | |
---|---|---|---|---|
Greedy Best-First Search | Yes | No really | ||
A* | Yes | Yes | ||
Hill-Climbing | No | No | ||
Let
Let
Let
-
safe: if
$h∗(s) = ∞$ for all$s ∈ S$ with$h(s) = ∞$ ; -
goal-aware: if
$h(s) = 0$ for all goal states$s ∈ SG$ ; -
admissible: if
$h(s) ≤ h^∗(s)$ for all$s ∈ s $ ; -
consistent: if $h(s) ≤ h(s' ) + c(a) $for all transitions
$s ^{(a)}→ s'$ .
graph LR
A[consistent and goal-aware] -->B(admissible)
C[admissible] -->D(goal-aware)
C[admissible] -->F(safe)
- Using priority queue
- 使用$h(state(\delta))$ 排序
$f_value:= g(s) + h(s)$ - Generated nodes
- Expanded nodes
- Re-expanded nodes(re-opened nodes)
- if
$h$ is admissible and consistent, then A* never re-opens
- with duplicate detection and re-opening
- **Weigth
$W$ for$W*h(s)$ **- For
$W=0$ , weighted A* behaves like uniform-cost - For
$W=1$ , weighted A* behaves like A* - For
$W=\infin$ , weighted A* behaves like greedy best-first search
- For
- Adv:
- if
$h$ is admissible, then$W (W>1)$ makes the solutions$W$ times more costly than optimal ones.
- if
- Only make sense when
$h(s) >0 $ - Can easily get stuck in local minima
- Variations: different tie-breaking strategies, restarts...
- undirected graph can be complete
- but not optimal
Advantages | Disadvantages | ||
---|---|---|---|
Programming Based | domain-knowledge easy to express | cannot deal with situations not anticipated by programmer | |
Learning Based | - Unsupervised or Supervised - Evolutionary - does not require much knowledge in principle |
in practice, hard to know which features to learn, and is slow | |
Model Based | - General Problem Solving(GPS) - Powerful: Generality - Quick: Rapid phototyping - Flexible & Clear |
Rule based
Train
- PDDL 包括两部分: domain file和problem file
- problem file: gives the objects, the initial state, and the goal state.
- domain file: gives the predicates and the operators; each benchmark domain has one domain file.
- Satisficing Planning
- Optimal Planning
Use "(optimal)planners" to solve these problems
- PlanEx(satisficing planning.): given a planning task P, whether or not there exists a plan for P.
- PlanLen(optimal planning): given a planning task P and an integer B, whether or not there exists a plan for P of length at most B.
- Both of them are PSPACE-complete
- PSPACE: 可以被图灵机在多项式空间内解决, Polynomial SPACE
- NP: 算起来不一定快,但对于任何答案我们都可以快速的验证这个答案对不对
忽略某些factors, 使得问题可以近似到一个其他容易解决的问题
简单来讲就是simplify problem
Definition:
-
native: if
$P′ ⊆P$ and$ h′^∗ =h^∗$ -
efficiently constructible:
$r$ function是一个多项式复杂度的算法- if there exists a polynomial-time algorithm that, given$Π ∈ P$ , computes$r(Π)$ -
efficiently computable:
$h'^*$ function是一个多项式复杂的的算法 - if there exists a polynomial-time algorithm that, given$Π′ ∈ P′$ , computes$h′^∗(Π′)$ .
Definition | Proposition |
---|---|
包含关系(子集) | - 如果子集是goal,那么超集也是 - 如果一组actions组合可以在子集中用(valid), 那么也可以在超集中用(valid) |
ILUYHOI:JHOIDFH(OU23019p817y9278t65r4^&()P{LOKJHYt561783904rpoijhkbhvGUY!&()PO"K:LMkknjhuifapodk[';f'])
- SIW uses IW for both decomposing a problemin to subproblems and for solving subproblems
- It’s a blind search procedure, no heuristic of any sort, IW does not even know next goal
$G_i$ “to achieve”
-
Non-linear dynamics
-
Perturbation in flight controls
-
Partial observability
-
Uncertainty about opponent strategy
-
Reinforcement Learning and Deep Learning trained to learn a controller
-
Search algorithm as a lookahead for action selection
MDPs | Classical | |
---|---|---|
Transformation Function |
no longer deterministic | deterministic |
Goals | No (but rewards) | Yes |
Action cost | No (but negative rewards) | Yes |
Discount Factor ( |
Yes (for future, uncertain rewards) | No |
expected discounted reward from state |
- |
|
Compute MDPs (Bellman equations) |
Or |
recursively |
Value Iteration: finds the optimal value function V∗ solving the Bellman equations iteratively, using the following algorithm:
-
Set
$V_0$ to arbitrary value function; e.g., V0(s) = 0 for all s. -
Set
$V_{i+1}$ to result of Bellman’s right hand side using$V_i$ in place of$V$ -
Stop when
$R = max_s |V_{i+1}(s) − V_i(s)| <= \epsilon$ , where$\epsilon$ is a predefined threhold -
Example (Assuming no action costs and γ = 0.9) :
- initial all
$V_0 = 0$ - calculate
$V_1$ , by select$max$ reward of next state$s'$ , eg, the one on the left of+1
,$max_reward = \gamma * p(right) * 1 = 0.9 * 0.8 * 1 = 0.72$ - loop until
$max_s |V_{i+1}(s) − V_i(s)| <= \epsilon$
simpley select :
Create a improved policy in each iteration.
Formula:
- - |
||
- Starting with arbitrary policy
$\pi$ - Compute
$V^π(s)$ for all$s$ (policy evaluation
) - Improve
$\pi$ by setting$π(s) := argmax_{a∈A(s)} Q^π (a, s)$ (improvement) - If
$π$ changed in 3, go back to 2, else finish
relax the assumption of full-observability. A POMDP is defined as:
- each state is a probability distribution over the set
$S$ - each state of the POMDP is a belief state, which defined the probability of being in each state
$S$ . - solutions are policies that map belief states into actions
- Optimal policies minimise the expected cost
- offline planning method: make decision at runtime, need calculation for every single step
# | Step | Description |
---|---|---|
1 | Select | Given a tree policy, select a single node in the tree to assess. |
2 | Expand | Expand this node by applying one available action from the node. |
3 | Simulation | From the expanded node, perform a complete random simulation to a leaf node. This therefore assumes that the search tree is finite (but version for |
infinitely large trees exist). | ||
4 | Backpropagate | the value of the node is backpropagated to the root node, updating the value of each ancestor node on the way. |
An N-armed bandit is defined by a set of random variables
In this definition, we can transform it to a MCTs problem:
- actions
$a$ applicable on s are the “arms of the bandit“ -
$Q(a, s)$ corresponds to the random variables$X_{i,n}$
- select each arm with the same probability
-
$Q_value$ for an action$a$ in a given state $s c$an be approximated:
- N(s,a) is the number of times a executed in
$s$ .$N(s)$ is the number of times$s$ is visited.$r_t$ is the reward of$t^{th}$ simulation from$s$ $I_t(s,a)$ is 1 if a was selected on the$t^{th}$ simulation from$s$ , and is 0 otherwise
- 所有action都有相同概率被选中, 浪费时间, 效率低
- 更好的做法是focus on the most promising parts of the tree given the rewards we have received so far.
If I play arm b, my regret is the best possible expected reward minus the expected reward of playing b. If I play arm a (the best arm), my regret is 0. Regret is thus the expected loss due to not doing the best action
当选择的不是最优的选项时, 所丢失的分数
how to select policy accoding to current state
exploitation | exploration | ||
---|---|---|---|
+ | |||
weight from learned knwoledge | weigth from select count (the less the big) |
-
$Q(a, s)$ is the estimated Q-value. -
$N(s)$ is the number of times s has been visited. -
$N(s,a)$ is the number of times a has been executed in s.
-
$C_p > 0$ : exploration constant (bigger$C_P$ encourages more exploration)
Value/Policy iteration | MCTS | |
---|---|---|
Cost | High | Low |
Converge | Higher(works from any Robustness state) | Low (works only from initial state or state reachable from initial state) |
一次计算,终身受用 (offline) | 每步(之前没碰到过的)都需要重新计算(online) |
- Maintain a Q-function that records Q(s, a) for every state-action pair.
- Repeat for each step until converge/timeup/good engouth:
- choose an action using a multi-armed bandit algorithm
- apply that action and receive the reward
- update Q(s, a) based on that reward
Used to update Q_value
Used to maintain Q-Values from Q-Function
Policy Extractation: (same as the one we used in value iteration)
On-Policy: Instead of estimating Q(s′,a′) for the best estimated future state during update, on-policy uses the actual next action to update
- take long time to propagate (pass known knowledge to eary states)
- Q-Table is too big sometimes (
$|A| * |S|$ ) - Hard to converge
- Rewards can be spare, which means that a better state cannot be sensed in a short time.
- n-Step temporal difference learning: look nsteps ahead for the reward before updating the reward
-
approximate methods: using approximate method to eliminate the need for large Q-table, and also allow providing reasonable estimates of
$Q(s,a)$ even if$(s,a)$ never been applied before - Reward shaping and value-Function initialisation: MOdify/augment our reward function to make the better states to be sensed easier.
-
$\lambda$ : the number of steps that we want to look ahead-
$T(0)$ -> standard reinforcement learning -
$T(1)$ -> looks one step ahead -
$T(2)$ -> looks 2 steps ahead
-
-
Back propagation for
$n$ steps
- used to reduce the large state space problem, by
1. only consider most improtant features (or a linear combination of features)
2. using the features selected from step 1 in Q-function
3. using value from step 2 to update Q-table (instead of Q-function with state)
Definition | Example |
---|---|
- For Q-learning
- For SARSA
once we get a reward, we get a matrix of W, so that we can use this for all states
- used for spare rewards by applying domain knowledge (point out what a good state looks like)
non-zero-sum game
dominant strategy: 总是对自己最好的choice (for all actions the other one may choose)
Information of a state cannot be fully seen/sensed by players
- do research before choosing (so that there we do choose based on knowledge from research, like buy things based whether a product is good)