Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lec04: 20min-30min #24

Merged
merged 5 commits into from
Apr 5, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
*.DS_Store
# mac
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore 通常都只要通配后缀就行了。*.DS_Store 应该能满足你的需求吧?

**/.DS_Store
143 changes: 143 additions & 0 deletions lec04/Lec4-3.en.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
You know they had replication
but it wasn't replicating every single bit of memory
between the primaries and the backups
It was replicating much more application level table of chunks
I had this abstraction of chunks and chunk identifiers
And that's what it was replicating
It wasn't replicating sort of everything else
wasn't going to the expense of
replicating every single other thing in that machines
We're doing okay as long as
they had the same sort of application visible set of chunks
So most replication schemes out there go the GFS route
In fact almost everything except pretty much this paper
and a few handful of similar systems
almost everything uses application at some level application level of replication
Because it can be much more efficient
Because we don't have to go to the trouble of making sure
that interrupts occur at exactly the same point
in the execution of the primary and backup
GFS does not sweat that at all
But this paper has to do
Because it replicates at such a low level
So most people build efficient systems with applications specific replication
The consequence of that though is that
the replication has to be built into the application
If you're getting a feed of application level operations
for example you really need to have the application participate in that
because some generic replication thing like today's paper
doesn't really can't understand the semantics of what needs to be replicated
So most teams are application specific
like GFS and every other paper we're going to read on this topic
Today's paper is unique in that it replicates at the level of the machine
and therefore does not care what software you run on it
It replicates the low-level memory and machine registers
You can run any software you like on it
as long as it runs on that kind of microprocessor that's being represented
This replication scheme applies to the software can be anything
And the downside is that it's not that efficient necessarily
The upside is that you can take any existing piece of software
Maybe you don't even have source code for it or understand how it works
And do within some limits you can just run it under VMware's replication scheme
And it'll just work which is magic fault-tolerance wand for arbitrary software
All right now let me talk about how this is VMware FT
First of all VMware is a virtual machine company
A lot of their business is selling virtual machine technology
And what virtual machines refer to is the idea of
you buy a single computer
And instead of booting an operating system like Linux on the hardware
you boot we'll call a virtual machine monitor or hypervisor on the hardware
And the hypervisor's job is actually to
simulate multiple virtual computers on this piece of hardware
So the virtual machine monitor may boot up you know one instance of Linux
may be multiple instances of Linux may be a Windows
The virtual machine monitor on this one computer
can run a bunch of different operating systems
Each of these is itself some operating system kernel and then applications
So this is the technology they're starting with
And the reason for this is that it just turns out
there's many many reasons why it's very convenient to interpose this level of indirection
between the hardware and the operating systems
And means that we can buy one computer
and run lots of different operating systems on it
If we run lots and lots of little services
instead of having to have lots and lots of computers one per service
you can just buy one computer and run each service in the operating system
that it needs using this virtual machines
So this was their starting point
They already had this stuff and a lot of sophisticated things built around it
at the start of designing VMware FT
So this is just virtual machines
What the paper's doing is that it's gonna set up one machine
or they did requires two physical machines
Because there's no point in running the primary and backup software
in different virtual machines on the same physical machine
Because we're trying to guard against hardware failures
So you have two machines running their virtual machine monitors
And the primary is going to run on one, the backup is on the other
So on one of these machines we have a guest
It might be running a lot of virtual machines
We only care about one of them
It's gonna be running some guest operating system and some sort of server application
Maybe a database server, MapReduce master, or something
So I'll call this the primary
And there'll be a second machine that runs the same virtual machine monitor
and an identical virtual machine holding the backup
So we have the same whatever the operating system is exactly the same
And the virtual machine is giving these guest operating systems the primary and backup a each range of memory
and this memory images will be identical
or the goal is to make them identical in the primary in the backup
We have two physical machines
Each one of them running a virtual machine guest
with its own copy of the service we care about
We're assuming that there's a network connecting these two machines
And in addition on this Local Area Network there's some set of clients
Really, they don't have to be clients
They're just maybe other computers that our replicated service needs to talk with
Some of them are clients sending requests
It turns out in this paper the replicated service actually doesn't use a local disk
and instead assumes that there's some sort of disk server that it talks to him
Although it's a little bit hard to realize this from the paper
The scheme actually does not really treat the server particularly
Especially it's just another external source of packets
and place that the replicated state machine may send packets to
Not very much different from clients
So the basic scheme is that the we assume that
these two replicas, the two virtual machines, primary and backup, are exact replicas
Some client, you know database client who knows who has
Some client of our replicated server sends a request to the primary
And that really takes the form of a network packet
that's what we're talking about
That generates an interrupt and this interrupt actually goes to
the virtual machine monitor at least in the first instance
The virtual machine monitor sees here's the input for this replicated service
And so the virtual machine monitor does two things
One is it simulates a network packet arrival interrupt
into the primary guest operating system
to deliver it to the primary copy of the application
And in addition the virtual machine monitor knows that
this is an input to a replicated virtual machine
And so it sends back out on the network a copy of that packet
to the backup virtual machine monitor
It also gets it and backup virtual machine monitor knows
it is a packet for this particular replicated state machine
And it also fakes a network packet arrival interrupt
at the backup and delivers the packet
So now both the primary and the backup have a copy
This packet they looks at, the same input, with a lot of details
are gonna process it in the same way and stay synchronized
Course the service is probably going to reply to the client
On the primary the service will generate a reply packet
and send it on the NIC that the virtual machine monitor is emulating
And then the virtual machine monitor will see that output packet on the primary
They'll actually send the reply back out on the network to the client
Because the backup is running exactly the same sequence of instructions
It also generates a reply packet back to the client
and sends that reply packet on its emulated NIC
It's the virtual machine monitor, that's emulating that network interface card
And the virtual machine monitor says I know this was the backup
only the primary is allowed to generate output
And the virtual machine monitor drops the reply packet
So both of them see inputs and only the primary generates outputs
As far as terminology goes, the paper calls this stream of input events
and other things, other events we'll talk about from the stream the logging Channel
143 changes: 143 additions & 0 deletions lec04/Lec4-3.zh.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
他们的确是有复制的
但是并没有在主和副本服务之间复制每一个bit的内存
但是并没有在主和副本服务之间复制每一个bit的内存
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有时间轴

而是复制偏应用程序级别的内存块表
我对块和块标识符进行了这种抽象
这就是需要复制的东西
并不需要复制任何其他的东西
也没有在该机器上复制任何其他的东西的代价
也没有在该机器上复制任何其他的东西的代价
这样做是可以的
只要主服务和副本服务具有相同应用程序可见性的内存块集
因此大多数复制方案都采用与GFS相似的方案
实际上,几乎是除了这篇论文以及一些类似的系统之外的所有方案
实际上,几乎是除了这篇论文以及一些类似的系统之外的所有方案
他们几乎都使用了应用级别的复制
因为这样可以更有效率
因为我们不必费力的去确保
在主和副本服务运行时
中断发生在完全相同的时间点
GFS完全不用担心这点
但是本文必须要确保这点
因为它在很低的级别进行复制
因此大多数人使用特定于应用程序的复制来构建高效的系统
这样做的后果是
必须将复制内置到应用程序权限中
例如,如果你需要获取应用程序级别操作的提要
你就需要让应用程序参与其中
因为有些通用的复制方案,例如今天的论文
并不能理解哪些东西需要被复制的语义
因此大多数方案是针对于特定应用的
例如GFS以及我们将要在这个主题下阅读的所有其他论文
今天的论文的不同之处在于它实在机器级别进行复制的
因此它不关系在其上运行了什么软件
它复制低级别的内存以及寄存器
你可以其上运行任何软件
只要它可以在这种所表示的微处理器上运行
这种复制方案可以适应任何软件
缺点是效率不一定高
优点是你可以使用任何现有的软件
甚至你没有源代码或者不知道它是如何工作的
在一定的限制下,你就可以在VMware的复制方案下运行它
它可以正常工作,对于任意软件都可以进行容错
现在我们来讨论VMware FT
首先,VMware是一家虚拟机公司
他们的很多业务都在销售虚拟机技术
虚拟机指的是
你买一台电脑
在硬件上不是启动像Linux这样的操作系统
而是启动虚拟机监视器
它的工作实际上是
在此硬件上模拟多台虚拟的电脑
因此虚拟机监视器可能会启动一个Linux实例
多个Linux实例,或者一个Windows实例
这台计算机上的虚拟机监视器可以运行许多不同的操作系统
这台计算机上的虚拟机监视器可以运行许多不同的操作系统
它们每个包含某种操作系统内核以及应用程序
所以这是他们开始使用的技术
原因是事实证明
在硬件和操作系统之间进行这种级别的间接干预非常方便 的原因有很多
在硬件和操作系统之间进行这种级别的间接干预非常方便 的原因有很多
这意味着我们可以购买一台计算机
并在其上运行许多不同的操作系统
如果我们运行大量的小型服务
而不是使用大量的每台运行一个服务的计算机
你可以只购买一台计算机,在基于虚拟机上的操作系统中运行每个服务
你可以只购买一台计算机,在基于虚拟机上的操作系统中运行每个服务
这就是他们的出发点
在最开始设计VMware FT时,他们已经构建了这项功能和许多其他复杂的东西
在最开始设计VMware FT时,他们已经构建了这项功能和许多其他复杂的东西
所以这就是虚拟机
论文要做的是要搭建一台机器
或者说他们需要两台物理机
因为在同一台物理计算机上的不同虚拟机中运行主软件和副本软件毫无意义
因为在同一台物理计算机上的不同虚拟机中运行主软件和副本软件毫无意义
因为我们正在努力应对硬件故障
因此,你有两台计算机分别运行其虚拟机监视器
而主虚拟机将在一台计算机上运行,而副本虚拟机将在另一台上运行
在其中一台计算机上有一个guest操作系统...
它可能正在运行许多虚拟机
我们只在乎其中的一个
它会运行多个guest操作系统和服务应用程序
也许是数据库服务,MapReduce主数据库或其他东西
我们称这个为主虚拟机
这里有第二台计算机运行相同的虚拟机监视器
也有运行副本服务的相同的虚拟机
因此,无论是何种操作系统,我们都具有完全相同的东西
虚拟机为这些guest操作系统、主和副本服务器提供一定范围的内存空间
并且这两个内存镜像是完全相同的
或其目标是使它们在主和副本虚拟机中完全相同
我们有两台物理计算机
每台都在运行guest虚拟机
该虚拟机上带有我们关心的服务的副本
我们假设有一个网络连接了这两台机器
此外,在此局域网上还有一些客户端
事实上,它们不一定是是客户端
它们可能是带有复制的服务需要与之通信的其他计算机
其中一些是来发送请求的客户端
这篇论文中的带有复制的服务实际上并不使用本地磁盘
而是假设与某种磁盘服务器进行通信
尽管从本篇论文中很难意识到这一点
该方案实际上并没有特殊对待这种服务器
它只是数据包的另一个外部来源
只是复制状态机可能会将数据包发送到的地方
这与其他客户端没有太大不同
因此,基本方案是,我们假设
这两个副本、两个虚拟机、或者说主和副本虚拟机,都是精确的副本
某个客户端,例如数据库客户端
复制服务器的某个客户端向主虚拟机发送请求
而这实际上是以网络数据包的形式发送的,就是我们刚刚讨论的
而这实际上是以网络数据包的形式发送的,就是我们刚刚讨论的
它生成一个中断
该中断进入第一个实例的虚拟机监视器
虚拟机监视器发现复制服务的输入到来了
因此,虚拟机监视器会做两件事
第一件事,它模拟网络数据包到达中断,传递给主guest操作系统
第一件事,它模拟网络数据包到达中断,传递给主guest操作系统
以此将其传递给应用程序的主副本
第二件事,虚拟机监视器知道这是复制虚拟机的输入
第二件事,虚拟机监视器知道这是复制虚拟机的输入
因此,它通过网络将数据包副本发送给副本虚拟机监视器
因此,它通过网络将数据包副本发送给副本虚拟机监视器
所以它也得到了数据包,副本虚拟机监视器
知道它是此复制状态机的数据包
它在副本虚拟机中也会构造网络数据包到达中断,并传送数据包
它在副本虚拟机中也会构造网络数据包到达中断,并传送数据包
所以现在主和副本虚拟机都有了数据包的一份副本
它们看到的这个数据包、这个相同的输入
通过考虑大量的细节,会以相同的方式处理并保持同步
当然,服务可能会回复客户
在主虚拟机上,服务将生成一个回复数据包
将其发送到虚拟机监视器所模拟的NIC上
然后,虚拟机监视器将在主计算机上看到该输出数据包
它们会将回复通过网络发送回客户端
由于副本在运行完全相同的指令序列
它也会生成一个回复数据包返回给客户端
在其模拟的NIC上发送该回复数据包
虚拟机监视器模拟了该网卡
虚拟机监视器知道这是副本虚拟机
而它只允许主虚拟机生成输出
因此,虚拟机监视器会丢弃回复数据包
所以他们两个都看到了输入,而只有主虚拟机产生了输出
就术语而言,这篇论文将这种输入事件流
以及之后要讨论的其他事件流,称为日志记录通道
Loading