Chapter 2

初探 redis

这一篇带大家大概浏览 redis。

.1 redis 在缓存系统所处的位置

通常，在系统中，我们会把数据交由数据库来存储，但传统的数据库增删查改的

性能优先，且比较复杂。根据 80/20 法则，百分之八十的业务访问集中在百分之二十

的数据上。是否可以有一个存在于物理内存中的数据中间层，来缓存一些常用的数据，

解决传统数据库数据读写性能问题。常用的数据都存储在内存中，读写性能非常可观。

CHAPTER 2. 初探 REDIS

这种思维在计算机中很常见，之前学习计算机系统的时候就有见过这张图：越

往上的存储设备，存储的速度就会更快。诸如，redis, memcache 属于 nosql，即 not

only sql，可见它们是为了弥补传统数据库的不足。

包括 redis/memcache 这样的 key-value 内存存储系统，非常适合于读多写少的

业务场景，而 redis 是一个基于多种数据结构的内存存储系统，让缓存系统更加好玩。

2.2. 从主函数开始

.2 从主函数开始

»»»»>

.3 redis 如何运作

在刚刚接触 redis 的时候，最想要知道的是一个’set name Jhon’ 命令到达 redis

服务器的时候，它是如何返回’OK’ 的？里面命令处理的流程如何，具体细节怎么样？

阅读别人的代码是很枯燥的，但带着好奇心阅读代码，是一件很兴奋的事情，接着翻

到了 redis 源码的 main 函数。

redis 在启动做了一些初始化逻辑，比如配置文件读取，数据中心初始化，网络通

信模块初始化等，待所有初始化任务完毕后，便开始等待请求。

当请求到来时，redis 进程会被唤醒，原理是 epoll. select, kqueue 等一些 I/O 多

路复用的系统调用。接着读取来来自客户端的数据，解析命令，查找命令，并执行命令。

执行命令’set name Jhon’ 的时候，redis 会在预先初始化好的哈希表里头，查找

key=’name’ 对应的位置，并存入。

最后，把回复的内容准备好回送给客户端，客户端于是收到了’OK’.

.3.1 详细的过程

带着命令是如何被处理的这个问题去读代码。刚开始的时候，会有一堆的变量和

函数等着读者，但只要抓住主干就好了，下面就是 redis 的主干部分。

int main ( int argc , char ** argv ) {

. . . . .

/ 初始化服务器配置，主要是填充 redisServer 结构体中的各

种参数

initServerConfig () ;

CHAPTER 2. 初探 REDIS

. . . . .

/ 初始化服务器

i n i t S e r v e r () ;

. . . . .

/ 进入事件循环

aeMain ( server . e l ) ;

}

分别来看看它们主要做了什么？

initServerConﬁg

initServerConﬁg 主要是填充 struct redisServer 这个结构体，redis 所有相关的配

置都在里面。

initServer

void i n i t S e r v e r () {

/ 创建事件循环结构体

server . e l = aeCreateEventLoop ( server . maxclients+

REDIS_EVENTLOOP_FDSET_INCR) ;

/ 分配数据集空间

server . db = zmalloc ( sizeof ( redisDb ) * server . dbnum) ;

* Open the TCP l i s t e n i n g socket for the user commands .

/ listenToPort () 中有调用 l i s t e n ()

i f ( server . port != 0 &&

listenToPort ( server . port , server . ipfd ,& server .

ipfd_count ) == REDIS_ERR)

e x i t (1) ;

. . . . .

/ 初始化 redis 数据集

* Create the Redis databases , and i n i t i a l i z e other

i n t e r n a l s t a t e . */

2.3. REDIS 如何运作

for ( j = 0; j < server .REDIS_DEFAULT_DBNUM; j++) { // 初

始化多个数据库

/ 哈希表，用于存储键值对

server . db [ j ] . dict = dictCreate(&dbDictType ,NULL) ;

/ 哈希表，用于存储每个键的过期时间

server . db [ j ] . expires = dictCreate(&keyptrDictType ,

NULL) ;

server . db [ j ] . blocking_keys = dictCreate(&

keylistDictType ,NULL) ;

server . db [ j ] . ready_keys = dictCreate(&setDictType ,

NULL) ;

server . db [ j ] . watched_keys = dictCreate(&

keylistDictType ,NULL) ;

server . db [ j ] . id = j ;

server . db [ j ] . avg_ttl = 0;

}

. . . . .

/ 创建接收 TCP 或者 UNIX 域套接字的事件处理

/ TCP

* Create an event handler for accepting new connections

in TCP and Unix

domain sockets . */

for ( j = 0; j < server . ipfd_count ; j++) {

/ acceptTcpHandler () tcp 连接接受处理函数

i f ( aeCreateFileEvent ( server . el , server . ipfd [ j ] ,

AE_READABLE,

acceptTcpHandler ,NULL) == AE_ERR)

{

redisPanic (

”

Unrecoverable ␣ error ␣ creating ␣ server . ipfd

␣ f i l e ␣ event . ” ) ;

}

. . . . .

}

在这里，创建了事件中心，是 redis 的网络模块，如果你有学过 linux 下的网络

编程，那么知道这里一定和 select/epoll/kqueue 相关。

接着，是初始化数据中心，我们平时使用 redis 设置的键值对，就是存储在里面。

这里不急着深入它是怎么做到存储我们的键值对的，接着往下看好了，因为我们主要

是想把大致的脉络弄清楚。

CHAPTER 2. 初探 REDIS

在最后一段的代码中，redis 给 listen fd 注册了回调函数 acceptTcpHandler，也

就是说当新的客户端连接的时候，这个函数会被调用，详情接下来再展开。

aeMain

接着就开始等待请求的到来。

void aeMain ( aeEventLoop *eventLoop ) {

eventLoop−>stop = 0;

while ( ! eventLoop−>stop ) {

/ 进入事件循环可能会进入睡眠状态。在睡眠之前，执行预

设置的函数 aeSetBeforeSleepProc () 。

i f ( eventLoop−>b e f o r e s l e e p != NULL)

eventLoop−>b e f o r e s l e e p ( eventLoop ) ;

/ AE_ALL_EVENTS 表示处理所有的事件

aeProcessEvents ( eventLoop , AE_ALL_EVENTS) ;

}

前面的两个函数都属于是初始化的工作，到这里的时候，redis 正式进入等待接

收请求的状态。具体的实现，和 select/epoll/kqueue 这些 IO 多路复用的系统调用相

关，而这也是网络编程的基础部分了。继续跟踪调用链：

int aeProcessEvents ( aeEventLoop *eventLoop , int f l a g s )

{

. . . . .

/ 调用 IO 多路复用函数阻塞监听

numevents = aeApiPoll ( eventLoop , tvp ) ;

/ 处理已经触发的事件

for ( j = 0; j < numevents ; j++) {

/ 找到 I /O 事件表中存储的数据

aeFileEvent * fe = &eventLoop−>events [ eventLoop−>

f i r e d [ j ] . fd ] ;

int mask = eventLoop−>f i r e d [ j ] . mask ;

int fd = eventLoop−>f i r e d [ j ] . fd ;

2.3. REDIS 如何运作

int r f i r e d = 0;

* note the fe−>mask & mask & . . . code : maybe an

already processed

event removed an element that f i r e d and we

s t i l l didn ’ t

processed , so we check i f the event i s s t i l l

v a l i d . */

/ 读事件

i f ( fe−>mask & mask & AE_READABLE) {

r f i r e d = 1;

fe−>r f i l e P r o c ( eventLoop , fd , fe−>clientData ,

mask) ;

}

/ 写事件

i f ( fe−>mask & mask & AE_WRITABLE) {

i f ( ! r f i r e d | | fe−>wfileProc != fe−>r f i l e P r o c

)

fe−>wfileProc ( eventLoop , fd , fe−>clientData

mask) ;

}

processed++;

}

/ 处理定时事件

* Check time events */

i f ( f l a g s & AE_TIME_EVENTS)

processed += processTimeEvents ( eventLoop ) ;

return processed ; /* return the number of processed f i l e /

time events */

}

可以看到，aeApiPoll 即是 IO 多路复用调用的地方，当有请求到来的时候，进程

会觉醒以处理到来的请求。

.3.2 新连接的处理流程

在 initServer 的讲解中，redis 注册了回调函数 acceptTcpHandler，当有新的连

接到来时，这个函数会被回调，上面的函数指针 rﬁleProc 实际上就是指向了 ac-

ceptTcpHandler。下面是 acceptTcpHandler 的核心代码：

CHAPTER 2. 初探 REDIS

/ 用于 TCP 接收请求的处理函数

void acceptTcpHandler ( aeEventLoop * el , int fd , void * privdata

int mask) {

int cport , cfd ;

char cip [REDIS_IP_STR_LEN ] ;

REDIS_NOTUSED( e l ) ;

REDIS_NOTUSED(mask) ;

REDIS_NOTUSED( privdata ) ;

/ 接收客户端请求

cfd = anetTcpAccept ( server . neterr , fd , cip , sizeof ( cip ) ,

cport ) ;

/ 出错

i f ( cfd == AE_ERR) {

redisLog (REDIS_WARNING, ” Accepting ␣ c l i e n t ␣ connection : ␣

s ” , server . neterr ) ;

return ;

}

/ 记录

redisLog (REDIS_VERBOSE, ” Accepted␣%s:%d” , cip , cport ) ;

/ 真正有意思的地方

acceptCommonHandler ( cfd , 0 ) ;

}

anetTcpAccept 是接收一个请求 cfd ，真正有意思的地方是

acceptCommonHandler ，而 acceptCommonHandler 最核心的调用是

createClient 。 r e d i s 对于每一个客户端的连接，都会对应一个

结构体 struct r e d i s C l i e n t 。下面是 createClient 的核心代

码：

r e d i s C l i e n t * createClient ( int fd ) {

r e d i s C l i e n t *c = zmalloc ( sizeof ( r e d i s C l i e n t ) ) ;

* passing −1 as fd i t i s p o s s i b l e to create a non

connected c l i e n t .

This i s u s e f u l since a l l the Redis commands needs to

be executed

in the context of a c l i e n t . When commands are executed

2.3. REDIS 如何运作

in other

contexts ( for instance a Lua s c r i p t ) we need a non

connected c l i e n t . */

i f ( fd != −1) {

anetNonBlock (NULL, fd ) ;

anetEnableTcpNoDelay (NULL, fd ) ;

i f ( server . tcpkeepalive )

anetKeepAlive (NULL, fd , server . tcpkeepalive ) ;

/ 为接收到的套接字注册监听事件

/ readQueryFromClient () 应该为处理客户端请求的函数

i f ( aeCreateFileEvent ( server . el , fd ,AE_READABLE,

readQueryFromClient , c ) == AE_ERR)

{

c l o s e ( fd ) ;

z f r e e ( c ) ;

return NULL;

}

. . . . .

return c ;

}

可以看到，createClient 在事件中心为与客户端连接的套接字注册了 read-

QueryFromClient 回调函数，而这也就是说当客户端有请求数据过来的时候，ac-

ceptTcpHandler 会被调用。于是，我们找到了’set name Jhon’ 开始处理的地方。

.3.3 请求的处理流程

readQueryFromClient 则是获取来自客户端的数据，接下来它会调用 processIn-

putBuﬀer 解析命令和执行命令，对于命令的执行，调用的是函数 processCommand。

下面是 processCommand 核心代码：

int processCommand ( r e d i s C l i e n t *c ) {

. . . . .

/ 查找命令， r e d i s C l i e n t . cmd 在此时赋值

* Now lookup the command and check ASAP about t r i v i a l

error conditions

such as wrong arity , bad command name and so f o r t h . */

c−>cmd = c−>lastcmd = lookupCommand( c−>argv[0]−> ptr ) ;

CHAPTER 2. 初探 REDIS

/ 没有找到命令

i f ( ! c−>cmd) {

flagTransaction ( c ) ;

addReplyErrorFormat ( c , ”unknown␣command␣’%s ’ ” ,

(

char*) c−>argv[0]−> ptr ) ;

return REDIS_OK;

/ 参数个数不符合

}

else i f (( c−>cmd−>a r i t y > 0 && c−>cmd−>a r i t y != c−>argc

)

| |

(

c−>argc < c−>cmd−>a r i t y ) ) {

flagTransaction ( c ) ;

addReplyErrorFormat ( c , ”wrong␣number␣ of ␣arguments␣ fo r ␣

’

%s ’ ␣command” ,

c−>cmd−>name) ;

return REDIS_OK;

}

. . . .

i f ( c−>f l a g s & REDIS_MULTI &&

c−>cmd−>proc != execCommand && c−>cmd−>proc !=

discardCommand &&

/ 加入命令队列的情况

* Exec the command */

c−>cmd−>proc != multiCommand && c−>cmd−>proc !=

watchCommand)

{

/ 命令入队

queueMultiCommand( c ) ;

addReply ( c , shared . queued ) ;

/ 真正执行命令。

/ 注意，如果是设置了多命令模式，那么不是直接执行命令，而

是让命令入队

}

else {

c a l l ( c ,REDIS_CALL_FULL) ;

i f ( listLength ( server . ready_keys ) )

handleClientsBlockedOnLists () ;

}

return REDIS_OK;

}

如上可以看到，redis 首先根据客户端给出的命令字在命令表中查找对应的 c-

cmd, 即 struct redisCommand.

2.3. REDIS 如何运作

c−>cmd = c−>lastcmd = lookupCommand( c−>argv[0]−> ptr ) ;

redis 在初始化的时候准备了一个大数组，初始化了所有的命令，即初始化多个

struct redisCommand，在 struct redisCommand 中就有该命令对应的回调函数指针。

找到命令结构体后，则开始执行命令，核心调用是 call().

.3.4 执行命令

call() 做的事情有很多，但这里只关注这一句话：call() 调用了命令的回调函数。

/ c a l l () 函数是执行命令的核心函数，真正执行命令的地方

* Call () i s the core of Redis execution of a command */

void c a l l ( r e d i s C l i e n t *c , int f l a g s ) {

. . . . .

/ 执行命令对应的处理函数

c−>cmd−>proc ( c ) ;

. . . . .

}

对于’set name Jhon’ 命令，对应的回调函数是 setCommand() 函数。setCommand

对 set 命令的参数做了检测，因为还提供设置一个键值对的过期时间等功能，这里只

关注最简单的情况。

CHAPTER 2. 初探 REDIS

void setCommand( r e d i s C l i e n t *c ) {

. . . . .

setGenericCommand ( c , flags , c−>argv [ 1 ] , c−>argv [ 2 ] , expire ,

unit ,NULL,NULL) ;

}

void setGenericCommand ( r e d i s C l i e n t *c , int flags , robj *key ,

robj *val , robj * expire , int unit , robj *ok_reply , robj *

abort_reply ) {

. . . . .

setKey ( c−>db , key , val ) ;

. . . . .

addReply ( c , ok_reply ? ok_reply : shared . ok ) ;

}

void setKey ( redisDb *db , robj *key , robj * val ) {

i f ( lookupKeyWrite (db , key ) == NULL) {

dbAdd(db , key , val ) ;

}

else {

dbOverwrite (db , key , val ) ;

}

. . . . .

}

setKey() 首先查看 key 是否存在于数据集中，如果存在则覆盖写；如果不存在则

添加到数据集中。这里关注 key 不存在的情况：

void dbAdd( redisDb *db , robj *key , robj * val ) {

sds copy = sdsdup ( key−>ptr ) ;

int r e t v a l = dictAdd (db−>dict , copy , val ) ;

redisAssertWithInfo (NULL, key , r e t v a l == REDIS_OK) ;

}

dictAdd() 就是把 key 存到字典中，实际上即是存到一个哈希表。

.3.5 在哪里回复客户端

最后，回到 setGenericCommand(), 会调用 addReply()。addReply() 会为与客户

端连接的套接字注册可写事件，把’ok’ 添加到客户端的回复缓存中。待再一次回到事

件循环的时候，如果这个套接字可写，相应的回调函数就可以被回调了。回复缓存中

2.4. REDIS 事件驱动模型

的数据会被发送到客户端。

由此’set name Jhon’ 命令执行完毕。

在把这个流程捋顺的过程，我省去了很多的细节，只关注场景最简单情况最单一

的时候，其他的代码都没有去看。这对我们快速了解一个系统的原理是很关键的。同

样，在面对其他系统代码的时候，也可以带着这三个最简单的问题去阅读：它是谁，

它从哪里来，又到哪里去。

.4 redis 事件驱动模型

.4.1 概述

»»»»>

.4.2 其他模型

»»»»>

CHAPTER 2. 初探 REDIS

Chapter 3

redis 事件驱动详解

.1 概述

redis 内部有一个小型的事件驱动，它和 libevent 网络库的事件驱动一样，都是

依托 I/O 多路复用技术支撑起来的。

利用 I/O 多路复用技术，监听感兴趣的文件 I/O 事件，例如读事件，写事件等，

同时也要维护一个以文件描述符为主键，数据为某个预设函数的事件表，这里其实就

是一个数组或者链表。当事件触发时，比如某个文件描述符可读，系统会返回文件描

述符值，用这个值在事件表中找到相应的数据项，从而实现回调。同样的，定时事件

也是可以实现的，因为系统提供的 I/O 多路复用技术中的函数允许我们设定时间值。

上面一段话比较综合，可能需要一些 linux 系统编程和网络编程的基础，但你会

看到多数事件驱动程序都是这么实现的。

.1.1 事件驱动数据结构

redis 事件驱动内部有四个主要的数据结构，分别是：事件循环结构体，文件事件

结构体，时间事件结构体和触发事件结构体。

/ 文件事件结构体

* File event structure */

typedef struct aeFileEvent {

int mask ; /* one of AE_(READABLE|WRITABLE) */

/ 回调函数指针

aeFileProc * r f i l e P r o c ;

aeFileProc * wfileProc ;

/ clientData 参数一般是指向 r e d i s C l i e n t 的指针

CHAPTER 3. REDIS 事件驱动详解

void * clientData ;

aeFileEvent ;

}

/ 时间事件结构体

* Time event structure */

typedef struct aeTimeEvent {

long long id ; /* time event i d e n t i f i e r . */

long when_sec ; /* seconds */

long when_ms ; /* milliseconds */

/ 定时回调函数指针

aeTimeProc *timeProc ;

/ 定时事件清理函数，当删除定时事件的时候会被调用

aeEventFinalizerProc * f i n a l i z e r P r o c ;

/ clientData 参数一般是指向 r e d i s C l i e n t 的指针

void * clientData ;

/ 定时事件表采用链表来维护

struct aeTimeEvent *next ;

}

aeTimeEvent ;

/ 触发事件

* A f i r e d event */

typedef struct aeFiredEvent {

int fd ;

int mask ;

}

aeFiredEvent ;

/ 事件循环结构体

* State of an event based program */

typedef struct aeEventLoop {

int maxfd ;

r e g i s t e r e d */

int s e t s i z e ; /* max number of f i l e d e s c r i p t o r s tracked */

/* h i g h e s t f i l e d es cr ip t o r currently

/ 记录最大的定时事件 id + 1

long long timeEventNextId ;

/ 用于系统时间的矫正

time_t lastTime ;

/* Used to d etect system clock skew

3.1. 概述

/ I /O 事件表

aeFileEvent * events ; /* Registered events */

/ 被触发的事件

aeFiredEvent * f i r e d ; /* Fired events */

/ 定时事件表

aeTimeEvent *timeEventHead ;

/ 事件循环结束标识

int stop ;

/ 对于不同的 I /O 多路复用技术，有不同的数据，详见各自实

现

void * apidata ; /* This i s used for p o l l i n g API s p e c i f i c

data */

/ 新的循环前需要执行的操作

aeBeforeSleepProc * b e f o r e s l e e p ;

aeEventLoop ;

}

上面的数据结构能给我们很好的提示：事件循环结构体维护 I/O 事件表，定时事

件表和触发事件表。

.1.2 事件循环中心

redis 的主函数中调用 initServer() 函数从而初始化事件循环中心（EventLoop），

它的主要工作是在 aeCreateEventLoop() 中完成的。

aeEventLoop *aeCreateEventLoop ( int s e t s i z e ) {

aeEventLoop *eventLoop ;

int i ;

/ 分配空间

i f (( eventLoop = zmalloc ( sizeof (* eventLoop ) ) ) == NULL)

goto err ;

/ 分配文件事件结构体空间

eventLoop−>events = zmalloc ( sizeof ( aeFileEvent ) * s e t s i z e ) ;

/ 分配已触发事件结构体空间

CHAPTER 3. REDIS 事件驱动详解

eventLoop−>f i r e d = zmalloc ( sizeof ( aeFiredEvent ) * s e t s i z e ) ;

i f ( eventLoop−>events == NULL | | eventLoop−>f i r e d == NULL

)

goto err ;

eventLoop−>s e t s i z e = s e t s i z e ;

eventLoop−>lastTime = time (NULL) ;

/ 时间事件链表头

eventLoop−>timeEventHead = NULL;

/ 后续提到

eventLoop−>timeEventNextId = 0;

eventLoop−>stop = 0;

eventLoop−>maxfd = −1;

/ 进入事件循环前需要执行的操作，此项会在 redis main () 函

数中设置

eventLoop−>b e f o r e s l e e p = NULL;

/ 在这里， aeApiCreate () 函数对于每个 IO 多路复用模型的实

现都有不同，具体参见源代码，因为每种 IO 多路复用模型的

初始化都不同

i f ( aeApiCreate ( eventLoop ) == −1) goto err ;

* Events with mask == AE_NONE are not set . So l e t ’ s

i n i t i a l i z e the

vector with i t . */

/ 初始化事件类型掩码为无事件状态

for ( i = 0; i < s e t s i z e ; i++)

eventLoop−>events [ i ] . mask = AE_NONE;

return eventLoop ;

err :

i f ( eventLoop ) {

z f r e e ( eventLoop−>events ) ;

z f r e e ( eventLoop−>f i r e d ) ;

z f r e e ( eventLoop ) ;

}

return NULL;

}

有上面初始化工作只是完成了一个空空的事件中心而已。要想驱动事件循环，还

需要下面的工作。

3.2. REDIS 事件驱动原理

.2 redis 事件驱动原理

.2.1 事件注册详解

文件 I/O 事件注册主要操作在 aeCreateFileEvent() 中完成。aeCreateFileEvent()

会根据文件描述符的数值大小在事件循环结构体的 I/O 事件表中取一个数据空间，利

用系统提供的 I/O 多路复用技术监听感兴趣的 I/O 事件，并设置回调函数。

int aeCreateFileEvent ( aeEventLoop *eventLoop , int fd , int

mask ,

aeFileProc *proc , void * clientData )

{

i f ( fd >= eventLoop−>s e t s i z e ) {

errno = ERANGE;

return AE_ERR;

}

/ 在 I /O 事件表中选择一个空间

aeFileEvent * f e = &eventLoop−>events [ fd ] ;

/ aeApiAddEvent () 只在此函数中调用，对于不同 IO 多路复用

实现，会有所不同

i f ( aeApiAddEvent ( eventLoop , fd , mask) == −1)

return AE_ERR;

fe−>mask |= mask ;

/ 设置回调函数

i f (mask & AE_READABLE) fe−>r f i l e P r o c = proc ;

i f (mask & AE_WRITABLE) fe−>wfileProc = proc ;

fe−>clientData = clientData ;

i f ( fd > eventLoop−>maxfd )

eventLoop−>maxfd = fd ;

return AE_OK;

}

对于不同版本的 I/O 多路复用，比如 epoll，select，kqueue 等，redis 有各自的

版本，但接口统一，譬如 aeApiAddEvent()。

CHAPTER 3. REDIS 事件驱动详解

.2.2 准备监听工作

initServer() 中调用了 aeCreateEventLoop() 完成了事件中心的初始化，init-

Server() 还做了监听的准备。

* Open the TCP l i s t e n i n g socket for the user commands . */

/ listenToPort () 中有调用 l i s t e n ()

i f ( server . port != 0 &&

listenToPort ( server . port , server . ipfd ,& server . ipfd_count )

= REDIS_ERR)

e x i t (1) ;

/ UNIX 域套接字

* Open the l i s t e n i n g Unix domain socket . */

i f ( server . unixsocket != NULL) {

unlink ( server . unixsocket ) ; /* don ’ t care i f t h i s f a i l s */

server . sofd = anetUnixServer ( server . neterr , server .

unixsocket , server . unixsocketperm ) ;

i f ( server . sofd == ANET_ERR) {

redisLog (REDIS_WARNING, ”Opening␣ socket : ␣%s ” , server .

neterr ) ;

e x i t (1) ;

}

从上面可以看出，redis 提供了 TCP 和 UNIX 域套接字两种工作方式。以 TCP

工作方式为例，listenPort() 创建绑定了套接字并启动了监听。

.2.3 为监听套接字注册事件

在进入事件循环前还需要做一些准备工作。紧接着，initServer() 为所有的监听套

接字注册了读事件，响应函数为 acceptTcpHandler() 或者 acceptUnixHandler()。

3.2. REDIS 事件驱动原理

/ 创建接收 TCP 或者 UNIX 域套接字的事件处理

/ TCP

* Create an event handler for accepting new connections

in TCP and Unix

domain sockets . */

for ( j = 0; j < server . ipfd_count ; j++) {

/ acceptTcpHandler () tcp 连接接受处理函数

i f ( aeCreateFileEvent ( server . el , server . ipfd [ j ] ,

AE_READABLE,

acceptTcpHandler ,NULL) == AE_ERR)

{

redisPanic (

”

Unrecoverable ␣ error ␣ creating ␣ server . ipfd

␣ f i l e ␣ event . ” ) ;

}

/ UNIX 域套接字

i f ( server . sofd > 0 && aeCreateFileEvent ( server . el , server

sofd ,AE_READABLE,

acceptUnixHandler ,NULL) == AE_ERR) redisPanic ( ”

Unrecoverable ␣ error ␣ creating ␣ server . sofd ␣ f i l e ␣

event . ” ) ;

来看看 acceptTcpHandler () 做了什么：

/ 用于 TCP 接收请求的处理函数

void acceptTcpHandler ( aeEventLoop * el , int fd , void * privdata

int mask) {

int cport , cfd ;

char cip [REDIS_IP_STR_LEN ] ;

REDIS_NOTUSED( e l ) ;

REDIS_NOTUSED(mask) ;

REDIS_NOTUSED( privdata ) ;

/ 接收客户端请求

cfd = anetTcpAccept ( server . neterr , fd , cip , sizeof ( cip ) ,

cport ) ;

/ 出错

i f ( cfd == AE_ERR) {

CHAPTER 3. REDIS 事件驱动详解

redisLog (REDIS_WARNING, ” Accepting ␣ c l i e n t ␣ connection : ␣

s ” , server . neterr ) ;

return ;

}

/ 记录

redisLog (REDIS_VERBOSE, ” Accepted␣%s:%d” , cip , cport ) ;

/ 真正有意思的地方

acceptCommonHandler ( cfd , 0 ) ;

}

接收套接字与客户端建立连接后，调用 acceptCommonHandler()。acceptCom-

monHandler() 主要工作就是：

. 建立并保存服务端与客户端的连接信息，这些信息保存在一个 struct re-

disClient 结构体中；2. 为与客户端连接的套接字注册读事件，相应的回调函数为

readQueryFromClient()，readQueryFromClient() 作用是从套接字读取数据，执行相

应操作并回复客户端。

.2.4 事件循环

以上做好了准备工作，可以进入事件循环。跳出 initServer() 回到 main() 中，

main() 会调用 aeMain()。进入事件循环发生在 aeProcessEvents() 中：

. 根据定时事件表计算需要等待的最短时间；2. 调用 redis api aeApiPoll() 进

入监听轮询，如果没有事件发生就会进入睡眠状态，其实就是 I/O 多路复用 select()

epoll() 等的调用；3. 有事件发生会被唤醒，处理已触发的 I/O 事件和定时事件。

void aeMain ( aeEventLoop *eventLoop ) {

eventLoop−>stop = 0;

while ( ! eventLoop−>stop ) {

/ 进入事件循环可能会进入睡眠状态。在睡眠之前，执行预

设置的函数 aeSetBeforeSleepProc () 。

i f ( eventLoop−>b e f o r e s l e e p != NULL)

eventLoop−>b e f o r e s l e e p ( eventLoop ) ;

/ AE_ALL_EVENTS 表示处理所有的事件

aeProcessEvents ( eventLoop , AE_ALL_EVENTS) ;

}

»» 此处考虑要不要加代码了

3.2. REDIS 事件驱动原理

.2.5 事件触发

这里以 select 版本的 redis api 实现作为讲解，aeApiPoll() 调用了 select() 进入

了监听轮询。aeApiPoll() 的 tvp 参数是最小等待时间，它会被预先计算出来，它主要

完成：

. 拷贝读写的 fdset。select() 的调用会破坏传入的 fdset，实际上有两份 fdset，

一份作为备份，另一份用作调用。每次调用 select() 之前都从备份中直接拷贝一份；

. 调用 select()；3. 被唤醒后，检查 fdset 中的每一个文件描述符，并将可读或者可

写的描述符记录到触发表当中。

接下来的操作便是执行相应的回调函数，代码在上一段中已经贴出：先处理 I/O

事件，再处理定时事件。

static int aeApiPoll ( aeEventLoop *eventLoop , struct timeval *

tvp ) {

aeApiState * state = eventLoop−>apidata ;

int retval , j , numevents = 0;

真有意思，在 aeApiState 结构中：

typedef s t r u c t aeApiState {

fd_set rfds , wfds ;

fd_set _rfds , _wfds ;

aeApiState ;

}

在调用 s e l e c t () 的时候传入的是 _rfds 和 _wfds ，所有监听的

数据在 r fds 和 wfds 中。

在下次需要调用 s e l e c () 的时候，会将 rf ds 和 wfds 中的数据

拷贝进 _rfds 和 _wfds 中。 */

memcpy(&state −>_rfds ,& state −>rfds , sizeof ( fd_set ) ) ;

memcpy(&state −>_wfds,& state −>wfds , sizeof ( fd_set ) ) ;

r e t v a l = s e l e c t ( eventLoop−>maxfd+1,

state −>_rfds ,& state −>_wfds ,NULL, tvp ) ;

i f ( r e t v a l > 0) {

/ 轮询

for ( j = 0; j <= eventLoop−>maxfd ; j++) {

int mask = 0;

aeFileEvent * fe = &eventLoop−>events [ j ] ;

i f ( fe−>mask == AE_NONE) continue ;

i f ( fe−>mask & AE_READABLE && FD_ISSET( j ,& state −>

rfds ) )

mask |= AE_READABLE;

CHAPTER 3. REDIS 事件驱动详解

i f ( fe−>mask & AE_WRITABLE && FD_ISSET( j ,& state −>

wfds ) )

mask |= AE_WRITABLE;

/ 添加到触发事件表中

eventLoop−>f i r e d [ numevents ] . fd = j ;

eventLoop−>f i r e d [ numevents ] . mask = mask ;

numevents++;

}

return numevents ;

}

.3 redis 与 memcache 的事件驱动比较

»»»»>

.4 总结

redis 的事件驱动总结如下：

. 初始化事件循环结构体

. 注册监听套接字的读事件

. 注册定时事件

3.4. 总结

. 进入事件循环

. 如果监听套接字变为可读，会接收客户端请求，并为对应的套接字注册读事件

. 如果与客户端连接的套接字变为可读，执行相应的操作

CHAPTER 3. REDIS 事件驱动详解

Part II

redis 基础数据结构

Chapter 4

redis 数据结构 redisObject

redis 是 key-value 存储系统，其中 key 类型一般为字符串，而 value 类型则为

redis 对象（redis object）。redis 对象可以绑定各种类型的数据，譬如 string、list 和 set。

typedef struct redisObject {

/ 刚刚好 32 b i t s

/ 对象的类型，字符串/列表/集合/哈希表

unsigned type : 4 ;

/ 未使用的两个位

unsigned notused : 2 ;

/* Not used */

/ 编码的方式， redis 为了节省空间，提供多种方式来保存一个

数据

/ 譬如： “ 123456789 ” 会被存储为整数 123456789

unsigned encoding : 4 ;

/ 当内存紧张，淘汰数据的时候用到

/* lru time ( r e l a t i v e to server .

unsigned lru : 2 2 ;

l r u c l o c k ) */

/ 引用计数

int refcount ;

/ 数据指针

void * ptr ;

CHAPTER 4. REDIS 数据结构 REDISOBJECT

}

robj ;

redis 中定义了 struct redisObject，它是一个简单优秀的数据结构，因为在 redis-

Object 中数据属性和数据分开来了，其中，数据属性包括数据类型，存储编码方式，

淘汰时钟，引用计数。下面一一展开：

数据类型，标记了 redis 对象绑定的是什么类型的数据，有下面几种可能的值；

* Object types */

define REDIS_STRING 0

define REDIS_LIST 1

define REDIS_SET 2

define REDIS_ZSET 3

define REDIS_HASH 4

存储编码方式，一个数据，可以以多种方式存储。譬如，数据类型为 REDIS_SET 的

数据编码方式可能为 REDIS_ENCODING_HT ，也可能为 REDIS_ENCODING_INTSET。

* Objects encoding . Some kind of o b j e c t s l i k e Strings and

Hashes can be

i n t e r n a l l y represented in multiple ways . The ’ encoding ’

f i e l d of the o b j e c t

i s se t to one of t h i s f i e l d s for t h i s o b j e c t . */

define REDIS_ENCODING_RAW 0

/* Raw representation */

/* Encoded as in teger */

/* Encoded as hash t a b l e */

define REDIS_ENCODING_INT 1

define REDIS_ENCODING_HT 2

define REDIS_ENCODING_ZIPMAP 3 /* Encoded as zipmap */

define REDIS_ENCODING_LINKEDLIST 4 /* Encoded as regular

linked l i s t */

define REDIS_ENCODING_ZIPLIST 5 /* Encoded as z i p l i s t */

define REDIS_ENCODING_INTSET 6 /* Encoded as i n t s e t */

define REDIS_ENCODING_SKIPLIST 7 /* Encoded as s k i p l i s t */

淘汰时钟，redis 对数据集占用内存的大小有「实时」的计算，当超出限额时，会

淘汰超时的数据。

引用计数，一个 redis 对象可能被多个指针引用。当需要增加或者减少引用的时

候，必须调用相应的函数，程序员必须遵守这一准则。

/ 增加 redis 对象引用

void incrRefCount ( robj *o ) {

o−>refcount++;

}

/ 减少 redis 对象引用。特别的，引用为零的时候会销毁对象

void decrRefCount ( robj *o ) {

i f (o−>refcount <= 0) redisPanic ( ” decrRefCount␣ against ␣

refcount ␣<=␣0” ) ;

/ 如果取消的是最后一个引用，则释放资源

i f (o−>refcount == 1) {

/ 不同数据类型，销毁操作不同

switch (o−>type ) {

case REDIS_STRING: freeStringObject ( o ) ; break ;

case REDIS_LIST: freeListObject ( o ) ; break ;

case REDIS_SET: freeSetObject ( o ) ; break ;

case REDIS_ZSET: freeZsetObject ( o ) ; break ;

case REDIS_HASH: freeHashObject ( o ) ; break ;

default : redisPanic ( ”Unknown␣ object ␣type ” ) ; break ;

}

z f r e e ( o ) ;

else {

o−>refcount −−;

}

得益于 redis 是单进程单线程工作的，所以增加/减少引用的操作不必保证原子

性，这在 memcache 中是做不到的。

struct redisObject 把最后一个指针留给了真正的数据。

CHAPTER 4. REDIS 数据结构 REDISOBJECT

Chapter 5

redis 数据结构 adlist

就是链表

CHAPTER 5. REDIS 数据结构 ADLIST

Chapter 6

redis 数据结构 sds

sds 被称为是 Hacking String. hack 的地方就在 sds 保存了字符串的长度以及剩

余空间。sds 的实现在 sds.c 中。

sds 头部的实现：

struct sdshdr {

int len ;

int f r e e ;

char buf [ ] ;

}

;

倘若使用指针即 char *buf，分配内存需要量两个步骤：一次分配结构体，一次

分配 char *buf，在是否内存的时候也需要释放两次内存：一次为 char *buf，一次为

结构体内存。而用长度为 0 的字符数组可以将分配和释放内存的次数都降低为 1 次，

从而简化内存的管理。

CHAPTER 6. REDIS 数据结构 SDS

另外，长度为 0 的数组即 char buf[] 不占用内存：

/ char buf [ ] 的情况

struct sdshdr s ;

p r i n t f ( ”%d” , sizeof ( s ) ) ;

/ 8

/ char * buf 的情况

struct sdshdr s ;

p r i n t f ( ”%d” , sizeof ( s ) ) ;

/ 12

redis 中涉及较多的字符串操作，譬如 APPEND 命令。相比普通的字符串，sds

获取字符串的长度以及剩余空间的复杂度都是 O(1)，前者需要 O(N).

/ 返回 sdshdr . len

static i n l i n e size_t sdslen ( const sds s ) {

struct sdshdr *sh = ( void *) ( s−(sizeof ( struct sdshdr ) ) ) ;

return sh−>len ;

}

/ 返回 sdshdr . fr e e

static i n l i n e size_t s d s a v a i l ( const sds s ) {

struct sdshdr *sh = ( void *) ( s−(sizeof ( struct sdshdr ) ) ) ;

return sh−>f r e e ;

}

sds.c 中还实现了针对 sds 的字符串操作函数，譬如分配，追加，释放等。

CHAPTER 6. REDIS 数据结构 SDS

Chapter 7

redis 数据结构 dict

.1 redis 的键值对存储在哪里

在 redis 中有多个数据集，数据集采用的数据结构是哈希表，用以存储键值对。

默认所有的客户端都是使用第一个数据集，如果客户端有需要可以使用 select 命令来

选择不同的数据集。redis 在初始化服务器的时候就会初始化所有的数据集：

void i n i t S e r v e r () {

. . . . .

/ 分配数据集空间

server . db = zmalloc ( sizeof ( redisDb ) * server . dbnum) ;

. . . . .

/ 初始化 redis 数据集

* Create the Redis databases , and i n i t i a l i z e other

i n t e r n a l s t a t e . */

for ( j = 0; j < server .REDIS_DEFAULT_DBNUM; j++) { // 初

始化多个数据库

/ 哈希表，用于存储键值对

server . db [ j ] . dict = dictCreate(&dbDictType ,NULL) ;

/ 哈希表，用于存储每个键的过期时间

server . db [ j ] . expires = dictCreate(&keyptrDictType ,

NULL) ;

. . . . .

}

. . . . .

}

CHAPTER 7. REDIS 数据结构 DICT

.2 哈希表 dict

数据集采用的数据结构是哈希表，数据真正存储在哈希表中，用开链法解决冲突

问题，struct dictht 即为一个哈希表。但在 redis 哈希表数据结构 struct dict 中有两

个哈希表，下文将两个哈希表分别称为第一个和第二个哈希表，redis 提供两个哈希

表是为了能够在不中断服务的情况下扩展（expand）哈希表，很有趣的一部分。

/ 可以把它认为是一个链表，提示，开链法

typedef struct dictEntry {

void *key ;

union {

void * val ;

uint64_t u64 ;

int64_t s64 ;

v ;

}

struct dictEntry *next ;

}

dictEntry ;

/ 要存储多种多样的数据结构，势必不同的数据有不同的哈希算法，

不同的键值比较算法，不同的析构函数。

typedef struct dictType {

/ 哈希函数

unsigned int (* hashFunction ) ( const void *key ) ;

void *(*keyDup) ( void * privdata , const void *key ) ;

void *(* valDup ) ( void * privdata , const void * obj ) ;

7.2. 哈希表 DICT

/ 比较函数

int (* keyCompare ) ( void * privdata , const void *key1 , const

void *key2 ) ;

/ 键值析构函数

void (* keyDestructor ) ( void * privdata , void *key ) ;

void (* valDestructor ) ( void * privdata , void * obj ) ;

dictType ;

}

/ 一般哈希表数据结构

* This i s our hash t a b l e structure . Every dictionary has two

of t h i s as we

implement incremental rehashing , for the old to the new

t a b l e . */

typedef struct dictht {

/ 两个哈希表

dictEntry ** table ;

/ 哈希表的大小

unsigned long s i z e ;

/ 哈希表大小掩码

unsigned long sizemask ;

/ 哈希表中数据项数量

unsigned long used ;

dictht ;

}

/ 哈希表（字典）数据结构， redis 的所有键值对都会存储在这里。

其中包含两个哈希表。

typedef struct dict {

/ 哈希表的类型，包括哈希函数，比较函数，键值的内存释放函

数

dictType *type ;

/ 存储一些额外的数据

void * privdata ;

/ 两个哈希表

dictht ht [ 2 ] ;

CHAPTER 7. REDIS 数据结构 DICT

/ 哈希表重置下标，指定的是哈希数组的数组下标

int rehashidx ; /* rehashing not in progress i f rehashidx

= −1 */

/ 绑定到哈希表的迭代器个数

int i t e r a t o r s ; /* number of i t e r a t o r s currently running

dict ;

}

.3 扩展哈希表

redis 为每个数据集配备两个哈希表，能在不中断服务的情况下扩展哈希表。平

时哈希表扩展的做法是，为新的哈希表另外开辟一个空间，将原哈希表的数据重新计

算哈希值，以移动到新哈希表。如果原哈希表数据过多，中间大量的计算过程较好费

大量时间。

redis 扩展哈希表的做法有点小聪明：为第二个哈希表分配新空间，其空间大小为原

哈希表键值对数量的两倍（是的，没错），接着逐步将第一个哈希表中的数据移动到

第二个哈希表；待移动完毕后，将第二个哈希值赋值给第一个哈希表，第二个哈希表

置空。在这个过程中，数据会分布在两个哈希表，这时候就要求在 CURD 时，都要

考虑两个哈希表。

而这里，将第一个哈希表中的数据移动到第二个哈希表被称为重置哈希（rehash）。

.4 重置哈希表

在 CURD 的时候会执行一步的重置哈希表操作，在服务器定时程序 serverCorn()

中会执行一定时间的重置哈希表操作。为什么在定时程序中重置哈希表了，还 CURD

的时候还要呢？或者反过来问。一个可能的原因是 redis 做了两手准备：在服务器空

闲的时候，定时程序会完成重置哈希表；在服务器过载的时候，更多重置哈希表操作

会落在 CURD 的服务上。

下面是重置哈希表的函数，其主要任务就是选择哈希表中的一个位置上的单链表，重

新计算哈希值，放到第二个哈希表。

int dictRehash ( dict *d , int n) {

/ 重置哈希表结束，直接返回

i f ( ! dictIsRehashing (d) ) return 0;

while (n−−) {

dictEntry *de , *nextde ;

/ 第一个哈希表为空，证明重置哈希表已经完成，将第二个

哈希表赋值给第一个，

7.4. 重置哈希表

/ 结束

* Check i f we already rehashed the whole t a b l e . . . */

i f (d−>ht [ 0 ] . used == 0) {

z f r e e (d−>ht [ 0 ] . table ) ;

d−>ht [ 0 ] = d−>ht [ 1 ] ;

dictReset(&d−>ht [ 1 ] ) ;

d−>rehashidx = −1;

return 0;

}

* Note that rehashidx can ’ t overflow as we are sure

there are more

elements because ht [ 0 ] . used != 0 */

a s s e r t (d−>ht [ 0 ] . s i z e > (unsigned)d−>rehashidx ) ;

/ 找到哈希表中不为空的位置

while (d−>ht [ 0 ] . table [ d−>rehashidx ] == NULL) d−>

rehashidx++;

de = d−>ht [ 0 ] . table [ d−>rehashidx ] ;

/ 此位置的所有数据移动到第二个哈希表

* Move a l l the keys in t h i s bucket from the old to

the new hash HT */

while ( de ) {

unsigned int h ;

nextde = de−>next ;

* Get the index in the new hash t a b l e */

/ 计算哈希值

h = dictHashKey (d , de−>key ) & d−>ht [ 1 ] . sizemask ;

/ 头插法

de−>next = d−>ht [ 1 ] . table [ h ] ;

d−>ht [ 1 ] . table [ h ] = de ;

/ 更新哈希表中的数据量

d−>ht [ 0 ] . used−−;

d−>ht [ 1 ] . used++;

de = nextde ;

}

/ 置空

CHAPTER 7. REDIS 数据结构 DICT

d−>ht [ 0 ] . table [ d−>rehashidx ] = NULL;

/ 指向哈希表的下一个位置

d−>rehashidx++;

}

return 1;

}

.5 低效率的哈希表添加替换

在 redis 添加替换的时候，都先要查看数据集中是否已经存在该键，也就是一个

查找的过程，如果一个 redis 命令导致过多的查找，会导致效率低下。可能是为了扬

长避短，即高读性能和低写性能，redis 中数据的添加和替换效率不高，特别是替换效

率低的恶心。

7.5. 低效率的哈希表添加替换

在 redis SET 命令的调用链中，添加键值对会导致了 2 次的键值对查找；替换键

值对最多会导致 4 次的键值对查找。在 dict 的实现中，dictFind() 和 _dictIndex()

都会导致键值对的查找，详细可以参看源码。所以，从源码来看，经常在 redis 上写

不是一个明智的选择。

CHAPTER 7. REDIS 数据结构 DICT

.6 哈希表的迭代

在 RDB 和 AOF 持久化操作中，都需要迭代哈希表。哈希表的遍历本身难度不

大，但因为每个数据集都有两个哈希表，所以遍历哈希表的时候也需要注意遍历两个

哈希表：第一个哈希表遍历完毕的时候，如果发现重置哈希表尚未结束，则需要继续

遍历第二个哈希表。

/ 迭代器取下一个数据项的入口

dictEntry * dictNext ( d i c t I t e r a t o r * i t e r )

{

while (1) {

i f ( iter −>entry == NULL) {

dictht *ht = &ite r −>d−>ht [ iter −>table ] ;

/ 新的迭代器

i f ( iter −>index == −1 && iter −>table == 0) {

i f ( it er −>s a f e )

ite r −>d−>i t e r a t o r s ++;

else

iter −>f i n g e r p r i n t = dictFingerprint ( ite r

−>d) ;

}

iter −>index++;

/ 下标超过了哈希表大小，不合法

i f ( iter −>index >= ( signed ) ht−>s i z e ) {

/ 如果正在重置哈希表， redis 会尝试在第二个哈

希表上进行迭代，

/ 否则真的就不合法了

i f ( dictIsRehashing ( iter −>d) && ite r −>table

= 0) {

/ 正在重置哈希表，证明数据正在从第一个哈希表

整合到第二个哈希表，

/ 则指向第二个哈希表

ite r −>table++;

iter −>index = 0;

ht = &iter −>d−>ht [ 1 ] ;

}

else {

/ 否则迭代完毕，这是真正不合法的情况

break ;

}

7.6. 哈希表的迭代

/ 取得数据项入口

ite r −>entry = ht−>table [ ite r −>index ] ;

}

else {

ite r −>entry = iter −>nextEntry ;

/ 取得下一个数据项人口

}

/ 迭代器会保存下一个数据项的入口，因为用户可能会删除

此函数返回的数据项

/ 入口，如此会导致迭代器失效，找不到下一个数据项入口

i f ( iter −>entry ) {

* We need to save the ’ next ’ here , the i t e r a t o r

user

may d e l e t e the entry we are returning . */

ite r −>nextEntry = ite r −>entry−>next ;

return iter −>entry ;

}

return NULL;

}

CHAPTER 7. REDIS 数据结构 DICT

Chapter 8

redis 数据结构 ziplist

.1 概述

在 redis 中，list 有两种存储方式：双链表（LinkedList）和压缩双链表（ziplist）。

双链表即普通数据结构中遇到的，在 adlist.h 和 adlist.c 中实现。压缩双链表以连续

的内存空间来表示双链表，压缩双链表节省前驱和后驱指针的空间（8B），这在小的

list 上，压缩效率是非常明显的；压缩双链表在 ziplist.h 和 ziplist.c 中实现。

这篇主要详述压缩双链表，普通双链表可以参看其他。

.2 压缩双链表的具体实现

在压缩双链表中，节省了前驱和后驱指针的空间，共 8 个字节，这让数据在内存

中更为紧凑。只要清晰的描述每个数据项的边界，就可以轻易得到后驱数据项的位置；

只要描述前驱数据项的大小，就可以定位前驱数据项的位置，redis 就是这么做的。

ziplist 的格式可以表示为：

zlbytes ><z l t a i l ><zllen ><entry >... < entry><zlend>

zlbytes 是 ziplist 占用的空间；zltail 是最后一个数据项的偏移位置，这方便逆向

遍历链表，也是双链表的特性；zllen 是数据项 entry 的个数；zlend 就是 255，占 1B.

详细展开 entry 的结构。

entry 的格式即为典型的 type-lenght-value，即 TLV，表述如下：

< prelen><<encoding+l e n s i z e ><len>><data >|

−−−1−−−−−−−−−−−−−−−−2−−−−−−−−−−−−−−3−−−|

域 1）是前驱数据项的大小。因为不用描述前驱的数据类型，描述较为简单。

CHAPTER 8. REDIS 数据结构 ZIPLIST

域 2）是此数据项的的类型和数据大小。为了节省空间，redis 预设定了多种长度

的字符串和整数。

种长度的字符串

define ZIP_STR_06B (0 << 6)

define ZIP_STR_14B (1 << 6)

define ZIP_STR_32B (2 << 6)

种长度的整数

define ZIP_INT_16B (0 xc0 | 0<<4)

define ZIP_INT_32B (0 xc0 | 1<<4)

define ZIP_INT_64B (0 xc0 | 2<<4)

define ZIP_INT_24B (0 xc0 | 3<<4)

define ZIP_INT_8B 0 xfe

域 3）为真正的数据。

透过 ziplist 查找函数 ziplistFind()，熟悉 ziplist entry 对数据格式：

/ 在 z i p l i s t 中查找数据项

* Find pointer to the entry equal to the s p e c i f i e d entry .

Skip ’ skip ’ e n t r i e s

between every comparison . Returns NULL when the f i e l d could

not be found . */

unsigned char * z i p l i s t F i n d (unsigned char *p , unsigned char *

vstr , unsigned int vlen , unsigned int skip ) {

int skipcnt = 0;

unsigned char vencoding = 0;

long long v l l = 0;

while (p [ 0 ] != ZIP_END) {

unsigned int prevlensize , encoding , l e n s i z e , len ;

unsigned char *q ;

ZIP_DECODE_PREVLENSIZE(p , p r e v l e n s i z e ) ;

/ 跳过前驱数据项大小，解析数据项大小

/ len 为 data 大小

/ l e n s i z e 为 len 所占内存大小

ZIP_DECODE_LENGTH(p + prevlensize , encoding , l e n s i z e ,

len ) ;

8.2. 压缩双链表的具体实现

/ q 指向 data

q = p + p r e v l e n s i z e + l e n s i z e ;

i f ( skipcnt == 0) {

* Compare current entry with s p e c i f i e d entry */

i f (ZIP_IS_STR( encoding ) ) {

/ 字符串比较

i f ( len == vlen && memcmp(q , vstr , vlen ) ==

) {

return p ;

}

else {

/ 整数比较

* Find out i f the searched f i e l d can be

encoded . Note that

we do i t only the f i r s t time , once done

vencoding i s s et

to non−zero and v l l i s s et to the i n teger

value . */

i f ( vencoding == 0) {

/ 尝试将 v s t r 解析为整数

i f ( ! zipTryEncoding ( vstr , vlen , &vll , &

vencoding ) ) {

* I f the entry can ’ t be encoded we

set i t to

UCHAR_MAX so that we don ’ t retry

again the next

time . */

/ 不能编码为数字！！！会导致当前查找

的数据项被跳过

vencoding = UCHAR_MAX;

}

* Must be non−zero by now */

a s s e r t ( vencoding ) ;

}

* Compare current entry with s p e c i f i e d entry

do i t only

i f vencoding != UCHAR_MAX because i f there

i s no encoding

p o s s i b l e for the f i e l d i t can ’ t be a v a l i d

in teger . */

CHAPTER 8. REDIS 数据结构 ZIPLIST

i f ( vencoding != UCHAR_MAX) {

/ 读取整数

long long l l = zipLoadInteger (q , encoding

)

;

i f ( l l == v l l ) {

return p ;

}

* Reset skip count */

skipcnt = skip ;

}

else {

skipcnt −−;

* Skip entry */

/ 移动到 z i p l i s t 的下一个数据项

* Move to next entry */

p = q + len ;

}

/ 没有找到

return NULL;

}

》》》》》》》》》》》》》

Chapter 9

redis 数据结构 skiplist

.1 概述

跳表（skiplist）是一个特俗的链表，相比一般的链表，有更高的查找效率，其效

率可比拟于二叉查找树。

一张关于跳表和跳表搜索过程如下图：

在图中，需要寻找 68，在给出的查找过程中，利用跳表数据结构优势，只比较了

次，横箭头不比较，竖箭头比较。由此可见，跳表预先间隔地保存了有序链表中的

节点，从而在查找过程中能达到类似于二分搜索的效果，而二分搜索思想就是通过比

较中点数据放弃另一半的查找，从而节省一半的查找时间。

缺点即浪费了空间，自古空间和时间两难全。

插播一段：跳表在 1990 年由 William Pugh 提出，而红黑树早在 1972 年由鲁道

夫·贝尔发明了。红黑树在空间和时间效率上略胜跳表一筹，但跳表实现相对简单得

到程序猿们的青睐。redis 和 leveldb 中都有采用跳表。

CHAPTER 9. REDIS 数据结构 SKIPLIST

这篇文章，借着 redis 的源码了解跳表的实现。

.2 跳表的数据结构

从上图中，总结跳表的性质：

. 由很多层结构组成

. 每一层都是一个有序的链表

. 最底层 (Level 1) 的链表包含所有元素

. 如果一个元素出现在 Level i 的链表中，则它在 Level i 之下的链表也都会出现。

. 每个节点包含两个指针，一个指向同一链表中的下一个元素，一个指向下面一

层的元素。

redis 中跳表数据结构定义：

/ 跳表节点结构体

* ZSETs use a s p e c i a l i z e d version of S k i p l i s t s */

typedef struct zskiplistNode {

/ 节点数据

robj * obj ;

/ 分数，游戏分数？按游戏分数排序

double score ;

/ 后驱指针

struct zskiplistNode *backward ;

9.3. 跳表的插入

/ 前驱指针数组 TODO

struct z s k i p l i s t L e v e l {

struct zskiplistNode * forward ;

/ 调到下一个数据项需要走多少步

unsigned int span ;

} l e v e l [ ] ;

zskiplistNode ;

}

typedef struct z s k i p l i s t {

/ 跳表头尾指针

struct zskiplistNode *header , * t a i l ;

/ 跳表的长度

unsigned long length ;

/ 跳表的高度

int l e v e l ;

}

z s k i p l i s t ;

特别的，在上图中似乎每个数据都被保存了多次，其实只保存了一次。在 struct

zskiplistNode 中数据和指针是分开存储的，struct zskiplistLevel 即是一个描述跳表层

级的数据结构。

.3 跳表的插入

跳表算法描述如下：找出每一层新插入数据位置的前驱并保存，在 redis 中跳表

插入是根据 score/member 的大小（看不懂可以参看 redis ZADD 命令）来决定插入

的位置；将新数据插入到指定位置，并调整指针，在 redis 中还会调整 span。

什么是 span？

CHAPTER 9. REDIS 数据结构 SKIPLIST

span 即从两个相邻节点间隔了多少节点。譬如 level 1，-1 的 span 就是 1；level

，-1 的 span 为 2。

因为新出入数据的层数是随机的，有两种情况 ꢀ 小于等于原有的层数；ꢀ 大于原

有的层数。需要做特殊处理。

）小于等于原有的层数

redis 中跳表插入算法的具体实现：

zskiplistNode * z s l I n s e r t ( z s k i p l i s t * zsl , double score , robj *

obj ) {

zskiplistNode *update [ZSKIPLIST_MAXLEVEL] , *x ;

unsigned int rank [ZSKIPLIST_MAXLEVEL ] ;

int i , l e v e l ;

redisAssert ( ! isnan ( score ) ) ;

x = zsl −>header ;

/ 遍历 s k i p l i s t 中所有的层，找到数据将要插入的位置，并保

存在 update 中

for ( i = zsl −>level −1; i >= 0; i −−) {

* store rank that i s crossed to reach the i n s e r t

pos i ti on */

rank [ i ] = i == ( zsl −>level −1) ? 0 : rank [ i +1];

/ 链表的搜索

while (x−>l e v e l [ i ] . forward &&

x−>l e v e l [ i ] . forward−>score < score | |

x−>l e v e l [ i ] . forward−>score == score &&

compareStringObjects (x−>l e v e l [ i ] . forward−>obj

(

obj ) < 0) ) ) {

rank [ i ] += x−>l e v e l [ i ] . span ;

x = x−>l e v e l [ i ] . forward ;

9.3. 跳表的插入

}

/ update [ i ] 记录了新数据项的前驱

update [ i ] = x ;

}

/ random 一个 l e v e l ，是随机的

* we assume the key i s not already inside , since we

allow duplicated

scores , and the re−i n s e r t i o n of score and redis o b j e c t

should never

happen since the c a l l e r of z s l I n s e r t () should t e s t in

the hash t a b l e

i f the element i s already inside or not . */

l e v e l = zslRandomLevel () ;

/ random l e v e l 比原有的 zsl −>l e v e l 大，需要增加 s k i p l i s t

的 l e v e l

i f ( l e v e l > zsl −>l e v e l ) {

for ( i = zsl −>l e v e l ; i < l e v e l ; i++) {

rank [ i ] = 0;

update [ i ] = zsl −>header ;

update [ i ]−> l e v e l [ i ] . span = zsl −>length ;

}

zsl −>l e v e l = l e v e l ;

}

/ 插入

x = zslCreateNode ( level , score , obj ) ;

for ( i = 0; i < l e v e l ; i++) {

/ 新节点项插到 update [ i ] 的后面

x−>l e v e l [ i ] . forward = update [ i ]−> l e v e l [ i ] . forward ;

update [ i ]−> l e v e l [ i ] . forward = x ;

* update span covered by update [ i ] as x i s inserted

here */

x−>l e v e l [ i ] . span = update [ i ]−> l e v e l [ i ] . span − ( rank

[

0 ] − rank [ i ] ) ;

update [ i ]−> l e v e l [ i ] . span = ( rank [ 0 ] − rank [ i ] ) + 1;

}

/ 更高的 l e v e l 尚未调整 span

CHAPTER 9. REDIS 数据结构 SKIPLIST

* increment span for untouched l e v e l s */

for ( i = l e v e l ; i < zsl −>l e v e l ; i++) {

update [ i ]−> l e v e l [ i ] . span++;

}

/ 调整新节点的前驱指针

x−>backward = ( update [ 0 ] == zsl −>header ) ? NULL : update

[

0 ] ;

i f (x−>l e v e l [ 0 ] . forward )

x−>l e v e l [ 0 ] . forward−>backward = x ;

else

zsl −>t a i l = x ;

/ 调整 s k i p l i s t 的长度

zsl −>length++;

return x ;

}

.4 跳表的删除

跳表的删除算和插入算法步骤类似：找出每一层需删除数据的前驱并保存；接着

调整指针，在 redis 中还会调整 span。

redis 中跳表删除算法的具体实现：

/ x 是需要删除的节点

/ update 是每一个层 x 的前驱数组

* Internal function used by zslDelete , zslDeleteByScore and

zslDeleteByRank */

9.5. REDIS 中的跳表

void zslDeleteNode ( z s k i p l i s t * zsl , zskiplistNode *x ,

zskiplistNode ** update ) {

int i ;

/ 调整 span 和 forward 指针

for ( i = 0; i < zsl −>l e v e l ; i++) {

i f ( update [ i ]−> l e v e l [ i ] . forward == x) {

update [ i ]−> l e v e l [ i ] . span += x−>l e v e l [ i ] . span − 1;

update [ i ]−> l e v e l [ i ] . forward = x−>l e v e l [ i ] . forward

;

}

else {

/ update [ i]−> l e v e l [ i ] . forward == NULL，只调整

span

update [ i ]−> l e v e l [ i ] . span −= 1;

}

/ 调整后驱指针

i f (x−>l e v e l [ 0 ] . forward ) {

x−>l e v e l [ 0 ] . forward−>backward = x−>backward ;

else {

zsl −>t a i l = x−>backward ;

}

/ 删除某一个节点后，层数 l e v e l 可能降低，调整 l e v e l

while ( zsl −>l e v e l > 1 && zsl −>header−>l e v e l [ zsl −>level −1].

forward == NULL)

zsl −>level −−;

/ 调整跳表的长度

zsl −>length −−;

}

.5 redis 中的跳表

redis 中结合跳表（skiplist）和哈希表（dict）形成一个新的数据结构 zset。添加

dict 是为了快速定位跳表中是否存在某个 member！

typedef struct zset {

dict * dict ;

z s k i p l i s t * z s l ;

CHAPTER 9. REDIS 数据结构 SKIPLIST

}

zset ;

.6 redis 选用 skiplist 场景

ZXX 命令是针对有序集合（sorted set）的，譬如：

ZADD

ZCARD

ZCOUNT

ZINCRBY

ZINTERSTORE

ZLEXCOUNT

ZRANGE

ZRANGEBYLEX

ZRANGEBYSCORE

ZRANK

ZREM

ZREMRANGEBYLEX

ZREMRANGEBYRANK

ZREMRANGEBYSCORE

ZREVRANGE

ZREVRANGEBYSCORE

ZREVRANK

ZSCAN

ZSCORE

ZUNIONSTORE

Chapter 10

redis 数据结构 intset

intset 和 dict 都是 sadd 命令的底层数据结构，当添加的所有数据都是整数时，

会使用前者；否则使用后者。特别的，当遇到添加数据为字符串，即不能表示为整数

时，redis 会把数据结构转换为 dict，即把 intset 中的数据全部搬迁到 dict。

本片展开的是 intset，dict 的文章可以参看之前写的《深入剖析 redis 数据结构

dict》。

0.1 intset 结构体

intset 底层本质是一个有序的、不重复的、整型的数组，支持不同类型整数。

typedef struct i n t s e t {

/ 每个整数的类型

uint32_t encoding ;

/ i n t s e t 长度

uint32_t length ;

/ 整数数组

int8_t contents [ ] ;

i n t s e t ;

}

encoding 能下面的三个值：分别是 16，32 和 64 位整数：

* Note that these encodings are ordered , so :

INTSET_ENC_INT16 < INTSET_ENC_INT32 < INTSET_ENC_INT64. */

define INTSET_ENC_INT16 ( sizeof ( int16_t ) )

CHAPTER 10. REDIS 数据结构 INTSET

define INTSET_ENC_INT32 ( sizeof ( int32_t ) )

define INTSET_ENC_INT64 ( sizeof ( int64_t ) )

0.2 intset 搜索

intset 是有序的整数数组，可以用二分搜索查找。

static uint8_t intsetSearch ( i n t s e t * is , int64_t value ,

uint32_t *pos ) {

int min = 0 , max = i n t r e v 3 2 i f b e ( is −>length ) −1, mid = −1;

int64_t cur = −1;

* The value can never be found when the set i s empty */

/ 集合为空

i f ( i n t r e v 3 2 i f b e ( is −>length ) == 0) {

i f ( pos ) *pos = 0;

return 0;

}

else {

* Check for the case where we know we cannot find

the value ,

but do know the i n s e r t po sit i on . */

i f ( value > _intsetGet ( is , i n t r e v 3 2 i f b e ( is −>length ) −1)

/ value 比最大元素还大

)

{

i f ( pos ) *pos = i n t r e v 3 2 i f b e ( is −>length ) ;

return 0;

}

/ value 比最小元素还小

else i f ( value < _intsetGet ( is , 0 ) ) {

i f ( pos ) *pos = 0;

return 0;

}

/ 二分查找

while (max >= min) {

mid = (min+max) /2;

cur = _intsetGet ( is , mid) ;

i f ( value > cur ) {

min = mid+1;

}

else i f ( value < cur ) {

max = mid−1;

10.3. INTSET 插入

}

else {

break ;

}

i f ( value == cur ) {

i f ( pos ) *pos = mid ;

return 1;

}

else {

i f ( pos ) *pos = min ;

return 0;

}

0.3 intset 插入

intset 实现中比较有意思的是插入算法部分。

* Insert an in teger in the i n t s e t */

i n t s e t * intsetAdd ( i n t s e t * is , int64_t value , uint8_t * success

)

{

uint8_t valenc = _intsetValueEncoding ( value ) ;

uint32_t pos ;

i f ( success ) * success = 1;

* Upgrade encoding i f necessary . I f we need to upgrade ,

we know that

t h i s value should be e i t h e r appended ( i f > 0) or

prepended ( i f < 0) ,

because i t l i e s outside the range of e x i s t i n g values .

i f ( valenc > i n t r e v 3 2 i f b e ( is −>encoding ) ) {

/ 需要插入整数的所需内存超出了原有集合整数的范围，即内存

类型不同，

/ 则升级整数类型

* This always succeeds , so we don ’ t need to curry *

success . */

return intsetUpgradeAndAdd ( is , value ) ;

}

/ 正常，分配内存，插入

else {

CHAPTER 10. REDIS 数据结构 INTSET

/ i n t s e t 内部不允许重复

* Abort i f the value i s already present in the set .

This c a l l w i l l populate ”pos” with the r i g h t

pos i ti on to i n s e r t

the value when i t cannot be found . */

i f ( intsetSearch ( is , value ,&pos ) ) {

i f ( success ) * success = 0;

return i s ;

}

/ r e a l l o c

i s = i n t s e t R e s i z e ( is , i n t r e v 3 2 i f b e ( is −>length )+1) ;

/ 迁移内存，腾出空间给新的数据。 intsetMoveTail () 完

成内存迁移工作

i f ( pos < i n t r e v 3 2 i f b e ( is −>length ) ) intsetMoveTail ( is

pos , pos+1) ;

}

// 在腾出的空间中设置新的数据

_intsetSet ( is , pos , value ) ;

/ 更新 i n t s e t s i z e

is −>length = i n t r e v 3 2 i f b e ( i n t r e v 3 2 i f b e ( is −>length )+1) ;

return i s ;

}

/ 升级整数类型，譬如从 short−>in t 。当插入数据的内存占用比原

有数据大

/ 的时候，会被调用

* Upgrades the i n t s e t to a l a r g e r encoding and i n s e r t s the

given in t eger . */

static i n t s e t *intsetUpgradeAndAdd ( i n t s e t * is , int64_t value )

{

uint8_t curenc = i n t r e v 3 2 i f b e ( is −>encoding ) ;

uint8_t newenc = _intsetValueEncoding ( value ) ;

int length = i n t r e v 3 2 i f b e ( is −>length ) ;

/ value <0 头插， value >0 尾插

int prepend = value < 0 ? 1 : 0;

/ r e a l l o c

10.3. INTSET 插入

* First s et new encoding and r e s i z e */

is −>encoding = i n t r e v 3 2 i f b e ( newenc ) ;

i s = i n t s e t R e s i z e ( is , i n t r e v 3 2 i f b e ( is −>length )+1) ;

/ 逆向处理，防止数据被覆盖，一般的插入排序步骤

* Upgrade back−to−front so we don ’ t overwrite values .

Note that the ”prepend” v a r i a b l e i s used to make sure

we have an empty

space at e i t h e r the beginning or the end of the i n t s e t

while ( length −−)

intsetSet ( is , length+prepend , _intsetGetEncoded ( is ,

length , curenc ) ) ;

/ value <0 放在集合开头，否则放在集合末尾。

/ 因为，此函数是对整数所占内存进行升级，意味着 value 不

是在集合中最大就是最小！

* Set the value at the beginning or the end . */

i f ( prepend )

intsetSet ( is , 0 , value ) ;

else

intsetSet ( is , i n t r e v 3 2 i f b e ( is −>length ) , value ) ;

/ 更新 set s i z e

is −>length = i n t r e v 3 2 i f b e ( i n t r e v 3 2 i f b e ( is −>length )+1) ;

return i s ;

}

CHAPTER 10. REDIS 数据结构 INTSET

Part III

redis 内功心法

Chapter 11

redis 数据淘汰机制

1.1 概述

在 redis 中，允许用户设置最大使用内存大小 server.maxmemory，在内存限定的

情况下是很有用的。譬如，在一台 8G 机子上部署了 4 个 redis 服务点，每一个服务

点分配 1.5G 的内存大小，减少内存紧张的情况，由此获取更为稳健的服务。

redis 内存数据集大小上升到一定大小的时候，就会施行数据淘汰策略。redis 提

供 6 种数据淘汰策略：

. volatile-lru：从已设置过期时间的数据集（server.db[i].expires）中挑选最近最少

使用的数据淘汰

. volatile-ttl：从已设置过期时间的数据集（server.db[i].expires）中挑选将要过期

的数据淘汰

. volatile-random：从已设置过期时间的数据集（server.db[i].expires）中任意选择

数据淘汰

. allkeys-lru：从数据集（server.db[i].dict）中挑选最近最少使用的数据淘汰

. allkeys-random：从数据集（server.db[i].dict）中任意选择数据淘汰

. no-enviction（驱逐）：禁止驱逐数据

redis 确定驱逐某个键值对后，会删除这个数据并，并将这个数据变更消息发布

到本地（AOF 持久化）和从机（主从连接）。

CHAPTER 11. REDIS 数据淘汰机制

1.2 LRU 数据淘汰机制

在服务器配置中保存了 lru 计数器 server.lrulock，会定时（redis 定时程序 server-

Corn()）更新，server.lrulock 的值是根据 server.unixtime 计算出来的。

另外，从 struct redisObject 中可以发现，每一个 redis 对象都会设置相应的 lru。

可以想象的是，每一次访问数据的时候，会更新 redisObject.lru。

LRU 数据淘汰机制是这样的：在数据集中随机挑选几个键值对，取出其中 lru 最

大的键值对淘汰。所以，你会发现，redis 并不是保证取得所有数据集中最近最少使用

（

LRU）的键值对，而只是随机挑选的几个键值对中的。

/ redisServer 保存了 lru 计数器

struct redisServer {

. .

unsigned l r u c l o c k : 2 2 ;

/* Clock incrementing every

minute , for LRU */

. .

}

;

/ 每一个 redis 对象都保存了 lru

define REDIS_LRU_CLOCK_MAX ((1<<21)−1) /* Max value of obj−>

lru */

define REDIS_LRU_CLOCK_RESOLUTION 10 /* LRU clock r e s o l u t i o n

in seconds */

typedef struct redisObject {

/ 刚刚好 32 b i t s

/ 对象的类型，字符串/列表/集合/哈希表

unsigned type : 4 ;

/ 未使用的两个位

/* Not used */

/ 编码的方式， redis 为了节省空间，提供多种方式来保存一个

数据

unsigned notused : 2 ;

/ 譬如：“ 123456789 ” 会被存储为整数 123456789

unsigned encoding : 4 ;

unsigned lru : 2 2 ;

l r u c l o c k ) */

/* lru time ( r e l a t i v e to server .

/ 引用数

int refcount ;

11.3. TTL 数据淘汰机制

/ 数据指针

void * ptr ;

robj ;

}

/ redis 定时执行程序。联想： linux cron

int serverCron ( struct aeEventLoop *eventLoop , long long id ,

void * clientData ) {

. . . . .

* We have j u s t 22 b i t s per o b j e c t for LRU information .

So we use an ( e v e n t u a l l y wrapping ) LRU clock with 10

seconds r e s o l u t i o n .

2^22 b i t s with 10 seconds r e s o l u t i o n i s more or l e s s

.5 years .

Note that even i f t h i s w i l l wrap a f t e r 1.5 years i t ’ s

not a problem ,

everything w i l l s t i l l work but j u s t some o b j e c t w i l l

appear younger

to Redis . But for t h i s to happen a given o b j e c t should

never be touched

for 1.5 years .

Note that you can change the r e s o l u t i o n a l t e r i n g the

REDIS_LRU_CLOCK_RESOLUTION define .

updateLRUClock () ;

. . . . .

}

/ 更新服务器的 lru 计数器

void updateLRUClock ( void ) {

server . l r u c l o c k = ( server . unixtime /

REDIS_LRU_CLOCK_RESOLUTION) &

REDIS_LRU_CLOCK_MAX

;

}

1.3 TTL 数据淘汰机制

redis 数据集数据结构中保存了键值对过期时间的表，即 redisDb.expires。和 LRU

数据淘汰机制类似，TTL 数据淘汰机制是这样的：从过期时间的表中随机挑选几个

键值对，取出其中 ttl 最大的键值对淘汰。同样你会发现，redis 并不是保证取得所有

CHAPTER 11. REDIS 数据淘汰机制

过期时间的表中最快过期的键值对，而只是随机挑选的几个键值对中的。

1.4 总结

redis 每服务客户端执行一个命令的时候，会检测使用的内存是否超额。如果超

额，即进行数据淘汰。

/ 执行命令

int processCommand ( r e d i s C l i e n t *c ) {

. . . . .

/ 内存超额

* Handle the maxmemory d i r e c t i v e .

First we try to fr ee some memory i f p o s s i b l e ( i f there

are v o l a t i l e

keys in the dataset ) . I f there are not the only thing

we can do

i s returning an error . */

i f ( server . maxmemory) {

int r e t v a l = freeMemoryIfNeeded () ;

i f (( c−>cmd−>f l a g s & REDIS_CMD_DENYOOM) && r e t v a l ==

REDIS_ERR) {

flagTransaction ( c ) ;

addReply ( c , shared . oomerr ) ;

return REDIS_OK;

}

. . . . .

}

/ 如果需要，是否一些内存

int freeMemoryIfNeeded ( void ) {

size_t mem_used , mem_tofree , mem_freed ;

int s l a v e s = listLength ( server . s l a v e s ) ;

/ redis 从机回复空间和 AOF 内存大小不计算入 redis 内存大

小

* Remove the s i z e of s l a v e s output b u f f e r s and AOF

b u f f e r from the

count of used memory . */

mem_used = zmalloc_used_memory () ;

11.4. 总结

/ 从机回复空间大小

i f ( s l a v e s ) {

l i s t I t e r l i ;

listNode * ln ;

listRewind ( server . slaves ,& l i ) ;

while (( ln = lis tN e xt (& l i ) ) ) {

r e d i s C l i e n t * slave = listNodeValue ( ln ) ;

unsigned long obuf_bytes =

getClientOutputBufferMemoryUsage ( slave ) ;

i f ( obuf_bytes > mem_used)

mem_used = 0;

else

mem_used −= obuf_bytes ;

}

/ server . aof_buf && server . aof_rewrite_buf_blocks

i f ( server . aof_state != REDIS_AOF_OFF) {

mem_used −= sdslen ( server . aof_buf ) ;

mem_used −= aofRewriteBufferSize () ;

}

/ 内存是否超过设置大小

* Check i f we are over the memory l i m i t . */

i f (mem_used <= server . maxmemory) return REDIS_OK;

/ redis 中可以设置内存超额策略

i f ( server . maxmemory_policy ==

REDIS_MAXMEMORY_NO_EVICTION)

return REDIS_ERR; /* We need to f re e memory , but

p o l i c y f o r b i d s . */

* Compute how much memory we need to f r ee . */

mem_tofree = mem_used − server . maxmemory ;

mem_freed = 0;

while (mem_freed < mem_tofree ) {

int j , k , keys_freed = 0;

/ 遍历所有数据集

for ( j = 0; j < server . dbnum ; j++) {

long bestval = 0; /* j u s t to prevent warning */

CHAPTER 11. REDIS 数据淘汰机制

sds bestkey = NULL;

struct dictEntry *de ;

redisDb *db = server . db+j ;

dict * dict ;

/ 不同的策略，选择的数据集不一样

i f ( server . maxmemory_policy ==

REDIS_MAXMEMORY_ALLKEYS_LRU | |

server . maxmemory_policy ==

REDIS_MAXMEMORY_ALLKEYS_RANDOM)

{

dict = server . db [ j ] . dict ;

else {

dict = server . db [ j ] . expires ;

}

/ 数据集为空，继续下一个数据集

i f ( d i c t S i z e ( dict ) == 0) continue ;

/ 随机淘汰随机策略：随机挑选

* v o l a t i l e −random and a l l k e y s −random p o l i c y */

i f ( server . maxmemory_policy ==

REDIS_MAXMEMORY_ALLKEYS_RANDOM | |

server . maxmemory_policy ==

REDIS_MAXMEMORY_VOLATILE_RANDOM)

{

de = dictGetRandomKey ( dict ) ;

bestkey = dictGetKey ( de ) ;

}

else i f ( server . maxmemory_policy ==

REDIS_MAXMEMORY_ALLKEYS_LRU | |

server . maxmemory_policy ==

REDIS_MAXMEMORY_VOLATILE_LRU)

{

/ LRU 策略：挑选最近最少使用的数据

* v o l a t i l e −lru and a l l k e y s −lru p o l i c y */

/ server . maxmemory_samples 为随机挑选键值对

次数

/ 随机挑选 server . maxmemory_samples个键值

对，驱逐最近最少使用的数据

for (k = 0; k < server . maxmemory_samples ; k

11.4. 总结

+) {

sds thiskey ;

long t h i s v a l ;

robj *o ;

/ 随机挑选键值对

de = dictGetRandomKey ( dict ) ;

/ 获取键

thiskey = dictGetKey ( de ) ;

* When p o l i c y i s v o l a t i l e −lru we need an

a d d i t i o n a l lookup

to l o c a t e the r eal key , as d i c t i s set

to db−>expires . */

i f ( server . maxmemory_policy ==

REDIS_MAXMEMORY_VOLATILE_LRU)

de = dictFind (db−>dict , thiskey ) ;

o = dictGetVal ( de ) ;

/ 计算数据的空闲时间

t h i s v a l = estimateObjectIdleTime ( o ) ;

/ 当前键值空闲时间更长，则记录

* Higher i d l e time i s b e t t e r candidate

for d e l e t i o n */

i f ( bestkey == NULL | | t h i s v a l > bestval )

{

bestkey = thiskey ;

bestval = t h i s v a l ;

}

/ TTL 策略：挑选将要过期的数据

* v o l a t i l e −t t l */

else i f ( server . maxmemory_policy ==

REDIS_MAXMEMORY_VOLATILE_TTL) {

/ server . maxmemory_samples 为随机挑选键值对

次数

/ 随机挑选 server . maxmemory_samples个键值

对，驱逐最快要过期的数据

CHAPTER 11. REDIS 数据淘汰机制

for (k = 0; k < server . maxmemory_samples ; k

+) {

sds thiskey ;

long t h i s v a l ;

de = dictGetRandomKey ( dict ) ;

thiskey = dictGetKey ( de ) ;

t h i s v a l = ( long ) dictGetVal ( de ) ;

* Expire sooner ( minor expire unix

timestamp ) i s b e t t e r

candidate for d e l e t i o n */

i f ( bestkey == NULL | | t h i s v a l < bestval )

{

bestkey = thiskey ;

bestval = t h i s v a l ;

}

/ 删除选定的键值对

* Finally remove the s e l e c t e d key . */

i f ( bestkey ) {

long long delta ;

robj * keyobj = createStringObject ( bestkey ,

sdslen ( bestkey ) ) ;

/ 发布数据更新消息，主要是 AOF 持久化和从机

propagateExpire (db , keyobj ) ;

/ 注意， propagateExpire () 可能会导致内存的

分配， propagateExpire () 提前执行就是因为

redis 只计算 dbDelete () 释放的内存大小。倘

若同时计算 dbDelete () 释放的内存和

propagateExpire () 分配空间的大小，与此同时

假设分配空间大于释放空间，就有可能永远退不

出这个循环。

/ 下面的代码会同时计算 dbDelete () 释放的内存

和 propagateExpire () 分配空间的大小：

/ propagateExpire ( db , keyobj ) ;

/ d e l t a = ( long long ) zmalloc_used_memory () ;

11.4. 总结

/ dbDelete ( db , keyobj ) ;

/ d e l t a −= ( long long ) zmalloc_used_memory ()

;

/ mem_freed += d e l t a ;

////////////////////////////////////////

* We compute the amount of memory freed by

dbDelete () alone .

I t i s p o s s i b l e that a c t u a l l y the memory

needed to propagate

the DEL in AOF and r e p l i c a t i o n l i n k i s

greater than the one

we are f r e e i n g removing the key , but we

can ’ t account for

that otherwise we would never e x i t the

loop .

AOF and Output b u f f e r memory w i l l be freed

e v e n t u a l l y so

we only care about memory used by the key

space . */

/ 只计算 dbDelete () 释放内存的大小

delta = ( long long ) zmalloc_used_memory () ;

dbDelete (db , keyobj ) ;

delta −= ( long long ) zmalloc_used_memory () ;

mem_freed += delta ;

server . stat_evictedkeys++;

/ 将数据的删除通知所有的订阅客户端

notifyKeyspaceEvent (REDIS_NOTIFY_EVICTED, ”

evicted ” ,

keyobj , db−>id ) ;

decrRefCount ( keyobj ) ;

keys_freed++;

/ 将从机回复空间中的数据及时发送给从机

* When the memory to f ree s t a r t s to be big

enough , we may

s t a r t spending so much time here that i s

impossible to

d e l i v e r data to the s l a v e s f a s t enough , so

CHAPTER 11. REDIS 数据淘汰机制

we force the

transmission here in side the loop . */

i f ( s l a v e s ) flushSlavesOutputBuffers () ;

}

/ 未能释放空间，且此时 redis 使用的内存大小依旧超

额，失败返回

i f ( ! keys_freed ) return REDIS_ERR; /* nothing to f re e

. . */

}

return REDIS_OK;

}

Chapter 12

RDB 持久化策略

2.1 简介 redis 持久化 RDB、AOF

redis 提供两种持久化方式：RDB 和 AOF。redis 允许两者结合，也允许两者同

时关闭。

RDB 可以定时备份内存中的数据集。服务器启动的时候，可以从 RDB 文件中

回复数据集。

AOF 可以记录服务器的所有写操作。在服务器重新启动的时候，会把所有的写

操作重新执行一遍，从而实现数据备份。当写操作集过大（比原有的数据集还大），

redis 会重写写操作集。

本篇主要讲的是 RDB 持久化，了解 RDB 的数据保存结构和运作机制。redis 主

要在 rdb.h 和 rdb.c 两个文件中实现 RDB 的操作。

2.2 数据结构 rio

持久化的 IO 操作在 rio.h 和 rio.c 中实现，核心数据结构是 struct rio。RDB 中

的几乎每一个函数都带有 rio 参数。struct rio 既适用于文件，又适用于内存缓存，从

struct rio 的实现可见一斑。

struct _rio {

/ 函数指针，包括读操作，写操作和文件指针移动操作

* Backend functions .

Since t h i s functions do not t o l e r a t e short writes or

reads the return

value i s s i m p l i f i e d to : zero on error , non zero on

complete success . */

size_t (* read ) ( struct _rio * , void *buf , size_t len ) ;

CHAPTER 12. RDB 持久化策略

size_t (* write ) ( struct _rio * , const void *buf , size_t

len ) ;

off_t (* t e l l ) ( struct _rio *) ;

/ 校验和计算函数

* The update_cksum method i f not NULL i s used to compute

the checksum of

a l l the data that was read or written so far . The

method should be

designed so that can be c a l l e d with the current

checksum , and the buf

and len f i e l d s pointing to the new block of data to

add to the checksum

computation . */

void (* update_cksum ) ( struct _rio * , const void *buf ,

size_t len ) ;

/ 校验和

* The current checksum */

uint64_t cksum ;

/ 已经读取或者写入的字符数

* number of bytes read or written */

size_t processed_bytes ;

/ 每次最多能处理的字符数

* maximum s i n g l e read or write chunk s i z e */

size_t max_processing_chunk ;

/ 可以是一个内存总的字符串，也可以是一个文件描述符

* Backend−s p e c i f i c vars . */

union {

struct {

sds ptr ;

/ 偏移量

off_t pos ;

buffer ;

struct {

FILE * fp ;

/ 偏移量

off_t buffered ; /* Bytes written since l a s t fsync

}

12.2. 数据结构 RIO

off_t autosync ; /* fsync a f t e r ’ autosync ’ bytes

written . */

f i l e ;

}

io ;

}

;

typedef struct _rio r i o ;

r e d i s 定义两个 struct r i o ，分别是 rioFileIO 和 rioBufferIO ，

前者用于内存缓存，后者用于文件 IO：

/ 适用于内存缓存

static const r i o rioBufferIO = {

rioBufferRead ,

rioBufferWrite ,

rioBufferTell ,

NULL,

/* update_checksum */

/* current checksum */

/* bytes read or written */

{

/* read/ write chunk s i z e */

{ NULL, 0 } } /* union for io−s p e c i f i c vars */

}

;

/ 适用于文件 IO

static const r i o rioFileIO = {

rioFileRead ,

rioFileWrite ,

r i o F i l e T e l l ,

NULL,

/* update_checksum */

/* current checksum */

/* bytes read or written */

{

/* read/ write chunk s i z e */

{ NULL, 0 } } /* union for io−s p e c i f i c vars */

}

;

CHAPTER 12. RDB 持久化策略

2.3 RDB 持久化的运作机制

redis 支持两种方式进行 RDB：当前进程执行和后台执行（BGSAVE）。RDB

BGSAVE 策略是 fork 出一个子进程，把内存中的数据集整个 dump 到硬盘上。两个

场景举例：

. redis 服务器初始化过程中，设定了定时事件，每隔一段时间就会触发持久化操

作；进入定时事件处理程序中，就会 fork 产生子进程执行持久化操作。

. redis 服务器预设了 save 指令，客户端可要求服务器进程中断服务，执行持久化

操作。

这里主要展开的内容是 RDB 持久化操作的写文件过程，读过程和写过程相

反。子进程的产生发生在 rdbSaveBackground() 中，真正的 RDB 持久化操作是在

rdbSave()，想要直接进行 RDB 持久化，调用 rdbSave() 即可。

以下主要以代码的方式来展开 RDB 的运作机制：

/ 备份主程序

* Save the DB on disk . Return REDIS_ERR on error , REDIS_OK

on success */

int rdbSave (char * filename ) {

d i c t I t e r a t o r * di = NULL;

dictEntry *de ;

char tmpfile [ 2 5 6 ] ;

char magic [ 1 0 ] ;

int j ;

long long now = mstime () ;

FILE * fp ;

r i o rdb ;

uint64_t cksum ;

/ 打开文件，准备写

s n p r i n t f ( tmpfile ,256 , ”temp−%d . rdb” , ( int ) getpid () ) ;

fp = fopen ( tmpfile , ”w” ) ;

12.3. RDB 持久化的运作机制

i f ( ! fp ) {

redisLog (REDIS_WARNING, ” Failed ␣ opening ␣ . rdb␣ fo r ␣

saving : ␣%s ” ,

s t r e r r o r ( errno ) ) ;

return REDIS_ERR;

}

/ 初始化 rdb 结构体。 rdb 结构体内指定了读写文件的函数，

已写/读字符统计等数据

rioInitWithFile (&rdb , fp ) ;

i f ( server . rdb_checksum ) // 校验和

rdb . update_cksum = rioGenericUpdateChecksum ;

/ 先写入版本号

s n p r i n t f ( magic , sizeof ( magic ) , ”REDIS%04d” ,

REDIS_RDB_VERSION) ;

i f ( rdbWriteRaw(&rdb , magic , 9 ) == −1) goto werr ;

for ( j = 0; j < server . dbnum ; j++) {

/ server 中保存的数据

redisDb *db = server . db+j ;

/ 字典

dict *d = db−>dict ;

i f ( d i c t S i z e (d) == 0) continue ;

/ 字典迭代器

di = dictGetSafeIterator (d) ;

i f ( ! di ) {

f c l o s e ( fp ) ;

return REDIS_ERR;

}

/ 写入 RDB 操作码

* Write the SELECT DB opcode */

i f ( rdbSaveType(&rdb ,REDIS_RDB_OPCODE_SELECTDB) ==

−

1) goto werr ;

/ 写入数据库序号

i f ( rdbSaveLen(&rdb , j ) == −1) goto werr ;

CHAPTER 12. RDB 持久化策略

/ 写入数据库中每一个数据项

* I t e r a t e t h i s DB writing every entry */

while (( de = dictNext ( di ) ) != NULL) {

sds keystr = dictGetKey ( de ) ;

robj key ,

o = dictGetVal ( de ) ;

long long expire ;

/ 将 keystr 封装在 robj 里

i n i t S t a t i c S t r i n g O b j e c t ( key , keystr ) ;

/ 获取过期时间

expire = getExpire (db,&key ) ;

/ 开始写入磁盘

i f ( rdbSaveKeyValuePair(&rdb ,&key , o , expire , now)

= −1) goto werr ;

}

d i c t R e l e a s e I t e r a t o r ( di ) ;

}

di = NULL; /* So that we don ’ t r e l e a s e i t again on error .

/ RDB 结束码

* EOF opcode */

i f ( rdbSaveType(&rdb ,REDIS_RDB_OPCODE_EOF) == −1) goto

werr ;

/ 校验和

* CRC64 checksum . I t w i l l be zero i f checksum

computation i s disabled , the

loading code s k i ps the check in t h i s case . */

cksum = rdb . cksum ;

memrev64ifbe(&cksum) ;

rioWrite(&rdb ,&cksum , 8 ) ;

/ 同步到磁盘

* Make sure data w i l l not remain on the OS ’ s output

b u f f e r s */

f f l u s h ( fp ) ;

fsync ( f i l e n o ( fp ) ) ;

f c l o s e ( fp ) ;

12.3. RDB 持久化的运作机制

/ 修改临时文件名为指定文件名

* Use RENAME to make sure the DB f i l e i s changed

atomically only

i f the generate DB f i l e i s ok . */

i f ( rename ( tmpfile , filename ) == −1) {

redisLog (REDIS_WARNING, ” Error ␣moving␣temp␣DB␣ f i l e ␣on␣

the ␣ f i n a l ␣ destination : ␣%s ” , s t r e r r o r ( errno ) ) ;

unlink ( tmpfile ) ;

return REDIS_ERR;

}

redisLog (REDIS_NOTICE, ”DB␣saved␣on␣ disk ” ) ;

server . dirty = 0;

/ 记录成功执行保存的时间

server . l a s t s a v e = time (NULL) ;

/ 记录执行的结果状态为成功

server . lastbgsave_status = REDIS_OK;

return REDIS_OK;

werr :

/ 清理工作，关闭文件描述符等

f c l o s e ( fp ) ;

unlink ( tmpfile ) ;

redisLog (REDIS_WARNING, ”Write␣ error ␣ saving ␣DB␣on␣ disk : ␣%s

”

, s t r e r r o r ( errno ) ) ;

i f ( di ) d i c t R e l e a s e I t e r a t o r ( di ) ;

return REDIS_ERR;

}

/ bgsaveCommand () , serverCron () ,syncCommand() ,

updateSlavesWaitingBgsave () 会调用 rdbSaveBackground ()

int rdbSaveBackground (char * filename ) {

pid_t childpid ;

long long s t a r t ;

/ 已经有后台程序了，拒绝再次执行

i f ( server . rdb_child_pid != −1) return REDIS_ERR;

server . dirty_before_bgsave = server . dirty ;

CHAPTER 12. RDB 持久化策略

/ 记录这次尝试执行持久化操作的时间

server . lastbgsave_try = time (NULL) ;

s t a r t = ustime () ;

i f (( childpid = fork () ) == 0) {

int r e t v a l ;

/ 取消监听

* Child */

closeListeningSockets (0) ;

redisSetProcTitle ( ” redis −rdb−bgsave ” ) ;

/ 执行备份主程序

r e t v a l = rdbSave ( filename ) ;

/ 脏数据，其实就是子进程所消耗的内存大小

i f ( r e t v a l == REDIS_OK) {

/ 获取脏数据大小

size_t private_dirty = zmalloc_get_private_dirty

) ;

(

/ 记录脏数据

i f ( private_dirty ) {

redisLog (REDIS_NOTICE,

”

RDB: ␣%zu␣MB␣ of ␣memory␣used␣by␣copy−on−

write ” ,

private_dirty /(1024*1024) ) ;

}

/ 退出子进程

exitFromChild (( r e t v a l == REDIS_OK) ? 0 : 1) ;

}

else {

* Parent */

/ 计算 fork 消耗的时间

server . stat_fork_time = ustime ()−s t a r t ;

/ fork 出错

i f ( childpid == −1) {

/ 记录执行的结果状态为失败

server . lastbgsave_status = REDIS_ERR;

redisLog (REDIS_WARNING, ”Can ’ t ␣ save ␣ in ␣background :

12.4. RDB 数据的组织方式

␣

fork : ␣%s ” ,

s t r e r r o r ( errno ) ) ;

return REDIS_ERR;

}

redisLog (REDIS_NOTICE, ”Background␣ saving ␣ started ␣by␣

pid ␣%d” , childpid ) ;

/ 记录保存的起始时间

server . rdb_save_time_start = time (NULL) ;

/ 子进程 ID

server . rdb_child_pid = childpid ;

updateDictResizePolicy () ;

return REDIS_OK;

}

return REDIS_OK; /* unreached */

}

如果采用 BGSAVE 策略，且内存中的数据集很大，fork() 会因为要为子进程产

生一份虚拟空间表而花费较长的时间；如果此时客户端请求数量非常大的话，会导致

较多的写时拷贝操作；在 RDB 持久化操作过程中，每一个数据都会导致 write() 系

统调用，CPU 资源很紧张。因此，如果在一台物理机上部署多个 redis，应该避免同

时持久化操作。

那如何知道 BGSAVE 占用了多少内存？子进程在结束之前，读取了自身私有

脏数据 Private_Dirty 的大小，这样做是为了让用户看到 redis 的持久化进程所占

用了有多少的空间。在父进程 fork 产生子进程过后，父子进程虽然有不同的虚拟空

间，但物理空间上是共存的，直至父进程或者子进程修改内存数据为止，所以脏数据

Private_Dirty 可以近似的认为是子进程，即持久化进程占用的空间。

2.4 RDB 数据的组织方式

RDB 的文件组织方式为：数据集序号 1：操作码：数据 1：结束码：校验和—-数

据集序号 2：操作码：数据 2：结束码：校验和......

其中，数据的组织方式为：过期时间：数据类型：键：值，即 TVL（type，length，

value)。

举两个字符串存储的例子，其他的大概都以至于的形式来组织数据：

100

CHAPTER 12. RDB 持久化策略

可见，RDB 持久化的结果是一个非常紧凑的文件，几乎每一位都是有用的信息。

如果对 redis RDB 数据组织方式的细则感兴趣，可以参看 rdb.h 和 rdb.c 两个文件的

实现。

对于每一个键值对都会调用 rdbSaveKeyValuePair()，如下：

int rdbSaveKeyValuePair ( r i o *rdb , robj *key , robj *val ,

long long expiretime , long long now)

{

/ 过期时间

* Save the expire time */

i f ( expiretime != −1) {

* I f t h i s key i s already expired skip i t */

i f ( expiretime < now) return 0;

i f ( rdbSaveType ( rdb ,REDIS_RDB_OPCODE_EXPIRETIME_MS)

= −1) return −1;

i f ( rdbSaveMillisecondTime ( rdb , expiretime ) == −1)

return −1;

}

* Save type , key , value */

/ 数据类型

i f ( rdbSaveObjectType ( rdb , val ) == −1) return −1;

/ 键

i f ( rdbSaveStringObject ( rdb , key ) == −1) return −1;

/ 值

i f ( rdbSaveObject ( rdb , val ) == −1) return −1;

return 1;

}

Chapter 13

AOF 持久化策略

3.1 数据结构 rio

本篇主要讲的是 AOF 持久化，了解 AOF 的数据组织方式和运作机制。redis 主

要在 aof.c 中实现 AOF 的操作。

3.2 AOF 数据组织方式

假设 redis 内存有「name:Jhon」的键值对，那么进行 AOF 持久化后，AOF 文

件有如下内容：

# 2 个参数

# 第一个参数长度为 6

# 第一个参数

SELECT

# 第二参数长度为 1

# 第二参数

# 3 个参数

# 第一个参数长度为 4

# 第一个参数

SET

# 第二参数长度为 4

# 第二个参数

name

# 第三个参数长度为 4

# 第二参数长度为 4

Jhon

所以对上面的内容进行恢复，能得到熟悉的一条 redis 命令：SELECT 8;SET

name Jhon. 可以想象的是，redis 遍历内存数据集中的每个 key-value 对，依次写入

磁盘中；redis 启动的时候，从 AOF 文件中读取数据，恢复数据。

102

CHAPTER 13. AOF 持久化策略

3.3 AOF 持久化运作机制

和 redis RDB 持久化运作机制不同，redis AOF 有后台执行和边服务边备份两种

方式。

1）AOF 后台执行的方式和 RDB 有类似的地方，fork 一个子进程，主进程仍进

行服务，子进程执行 AOF 持久化，数据被 dump 到磁盘上。与 RDB 不同的是，后

台子进程持久化过程中，主进程会记录期间的所有数据变更（主进程还在服务），并

存储在 server.aof_rewrite_buf_blocks 中；后台子进程结束后，redis 更新缓存追加

到 AOF 文件中，是 RDB 持久化所不具备的。

来说说更新缓存这个东西。redis 服务器产生数据变更的时候，譬如 set name

Jhon，不仅仅会修改内存数据集，也会记录此更新（修改）操作，记录的方式就是上

面所说的数据组织方式。

更新缓存可以存储在 server.aof_buf 中，你可以把它理解为一个小型临时中转

站，所有累积的更新缓存都会先放入这里，它会在特定时机写入文件或者插入到

server.aof_rewrite_buf_blocks 下链表（下面会详述）；server.aof_buf 中的数据在

propagrate() 添加，在涉及数据更新的地方都会调用 propagrate() 以累积变更。更

新缓存也可以存储在 server.aof_rewrite_buf_blocks，这是一个元素类型为 struct

aofrwblock 的链表，你可以把它理解为一个仓库，当后台有 AOF 子进程的时候，会

将累积的更新缓存（在 server.aof_buf 中）插入到链表中，而当 AOF 子进程结束，它

会被整个写入到文件。两者是有关联的。

下面是后台执行的主要代码：

/ 启动后台子进程，执行 AOF 持久化操作。 bgrewriteaofCommand ()

，

startAppendOnly () ， serverCron () 中会调用此函数

* This i s how rewriting of the append only f i l e in

background works :

1) The user c a l l s BGREWRITEAOF

2) Redis c a l l s t h i s function , that forks () :

2a) the c h i l d rewrite the append only f i l e in a temp

f i l e .

13.3. AOF 持久化运作机制

103

2b ) the parent accumulates d i f f e r e n c e s in server .

aof_rewrite_buf .

3) When the c h i l d f i n i s h e d ’2a ’ e x i s t s .

4) The parent w i l l trap the e x i t code , i f i t ’ s OK, w i l l

append the

data accumulated into server . aof_rewrite_buf into the

temp f i l e , and

f i n a l l y w i l l rename (2) the temp f i l e in the actual f i l e

name .

The the new f i l e i s reopened as the new append only f i l e

Profit !

int rewriteAppendOnlyFileBackground ( void ) {

pid_t childpid ;

long long s t a r t ;

/ 已经有正在执行备份的子进程

i f ( server . aof_child_pid != −1) return REDIS_ERR;

s t a r t = ustime () ;

i f (( childpid = fork () ) == 0) {

char tmpfile [ 2 5 6 ] ;

/ 子进程

* Child */

/ 关闭监听

closeListeningSockets (0) ;

/ 设置进程 t i t l e

redisSetProcTitle ( ” redis −aof−rewrite ” ) ;

/ 临时文件名

s n p r i n t f ( tmpfile ,256 , ”temp−rewriteaof −bg−%d . aof ” , (

int ) getpid () ) ;

/ 脏数据，其实就是子进程所消耗的内存大小

i f ( rewriteAppendOnlyFile ( tmpfile ) == REDIS_OK) {

/ 获取脏数据大小

size_t private_dirty = zmalloc_get_private_dirty

) ;

(

104

CHAPTER 13. AOF 持久化策略

/ 记录脏数据

i f ( private_dirty ) {

redisLog (REDIS_NOTICE,

”

AOF␣ rewrite : ␣%zu␣MB␣ of ␣memory␣used␣by␣

copy−on−write ” ,

private_dirty /(1024*1024) ) ;

}

exitFromChild (0) ;

else {

exitFromChild (1) ;

}

else {

* Parent */

server . stat_fork_time = ustime ()−s t a r t ;

i f ( childpid == −1) {

redisLog (REDIS_WARNING,

”

Can ’ t ␣ rewrite ␣append␣ only ␣ f i l e ␣ in ␣background

: ␣ fork : ␣%s ” ,

s t r e r r o r ( errno ) ) ;

return REDIS_ERR;

}

redisLog (REDIS_NOTICE,

Background␣append␣ only ␣ f i l e ␣ rewriting ␣ started ␣by

pid ␣%d” , childpid ) ;

/ AOF 已经开始执行，取消 AOF 计划

server . aof_rewrite_scheduled = 0;

”

␣

/ AOF 最近一次执行的起始时间

server . aof_rewrite_time_start = time (NULL) ;

/ 子进程 ID

server . aof_child_pid = childpid ;

updateDictResizePolicy () ;

/ 因为更新缓存都将写入文件，要强制产生选择数据集的指

令 SELECT ，以防出现数据合并错误。

* We s et appendseldb to −1 in order to force the

next c a l l to the

feedAppendOnlyFile () to issue a SELECT command, so

the d i f f e r e n c e s

accumulated by the parent into server .

aof_rewrite_buf w i l l s t a r t

13.3. AOF 持久化运作机制

105

with a SELECT statement and i t w i l l be safe to

merge . */

server . aof_selected_db = −1;

replicationScriptCacheFlush () ;

return REDIS_OK;

}

return REDIS_OK; /* unreached */

}

/ AOF 持久化主函数。只在 rewriteAppendOnlyFileBackground ()

中会调用此函数

* Write a sequence of commands able to f u l l y r e b u i l d the

dataset into

” filename ”. Used both by REWRITEAOF and BGREWRITEAOF.

In order to minimize the number of commands needed in the

rewritten

int rewriteAppendOnlyFile (char * filename ) {

d i c t I t e r a t o r * di = NULL;

dictEntry *de ;

log Redis uses variadic commands when possible , such as

RPUSH, SADD

and ZADD. However at max REDIS_AOF_REWRITE_ITEMS_PER_CMD

items per time

are inserted using a s i n g l e command. */

r i o aof ;

FILE * fp ;

char tmpfile [ 2 5 6 ] ;

int j ;

long long now = mstime () ;

* Note that we have to use a d i f f e r e n t temp name here

compared to the

one used by rewriteAppendOnlyFileBackground () function

. */

s n p r i n t f ( tmpfile ,256 , ”temp−rewriteaof−%d . aof ” , ( int )

getpid () ) ;

/ 打开文件

fp = fopen ( tmpfile , ”w” ) ;

i f ( ! fp ) {

106

CHAPTER 13. AOF 持久化策略

redisLog (REDIS_WARNING, ”Opening␣ the ␣temp␣ f i l e ␣ fo r ␣

AOF␣ rewrite ␣ in ␣ rewriteAppendOnlyFile () : ␣%s ” ,

s t r e r r o r ( errno ) ) ;

return REDIS_ERR;

}

/ 初始化 rio 结构体

rioInitWithFile (&aof , fp ) ;

/ 如果设置了自动备份参数，将进行设置

i f ( server . aof_rewrite_incremental_fsync )

rioSetAutoSync(&aof ,REDIS_AOF_AUTOSYNC_BYTES) ;

/ 备份每一个数据集

for ( j = 0; j < server . dbnum ; j++) {

char selectcmd [ ] = ”*2\ r \n$6\ r \nSELECT\ r \n” ;

redisDb *db = server . db+j ;

dict *d = db−>dict ;

i f ( d i c t S i z e (d) == 0) continue ;

/ 获取数据集的迭代器

di = dictGetSafeIterator (d) ;

i f ( ! di ) {

f c l o s e ( fp ) ;

return REDIS_ERR;

}

/ 写入 AOF 操作码

* SELECT the new DB */

i f ( rioWrite(&aof , selectcmd , sizeof ( selectcmd ) −1) ==

) goto werr ;

/ 写入数据集序号

i f ( rioWriteBulkLongLong(&aof , j ) == 0) goto werr ;

/ 写入数据集中每一个数据项

* I t e r a t e t h i s DB writing every entry */

while (( de = dictNext ( di ) ) != NULL) {

sds keystr ;

robj key , *o ;

long long expiretime ;

13.3. AOF 持久化运作机制

107

keystr = dictGetKey ( de ) ;

o = dictGetVal ( de ) ;

/ 将 keystr 封装在 robj 里

i n i t S t a t i c S t r i n g O b j e c t ( key , keystr ) ;

/ 获取过期时间

expiretime = getExpire (db,&key ) ;

/ 如果已经过期，放弃存储

* I f t h i s key i s already expired skip i t */

i f ( expiretime != −1 && expiretime < now)

continue ;

/ 写入键值对应的写操作

* Save the key and associated value */

i f (o−>type == REDIS_STRING) {

* Emit a SET command */

char cmd[]= ”*3\ r \n$3\ r \nSET\ r \n” ;

i f ( rioWrite(&aof , cmd, sizeof (cmd) −1) == 0)

goto werr ;

* Key and value */

i f ( rioWriteBulkObject(&aof ,&key ) == 0) goto

werr ;

i f ( rioWriteBulkObject(&aof , o ) == 0) goto

werr ;

}

else i f (o−>type == REDIS_LIST) {

i f ( rewriteListObject (&aof ,&key , o ) == 0) goto

werr ;

else i f (o−>type == REDIS_SET) {

i f ( rewriteSetObject(&aof ,&key , o ) == 0) goto

werr ;

else i f (o−>type == REDIS_ZSET) {

i f ( rewriteSortedSetObject(&aof ,&key , o ) == 0)

goto werr ;

else i f (o−>type == REDIS_HASH) {

i f ( rewriteHashObject(&aof ,&key , o ) == 0) goto

werr ;

}

else {

redisPanic ( ”Unknown␣ object ␣type ” ) ;

108

CHAPTER 13. AOF 持久化策略

/ 写入过期时间

* Save the expire time */

i f ( expiretime != −1) {

char cmd[]= ”*3\ r \n$9\ r \nPEXPIREAT\ r \n” ;

i f ( rioWrite(&aof , cmd, sizeof (cmd) −1) == 0)

goto werr ;

i f ( rioWriteBulkObject(&aof ,&key ) == 0) goto

werr ;

i f ( rioWriteBulkLongLong(&aof , expiretime ) ==

) goto werr ;

}

/ 释放迭代器

d i c t R e l e a s e I t e r a t o r ( di ) ;

}

/ 写入磁盘

* Make sure data w i l l not remain on the OS ’ s output

b u f f e r s */

f f l u s h ( fp ) ;

aof_fsync ( f i l e n o ( fp ) ) ;

f c l o s e ( fp ) ;

/ 重写文件名

* Use RENAME to make sure the DB f i l e i s changed

atomically only

i f the generate DB f i l e i s ok . */

i f ( rename ( tmpfile , filename ) == −1) {

redisLog (REDIS_WARNING, ” Error ␣moving␣temp␣append␣ only

␣

f i l e ␣on␣ the ␣ f i n a l ␣ destination : ␣%s ” , s t r e r r o r (

errno ) ) ;

unlink ( tmpfile ) ;

return REDIS_ERR;

}

redisLog (REDIS_NOTICE, ”SYNC␣append␣ only ␣ f i l e ␣ rewrite ␣

performed ” ) ;

return REDIS_OK;

werr :

/ 清理工作

f c l o s e ( fp ) ;

13.3. AOF 持久化运作机制

109

unlink ( tmpfile ) ;

redisLog (REDIS_WARNING, ”Write␣ error ␣ writing ␣append␣ only ␣

f i l e ␣on␣ disk : ␣%s ” , s t r e r r o r ( errno ) ) ;

i f ( di ) d i c t R e l e a s e I t e r a t o r ( di ) ;

return REDIS_ERR;

}

/ 后台子进程结束后， redis 更新缓存 server .

aof_rewrite_buf_blocks 追加到 AOF 文件中

/ 在 AOF 持久化结束后会执行这个函数，

backgroundRewriteDoneHandler () 主要工作是将 server .

aof_rewrite_buf_blocks ，即 AOF 缓存写入文件

* A background append only f i l e rewriting (BGREWRITEAOF)

terminated i t s work .

Handle t h i s . */

void backgroundRewriteDoneHandler ( int exitcode , int bysignal )

{

. . . . .

/ 将 AOF 缓存 server . aof_rewrite_buf_blocks 的 AOF

写入磁盘

i f ( aofRewriteBufferWrite ( newfd ) == −1) {

redisLog (REDIS_WARNING,

”

Error ␣ trying ␣ to ␣ f l u s h ␣ the ␣ parent ␣ d i f f ␣ to ␣ the

␣ rewritten ␣AOF: ␣%s ” , s t r e r r o r ( errno ) ) ;

c l o s e ( newfd ) ;

goto cleanup ;

}

. . . . .

}

/ 将累积的更新缓存 server . aof_rewrite_buf_blocks 同步到磁盘

* Write the b u f f e r ( p o s s i b l y composed of multiple blocks )

into the s p e c i f i e d

fd . I f no short write or any other error happens −1 i s

returned ,

otherwise the number of bytes written i s returned . */

ssize_t aofRewriteBufferWrite ( int fd ) {

listNode * ln ;

l i s t I t e r l i ;

ssize_t count = 0;

listRewind ( server . aof_rewrite_buf_blocks ,& l i ) ;

110

CHAPTER 13. AOF 持久化策略

while (( ln = listN e xt (& l i ) ) ) {

aofrwblock * block = listNodeValue ( ln ) ;

ssize_t nwritten ;

i f ( block−>used ) {

nwritten = write ( fd , block−>buf , block−>used ) ;

i f ( nwritten != block−>used ) {

i f ( nwritten == 0) errno = EIO ;

return −1;

}

count += nwritten ;

}

return count ;

}

）边服务边备份的方式，即 redis 服务器会把所有的数据变更存储在 server.aof_buf

中，并在特定时机将更新缓存写入预设定的文件（server.aof_ﬁlename）。特定时机有

三种：

. 进入事件循环之前

. redis 服务器定时程序 serverCron() 中

. 停止 AOF 策略的 stopAppendOnly() 中

redis 无非是不想服务器突然崩溃终止，导致过多的数据丢失。redis 默认是每两

秒钟进行一次边服务边备份，即隔两秒将累积的写入文件。

redis 为什么取消直接在本进程进行 AOF 持久化的方法？原因可能是产生一个

AOF 文件要比 RDB 文件消耗更多的时间；如果在当前进程执行 AOF 持久化，会占

用服务进程（主进程）较多的时间，停止服务的时间也更长（？）

下面是边服务边备份的主要代码：

/ 同步磁盘；将所有累积的更新 server . aof_buf 写入磁盘

* Write the append only f i l e b u f f e r on disk .

Since we are required to write the AOF before replying to

the c l i e n t ,

and the only way the c l i e n t socket can get a write i s

entering when the

the event loop , we accumulate a l l the AOF writes in a

memory

13.3. AOF 持久化运作机制

111

b u f f e r and write i t on disk using t h i s function j u s t before

entering

the event loop again .

About the ’ force ’ argument :

When the fsync p o l i c y i s s et to ’ everysec ’ we may delay the

f l u s h i f there

i s s t i l l an fsync () going on in the background thread ,

since for instance

on Linux write (2) w i l l be blocked by the background fsync

anyway .

When t h i s happens we remember that there i s some aof b u f f e r

to be

f l u s h e d ASAP, and w i l l try to do that in the serverCron ()

function .

However i f force i s s et to 1 we ’ l l write r e g a r d l e s s of the

background

fsync . */

void flushAppendOnlyFile ( int f o r c e ) {

ssize_t nwritten ;

int sync_in_progress = 0;

/ 无数据，无需同步到磁盘

i f ( sdslen ( server . aof_buf ) == 0) return ;

/ 创建线程任务，主要调用 fsync ()

i f ( server . aof_fsync == AOF_FSYNC_EVERYSEC)

sync_in_progress = bioPendingJobsOfType (

REDIS_BIO_AOF_FSYNC) != 0;

/ 如果没有设置强制同步的选项，可能不会立即进行同步

i f ( server . aof_fsync == AOF_FSYNC_EVERYSEC && ! f o r c e ) {

/ 推迟执行 AOF

* With t h i s append fsync p o l i c y we do background

fsyncing .

I f the fsync i s s t i l l in progress we can try to

delay

the write for a couple of seconds . */

i f ( sync_in_progress ) {

i f ( server . aof_flush_postponed_start == 0) {

112

CHAPTER 13. AOF 持久化策略

/ 设置延迟冲洗时间选项

* No previous write postponinig , remember

that we are

postponing the f l u s h and return . */

server . aof_flush_postponed_start = server .

unixtime ; // /* Unix time sampled every

cron cycle . */

return ;

/ 没有超过 2s ，直接结束

}

else i f ( server . unixtime − server .

aof_flush_postponed_start < 2) {

* We were already waiting for fsync to

fin is h , but for l e s s

than two seconds t h i s i s s t i l l ok .

Postpone again . */

return ;

}

/ 否则，要强制写入磁盘

* Otherwise f a l l trough , and go write since we

can ’ t wait

over two seconds . */

server . aof_delayed_fsync++;

redisLog (REDIS_NOTICE, ”Asynchronous␣AOF␣ fsync ␣ i s ␣

taking ␣ too ␣ long ␣ ( disk ␣ i s ␣busy ?) . ␣Writing␣ the ␣

AOF␣ buffer ␣ without ␣ waiting ␣ fo r ␣ fsync ␣ to ␣

complete , ␣ t h i s ␣may␣ slow ␣down␣Redis . ” ) ;

}

/ 取消延迟冲洗时间设置

* I f you are f o l l o w i n g t h i s code path , then we are going

to write so

set r e s e t the postponed f l u s h s e n t i n e l to zero . */

server . aof_flush_postponed_start = 0;

* We want to perform a s i n g l e write . This should be

guaranteed atomic

at l e a s t i f the f i l e s y s t e m we are writing i s a r eal

p hys ic al one .

While t h i s w i l l save us against the server being

13.3. AOF 持久化运作机制

113

k i l l e d I don ’ t think

there i s much to do about the whole server stopping

for power problems

or a l i k e */

/ AOF 文件已经打开了。将 server . aof_buf 中的所有缓存数据

写入文件

nwritten = write ( server . aof_fd , server . aof_buf , sdslen (

server . aof_buf ) ) ;

i f ( nwritten != ( signed ) sdslen ( server . aof_buf ) ) {

* Ooops , we are in t r o u b l e s . The best thing to do

for now i s

aborting instead of giving the i l l u s i o n that

everything i s

working as expected . */

i f ( nwritten == −1) {

redisLog (REDIS_WARNING, ” Exiting ␣on␣ error ␣ writing ␣

to ␣ the ␣append−only ␣ f i l e : ␣%s ” , s t r e r r o r ( errno ) ) ;

else {

redisLog (REDIS_WARNING, ” Exiting ␣on␣ short ␣ write ␣

}

while ␣ writing ␣ to ␣”

”

the ␣append−only ␣ f i l e : ␣%s ␣

nwritten=%ld , ␣”

expected=%ld ) ” ,

(

s t r e r r o r ( errno ) ,

(

long ) nwritten ,

long ) sdslen ( server .

aof_buf ) ) ;

i f ( ftruncate ( server . aof_fd , server .

aof_current_size ) == −1) {

redisLog (REDIS_WARNING, ”Could␣not␣remove␣

short ␣ write ␣”

”

from␣ the ␣append−only ␣ f i l e . ␣␣Redis␣

may␣ r e f u s e ␣”

to ␣ load ␣ the ␣AOF␣ the ␣next␣time␣ i t ␣

s t a r t s . ␣␣”

ftruncate : ␣%s ” , s t r e r r o r ( errno ) ) ;

}

e x i t (1) ;

}

114

CHAPTER 13. AOF 持久化策略

/ 更新 AOF 文件的大小

server . aof_current_size += nwritten ;

* 当 server . aof_buf 足够小 , 重新利用空间，防止频繁的内存分

配。

相反，当 server . aof_buf 占据大量的空间，采取的策略是释放

空间，可见 redis 对内存很敏感。 */

* Re−use AOF b u f f e r when i t i s small enough . The maximum

comes from the

arena s i z e of 4k minus some overhead ( but i s otherwise

a r b i t r a r y ) . */

i f (( sdslen ( server . aof_buf )+s d s a v a i l ( server . aof_buf ) ) <

000) {

s d s c l e a r ( server . aof_buf ) ;

else {

s d s f r e e ( server . aof_buf ) ;

}

server . aof_buf = sdsempty () ;

}

* Don ’ t fsync i f no−appendfsync−on−rewrite i s set to yes

and there are

children doing I /O in the background . */

i f ( server . aof_no_fsync_on_rewrite &&

(

server . aof_child_pid != −1 | | server . rdb_child_pid

!= −1))

return ;

/ sync , 写入磁盘

* Perform the fsync i f needed . */

i f ( server . aof_fsync == AOF_FSYNC_ALWAYS) {

* aof_fsync i s defined as fdatasync () for Linux in

order to avoid

f l u s h i n g metadata . */

aof_fsync ( server . aof_fd ) ; /* Let ’ s try to get t h i s

data on the disk */

server . aof_last_fsync = server . unixtime ;

else i f (( server . aof_fsync == AOF_FSYNC_EVERYSEC &&

server . unixtime > server . aof_last_fsync ) ) {

i f ( ! sync_in_progress ) aof_background_fsync ( server .

aof_fd ) ;

}

server . aof_last_fsync = server . unixtime ;

13.4. 细说更新缓存

115

}

3.4 细说更新缓存

上面两次提到了「更新缓存」，它即是 redis 累积的数据变更。

更新缓存可以存储在 server.aof_buf 中，可以存储在 server.server.aof_rewrite_buf_blocks

连表中。他们的关系是：每一次数据变更记录都会写入 server.aof_buf 中，同时如果后

台子进程在持久化，变更记录还会被写入 server.server.aof_rewrite_buf_blocks 中。

server.aof_buf 会在特定时期写入指定文件，server.server.aof_rewrite_buf_blocks 会

在后台持久化结束后追加到文件。

redis 源码中是这么实现的：propagrate()->feedAppendOnlyFile()->aofRewriteBuﬀerAppend()

注释：feedAppendOnlyFile() 会把更新添加到 server.aof_buf；接下来会有一个

判断，如果存在 AOF 子进程，则调用 aofRewriteBuﬀerAppend() 将 server.aof_buf

中的所有数据插入到 server.aof_rewrite_buf_blocks 链表。

/ 向 AOF 和从机发布数据更新

* Propagate the s p e c i f i e d command ( in the context of the

s p e c i f i e d database id )

to AOF and Slaves .

f l a g s are an xor between :

+ REDIS_PROPAGATE_NONE (no propagation of command at a l l )

+ REDIS_PROPAGATE_AOF ( propagate into the AOF f i l e i f i s

enabled )

+ REDIS_PROPAGATE_REPL ( propagate into the r e p l i c a t i o n l i n k

)

void propagate ( struct redisCommand *cmd, int dbid , robj **

argv , int argc ,

int f l a g s )

{

/ AOF 策略需要打开，且设置 AOF 传播标记，将更新发布给本

地文件

i f ( server . aof_state != REDIS_AOF_OFF && f l a g s &

REDIS_PROPAGATE_AOF)

feedAppendOnlyFile (cmd, dbid , argv , argc ) ;

CHAPTER 13. AOF 持久化策略

/ 设置了从机传播标记，将更新发布给从机

i f ( f l a g s & REDIS_PROPAGATE_REPL)

replicationFeedSlaves ( server . slaves , dbid , argv , argc ) ;

}

/ 将数据更新记录到 AOF 缓存中

void feedAppendOnlyFile ( struct redisCommand *cmd, int dictid ,

robj **argv , int argc ) {

sds buf = sdsempty () ;

robj *tmpargv [ 3 ] ;

* The DB t h i s command was t a r g e t i n g i s not the same as

the l a s t command

we appendend . To issue a SELECT command i s needed . */

i f ( d i c t i d != server . aof_selected_db ) {

char seldb [ 6 4 ] ;

s n p r i n t f ( seldb , sizeof ( seldb ) , ”%d” , d i c t i d ) ;

buf = s d s c a t p r i n t f ( buf , ”*2\ r \n$6\ r \nSELECT\ r \n$%lu \ r \

n%s \ r \n” ,

(

unsigned long ) s t r l e n ( seldb ) , seldb ) ;

server . aof_selected_db = d i c t i d ;

}

i f (cmd−>proc == expireCommand | | cmd−>proc ==

pexpireCommand | |

cmd−>proc == expireatCommand ) {

* Translate EXPIRE/PEXPIRE/EXPIREAT into PEXPIREAT

buf = catAppendOnlyExpireAtCommand ( buf , cmd, argv [ 1 ] ,

argv [ 2 ] ) ;

}

else i f (cmd−>proc == setexCommand | | cmd−>proc ==

psetexCommand) {

* Translate SETEX/PSETEX to SET and PEXPIREAT */

tmpargv [ 0 ] = createStringObject ( ”SET” ,3) ;

tmpargv [ 1 ] = argv [ 1 ] ;

tmpargv [ 2 ] = argv [ 3 ] ;

buf = catAppendOnlyGenericCommand ( buf , 3 , tmpargv ) ;

decrRefCount ( tmpargv [ 0 ] ) ;

buf = catAppendOnlyExpireAtCommand ( buf , cmd, argv [ 1 ] ,

argv [ 2 ] ) ;

else {

13.4. 细说更新缓存

117

* All the other commands don ’ t need t r a n s l a t i o n or

need the

same t r a n s l a t i o n already operated in the command

vector

for the r e p l i c a t i o n i t s e l f . */

buf = catAppendOnlyGenericCommand ( buf , argc , argv ) ;

}

/ 将生成的 AOF 追加到 server . aof_buf 中。 server . 在下一次

进入事件循环之前， aof_buf 中的内容将会写到磁盘上

* Append to the AOF b u f f e r . This w i l l be f l u s h e d on disk

j u s t before

of re−entering the event loop , so before the c l i e n t

w i l l get a

p o s i t i v e reply about the operation performed . */

i f ( server . aof_state == REDIS_AOF_ON)

server . aof_buf = sdscatlen ( server . aof_buf , buf , sdslen (

buf ) ) ;

/ 如果已经有 AOF 子进程运行， redis 采取的策略是累积子进

程 AOF 备份的数据和内存中数据集的差异。

aofRewriteBufferAppend () 把 buf 的内容追加到 server .

aof_rewrite_buf_blocks 数组中

* I f a background append only f i l e rewriting i s in

progress we want to

accumulate the d i f f e r e n c e s between the c h i l d DB and

the current one

in a buffer , so that when the c h i l d process w i l l do

i t s work we

can append the d i f f e r e n c e s to the new append only f i l e

i f ( server . aof_child_pid != −1)

aofRewriteBufferAppend (( unsigned char*) buf , sdslen ( buf

)

) ;

s d s f r e e ( buf ) ;

}

/ 将数据更新记录写入 server . aof_rewrite_buf_blocks ，此函数只

由 feedAppendOnlyFile () 调用

* Append data to the AOF rewrite buffer , a l l o c a t i n g new

bl ocks i f needed . */

118

CHAPTER 13. AOF 持久化策略

void aofRewriteBufferAppend (unsigned char *s , unsigned long

len ) {

/ 尾插法

listNode * ln = l i s t L a s t ( server . aof_rewrite_buf_blocks ) ;

aofrwblock * block = ln ? ln−>value : NULL;

while ( len ) {

* I f we already got at l e a s t an a l l o c a t e d block , try

appending

at l e a s t some piece into i t . */

i f ( block ) {

unsigned long t h i s l e n = ( block−>f r e e < len ) ?

block−>f r e e : len ;

i f ( t h i s l e n ) { /* The current block i s not

already f u l l . */

memcpy( block−>buf+block−>used , s , t h i s l e n ) ;

block−>used += t h i s l e n ;

block−>f r e e −= t h i s l e n ;

s += t h i s l e n ;

len −= t h i s l e n ;

}

i f ( len ) { /* First block to al l oca t e , or need

another block . */

int numblocks ;

/ 创建新的节点，插到尾部

block = zmalloc ( sizeof (* block ) ) ;

block−>f r e e = AOF_RW_BUF_BLOCK_SIZE;

block−>used = 0;

/ 尾插法

listAddNodeTail ( server . aof_rewrite_buf_blocks ,

block ) ;

* Log every time we cross more 10 or 100 blocks ,

r e s p e c t i v e l y

as a notice or warning . */

numblocks = listLength ( server .

aof_rewrite_buf_blocks ) ;

i f ( ( ( numblocks+1) % 10) == 0) {

13.4. 细说更新缓存

119

int l e v e l = (( numblocks+1) % 100) == 0 ?

REDIS_WARNING :

REDIS_NOTICE

;

redisLog ( level , ”Background␣AOF␣ buffer ␣ s i z e : ␣%

lu ␣MB” ,

aofRewriteBufferSize () /(1024*1024) ) ;

}

一副可以缓解视力疲劳的图片——AOF 持久化运作机制：

两种数据落地的方式，就是 AOF 的两个主线。因此，redis AOF 持久化机制有

两条主线：后台执行和边服务边备份，抓住这两点就能理解 redis AOF 了。

这里有一个疑问，两条主

线都会涉及文件的写：后台执行会写一个 AOF 文件，边服务边备份也会写一个，以

哪个为准？

后台持久化的数据首先会被写入「temp-rewriteaof-bg-%d.aof」，其中「%d」是 AOF

子进程 id；待 AOF 子进程结束后，「temp-rewriteaof-bg-%d.aof」会被以追加的方式打

开，继而写入 server.aof_rewrite_buf_blocks 中的更新缓存，最后「temp-rewriteaof-

bg-%d.aof」文件被命名为 server.aof_ﬁlename，所以之前的名为 server.aof_ﬁlename

的文件会被删除，也就是说边服务边备份写入的文件会被删除。边服务边备份的数据

120

CHAPTER 13. AOF 持久化策略

会被一直写入到 server.aof_ﬁlename 文件中。

因此，确实会产生两个文件，但是最后都会变成 server.aof_ﬁlename 文件。

这里还有一个疑问，既然有了后台持久化，为什么还要边服务边备份？边服务边

备份时间长了会产生数据冗余甚至备份过旧的数据，而后台持久化可以消除这些东

西。看，这里是 redis 的双保险。

3.5 AOF 恢复过程

AOF 的数据恢复过程设计实在是棒极了，它模拟一个服务过程。redis 首先虚拟

一个客户端，读取 AOF 文件恢复 redis 命令和参数；然后就像服务客户端一样执行

命令相应的函数，从而恢复数据。这些过程主要在 loadAppendOnlyFile() 中实现。

/ 加载 AOF 文件，恢复数据

* Replay the append log f i l e . On error REDIS_OK i s returned .

On non f a t a l

error ( the append only f i l e i s zero−length ) REDIS_ERR i s

returned . On

f a t a l error an error message i s logged and the program

e x i s t s . */

int loadAppendOnlyFile (char * filename ) {

struct r e d i s C l i e n t * fakeClient ;

FILE * fp = fopen ( filename , ” r ” ) ;

struct redis_stat sb ;

int old_aof_state = server . aof_state ;

long loops = 0;

/ 文件大小不能为 0

i f ( fp && r e d i s _ f s t a t ( f i l e n o ( fp ) ,&sb ) != −1 && sb . st_size

= 0) {

server . aof_current_size = 0;

f c l o s e ( fp ) ;

return REDIS_ERR;

}

i f ( fp == NULL) {

redisLog (REDIS_WARNING, ” Fatal ␣ error : ␣can ’ t ␣open␣ the ␣

append␣ log ␣ f i l e ␣ f o r ␣ reading : ␣%s ” , s t r e r r o r ( errno ) ) ;

e x i t (1) ;

}

13.5. AOF 恢复过程

121

/ 正在执行 AOF 加载操作，于是暂时禁止 AOF 的所有操作，以

免混淆

* Temporarily d i s a b l e AOF, to prevent EXEC from feeding

a MULTI

to the same f i l e we ’ re about to read . */

server . aof_state = REDIS_AOF_OFF;

/ 虚拟出一个客户端，即 r e d i s C l i e n t

fakeClient = createFakeClient () ;

startLoading ( fp ) ;

while (1) {

int argc , j ;

unsigned long len ;

robj ** argv ;

char buf [ 1 2 8 ] ;

sds argsds ;

struct redisCommand *cmd ;

/ 每循环 1000 次，在恢复数据的同时，服务器也为客户端

服务。 aeProcessEvents () 会进入事件循环

i f ( ! ( loops++ % 1000) ) {

* Serve the c l i e n t s from time to time */

loadingProgress ( f t e l l o ( fp ) ) ;

aeProcessEvents ( server . el , AE_FILE_EVENTS|

AE_DONT_WAIT) ;

}

/ 可能 aof 文件到了结尾

i f ( f g e t s ( buf , sizeof ( buf ) , fp ) == NULL) {

i f ( f e o f ( fp ) )

break ;

else

goto readerr ;

}

/ 必须以 “*” 开头，格式不对，退出

i f ( buf [ 0 ] != ’ * ’ ) goto fmterr ;

/ 参数的个数

argc = atoi ( buf+1) ;

122

CHAPTER 13. AOF 持久化策略

/ 参数个数错误

i f ( argc < 1) goto fmterr ;

/ 为参数分配空间

argv = zmalloc ( sizeof ( robj *) * argc ) ;

/ 依次读取参数

for ( j = 0; j < argc ; j++) {

i f ( f g e t s ( buf , sizeof ( buf ) , fp ) == NULL) goto

readerr ;

i f ( buf [ 0 ] != ’ $ ’ ) goto fmterr ;

len = s t r t o l ( buf+1,NULL, 10) ;

argsds = sdsnewlen (NULL, len ) ;

i f ( len && fread ( argsds , len , 1 , fp ) == 0) goto

fmterr ;

argv [ j ] = createObject (REDIS_STRING, argsds ) ;

i f ( fread ( buf , 2 , 1 , fp ) == 0) goto fmterr ; /*

discard CRLF */

}

/ 找到相应的命令

* Command lookup */

cmd = lookupCommand( argv[0]−> ptr ) ;

i f ( ! cmd) {

redisLog (REDIS_WARNING, ”Unknown␣command␣’%s ’ ␣

reading ␣ the ␣append␣ only ␣ f i l e ” , (char*) argv

[

0]−> ptr ) ;

e x i t (1) ;

}

/ 执行命令，模拟服务客户端请求的过程，从而写入数据

* Run the command in the context of a fake c l i e n t */

fakeClient −>argc = argc ;

fakeClient −>argv = argv ;

cmd−>proc ( fakeClient ) ;

* The fake c l i e n t should not have a reply */

redisAssert ( fakeClient −>bufpos == 0 && listLength (

fakeClient −>reply ) == 0) ;

redisAssert (( fakeClient −>f l a g s & REDIS_BLOCKED) == 0)

* The fake c l i e n t should never get blocked */

13.5. AOF 恢复过程

123

;

/ 释放虚拟客户端空间

* Clean up . Command code may have changed argv / argc

so we use the

argv / argc of the c l i e n t instead of the l o c a l

v a r i a b l e s . */

for ( j = 0; j < fakeClient −>argc ; j++)

decrRefCount ( fakeClient −>argv [ j ] ) ;

z f r e e ( fakeClient −>argv ) ;

}

* This point can only be reached when EOF i s reached

without errors .

I f the c l i e n t i s in the middle of a MULTI/EXEC, log

error and q u i t . */

i f ( fakeClient −>f l a g s & REDIS_MULTI) goto readerr ;

/ 清理工作

f c l o s e ( fp ) ;

freeFakeClient ( fakeClient ) ;

/ 恢复旧的 AOF 状态

server . aof_state = old_aof_state ;

stopLoading () ;

/ 记录最近 AOF 操作的文件大小

aofUpdateCurrentSize () ;

server . aof_rewrite_base_size = server . aof_current_size ;

return REDIS_OK;

readerr :

/ 错误，清理工作

i f ( f e o f ( fp ) ) {

redisLog (REDIS_WARNING, ”Unexpected␣end␣ of ␣ f i l e ␣

reading ␣ the ␣append␣ only ␣ f i l e ” ) ;

}

else {

redisLog (REDIS_WARNING, ” Unrecoverable ␣ error ␣ reading ␣

the ␣append␣ only ␣ f i l e : ␣%s ” , s t r e r r o r ( errno ) ) ;

}

e x i t (1) ;

fmterr :

124

CHAPTER 13. AOF 持久化策略

redisLog (REDIS_WARNING, ”Bad␣ f i l e ␣format␣ reading ␣ the ␣

append␣ only ␣ f i l e : ␣make␣a␣backup␣ of ␣your␣AOF␣ f i l e , ␣then

␣

use ␣ . / redis −check−aof ␣−−f i x ␣<filename>” ) ;

e x i t (1) ;

}

3.6 AOF 的适用场景

如果对数据比较关心，分秒必争，可以用 AOF 持久化，而且 AOF 文件很容易

进行分析。

Chapter 14

主从复制

redis 支持 master-slave（主从）模式，redis server 可以设置为另一个 redis server

的主机（从机），从机定期从主机拿数据。特殊的，一个从机同样可以设置为一个 redis

server 的主机，这样一来 master-slave 的分布看起来就是一个有向无环图 DAG，如

此形成 redis server 集群，无论是主机还是从机都是 redis server，都可以提供服务）。

在配置后，主机可负责读写服务，从机只负责读。redis 提高这种配置方式，为的

是让其支持数据的弱一致性，即最终一致性。在业务中，选择强一致性还是若已执行，

应该取决于具体的业务需求，像微博，完全可以使用弱一致性模型；像支付宝，可以

选用强一致性模型。

126

CHAPTER 14. 主从复制

4.1 积压空间

binlog 是在 mysql 中的一种日志类型，它记录了所有数据库自备份一来的所有

更新操作或潜在的更新操作，描述了数据的更改。因为 binlog 只记录了数据的更新，

所以适合用来做实时备份和主从复制。同样，redis 在主从复制上用的就是一种类似

binlog 的日志。TODO

在《深入剖析 redis AOF 持久化策略》中，介绍了更新缓存的概念，举一个例

子：客户端发来命令：set name Jhon，这一数据更新被记录为：*3/r/n$3/r/nSET/

r/n$4/r/nname/r/n$3/r/nJhon/r/n，并存储在更新缓存中。

同样，在主从连接中，也有更新缓存的概念。只是两者的用途不一样，前者被写

入本地，后者被写入从机，这里我们把它成为积压空间。

更新缓存存储在 server.repl_backlog，redis 将其作为一个环形空间来处理，这样

做节省了空间，避免内存再分配的情况。

struct redisServer {

* Replication ( master ) */

/ 最近一次使用（访问）的数据集

int slaveseldb ;

r e p l i c a t i o n output */

/* Last SELECTed DB in

/* Global r e p l i c a t i o n

/* Master pings the s l av e

/* Replication backlog

/* Backlog c i r c u l a r

long long master_repl_offset ;

/ 全局的数据同步偏移量

o f f s e t */

int repl_ping_slave_period ;

/ 主从连接心跳频率

every N seconds */

char * repl_backlog ;

/ 积压空间指针

for p a r t i a l syncs */

long long repl_backlog_size ;

/ 积压空间大小

b u f f e r s i z e */

/ 积压空间中写入的新数据的大小

long long repl_backlog_histlen ; /* Backlog actual data

length */

14.1. 积压空间

127

/ 下一次向积压空间写入数据的起始位置

/* Backlog c i r c u l a r

long long repl_backlog_idx ;

b u f f e r current o f f s e t */

/ 积压数据的起始位置，是一个宏观值

/* Replication o f f s e t of

backlog b u f f e r . */

long long repl_backlog_off ;

f i r s t byte in the

/ 积压空间有效时间

time_t repl_backlog_time_limit ; /* Time without s l a v e s

a f t e r the backlog

gets released . */

}

积压空间中的数据变更记录是什么时候被写入的？在执行一个 redis 命令的时候，

如果存在数据的修改（写），那么就会把变更记录传播。redis 源码中是这么实现的：

call()->propagate()->replicationFeedSlaves()

注释：命令真正执行的地方在 call() 中，call() 如果发现数据被修改（dirty），则

传播 propagrate()，replicationFeedSlaves() 将修改记录写入积压空间和所有已连接的

从机。

这里可能会有疑问：为什么把数据添加入积压空间，又把数据分发给所有的从

机？为什么不仅仅将数据分发给所有从机呢？

因为有一些从机会因特殊情况（？？？）与主机断开连接，注意从机断开前有暂

存主机的状态信息，因此这些断开的从机就没有及时收到更新的数据。redis 为了

让断开的从机在下次连接后能够获取更新数据，将更新数据加入了积压空间。从

replicationFeedSlaves() 实现来看，在线的 slave 能马上收到数据更新记录；因某些

原因暂时断开连接的 slave，需要从积压空间中找回断开期间的数据更新记录。如果

断开的时间足够长，master 会拒绝 slave 的部分同步请求，从而 slave 只能进行全同步。

下面是源码注释：

/ c a l l () 函数是执行命令的核心函数，真正执行命令的地方

* Call () i s the core of Redis execution of a command */

void c a l l ( r e d i s C l i e n t *c , int f l a g s ) {

. . . . .

* Call the command. */

c−>f l a g s &= ~(REDIS_FORCE_AOF|REDIS_FORCE_REPL) ;

redisOpArrayInit(& server . also_propagate ) ;

128

CHAPTER 14. 主从复制

/ 脏数据标记，数据是否被修改

dirty = server . dirty ;

/ 执行命令对应的函数

c−>cmd−>proc ( c ) ;

dirty = server . dirty −dirty ;

duration = ustime ()−s t a r t ;

. . . . .

/ 将客户端请求的数据修改记录传播给 AOF 和从机

* Propagate the command into the AOF and r e p l i c a t i o n

l i n k */

i f ( f l a g s & REDIS_CALL_PROPAGATE) {

int f l a g s = REDIS_PROPAGATE_NONE;

/ 强制主从复制

i f ( c−>f l a g s & REDIS_FORCE_REPL) f l a g s |=

REDIS_PROPAGATE_REPL;

/ 强制 AOF 持久化

i f ( c−>f l a g s & REDIS_FORCE_AOF) f l a g s |=

REDIS_PROPAGATE_AOF;

/ 数据被修改

i f ( dirty )

f l a g s |= (REDIS_PROPAGATE_REPL |

REDIS_PROPAGATE_AOF) ;

/ 传播数据修改记录

i f ( f l a g s != REDIS_PROPAGATE_NONE)

propagate ( c−>cmd, c−>db−>id , c−>argv , c−>argc , f l a g s )

;

}

. . . . .

}

/ 向 AOF 和从机发布数据更新

* Propagate the s p e c i f i e d command ( in the context of the

s p e c i f i e d database id )

to AOF and Slaves .

14.1. 积压空间

129

f l a g s are an xor between :

+ REDIS_PROPAGATE_NONE (no propagation of command at a l l )

+ REDIS_PROPAGATE_AOF ( propagate into the AOF f i l e i f i s

enabled )

+ REDIS_PROPAGATE_REPL ( propagate into the r e p l i c a t i o n

l i n k )

void propagate ( struct redisCommand *cmd, int dbid , robj **

argv , int argc ,

int f l a g s )

{

/ AOF 策略需要打开，且设置 AOF 传播标记，将更新发布给本

地文件

i f ( server . aof_state != REDIS_AOF_OFF && f l a g s &

REDIS_PROPAGATE_AOF)

feedAppendOnlyFile (cmd, dbid , argv , argc ) ;

/ 设置了从机传播标记，将更新发布给从机

i f ( f l a g s & REDIS_PROPAGATE_REPL)

replicationFeedSlaves ( server . slaves , dbid , argv , argc ) ;

}

/ 向积压空间和从机发送数据

void replicationFeedSlaves ( l i s t * slaves , int dictid , robj **

argv , int argc ) {

listNode * ln ;

l i s t I t e r l i ;

int j , len ;

char l l s t r [REDIS_LONGSTR_SIZE ] ;

/ 没有积压数据且没有从机，直接退出

* I f there aren ’ t slaves , and there i s no backlog b u f f e r

to populate ,

we can return ASAP. */

i f ( server . repl_backlog == NULL && listLength ( s l a v e s ) ==

) return ;

* We can ’ t have s l a v e s attached and no backlog . */

redisAssert ( ! ( listLength ( s l a v e s ) != 0 && server .

repl_backlog == NULL) ) ;

130

CHAPTER 14. 主从复制

* Send SELECT command to every s l a v e i f needed . */

i f ( server . slaveseldb != d i c t i d ) {

robj * selectcmd ;

/ 小于等于 10 的可以用共享对象

* For a few DBs we have pre−computed SELECT command.

i f ( d i c t i d >= 0 && d i c t i d < REDIS_SHARED_SELECT_CMDS)

{

selectcmd = shared . s e l e c t [ d i c t i d ] ;

}

else {

/ 不能使用共享对象，生成 SELECT 命令对应的 redis 对

象

int dictid_len ;

dictid_len = l l 2 s t r i n g ( l l s t r , sizeof ( l l s t r ) , d i c t i d

)

;

selectcmd = createObject (REDIS_STRING,

s d s c a t p r i n t f ( sdsempty () ,

”

*2\ r \n$6\ r \nSELECT\ r \n$%d\ r \n%s \ r \n” ,

dictid_len , l l s t r ) ) ;

}

/ 这里可能会有疑问：为什么把数据添加入积压空间，又把

数据分发给所有的从机？

/ 为什么不仅仅将数据分发给所有从机呢？

/ 因为有一些从机会因特殊情况（？？？）与主机断开连

接，注意从机断开前有暂存

/ 主机的状态信息，因此这些断开的从机就没有及时收到更

新的数据。 redis 为了让

/ 断开的从机在下次连接后能够获取更新数据，将更新数据

加入了积压空间。

/ 将 SELECT 命令对应的 redis 对象数据添加到积压空间

* Add the SELECT command into the backlog . */

i f ( server . repl_backlog )

feedReplicationBacklogWithObject ( selectcmd ) ;

/ 将数据分发所有的从机

* Send i t to s l a v e s . */

listRewind ( slaves ,& l i ) ;

while (( ln = lis tN ext (& l i ) ) ) {

14.1. 积压空间

131

r e d i s C l i e n t * slave = ln−>value ;

addReply ( slave , selectcmd ) ;

}

/ 销毁对象

i f ( d i c t i d < 0 | | d i c t i d >= REDIS_SHARED_SELECT_CMDS)

decrRefCount ( selectcmd ) ;

}

/ 更新最近一次使用（访问）的数据集

server . slaveseldb = d i c t i d ;

/ 将命令写入积压空间

* Write the command to the r e p l i c a t i o n backlog i f any .

i f ( server . repl_backlog ) {

char aux [REDIS_LONGSTR_SIZE+3];

/ 命令个数

* Add the multi bulk reply length . */

aux [ 0 ] = ’ * ’ ;

len = l l 2 s t r i n g ( aux+1, sizeof ( aux ) −1, argc ) ;

aux [ len +1] = ’ \ r ’ ;

aux [ len +2] = ’ \n ’ ;

feedReplicationBacklog (aux , len +3) ;

/ 逐个命令写入

for ( j = 0; j < argc ; j++) {

long objlen = stringObjectLen ( argv [ j ] ) ;

* We need to feed the b u f f e r with the o b j e c t as

a bulk reply

not j u s t as a plain string , so create the $ . .

CRLF payload len

ad add the f i n a l CRLF */

aux [ 0 ] = ’ $ ’ ;

len = l l 2 s t r i n g ( aux+1, sizeof ( aux ) −1, objlen ) ;

aux [ len +1] = ’ \ r ’ ;

aux [ len +2] = ’ \n ’ ;

* 每个命令格式如下：

132

CHAPTER 14. 主从复制

SET

NAME

Jhon*/

/ 命令长度

feedReplicationBacklog (aux , len +3) ;

/ 命令

feedReplicationBacklogWithObject ( argv [ j ] ) ;

/ 换行

feedReplicationBacklog ( aux+len +1 ,2) ;

}

/ 立即给每一个从机发送命令

* Write the command to every s l a v e . */

listRewind ( slaves ,& l i ) ;

while (( ln = listN e xt (& l i ) ) ) {

r e d i s C l i e n t * slave = ln−>value ;

/ 如果从机要求全同步，则不对此从机发送数据

* Don ’ t feed s l a v e s that are s t i l l waiting for

BGSAVE to s t a r t */

i f ( slave −>r e p l s t a t e == REDIS_REPL_WAIT_BGSAVE_START)

continue ;

* Feed s l a v e s that are waiting for the i n i t i a l SYNC

(

so these commands

are queued in the output b u f f e r u n t i l the i n i t i a l

SYNC completes ) ,

or are already in sync with the master . */

/ 向从机命令的长度

* Add the multi bulk length . */

addReplyMultiBulkLen ( slave , argc ) ;

/ 向从机发送命令

* Finally any a d d i t i o n a l argument that was not

stored i nside the

s t a t i c b u f f e r i f any ( from j to argc ) . */

14.2. 主从数据同步机制概述

133

for ( j = 0; j < argc ; j++)

addReplyBulk ( slave , argv [ j ] ) ;

}

4.2 主从数据同步机制概述

redis 主从同步有两种方式（或者所两个阶段）：全同步和部分同步。

主从刚刚连接的时候，进行全同步；全同步结束后，进行部分同步。当然，如果

有需要，slave 在任何时候都可以发起全同步。redis 策略是，无论如何，首先会尝试进

行部分同步，如不成功，要求从机进行全同步，并启动 BGSAVE……BGSAVE 结束

后，传输 RDB 文件；如果成功，允许从机进行部分同步，并传输积压空间中的数据。

134

CHAPTER 14. 主从复制

如需设置 slave，master 需要向 slave 发送 SLAVEOF hostname port，从机接收

到后会自动连接主机，注册相应读写事件（syncWithMaster())。

/ 修改主机

void slaveofCommand ( r e d i s C l i e n t *c ) {

i f ( ! strcasecmp ( c−>argv[1]−>ptr , ”no” ) &&

strcasecmp ( c−>argv[2]−>ptr , ”one” ) ) {

/ s l a v e o f no one 断开主机连接

i f ( server . masterhost ) {

replicationUnsetMaster () ;

14.2. 主从数据同步机制概述

135

redisLog (REDIS_NOTICE, ”MASTER␣MODE␣ enabled ␣ ( user ␣

request ) ” ) ;

}

else {

}

long port ;

i f (( getLongFromObjectOrReply ( c , c−>argv [ 2 ] , &port ,

NULL) != REDIS_OK) )

return ;

/ 可能已经连接需要连接的主机

* Check i f we are already attached to the s p e c i f i e d

s l a v e */

i f ( server . masterhost && ! strcasecmp ( server .

masterhost , c−>argv[1]−> ptr )

& server . masterport == port ) {

redisLog (REDIS_NOTICE, ”SLAVE␣OF␣would␣ r e s u l t ␣ into

synchronization ␣with␣ the ␣master␣we␣ are ␣

already ␣ connected ␣with . ␣No␣ operation ␣performed

␣

” ) ;

addReplySds ( c , sdsnew ( ”+OK␣Already␣ connected ␣ to ␣

s p e c i f i e d ␣master\ r \n” ) ) ;

return ;

}

/ 断开之前连接主机的连接，连接新的。

replicationSetMaster () 并不会真正连接主机，只是修

改 s t r u c t server 中关于主机的设置。真正的主机连接

在 replicationCron () 中完成

* There was no previous master or the user s p e c i f i e d

a d i f f e r e n t one ,

we can continue . */

replicationSetMaster ( c−>argv[1]−>ptr , port ) ;

redisLog (REDIS_NOTICE, ”SLAVE␣OF␣%s:%d␣ enabled ␣ ( user ␣

request ) ” ,

server . masterhost , server . masterport ) ;

}

addReply ( c , shared . ok ) ;

}

/ 设置新主机

* Set r e p l i c a t i o n to the s p e c i f i e d master address and port .

136

CHAPTER 14. 主从复制

void replicationSetMaster (char *ip , int port ) {

s d s f r e e ( server . masterhost ) ;

server . masterhost = sdsdup ( ip ) ;

server . masterport = port ;

/ 断开之前主机的连接

i f ( server . master ) f r e e C l i e n t ( server . master ) ;

disconnectSlaves () ; /* Force our s l a v e s to resync with us

as w e l l . */

/ 取消缓存主机

replicationDiscardCachedMaster () ; /* Don ’ t try a PSYNC.

/ 释放积压空间

freeReplicationBacklog () ; /* Don ’ t allow our chained

s l a v e s to PSYNC. */

/ cancelReplicationHandshake () 尝试断开数据传输和主机连

接

cancelReplicationHandshake () ;

server . repl_state = REDIS_REPL_CONNECT;

server . master_repl_offset = 0;

}

/ 管理主从连接的定时程序定时程序，每秒执行一次

/ 在 serverCorn () 中调用

* −−−−−−−−−−−−−−−−−−−−−−−−−−− REPLICATION CRON

−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */

* Replication cron funciton , c a l l e d 1 time per second . */

void replicationCron ( void ) {

. . . . .

/ 如果需要（ REDIS_REPL_CONNECT），尝试连接主机，真正连

接主机的操作在这里

* Check i f we should connect to a MASTER */

i f ( server . repl_state == REDIS_REPL_CONNECT) {

redisLog (REDIS_NOTICE, ” Connecting␣ to ␣MASTER␣%s:%d” ,

server . masterhost , server . masterport ) ;

i f ( connectWithMaster () == REDIS_OK) {

redisLog (REDIS_NOTICE, ”MASTER␣<−>␣SLAVE␣sync␣

14.3. 全同步

137

started ” ) ;

}

. . . . .

}

4.3 全同步

接着自动发起 PSYNC 请求 master 进行全同步。无论如何，redis 首先会尝试部

分同步，如果失败才尝试全同步。而刚刚建立连接的 master-slave 需要全同步。

从机连接主机后，会主动发起 PSYNC 命令，从机会提供 master_runid 和 oﬀset，

主机验证 master_runid 和 oﬀset 是否有效？master_runid 相当于主机身份验证码，

用来验证从机上一次连接的主机，oﬀset 是全局积压空间数据的偏移量。

验证未通过则，则进行全同步：主机返回 +FULLRESYNC master_runid oﬀset

（

从机接收并记录 master_runid 和 oﬀset，并准备接收 RDB 文件）接着启动 BGSAVE

生成 RDB 文件，BGSAVE 结束后，向从机传输，从而完成全同步。

/ 连接主机 connectWithMaster () 的时候，会被注册为回调函数

void syncWithMaster ( aeEventLoop * el , int fd , void * privdata ,

int mask) {

char tmpfile [ 2 5 6 ] , * err ;

int dfd , maxtries = 5;

int sockerr = 0 , psync_result ;

socklen_t e r r l e n = sizeof ( sockerr ) ;

. . . . .

/ 这里尝试向主机请求部分同步，主机会回复以拒绝或接受请

求。如果拒绝部分同步，会返回 +FULLRESYNC master_runid

o f f s e t

/ 从机接收后准备进行全同步

psync_result =

slaveTryPartialResynchronization ( fd ) ;

i f ( psync_result == PSYNC_CONTINUE) {

redisLog (REDIS_NOTICE, ”MASTER␣<−>␣SLAVE␣sync : ␣Master

␣

accepted ␣a␣ Partial ␣ Resynchronization . ” ) ;

return ;

}

138

CHAPTER 14. 主从复制

/ 执行全同步

* Fall back to SYNC i f needed . Otherwise psync_result ==

PSYNC_FULLRESYNC

and the server . repl_master_runid and

repl_master_initial_offset are

already populated . */

/ 未知结果，进行出错处理

i f ( psync_result == PSYNC_NOT_SUPPORTED) {

redisLog (REDIS_NOTICE, ” Retrying ␣with␣SYNC . . . ” ) ;

i f ( syncWrite ( fd , ”SYNC\ r \n” ,6 , server .

repl_syncio_timeout *1000) == −1) {

redisLog (REDIS_WARNING, ” I /O␣ error ␣ writing ␣ to ␣

MASTER: ␣%s ” ,

s t r e r r o r ( errno ) ) ;

goto error ;

}

/ 为什么要尝试 5 次？？？

* Prepare a s u i t a b l e temp f i l e for bulk t r a n s f e r */

while ( maxtries −−) {

s n p r i n t f ( tmpfile ,256 ,

”

temp−%d.% ld . rdb” ,( int ) server . unixtime , ( long int )

getpid () ) ;

dfd = open ( tmpfile ,O_CREAT|O_WRONLY|O_EXCL,0644) ;

i f ( dfd != −1) break ;

sleep (1) ;

}

i f ( dfd == −1) {

redisLog (REDIS_WARNING, ”Opening␣ the ␣temp␣ f i l e ␣needed␣

fo_e r_r_r␣ M_n_oA₎S T₎ E_; R␣<−>␣SLAVE␣ synchronization : ␣%s ” , s t r e r r o r

(

goto error ;

}

/ 注册读事件，回调函数 readSyncBulkPayload () ，准备读

RDB 文件

* Setup the non blocking download of the bulk f i l e . */

i f ( aeCreateFileEvent ( server . el , fd , AE_READABLE,

readSyncBulkPayload ,NULL)

== AE_ERR)

14.3. 全同步

139

{

redisLog (REDIS_WARNING,

Can ’ t ␣ create ␣ readable ␣ event ␣ fo r ␣SYNC: ␣%s ␣ ( fd=%d)

”

s t r e r r o r ( errno ) , fd ) ;

goto error ;

}

/ 设置传输 RDB 文件数据的选项

/ 状态

server . repl_state = REDIS_REPL_TRANSFER;

/ RDB 文件大小

server . repl_transfer_size = −1;

/ 已经传输的大小

server . repl_transfer_read = 0;

/ 上一次同步的偏移，为的是定时写入磁盘

server . repl_transfer_last_fsync_off = 0;

/ 本地 RDB 文件套接字

server . repl_transfer_fd = dfd ;

/ 上一次同步 IO 时间

server . repl_transfer_lastio = server . unixtime ;

server . repl_transfer_tmpfile = zstrdup ( tmpfile ) ;

/ 临时文件名

return ;

error :

c l o s e ( fd ) ;

server . repl_transfer_s = −1;

server . repl_state = REDIS_REPL_CONNECT;

return ;

}

全同步请求的数据是 RDB 数据文件和积压空间中的数据。关于 RDB 数据文件，

请参看《深入剖析 redis RDB 持久化策略》。如果没有后台持久化 BGSAVE 进程，那

么 BGSVAE 会被触发，否则所有请求全同步的 slave 都会被标记为等待 BGSAVE 结

束。BGSAVE 结束后，master 会马上向所有的从机发送 RDB 文件。

/ 主机 SYNC 和 PSYNC 命令处理函数，会尝试进行部分同步和全同

步

* SYNC ad PSYNC command implemenation . */

void syncCommand( r e d i s C l i e n t *c ) {

. . . . .

/ 主机尝试部分同步，失败的话向从机发送 +FULLRESYNC

140

CHAPTER 14. 主从复制

master_runid o f f s e t ，接着启动 BGSAVE

/ 执行全同步：

* Full resynchronization . */

server . stat_sync_full++;

* Here we need to check i f there i s a background saving

operation

in progress , or i f i t i s required to s t a r t one */

i f ( server . rdb_child_pid != −1) {

存在 BGSAVE 后台进程。

. 如果 master 现有所连接的所有从机 s l a v e s 当中有存在

REDIS_REPL_WAIT_BGSAVE_END 的从机，那么将从机 c 设

置为 REDIS_REPL_WAIT_BGSAVE_END；

. 否则，设置为 REDIS_REPL_WAIT_BGSAVE_START*/

* Ok a background save i s in progress . Let ’ s check

i f i t i s a good

one for r ep l i c a t io n , i . e . i f there i s another

s l a v e that i s

r e g i s t e r i n g d i f f e r e n c e s since the server forked to

save */

r e d i s C l i e n t * slave ;

listNode * ln ;

l i s t I t e r l i ;

/ 检测是否已经有从机申请全同步

listRewind ( server . slaves ,& l i ) ;

while (( ln = lis tN ext (& l i ) ) ) {

slave = ln−>value ;

i f ( slave −>r e p l s t a t e ==

REDIS_REPL_WAIT_BGSAVE_END) break ;

}

i f ( ln ) {

/ 存在状态为 REDIS_REPL_WAIT_BGSAVE_END 的从机 sl a v e

，

/ 就将此从机 c 状态设置为 REDIS_REPL_WAIT_BGSAVE_END

，

/ 从而在 BGSAVE 进程结束后，可以发送 RDB 文件，

/ 同时将从机 s l a v e 中的更新复制到此从机 c 。

14.3. 全同步

141

* Perfect , the server i s already r e g i s t e r i n g

d i f f e r e n c e s for

another s l a ve . Set the r i g h t state , and copy

the b u f f e r . */

/ 将其他从机上的待回复的缓存复制到从机 c

copyClientOutputBuffer ( c , slave ) ;

/ 修改从机 c 状态为「等待 BGSAVE 进程结束」

c−>r e p l s t a t e = REDIS_REPL_WAIT_BGSAVE_END;

redisLog (REDIS_NOTICE, ”Waiting␣ fo r ␣end␣ of ␣BGSAVE␣

fo r ␣SYNC” ) ;

else {

}

/ 不存在状态为 REDIS_REPL_WAIT_BGSAVE_END 的从机，就

将此从机 c 状态设置为 REDIS_REPL_WAIT_BGSAVE_START

，

即等待新的 BGSAVE 进程的开启。

/ 修改状态为「等待 BGSAVE 进程开始」

* No way , we need to wait for the next BGSAVE in

order to

r e g i s t e r d i f f e r e n c e s */

c−>r e p l s t a t e = REDIS_REPL_WAIT_BGSAVE_START;

redisLog (REDIS_NOTICE, ”Waiting␣ fo r ␣next␣BGSAVE␣

f o r ␣SYNC” ) ;

}

else {

/ 不存在 BGSAVE 后台进程，启动一个新的 BGSAVE 进程

* Ok we don ’ t have a BGSAVE in progress , l e t ’ s s t a r t

one */

redisLog (REDIS_NOTICE, ” Starting ␣BGSAVE␣ fo r ␣SYNC” ) ;

i f ( rdbSaveBackground ( server . rdb_filename ) !=

REDIS_OK) {

redisLog (REDIS_NOTICE, ” Replication ␣ f a i l e d , ␣can ’ t ␣

BGSAVE” ) ;

addReplyError ( c , ”Unable␣ to ␣perform␣background␣

save ” ) ;

return ;

}

/ 将此从机 c 状态设置为 REDIS_REPL_WAIT_BGSAVE_END，

从而在 BGSAVE 进程结束后，可以发送 RDB 文件，同时

142

CHAPTER 14. 主从复制

将从机 s l av e 中的更新复制到此从机 c 。

c−>r e p l s t a t e = REDIS_REPL_WAIT_BGSAVE_END;

/ 清理脚本缓存？？？

* Flush the s c r i p t cache for the new s l a v e . */

replicationScriptCacheFlush () ;

}

i f ( server . repl_disable_tcp_nodelay )

anetDisableTcpNoDelay (NULL, c−>fd ) ; /* Non c r i t i c a l

i f i t f a i l s . */

c−>repldbfd = −1;

c−>f l a g s |= REDIS_SLAVE;

server . slaveseldb = −1; /* Force to re−emit the SELECT

command. */

listAddNodeTail ( server . slaves , c ) ;

i f ( listLength ( server . s l a v e s ) == 1 && server . repl_backlog

= NULL)

createReplicationBacklog () ;

return ;

}

/ BGSAVE 结束后，会调用

* A background saving c h i l d (BGSAVE) terminated i t s work .

Handle t h i s . */

void backgroundSaveDoneHandler ( int exitcode , int bysignal ) {

/ 其他操作

. . . . .

/ 可能从机正在等待 BGSAVE 进程的终止

* Possibly there are s l a v e s waiting for a BGSAVE in

order to be served

( the f i r s t stage of SYNC i s a bulk t r a n s f e r of dump .

rdb ) */

updateSlavesWaitingBgsave ( exitcode == 0 ? REDIS_OK :

REDIS_ERR) ;

}

/ 当 RDB 持久化 ( backgroundSaveDoneHandler () ) 结束后，会调用此

函数

/ RDB 文件就绪，给所有的从机发送 RDB 文件

* This function i s c a l l e d at the end of every background

saving .

14.3. 全同步

143

The argument bgsaveerr i s REDIS_OK i f the background saving

succeeded

otherwise REDIS_ERR i s passed to the function .

The goal of t h i s function i s to handle s l a v e s waiting for a

s u c c e s s f u l

background saving in order to perform non−blocking

synchronization . */

void updateSlavesWaitingBgsave ( int bgsaveerr ) {

listNode * ln ;

int startbgsave = 0;

l i s t I t e r l i ;

listRewind ( server . slaves ,& l i ) ;

while (( ln = listN e xt (& l i ) ) ) {

r e d i s C l i e n t * slave = ln−>value ;

/ 等待 BGSAVE 开始。调整状态为等待下一次 BGSAVE 进程

的结束

i f ( slave −>r e p l s t a t e == REDIS_REPL_WAIT_BGSAVE_START)

{

startbgsave = 1;

slave −>r e p l s t a t e = REDIS_REPL_WAIT_BGSAVE_END;

}

/ 等待 BGSAVE 结束。准备向 s l a ve 发送 RDB 文件

else i f ( slave −>r e p l s t a t e ==

REDIS_REPL_WAIT_BGSAVE_END) {

struct redis_stat buf ;

/ 如果 RDB 持久化失败， bgsaveerr 会被设置为

REDIS_ERR

i f ( bgsaveerr != REDIS_OK) {

f r e e C l i e n t ( slave ) ;

redisLog (REDIS_WARNING, ”SYNC␣ f a i l e d . ␣BGSAVE␣

child ␣ returned ␣an␣ error ” ) ;

continue ;

}

/ 打开 RDB 文件

i f (( slave −>repldbfd = open ( server . rdb_filename ,

O_RDONLY) ) == −1 | |

144

CHAPTER 14. 主从复制

re di s _f s t a t ( slave −>repldbfd ,&buf ) == −1) {

f r e e C l i e n t ( slave ) ;

redisLog (REDIS_WARNING, ”SYNC␣ f a i l e d . ␣Can ’ t ␣

open/ stat ␣DB␣ a f t e r ␣BGSAVE: ␣%s ” , s t r e r r o r (

errno ) ) ;

continue ;

}

slave −>r e p ld bo f f = 0;

slave −>r e p l d bs i z e = buf . st_size ;

slave −>r e p l s t a t e = REDIS_REPL_SEND_BULK;

/ 如果之前有注册写事件，取消

aeDeleteFileEvent ( server . el , slave −>fd ,AE_WRITABLE

)

;

/ 注册新的写事件 , sendBulkToSlave () 传输 RDB 文件

i f ( aeCreateFileEvent ( server . el , slave −>fd ,

AE_WRITABLE, sendBulkToSlave , slave ) == AE_ERR

)

{

f r e e C l i e n t ( slave ) ;

continue ;

}

/ s t a r t b g s a v e == REDIS_ERR 表示 BGSAVE 失败，再一次进行

BGSAVE 尝试

i f ( startbgsave ) {

* Since we are s t a r t i n g a new background save for

one or more slaves ,

we f l u s h the Replication Script Cache to use EVAL

to propagate every

new EVALSHA for the f i r s t time , since a l l the new

s l a v e s don ’ t know

about previous s c r i p t s . */

replicationScriptCacheFlush () ;

i f ( rdbSaveBackground ( server . rdb_filename ) !=

REDIS_OK) {

*BGSAVE 可能 fork 失败，所有等待 BGSAVE 的从机都将结

束连接。这是 redis 自我保护的措施， fork 失败很可能

14.4. 部分同步

145

是内存紧张 */

l i s t I t e r l i ;

listRewind ( server . slaves ,& l i ) ;

redisLog (REDIS_WARNING, ”SYNC␣ f a i l e d . ␣BGSAVE␣

f a i l e d ” ) ;

while (( ln = lis tN e xt (& l i ) ) ) {

r e d i s C l i e n t * slave = ln−>value ;

i f ( slave −>r e p l s t a t e ==

REDIS_REPL_WAIT_BGSAVE_START)

f r e e C l i e n t ( slave ) ;

}

4.4 部分同步

如上所说，无论如何，redis 首先会尝试部分同步。部分同步即把积压空间缓存的

数据，即更新记录发送给从机。

从机连接主机后，会主动发起 PSYNC 命令，从机会提供 master_runid 和 oﬀset，

主机验证 master_runid 和 oﬀset 是否有效？验证通过则，进行部分同步：主机返回

+CONTINUE（从机接收后会注册积压数据接收事件），接着发送积压空间数据。

/ 连接主机 connectWithMaster () 的时候，会被注册为回调函数

void syncWithMaster ( aeEventLoop * el , int fd , void * privdata ,

int mask) {

char tmpfile [ 2 5 6 ] , * err ;

int dfd , maxtries = 5;

int sockerr = 0 , psync_result ;

socklen_t e r r l e n = sizeof ( sockerr ) ;

. . . . .

/ 尝试部分同步，主机允许进行部分同步会返回 +CONTINUE，从

机接收后注册相应的事件

* Try a p a r t i a l resynchonization . I f we don ’ t have a

cached master

146

CHAPTER 14. 主从复制

slaveTryPartialResynchronization () w i l l at l e a s t try

to use PSYNC

to s t a r t a f u l l resynchronization so that we get the

master run id

and the g l o b a l o f f s e t , to try a p a r t i a l resync at the

reconnection attempt . */

/ 函数返回三种状态：

/ PSYNC_CONTINUE：表示会进行部分同步，在

slaveTryPartialResynchronization ()

/ 中已经设置回调函数

readQueryFromClient ()

psync_result = slaveTryPartialResynchronization ( fd ) ;

i f ( psync_result == PSYNC_CONTINUE) {

/ PSYNC_FULLRESYNC：全同步，会下载 RDB 文件

/ PSYNC_NOT_SUPPORTED：未知

redisLog (REDIS_NOTICE, ”MASTER␣<−>␣SLAVE␣sync : ␣Master

␣

accepted ␣a␣ Partial ␣ Resynchronization . ” ) ;

return ;

}

/ 执行全同步

. . . . .

}

/ 函数返回三种状态：

/ PSYNC_CONTINUE：表示会进行部分同步，已经设置回调函数

/ PSYNC_FULLRESYNC：全同步，会下载 RDB 文件

/ PSYNC_NOT_SUPPORTED：未知

define PSYNC_CONTINUE 0

define PSYNC_FULLRESYNC 1

define PSYNC_NOT_SUPPORTED 2

int slaveTryPartialResynchronization ( int fd ) {

char *psync_runid ;

char psync_offset [ 3 2 ] ;

sds reply ;

* I n i t i a l l y s et repl_master_initial_offset to −1 to mark

the current

master run_id and o f f s e t as not v a l i d . Later i f we ’ l l

be able to do

14.4. 部分同步

147

a FULL resync using the PSYNC command we ’ l l set the

o f f s e t at the

r i g h t value , so that t h i s information w i l l be

propagated to the

c l i e n t structure representing the master into server .

master . */

server . repl_master_initial_offset = −1;

i f ( server . cached_master ) {

/ 缓存了上一次与主机连接的信息，可以尝试进行部分同步，减

少数据传输

psync_runid = server . cached_master−>replrunid ;

s n p r i n t f ( psync_offset , sizeof ( psync_offset ) , ”%l l d ” ,

server . cached_master−>r e p l o f f +1) ;

redisLog (REDIS_NOTICE, ” Trying␣a␣ p a r t i a l ␣

resynchronization ␣ ( request ␣%s:%s ) . ” , psync_runid ,

psync_offset ) ;

}

else {

/ 未缓存上一次与主机连接的信息，进行全同步

/ psync ? −1 可以获取主机的 master_runid

redisLog (REDIS_NOTICE, ” Partial ␣ resynchronization ␣not␣

p o s s i b l e ␣ ( no␣cached␣master ) ” ) ;

psync_runid = ”?” ;

memcpy( psync_offset , ”−1” ,3) ;

}

/ 向主机发送命令，并接收回复

* Issue the PSYNC command */

reply = sendSynchronousCommand ( fd , ”PSYNC” , psync_runid ,

psync_offset ,NULL) ;

/ 全同步

i f ( ! strncmp ( reply , ”+FULLRESYNC” ,11) ) {

char * runid = NULL, * o f f s e t = NULL;

* FULL RESYNC, parse the reply in order to e x t r a c t

the run id

and the r e p l i c a t i o n o f f s e t . */

runid = strchr ( reply , ’ ␣ ’ ) ;

i f ( runid ) {

runid++;

o f f s e t = strchr ( runid , ’ ␣ ’ ) ;

148

CHAPTER 14. 主从复制

i f ( o f f s e t ) o f f s e t ++;

}

i f ( ! runid | | ! o f f s e t | | ( o f f s e t −runid −1) !=

REDIS_RUN_ID_SIZE) {

redisLog (REDIS_WARNING,

”

Master␣ r e p l i e d ␣with␣wrong␣+FULLRESYNC␣ syntax

” ) ;

* This i s an unexpected condition , a c t u a l l y the

FULLRESYNC

reply means that the master supports PSYNC,

but the reply

format seems wrong . To stay safe we blank the

master

runid to make sure next PSYNCs w i l l f a i l . */

memset( server . repl_master_runid , 0 ,

REDIS_RUN_ID_SIZE+1) ;

}

else {

/ 拷贝 runid

memcpy( server . repl_master_runid , runid , o f f s e t −

runid −1) ;

server . repl_master_runid [REDIS_RUN_ID_SIZE] = ’ \0

’

;

server . repl_master_initial_offset = s t r t o l l (

o f f s e t ,NULL, 10) ;

redisLog (REDIS_NOTICE, ” Full ␣ resync ␣from␣master : ␣%

s:% l l d ” ,

server . repl_master_runid ,

server . repl_master_initial_offset ) ;

}

* We are going to f u l l resync , discard the cached

master structure . */

replicationDiscardCachedMaster () ;

s d s f r e e ( reply ) ;

return PSYNC_FULLRESYNC;

}

/ 部分同步

i f ( ! strncmp ( reply , ”+CONTINUE” ,9) ) {

* Partial resync was accepted , set the r e p l i c a t i o n

s t a t e accordingly */

redisLog (REDIS_NOTICE,

”

Successful ␣ p a r t i a l ␣ resynchronization ␣with␣master

14.4. 部分同步

149

” ) ;

s d s f r e e ( reply ) ;

/ 缓存主机替代现有主机，且为 PSYNC（部分同步）做好

准备 c

replicationResurrectCachedMaster ( fd ) ;

return PSYNC_CONTINUE;

}

* I f we reach t h i s point we receied e i t h e r an error

since the master does

not understand PSYNC, or an unexpected reply from the

master .

Reply with PSYNC_NOT_SUPPORTED in both cases . */

/ 接收到主机发出的错误信息

i f ( strncmp ( reply , ”−ERR” ,4) ) {

* I f i t ’ s not an error , log the unexpected event . */

redisLog (REDIS_WARNING,

”

Unexpected␣ reply ␣ to ␣PSYNC␣from␣master : ␣%s ” ,

reply ) ;

}

else {

redisLog (REDIS_NOTICE,

”

Master␣ does ␣not␣ support ␣PSYNC␣ or ␣ i s ␣ in ␣”

error ␣ state ␣ ( reply : ␣%s ) ” , reply ) ;

s d s f r e e ( reply ) ;

replicationDiscardCachedMaster () ;

return PSYNC_NOT_SUPPORTED;

}

/ 主机 SYNC 和 PSYNC 命令处理函数，会尝试进行部分同步和全同

步

* SYNC ad PSYNC command implemenation . */

void syncCommand( r e d i s C l i e n t *c ) {

. . . . .

/ 主机尝试部分同步，允许则进行部分同步，会返回 +CONTINUE

，

接着发送积压空间

* Try a p a r t i a l resynchronization i f t h i s i s a PSYNC

150

CHAPTER 14. 主从复制

command.

I f i t f a i l s , we continue with usual f u l l

resynchronization , however

when t h i s happens masterTryPartialResynchronization ()

already

r e p l i e d with :

+FULLRESYNC <runid> <o f f s e t >

So the s l a v e knows the new runid and o f f s e t to try a

PSYNC l a t e r

i f the connection with the master i s l o s t . */

i f ( ! strcasecmp ( c−>argv[0]−>ptr , ” psync ” ) ) {

/ 部分同步

i f ( masterTryPartialResynchronization ( c ) == REDIS_OK)

{

server . stat_sync_partial_ok++;

return ; /* No f u l l resync needed , return . */

}

else {

/ 部分同步失败，会进行全同步，这时会收到来自客户端的

runid

char *master_runid = c−>argv[1]−> ptr ;

* Increment s t a t s for f a i l e d PSYNCs, but only i f

the

runid i s not ”?” , as t h i s i s used by s l a v e s to

force a f u l l

resync on purpose when they are not albe to

p a r t i a l l y

resync . */

i f ( master_runid [ 0 ] != ’ ? ’ ) server .

stat_sync_partial_err++;

}

else {

* I f a s l a v e uses SYNC, we are dealing with an old

implementation

of the r e p l i c a t i o n protocol ( l i k e redis −c l i −−

s l a v e ) . Flag the c l i e n t

so that we don ’ t expect to receive REPLCONF ACK

feedbacks . */

c−>f l a g s |= REDIS_PRE_PSYNC_SLAVE;

14.4. 部分同步

151

/ 执行全同步：

. . . . .

}

/ 主机尝试是否能进行部分同步

* This function handles the PSYNC command from the point of

view of a

master r e c e iv i n g a request for p a r t i a l resynchronization .

On success return REDIS_OK, otherwise REDIS_ERR i s returned

and we proceed

with the usual f u l l resync . */

int masterTryPartialResynchronization ( r e d i s C l i e n t *c ) {

long long psync_offset , psync_len ;

char *master_runid = c−>argv[1]−> ptr ;

char buf [ 1 2 8 ] ;

int buflen ;

* Is the runid of t h i s master the same advertised by the

wannabe s l a v e

via PSYNC? I f runid changed t h i s master i s a d i f f e r e n t

instance and

there i s no way to continue . */

i f ( strcasecmp ( master_runid , server . runid ) ) {

/ 当因为异常需要与主机断开连接的时候，从机会暂存主机的状

态信息，以便

/ 下一次的部分同步。

/ 1） master_runid 是从机提供一个因缓存主机的 runid ，

/ 2） server . runid 是本机（主机）的 runid 。

/ 匹配失败，说明是本机（主机）不是从机缓存的主机，这时候

不能进行部分同步，

/ 只能进行全同步

/ ”?” 表示从机要求全同步

/ 什么时候从机会要求全同步？？？

* Run id ”?” i s used by s l a v e s that want to force a

f u l l resync . */

i f ( master_runid [ 0 ] != ’ ? ’ ) {

redisLog (REDIS_NOTICE, ” Partial ␣ resynchronization ␣

not␣ accepted : ␣”

”

Runid␣mismatch␣ ( Client ␣asked␣ fo r ␣’%s ’ , ␣ I ’m␣

152

CHAPTER 14. 主从复制

’

%s ’) ” ,

master_runid , server . runid ) ;

else {

redisLog (REDIS_NOTICE, ” Full ␣ resync ␣ requested ␣by␣

slave . ” ) ;

}

goto need_full_resync ;

}

/ 从参数中解析整数，整数是从机指定的偏移量

* We s t i l l have the data our s l a v e i s asking for ? */

i f ( getLongLongFromObjectOrReply ( c , c−>argv [2] ,&

psync_offset ,NULL) !=

REDIS_OK) goto need_full_resync ;

/ 部分同步失败的情况

i f ( ! server . repl_backlog | | /* 不存在积压空间 */

psync_offset < server . repl_backlog_off | | /*

psync_offset 太过小，

即从机错

过太多

更新记

录，

安全起

见，实

行全同

步 */

psync_offset

越界

psync_offset > ( server . repl_backlog_off + server .

repl_backlog_histlen ) )

{

/ 经检测，不满足部分同步的条件，转而进行全同步

redisLog (REDIS_NOTICE,

”

Unable␣ to ␣ p a r t i a l ␣ resync ␣with␣ the ␣ slave ␣ fo r ␣ lack

␣ of ␣ backlog ␣ ( Slave ␣ request ␣was : ␣%l l d ) . ” ,

psync_offset ) ;

i f ( psync_offset > server . master_repl_offset ) {

redisLog (REDIS_WARNING,

”

Warning : ␣ slave ␣ t r i e d ␣ to ␣PSYNC␣with␣an␣ o f f s e t

14.4. 部分同步

153

␣

that ␣ i s ␣ greater ␣than␣ the ␣master␣

r e p l i c a t i o n ␣ o f f s e t . ” ) ;

}

goto need_full_resync ;

}

/ 执行部分同步：

/ 1）标记客户端为从机

/ 2）通知从机准备接收数据。从机收到 +CONTINUE 会做好准备

/ 3）开发发送数据

* I f we reached t h i s point , we are able to perform a

p a r t i a l resync :

1) Set c l i e n t s t a t e to make i t a s l a ve .

2) Inform the c l i e n t we can continue with +CONTINUE

3) Send the backlog data ( from the o f f s e t to the end )

to the s l a ve . */

/ 将连接的客户端标记为从机

c−>f l a g s |= REDIS_SLAVE;

/ 表示进行部分同步

/ #define REDIS_REPL_ONLINE 9 /* RDB f i l e transmitted ,

sending j u s t

/ updates . */

c−>r e p l s t a t e = REDIS_REPL_ONLINE;

/ 更新 ack 的时间

c−>repl_ack_time = server . unixtime ;

/ 添加入从机链表

listAddNodeTail ( server . slaves , c ) ;

/ 告诉从机可以进行部分同步，从机收到后会做相关的准备（注

册回调函数）

* We can ’ t use the connection b u f f e r s since they are

used to accumulate

new commands at t h i s stage . But we are sure the socket

send b u f f e r i s

emtpy so t h i s write w i l l never f a i l a c t u a l l y . */

buflen = s n p r i n t f ( buf , sizeof ( buf ) , ”+CONTINUE\ r \n” ) ;

i f ( write ( c−>fd , buf , buflen ) != buflen ) {

freeClientAsync ( c ) ;

154

CHAPTER 14. 主从复制

return REDIS_OK;

}

/ 向从机写积压空间中的数据，积压空间存储有「更新缓存」

psync_len = addReplyReplicationBacklog ( c , psync_offset ) ;

redisLog (REDIS_NOTICE,

”

Partial ␣ resynchronization ␣ request ␣ accepted . ␣Sending␣

%l l d ␣ bytes ␣ of ␣ backlog ␣ s t a r t i n g ␣from␣ o f f s e t ␣%l l d . ” ,

psync_len , psync_offset ) ;

* Note that we don ’ t need to set the s e l e c t e d DB at

server . s l a v e s e l d b

to −1 to force the master to emit SELECT, since the

sl a v e already

has t h i s s t a t e from the previous connection with the

master . */

refreshGoodSlavesCount () ;

return REDIS_OK; /* The c a l l e r can return , no f u l l resync

needed . */

need_full_resync :

. . . . .

/ 向从机发送 +FULLRESYNC runid r e p l _ o f f s e t

}

s ectio n { 暂缓主机}

从机因为某些原因，譬如网络延迟（PING 超时，ACK 超时等），可能

会断开与主机的连接。这时候，从机会尝试保存与主机连接的信

息，譬如全局积压空间数据偏移量等，以便下一次的部分同步，并

且从机会再一次尝试连接主机。注意一点，如果断开的时间足够

长，部分同步肯定会失败的。

void f r e e C l i e n t ( r e d i s C l i e n t *c ) {

listNode * ln ;

* I f t h i s i s marked as current c l i e n t unset i t */

i f ( server . current_client == c ) server . current_client =

NULL;

/ 如果此机为从机，已经连接主机，可能需要保存主机状态信

14.4. 部分同步

155

息，以便进行 PSYNC

* I f i t i s our master that ’ s beging disconnected we

should make sure

to cache the s t a t e to try a p a r t i a l resynchronization

l a t e r .

Note that before doing t h i s we make sure that the

c l i e n t i s not in

some unexpected state , by checking i t s f l a g s . */

i f ( server . master && c−>f l a g s & REDIS_MASTER) {

redisLog (REDIS_WARNING, ” Connection␣with␣master␣ l o s t . ”

)

;

i f ( ! ( c−>f l a g s & (REDIS_CLOSE_AFTER_REPLY|

REDIS_CLOSE_ASAP|

REDIS_BLOCKED|

REDIS_UNBLOCKED) ) )

{

replicationCacheMaster ( c ) ;

return ;

}

. . . . .

}

/ 为了实现部分同步，从机会保存主机的状态信息后才会断开主机的

连接，主机状态信息

/ 保存在 server . cached_master

/ 会在 f r e e C l i e n t () 中调用，保存与主机连接的状态信息，以便进

行 PSYNC

void replicationCacheMaster ( r e d i s C l i e n t *c ) {

listNode * ln ;

redisAssert ( server . master != NULL && server . cached_master

== NULL) ;

redisLog (REDIS_NOTICE, ”Caching␣ the ␣ disconnected ␣master␣

state . ” ) ;

/ 从客户端列表删除主机的信息

* Remove from the l i s t of c l i e n t s , we don ’ t want t h i s

c l i e n t to be

l i s t e d by CLIENT LIST or processed in any way by batch

operations . */

156

CHAPTER 14. 主从复制

ln = listSearchKey ( server . c l i e n t s , c ) ;

redisAssert ( ln != NULL) ;

listDelNode ( server . c l i e n t s , ln ) ;

/ 保存主机的状态信息

* Save the master . Server . master w i l l be set to n u l l

l a t e r by

replicationHandleMasterDisconnection () . */

server . cached_master = server . master ;

/ 注销事件，关闭连接

* Remove the event handlers and c l o s e the socket . We’ l l

l a t e r reuse

the socket of the new connection with the master

during PSYNC. */

aeDeleteFileEvent ( server . el , c−>fd ,AE_READABLE) ;

aeDeleteFileEvent ( server . el , c−>fd ,AE_WRITABLE) ;

c l o s e ( c−>fd ) ;

* Set fd to −1 so that we can s a f e l y c a l l f r e e C l i e n t ( c )

l a t e r . */

c−>fd = −1;

/ 修改连接的状态，设置 server . master = NULL

* Caching the master happens instead of the actual

f r e e C l i e n t () c a l l ,

so make sure to adjust the r e p l i c a t i o n s t a t e . This

function w i l l

also set server . master to NULL. */

replicationHandleMasterDisconnection () ;

}

4.5 总结

简单来说，主从同步就是 RDB 文件的上传下载；主机有小部分的数据修改，就

把修改记录传播给每个从机。这篇文章详述了 redis 主从复制的内部协议和机制。接

下来的几篇关于 redis 的文章，主要是其内部数据结构。

Chapter 15

redis 事务机制

5.1 redis 事务简述

MULTI，EXEC，DISCARD，WATCH 四个命令是 redis 事务的四个基础命令。

其中：MULTI，告诉 redis 服务器开启一个事务。注意，只是开启，而不是执行 EXEC，

告诉 redis 开始执行事务 DISCARD，告诉 redis 取消事务 WATCH，监视某一个键

值对，它的作用是在事务执行之前如果监视的键值被修改，事务会被取消。

在介绍 redis 事务之前，先来展开 redis 命令队列的内部实现。

5.2 redis 命令队列

redis 允许一个客户端不间断执行多条命令：发送 MULTI 后，用户键入多条命

令；再发送 EXEC 即可不间断执行之前输入的多条命令。因为，redis 是单进程单线

的工作模式，因此多条命令的执行是不会被中断的。

MULTI

INCR foo

QUEUED

INCR bar

QUEUED

EXEC

) ( i nteger ) 1

内部实现不难：redis 服务器收到来自客户端的 MULTI 命令后，为客户端保存一

个命令队列结构体，直到收到 EXEC 后才开始执行命令队列中的命令。

158

CHAPTER 15. REDIS 事务机制

下面是命令队列的数据结构：

/ 命令结构体，命令队列专用

* Client MULTI/EXEC s t a t e */

typedef struct multiCmd {

/ 命令参数

robj ** argv ;

/ 参数个数

int argc ;

/ 命令结构体，包含了与命令相关的参数，譬如命令执行函数

/ 如需更详细了解，参看 redis . c 中的 redisCommandTable 全

局参数

struct redisCommand *cmd ;

multiCmd ;

}

/ 命令队列结构体

typedef struct multiState {

/ 命令队列

multiCmd *commands ;

/* Array of MULTI commands */

/ 命令的个数

int count ;

/* Total number of MULTI commands

/ 以下两个参数暂时没有用到，和主从复制有关

/* MINREPLICAS for synchronous

int minreplicas ;

r e p l i c a t i o n */

time_t minreplicas_timeout ; /* MINREPLICAS timeout as

unixtime . */

}

multiState ;

通由上面给出的 redis 客户端操作，来看看 redis 服务器的状态变化：

MULTI

INCR foo

QUEUED

INCR bar

QUEUED

15.2. REDIS 命令队列

159

EXEC

) ( integ er ) 1

160

CHAPTER 15. REDIS 事务机制

15.3. 键值的监视

161

processCommand() 函数中的一段代码可以窥探命令入队的操作：

/ 执行命令

int processCommand ( r e d i s C l i e n t *c ) {

. . . . .

/ 加入命令队列的情况

* Exec the command */

i f ( c−>f l a g s & REDIS_MULTI &&

c−>cmd−>proc != execCommand && c−>cmd−>proc !=

discardCommand &&

c−>cmd−>proc != multiCommand && c−>cmd−>proc !=

watchCommand)

{

/ 命令入队

queueMultiCommand( c ) ;

addReply ( c , shared . queued ) ;

/ 真正执行命令。

/ 注意，如果是设置了多命令模式，那么不是直接执行命令，而

是让命令入队

}

else {

c a l l ( c ,REDIS_CALL_FULL) ;

i f ( listLength ( server . ready_keys ) )

handleClientsBlockedOnLists () ;

}

return REDIS_OK;

}

5.3 键值的监视

稍后再展开事务执行和取消的部分。

redis 的官方文档上说，WATCH 命令是为了让 redis 拥有 check-and-set(CAS)

的特性。CAS 的意思是，一个客户端在修改某个值之前，要检测它是否更改；如果没

有更改，修改操作才能成功。

一个不含 CAS 特性的例子：

162

CHAPTER 15. REDIS 事务机制

client B

get score(score=10)

client A

get score(score=10)

temp=score+1(temp=11) temp=score+1(temp=11)

set score temp(score=11)

ﬁnal: score=11

含有 CAS 特性的例子：

client A

client B

get score(score=10)

temp=score+1(temp=11)

set score temp(score=11)

temp=score+1(temp=11)

（服务器标记 score 已经被修改）

set score temp(score=11) (failed!!!)

ﬁnal: score=11

get score(score=11)

temp=score+1(temp=12)

set score temp(score=12)

ﬁnal: score=12

在后一个例子中，client A 第一次尝试修改失败，因为 client B 修改了 score.client

A 失败过后，再次尝试修改才成功。redis 事务的 CAS 特性借助了键值的监视。

redis 数据集结构体 redisDB 和客户端结构体 redisClient 都会保存键值监视的相

关数据。

15.3. 键值的监视

163

监视键值的过程：

/ WATCH 命令执行函数

void watchCommand( r e d i s C l i e n t *c ) {

164

CHAPTER 15. REDIS 事务机制

int j ;

/ WATCH 命令不能在 MULTI 和 EXEC 之间调用

i f ( c−>f l a g s & REDIS_MULTI) {

addReplyError ( c , ”WATCH␣ i n s i d e ␣MULTI␣ i s ␣not␣ allowed ” ) ;

return ;

}

/ 监视所给出的键

for ( j = 1; j < c−>argc ; j++)

watchForKey ( c , c−>argv [ j ] ) ;

addReply ( c , shared . ok ) ;

}

/ 监视键值函数

* Watch for the s p e c i f i e d key */

void watchForKey ( r e d i s C l i e n t *c , robj *key ) {

l i s t * c l i e n t s = NULL;

l i s t I t e r l i ;

listNode * ln ;

watchedKey *wk;

/ 是否已经监视该键值

* Check i f we are already watching for t h i s key */

listRewind ( c−>watched_keys ,& l i ) ;

while (( ln = listN e xt (& l i ) ) ) {

wk = listNodeValue ( ln ) ;

i f (wk−>db == c−>db && equalStringObjects ( key , wk−>key

)

return ; /* Key already watched */

}

/ 获取监视该键值的客户端链表

* This key i s not already watched in t h i s DB. Let ’ s add

i t */

c l i e n t s = dictFetchValue ( c−>db−>watched_keys , key ) ;

i f ( ! c l i e n t s ) {

/ 如果不存在链表，需要新建一个

c l i e n t s = l i s t C r e a t e () ;

dictAdd ( c−>db−>watched_keys , key , c l i e n t s ) ;

incrRefCount ( key ) ;

}

15.3. 键值的监视

165

/ 尾插法。将客户端添加到链表尾部

listAddNodeTail ( c l i e n t s , c ) ;

/ 将监视键添加到 r e d i s C l i e n t . watched_keys 的尾部

* Add the new key to the l i s t of keys watched by t h i s

c l i e n t */

wk = zmalloc ( sizeof (*wk) ) ;

wk−>key = key ;

wk−>db = c−>db ;

incrRefCount ( key ) ;

listAddNodeTail ( c−>watched_keys , wk) ;

}

当客户端键值的键值被修改的时候，监视该键值的所有客户端都会被标记为

REDIS_DIRTY_CAS，表示此该键值对被修改过。

touchWatchedKey() 是标记某键值被修改的函数，它一般不被 signalModifyKey()

函数包装。下面是 touchWatchedKey() 的实现。

/ 标记键值键值对的客户端为 REDIS_DIRTY_CAS，表示其所监视的数

据已经被修改过

* ”Touch” a key , so that i f t h i s key i s being WATCHed by

some c l i e n t the

next EXEC w i l l f a i l . */

void touchWatchedKey ( redisDb *db , robj *key ) {

l i s t * c l i e n t s ;

l i s t I t e r l i ;

listNode * ln ;

/ 获取监视 key 的所有客户端

i f ( d i c t S i z e (db−>watched_keys ) == 0) return ;

c l i e n t s = dictFetchValue (db−>watched_keys , key ) ;

i f ( ! c l i e n t s ) return ;

/ 标记监视 key 的所有客户端 REDIS_DIRTY_CAS

* Mark a l l the c l i e n t s watching t h i s key as

REDIS_DIRTY_CAS */

listRewind ( c l i e n t s ,& l i ) ;

* Check i f we are already watching for t h i s key */

while (( ln = lis tNext (& l i ) ) ) {

r e d i s C l i e n t *c = listNodeValue ( ln ) ;

166

CHAPTER 15. REDIS 事务机制

/ REDIS_DIRTY_CAS 更改的时候会设置此标记

c−>f l a g s |= REDIS_DIRTY_CAS;

}

5.4 redis 事务的执行与取消

当用户发出 EXEC 的时候，在它 MULTI 命令之后提交的所有命令都会被执行。从

代码的实现来看，如果客户端监视的数据被修改，它会被标记 REDIS_DIRTY_CAS，

会调用 discardTransaction() 从而取消该事务。特别的，用户开启一个事务后会提

交多个命令，如果命令在入队过程中出现错误，譬如提交的命令本身不存在，参数

错误和内存超额等，都会导致客户端被标记 REDIS_DIRTY_EXEC，被标记 RE-

DIS_DIRTY_EXEC 会导致事务被取消。

因此总结一下：

REDIS_DIRTY_CAS 更改的时候会设置此标记 REDIS_DIRTY_EXEC 命令

入队时出现错误，此标记会导致 EXEC 命令执行失败

下面是执行事务的过程：

/ 执行事务内的所有命令

void execCommand( r e d i s C l i e n t *c ) {

int j ;

robj ** orig_argv ;

int orig_argc ;

struct redisCommand *orig_cmd ;

int must_propagate = 0; /* Need to propagate MULTI/EXEC

to AOF / s l a v e s ? */

/ 必须设置多命令标记

i f ( ! ( c−>f l a g s & REDIS_MULTI) ) {

addReplyError ( c , ”EXEC␣ without ␣MULTI” ) ;

return ;

}

/ 停止执行事务命令的情况：

/ 1. 被监视的数据被修改

/ 2. 命令队列中的命令执行失败

15.4. REDIS 事务的执行与取消

167

* Check i f we need to abort the EXEC because :

1) Some WATCHed key was touched .

2) There was a previous error while queueing commands .

A f a i l e d EXEC in the f i r s t case returns a multi bulk

n i l o b j e c t

( t e c h n i c a l l y i t i s not an error but a s p e c i a l behavior

)

, while

in the second an EXECABORT error i s returned . */

i f ( c−>f l a g s & (REDIS_DIRTY_CAS|REDIS_DIRTY_EXEC) ) {

addReply ( c , c−>f l a g s & REDIS_DIRTY_EXEC ? shared .

execaborterr :

shared .

nullmultibulk

)

;

discardTransaction ( c ) ;

goto handle_monitor ;

}

/ 执行队列中的所有命令

* Exec a l l the queued commands */

unwatchAllKeys ( c ) ; /* Unwatch ASAP otherwise we ’ l l waste

CPU c y c l e s */

/ 保存当前的命令，一般为 MULTI，在执行完所有的命令后会恢

复。

orig_argv = c−>argv ;

orig_argc = c−>argc ;

orig_cmd = c−>cmd ;

addReplyMultiBulkLen ( c , c−>mstate . count ) ;

for ( j = 0; j < c−>mstate . count ; j++) {

/ 命令队列中的命令被赋值给当前的命令

c−>argc = c−>mstate . commands [ j ] . argc ;

c−>argv = c−>mstate . commands [ j ] . argv ;

c−>cmd = c−>mstate . commands [ j ] . cmd ;

/ 遇到包含写操作的命令需要将 MULTI 命令写入 AOF 文件

* Propagate a MULTI request once we encounter the

f i r s t write op .

This way we ’ l l d e l i v e r the MULTI/ . . . . /EXEC block

as a whole and

168

CHAPTER 15. REDIS 事务机制

both the AOF and the r e p l i c a t i o n l i n k w i l l have

the same consistency

and atomicity guarantees . */

i f ( ! must_propagate && ! ( c−>cmd−>f l a g s &

REDIS_CMD_READONLY) ) {

execCommandPropagateMulti ( c ) ;

must_propagate = 1;

}

/ 调用 c a l l () 执行

c a l l ( c ,REDIS_CALL_FULL) ;

/ 这几句是多余的

* Commands may a l t e r argc /argv , restore mstate . */

c−>mstate . commands [ j ] . argc = c−>argc ;

c−>mstate . commands [ j ] . argv = c−>argv ;

c−>mstate . commands [ j ] . cmd = c−>cmd ;

}

/ 恢复当前的命令，一般为 MULTI

c−>argv = orig_argv ;

c−>argc = orig_argc ;

c−>cmd = orig_cmd ;

/ 事务已经执行完毕，清理与此事务相关的信息，如命令队列和

客户端标记

discardTransaction ( c ) ;

* Make sure the EXEC command w i l l be propagated as w e l l

i f MULTI

was already propagated . */

i f ( must_propagate ) server . dirty++;

. . . . .

}

如上所说，被监视的键值被修改或者命令入队出错都会导致事务被取

消：

/ 取消事务

void discardTransaction ( r e d i s C l i e n t *c ) {

/ 清空命令队列

freeClientMultiState ( c ) ;

15.5. REDIS 事务番外篇

169

/ 初始化命令队列

in itC lientM ultiState ( c ) ;

/ 取消标记 f l a g

c−>f l a g s &= ~(REDIS_MULTI|REDIS_DIRTY_CAS|

REDIS_DIRTY_EXEC) ; ;

unwatchAllKeys ( c ) ;

}

5.5 redis 事务番外篇

你可能已经注意到「事务」这个词。在学习数据库原理的时候有提到过事务的

ACID，即原子性、一致性、隔离性、持久性。接下来，看看 redis 事务是否支持 ACID。

原子性，即一个事务中的所有操作，要么全部完成，要么全部不完成，不会结束

在中间某个环节。redis 事务不支持原子性，最明显的是 redis 不支持回滚操作。

一致性，在事务开始之前和事务结束以后，数据库的完整性没有被破坏。这一点，

redis 事务能够保证。

隔离性，当两个或者多个事务并发访问（此处访问指查询和修改的操作）数据库

的同一数据时所表现出的相互关系。redis 不存在多个事务的问题，因为 redis 是单进

程单线程的工作模式。

持久性，在事务完成以后，该事务对数据库所作的更改便持久地保存在数据库之

中，并且是完全的。redis 提供两种持久化的方式，即 RDB 和 AOF。RDB 持久化只

备份当前内存中的数据集，事务执行完毕时，其数据还在内存中，并未立即写入到磁

盘，所以 RDB 持久化不能保证 redis 事务的持久性。再来讨论 AOF 持久化，我在

《

深入剖析 redis AOF 持久化策略》中讨论过：redis AOF 有后台执行和边服务边备

份两种方式。后台执行和 RDB 持久化类似，只能保存当前内存中的数据集；边备份

边服务的方式中，因为 redis 只是每间隔 2s 才进行一次备份，因此它的持久性也是不

完整的！

当然，我们可以自己修改源码保证 redis 事务的持久性，这不难。

还有一个亮点，就是 check-and-set CAS。一个修改操作不断的判断 X 值是否

已经被修改，直到 X 值没有被其他操作修改，才设置新的值。redis 借助 WATCH/

MULTI 命令来实现 CAS 操作的。

实际操作中，多个线程尝试修改一个全局变量，通常我们会用锁，从读取这个变

量的时候就开始锁住这个资源从而阻挡其他线程的修改，修改完毕后才释放锁，这是

170

CHAPTER 15. REDIS 事务机制

悲观锁的做法。相对应的有一种乐观锁，乐观锁假定其他用户企图修改你正在修改的

对象的概率很小，直到提交变更的时候才加锁，读取和修改的情况都不加锁。一般情

况下，不同客户端会访问修改不同的键值对，因此一般 check 一次就可以 set 了，而

不需要重复 check 多次。

redis 借助 WATCH/MULTI 命令实现的就是乐观锁，来看一个例子：

http://stackoverﬂow.com/questions/10750626/transactions-and-watch-statement-

in-redis

Chapter 16

redis 与 lua 脚本

这篇文章，主要是讲 redis 和 lua 是如何协同工作的以及 redis 如何管理 lua 脚

本。

6.1 lua

lua 以可嵌入，轻量，高效，提升静态语言的灵活性，有了 lua，方便对程序进行

改动或拓展，减少编译的次数，这在游戏开发中特别常见。

举一个在 c 语言中调用 lua 脚本的例子：

/ 这是 lua 所需的三个头文件

/ 当然，你需要链接到正确的 l i b

extern ”C”

{

include ” lua . h”

include ” l a u x l i b . h”

include ” l u a l i b . h”

}

int main ( int argc , char *argv [ ] )

{

lua_State *L = lua_open () ;

/ 此处记住 , 当你使用的是5 .1 版本以上的Lua时 , 请修改以下两句

luaL_openlibs (L) ;

为

luaopen_base (L) ;

172

CHAPTER 16. REDIS 与 LUA 脚本

luaopen_io (L) ;

/ 记住 , 当你使用的是5 .1 版本以上的Lua时请使用 luaL_dostring

(

L, buf ) ;

lua_dofile ( ” s c r i p t . lua ” ) ;

lua_close (L) ;

return 0;

}

lua_doﬁle(”script.lua”); 这一句能为我们提供无限的遐想，开发人员可以在

script.lua 脚本文件中实现程序逻辑，而不需要重新编译 main.cpp 文件。在上面

给出的例子中，c 语言执行了 lua 脚本。不仅如此，我们也可以将 c 函数注册到 lua

解释器中，从而在 lua 脚本中，调用 c 函数。

16.2. REDIS 为什么添加 LUA 支持

173

6.2 redis 为什么添加 lua 支持

从上所说，lua 为静态语言提供更多的灵活性，redis lua 脚本出现之前 redis 是

没有服务器端运算能力的，主要是用来存储，用做缓存，运算是在客户端进行，这里

174

CHAPTER 16. REDIS 与 LUA 脚本

有两个缺点：一、如此会破坏数据的一致性，试想如果两个客户端先后获取（get）一

个值，它们分别对键值做不同的修改，然后先后提交结果，最终 redis 服务器中的结

果肯定不是某一方客户端所预期的。二、浪费了数据传输的网络带宽。

lua 出现之后这一问题得到了充分的解决，非常棒！有了 lua 的支持，客户端可

以定义对键值的运算。

TODO

6.3 lua 环境的初始化

在 redis 服务器初始化函数 scriptingInit() 中，初始化了 lua 的环境。

. 加载了常用的 lua 库，方便在 lua 脚本中调用

. 创建 SHA1->lua_script 哈希表，可见 redis 会保存客户端执行过的 lua 脚本

SHA1 是安全散列算法产生的一个固定长度的序列，你可以把它理解为一个键

值。可见 redis 服务器会保存客户端执行过的 lua 脚本。这在一个 lua 脚本需要

被经常执行的时候是非常有用的。试想，客户端只需要给定一个 SHA1 序列就

可以执行相应的 lua 脚本了。事实上，EVLASHA 命令就是这么工作的。

. 注册 redis 的一些处理函数，譬如命令处理函数，日志函数。注册过的函数，可

以在 lua 脚本中调用

16.3. LUA 环境的初始化

175

. 替换已经加载的某些库的函数

. 创建虚拟客户端（fake client）。和 AOF，RDB 数据恢复的做法一样，是为了复

用命令处理函数。

重点展开第三、五点。

redis 初始化 lua 环境的时候，注册了两个命令处理函数：luaRedisCallCommand()

和 luaRedisPCallCommand()，经注册后，开发人员可在 lua 脚本中调用这两个函数，

从而在 lua 脚本也可以执行 redis 命令。

void s c r i p t i n g I n i t ( void ) {

. . . . .

/ 向 lua 解释器注册 redis 的数据或者变量

* Register the redis commands t a b l e and f i e l d s */

lua_newtable ( lua ) ;

/ 注册 redis . c a l l 函数，命令处理函数

* redis . c a l l */

lua_pushstring ( lua , ” c a l l ” ) ;

lua_pushcfunction ( lua , luaRedisCallCommand ) ;

lua_settable ( lua , −3) ;

/ 注册 redis . p a l l 函数，命令处理函数

* redis . p c a l l */

lua_pushstring ( lua , ” p c a l l ” ) ;

lua_pushcfunction ( lua , luaRedisPCallCommand ) ;

lua_settable ( lua , −3) ;

. . . . .

}

以 luaRedisCallCommand() 为例，当它被回调的时候会完成：

. 检测参数的有效性，并通过 lua api 提取参数

. 向虚拟客户端 server.lua_client 填充参数

. 查找命令

. 脏命令检测（在下面详细展开）

. 执行命令

. 处理命令处理结果

176

CHAPTER 16. REDIS 与 LUA 脚本

fake client 的好处又一次体现出来了。在 lua 脚本处理期间，redis 服务器只服务

于 fake client.

6.4 redis lua 脚本的执行过程

我们依旧从客户端发送一个 lua 相关命令开始。假定用户发送了 EVAL 命令如下：

eval 1 ” set ␣KEY[ 1 ] ␣ARGV[ 1 ] ” views 18000 TODO 参数讲解

此命令的意图是，将 views 的值设置为 18000。redis 服务器收到此命令后，会调

用对应的命令处理函数 evalCommand() 如下：

void evalCommand( r e d i s C l i e n t *c ) {

evalGenericCommand ( c , 0 ) ;

}

void evalGenericCommand ( r e d i s C l i e n t *c , int evalsha ) {

lua_State * lua = server . lua ;

char funcname [ 4 3 ] ;

long long numkeys ;

int delhook = 0 , err ;

/ 随机数的种子，在产生哈希值的时候会用到

redisSrand48 (0) ;

/ 关于脏命令的标记

server . lua_random_dirty = 0;

server . lua_write_dirty = 0;

/ 检查参数的有效性

i f ( getLongLongFromObjectOrReply ( c , c−>argv [2] ,& numkeys ,

NULL) != REDIS_OK)

return ;

i f (numkeys > ( c−>argc − 3) ) {

addReplyError ( c , ”Number␣ of ␣ keys ␣can ’ t ␣be␣ greater ␣than

␣

number␣ of ␣ args ” ) ;

16.4. REDIS LUA 脚本的执行过程

177

return ;

}

/ 函数名以 f_ 开头

funcname [ 0 ] = ’ f ’ ;

funcname [ 1 ] = ’_’ ;

/ 如果没有哈希值，需要计算 lua 脚本的哈希值

i f ( ! evalsha ) {

/ 计算哈希值，会放入到 SHA1 −> lua_script 哈希表中

/ c−>argv[1]−> ptr 是用户指定的 lua 脚本

/ sha1hex () 产生的哈希值存在 funcname 中

sha1hex ( funcname+2,c−>argv[1]−>ptr , sdslen ( c−>argv

[

1]−> ptr ) ) ;

}

else {

/ 用户自己指定了哈希值

int j ;

char *sha = c−>argv[1]−> ptr ;

for ( j = 0; j < 40; j++)

funcname [ j +2] = tolower ( sha [ j ] ) ;

funcname [ 4 2 ] = ’ \0 ’ ;

}

/ 将错误处理函数入栈

/ lua_getglobal () 会将读取指定的全局变量，且将其入栈

lua_getglobal ( lua , ”__redis__err__handler” ) ;

* Try to lookup the Lua function */

/ 在 lua 中查找是否注册了此函数。这一句尝试将 funcname

入栈

lua_getglobal ( lua , funcname ) ;

i f ( l u a _ i s n i l ( lua , −1) ) { // funcname 在 lua 中不存在

/ 将 n i l 出栈

lua_pop ( lua , 1 ) ; /* remove the n i l from the stack */

/ 已经确定 funcname 在 lua 中没有定义，需要创建

i f ( evalsha ) {

lua_pop ( lua , 1 ) ; /* remove the error handler from

the stack . */

addReply ( c , shared . n o s c r i p t e r r ) ;

178

CHAPTER 16. REDIS 与 LUA 脚本

return ;

}

/ 创建 lua 函数 funcname

/ c−>argv [ 1 ] 指向用户指定的 lua 脚本

i f ( luaCreateFunction ( c , lua , funcname , c−>argv [ 1 ] ) ==

REDIS_ERR) {

lua_pop ( lua , 1 ) ;

return ;

}

/ 现在 lua 中已经有 funcname 这个全局变量了，将其读

取并入栈，

/ 准备调用

lua_getglobal ( lua , funcname ) ;

redisAssert ( ! l u a _ i s n i l ( lua , −1) ) ;

}

/ 设置参数，包括键和值

luaSetGlobalArray ( lua , ”KEYS” , c−>argv+3,numkeys ) ;

luaSetGlobalArray ( lua , ”ARGV” , c−>argv+3+numkeys , c−>argc−3−

numkeys ) ;

/ 选择数据集， lua_client 有专用的数据集

* S e l e c t the r i g h t DB in the context of the Lua c l i e n t

selectDb ( server . lua_client , c−>db−>id ) ;

/ 设置超时回调函数，以在 lua 脚本执行过长时间的时候停止

脚本的运行

server . lua_caller = c ;

server . lua_time_start = ustime () /1000;

server . l ua _ ki ll = 0;

i f ( server . lua_time_limit > 0 && server . masterhost ==

NULL) {

/ 当 lua 解释器执行了 100000 ， luaMaskCountHook () 会

被调用

lua_sethook ( lua , luaMaskCountHook ,LUA_MASKCOUNT

100000) ;

delhook = 1;

}

16.4. REDIS LUA 脚本的执行过程

179

/ 现在，我们确定函数已经注册成功了 . 可以直接调用 lua 脚本

err = lua_pcall ( lua ,0 ,1 , −2) ;

/ 删除超时回调函数

i f ( delhook ) lua_sethook ( lua , luaMaskCountHook , 0 , 0 ) ; /*

Disable hook */

/ 如果已经超时了，说明 lua 脚本已在超时后背 SCRPIT KILL

终结了

/ 恢复监听发送 lua 脚本命令的客户端

i f ( server . lua_timedout ) {

server . lua_timedout = 0;

aeCreateFileEvent ( server . el , c−>fd ,AE_READABLE,

readQueryFromClient , c ) ;

}

/ lua_caller 置空

server . lua_caller = NULL;

/ 执行 lua 脚本用的是 lua 脚本执行专用的数据集。现在恢复

原有的数据集

selectDb ( c , server . lua_client −>db−>id ) ; /* set DB ID from

Lua c l i e n t */

/ Garbage c o l l e c t i o n 垃圾回收

lua_gc ( lua ,LUA_GCSTEP, 1 ) ;

i f ( err ) {

/ 处理执行 lua 脚本的错误

/ 告知客户端

addReplyErrorFormat ( c , ” Error ␣ running ␣ s c r i p t ␣ ( c a l l ␣ to ␣

%s ) : ␣%s \n” ,

funcname , lua_tostring ( lua , −1) ) ;

lua_pop ( lua , 2 ) ; /* Consume the Lua reply and remove

error handler . */

}

/ 成功了

else {

* On success convert the Lua return value into Redis

protocol , and

send i t to * the c l i e n t . */

180

CHAPTER 16. REDIS 与 LUA 脚本

luaReplyToRedisReply ( c , lua ) ; /* Convert and consume

the reply . */

lua_pop ( lua , 1 ) ; /* Remove the error handler . */

}

/ 将 lua 脚本发布到主从复制上，并写入 AOF 文件

. . . . . TODO 补充，问题已经搞清楚了

}

16.4. REDIS LUA 脚本的执行过程

181

182

CHAPTER 16. REDIS 与 LUA 脚本

6.5 脏命令

在解释脏命令之前，我先交代一点。

redis 服务器执行的 lua 脚本和普通的命令一样，都是会写入 AOF 文件和发布

至主从复制连接上的。以主从复制为例，将 lua 脚本中发生的数据变更发布到从机

上，有两种方法。一，和普通的命令一样，只要涉及写的操作，都发布到从机上；二、

直接将 lua 脚本发送给从机。实际上，两种方法都可以的，数据变更都能得到传播，

但首先，第一种方法中普通命令会被转化为 redis 通信协议的格式，和 lua 脚本文本

大小比较起来，会浪费更多的带宽；其次，第一种方法也会浪费较多的 CPU 的资

源，因为从机收到了 redis 通信协议的格式的命令后，还需要转换为普通的命令，然

后才是执行，这比纯粹的执行 lua 脚本，会浪费更多的 CPU 资源。作者真是煞费

苦心;) ，开源软件做到这么细致的地方，很令人敬佩，redis 确实是“麻雀虽小五脏俱全”。

上面的结果是，直接将 lua 脚本发送给从机。但这又会产生一个问题。举例一个

lua 脚本：

−

− lua s c r p i t

l o c a l some_key

some_key = r e d i s . c a l l ( ’RANDOMKEY’ ) −− <−−− TODO n i l

r e d i s . c a l l ( ’ set ’ , some_key , ’ 123 ’ )

上面脚本想要做的是，从 redis 服务器中随机选取一个键，将其值设置为 123。从

RANDOMKEY 命令的命令处理函数来看，其调用了 random() 函数，如此一来问题

就来了：当 lua 脚本被发布到不同的从机上时，random() 调用返回的结果是不同的，

因此主从机的数据就不一致了。

因此在 redis 服务器配置选项目设置了两个变量来解决这个问题：

/ 在 lua 脚本中发生了写操作

int lua_write_dirty ; /* True i f a write command was

c a l l e d during the

execution of the current s c r i p t .

/ 在 lua 脚本发生了未决的操作，譬如 RANDOMKEY 命令操作

int lua_random_dirty ; /* True i f a random command was

c a l l e d during the

execution of the current s c r i p t .

在执行 lua 脚本之前，这两个参数会被置零。在执行 lua 脚本中，但在执行命令

操作之前，redis 会检测写操作之前是否发生了写操作，是则会禁止接下来的操作；否

则会更新上面两个变量，如果发现写操作 lua_write_dirty = 1；如果发现未决操作，

16.5. 脏命令

183

lua_random_dirty = 1。对于这段话的表述，有下面的流程图，大家也可以回看上一

段代码：

184

CHAPTER 16. REDIS 与 LUA 脚本

16.6. LUA 脚本的传播

185

6.6 lua 脚本的传播

如上所说，需要传播 lua 脚本中的数据变更，redis 的做法是直接将 lua 脚本发送

给从机和写入 AOF 文件的。

redis 的做法是，修改执行 lua 脚本客户端的参数为“EVAL”和相应的 lua 脚本

文本，至于发送到从机和写入 AOF 文件，交由主从复制机制和 AOF 持久化机制来

完成。下面摘一段代码：

TODO 在什么时候设置了 aof repl 标记。

void evalGenericCommand ( r e d i s C l i e n t *c , int evalsha ) {

. . . . .

i f ( evalsha ) {

i f ( ! replicationScriptCacheExists ( c−>argv[1]−> ptr ) ) {

* This s c r i p t i s not in our s c r i p t cache ,

r e p l i c a t e i t as

EVAL, then add i t into the s c r i p t cache , as

from now on

s l a v e s and AOF know about i t . */

/ 从 server . lua_scripts 获取 lua 脚本

/ c−>argv[1]−> ptr 是 SHA1

robj * s c r i p t = dictFetchValue ( server . lua_scripts ,

c−>argv[1]−> ptr ) ;

/ 添加到主从复制专用的脚本缓存中

replicationScriptCacheAdd ( c−>argv[1]−> ptr ) ;

redisAssertWithInfo ( c ,NULL, s c r i p t != NULL) ;

/ 重写命令

/ 参数 1 为：EVAL

/ 参数 2 为： lua_script

/ 如此一来在执行 AOF 持久化和主从复制的时候， lua

脚本就能得到传播

rewriteClientCommandArgument ( c , 0 ,

resetRefCount ( createStringObject ( ”EVAL” ,4) ) ) ;

rewriteClientCommandArgument ( c , 1 , s c r i p t ) ;

}

186

CHAPTER 16. REDIS 与 LUA 脚本

6.7 总结

redis 服务器的工作模式是单进程单线程，因为开发人员在写 lua 脚本的时候应

该特别注意时间复杂度的问题，不要让 lua 脚本影响整个 redis 服务器的性能。

Chapter 17

redis 监视机制

7.1 redis 哨兵的服务框架

哨兵也是 redis 服务器，只是它与我们平时提到的 redis 服务器职能不同，哨兵

负责监视普通的 redis 服务器，提高一个服务器集群的健壮和可靠性。哨兵和普通的

redis 服务器所用的是同一套服务器框架，这包括：网络框架，底层数据结构，订阅发

布机制等。

从主函数开始，来看看哨兵服务器是怎么诞生，它在什么时候和普通的 redis 服

务器分道扬镳：

int main ( int argc , char ** argv ) {

/ 随机种子，一般 rand () 产生随机数的函数会用到

srand ( time (NULL)^getpid () ) ;

gettimeofday(&tv ,NULL) ;

dictSetHashFunctionSeed ( tv . tv_sec^tv . tv_usec^getpid () ) ;

/ 通过命令行参数确认是否启动哨兵模式

server . sentinel_mode = checkForSentinelMode ( argc , argv ) ;

/ 初始化服务器配置，主要是填充 redisServer 结构体中的各种

参数

initServerConfig () ;

/ 将服务器配置为哨兵模式，与普通的 redis 服务器不同

* We need to i n i t s e n t i n e l r i g h t now as parsing the

configuration f i l e

188

CHAPTER 17. REDIS 监视机制

in s e n t i n e l mode w i l l have the e f f e c t of populating

the s e n t i n e l

data s t r u c t u r e s with master nodes to monitor . */

i f ( server . sentinel_mode ) {

/ i n i t S e n t i n e l C o n f i g () 只指定哨兵服务器的端口

i n i t S e n t i n e l C o n f i g () ;

i n i t S e n t i n e l () ;

}

. . . . .

/ 普通 redis 服务器模式

i f ( ! server . sentinel_mode ) {

}

. . . . .

/ 哨兵服务器模式

else {

/ 检测哨兵模式是否正常配置

sentinelIsRunning () ;

}

. . . . .

/ 进入事件循环

aeMain ( server . e l ) ;

/ 去除事件循环系统

aeDeleteEventLoop ( server . e l ) ;

return 0;

}

在上面，通过判断命令行参数来判断 redis 服务器是否启用哨兵模式，会设置服

务器参数结构体中的 redisServer.sentinel_mode 的值。在上面的主函数调用了一个很

关键的函数：initSentinel()，它完成了哨兵服务器特有的初始化程序，包括填充哨兵

服务器特有的命令表，struct sentinel 结构体。

/ 哨兵服务器特有的初始化程序

* Perform the Sentinel mode i n i t i a l i z a t i o n . */

void i n i t S e n t i n e l ( void ) {

int j ;

/ 如果 redis 服务器是哨兵模式，则清空命令列表。哨兵会有

一套专门的命令列表，这与普通的 redis 服务器不同

* Remove usual Redis commands from the command table ,

then j u s t add

17.1. REDIS 哨兵的服务框架

189

the SENTINEL command. */

dictEmpty ( server . commands ,NULL) ;

/ 将 sentinelcmds 命令列表中的命令填充到 server . commands

for ( j = 0; j < sizeof ( sentinelcmds ) / sizeof ( sentinelcmds

[

0 ] ) ; j++) {

int r e t v a l ;

struct redisCommand *cmd = sentinelcmds+j ;

r e t v a l = dictAdd ( server . commands , sdsnew (cmd−>name) ,

cmd) ;

redisAssert ( r e t v a l == DICT_OK) ;

}

* I n i t i a l i z e various data s t r u c t u r e s . */

/ s e n t i n e l . current_epoch 用以指定版本

s e n t i n e l . current_epoch = 0;

/ 哨兵监视的 redis 服务器哈希表

s e n t i n e l . masters = dictCreate(&instancesDictType ,NULL) ;

/ s e n t i n e l . t i l t 用以处理系统时间出错的情况

s e n t i n e l . t i l t = 0;

/ TILT 模式开始的时间

s e n t i n e l . tilt_start_time = 0;

/ s e n t i n e l . previous_time 是哨兵服务器上一次执行定时程序

的时间

s e n t i n e l . previous_time = mstime () ;

/ 哨兵服务器当前正在执行的脚本数量

s e n t i n e l . running_scripts = 0;

/ 脚本队列

s e n t i n e l . scripts_queue = l i s t C r e a t e () ;

}

我们查看 struct redisCommand sentinelcmds 这个全局变量就会发现，它里面只

有七个命令，难道哨兵仅仅提供了这种服务？为了能让哨兵自动管理普通的 redis 服

务器，哨兵还添加了一个定时程序，我们从 serverCron() 定时程序中就会发现，哨兵

的定时程序被调用执行了，这里包含了哨兵的主要工作。

int serverCron ( struct aeEventLoop *eventLoop , long long id ,

void * clientData ) {

. . . . .

run_with_period (100) {

i f ( server . sentinel_mode ) sentinelTimer () ;

190

CHAPTER 17. REDIS 监视机制

}

7.2 定时程序

关于定时程序是如何被调用的，可以参看之前的 redis 事件驱动那篇文章。

定时程序是哨兵服务器的重要角色，所做的工作主要包括：监视普通的 redis 服

务器（包括主机（master）和（从机）），执行故障修复，执行脚本命令。

/ 哨兵定时程序

void sentinelTimer ( void ) {

/ 检测是否需要启动 s e n t i n e l TILT 模式

sentinelCheckTiltCondition () ;

/ 对哈希表中的每个服务器实例执行调度任务，这个函数很重要

sentinelHandleDictOfRedisInstances ( s e n t i n e l . masters ) ;

/ 执行脚本命令，如果正在执行脚本的数量没有超出限定

sentinelRunPendingScripts () ;

/ 清理已经执行完脚本的进程，如果执行成功从脚本队列中删除

脚本

sentinelCollectTerminatedScripts () ;

/ 停止执行时间超时的脚本进程

sentinelKillTimedoutScripts () ;

/ 为了防止多个哨兵同时选举，故意错开定时程序执行的时间。

通过调整周期可以调整哨兵定时程序执行的时间，即默认值

REDIS_DEFAULT_HZ 加上一个任意值

server . hz = REDIS_DEFAULT_HZ + rand () % REDIS_DEFAULT_HZ;

}

7.3 哨兵与 redis 服务器的互联

每个哨兵都有一个 struct sentinel 结构体，里面维护了多个主机的连接，与每个

主机连接的相关信息都存储在 struct sentinelRedisInstance。透过这两个结构体，很

快就可以描绘出，一个哨兵服务器所维护的机器的信息：

17.3. 哨兵与 REDIS 服务器的互联

191

typedef struct sentinelRedisInstance {

. . . . .

* Master s p e c i f i c . */

/ 其他正在监视此主机的哨兵

/* Other s e n t i n e l s monitoring the

/* Slaves for t h i s master instance .

dict * s e n t i n e l s ;

same master . */

dict * s l a v e s ;

/ 次主机的从机列表

. . . . .

/ 如果是从机， master 则指向它的主机

struct sentinelRedisInstance *master ; /* Master instance

i f i t ’ s s l a v e . */

. . . . .

sentinelRedisInstance ;

}

可见，哨兵服务器连接（监视）了多台主机，多台从机和多台哨兵服务器。有这

样大概的脉络，我们继续往下看就会更有线索。

哨兵要监视 redis 服务器，就必须连接 redis 服务器。启动哨兵的时候需要指定

一个配置文件，程序初始化的时候会读取这个配置文件，获取被监视 redis 服务器的

IP 地址和端口等信息。

redis −server /path/ to / s e n t i n e l . conf −−s e n t i n e l

或者

redis −s e n t i n e l /path/ to / s e n t i n e l . conf

192

CHAPTER 17. REDIS 监视机制

如果想要监视一个 redis 服务器，可以在配置文件中写入：

s e n t i n e l monitor <master−name> <ip> <redis −port> <quorum>

其中，master-name 是主机名，ip redis-port 分别是 IP 地址和端口，quorum 是

哨兵用来判断某个 redis 服务器是否下线的参数，之后会讲到。讲到上面的配置，心里

一定会有疑惑，这个配置的处理过程发生在哪里，在源码中很容易就能找到“sentinel

monitor”关键词：sentinelHandleConﬁguration() 函数中，完成了对配置文件的解析

和处理过程。顺藤摸瓜，就可以找到：

/ 哨兵配置文件解析和处理

char * sentinelHandleConfiguration (char **argv , int argc ) {

sentinelRedisInstance * r i ;

i f ( ! strcasecmp ( argv [ 0 ] , ” monitor ” ) && argc == 5) {

* monitor <name> <host> <port> <quorum> */

int quorum = atoi ( argv [ 4 ] ) ;

/ quorum >= 0

i f (quorum <= 0) return ”Quorum␣must␣be␣1␣ or ␣ greater .

”

;

i f ( createSentinelRedisInstance ( argv [ 1 ] ,SRI_MASTER,

argv [ 2 ] ,

atoi ( argv [ 3 ] ) ,quorum ,

NULL) == NULL)

{

switch ( errno ) {

case EBUSY: return ” Duplicated ␣master␣name . ” ;

case ENOENT: return ”Can ’ t ␣ r e s o l v e ␣master␣

instance ␣hostname . ” ;

case EINVAL: return ” Invalid ␣ port ␣number” ;

}

. . . . .

可以看到里面主要调用了 createSentinelRedisInstance() 函数。createSentinelRe-

disInstance() 函数的主要工作是初始化 sentinelRedisInstance 结构体。在这里，哨兵

并没有选择立即去连接这指定的 redis 服务器，而是将 sentinelRedisInstance.ﬂag 标

记 SRI_DISCONNECT，而将连接的工作丢到定时程序中去，可以联想到，定时程

序中肯定有一个检测 sentinelRedisInstance.ﬂag 的函数，如果发现连接是断开的，会

发起连接。这个策略和我们之前的讲到的客户端连接 redis 服务器时候的策略是一样

的，是 redis 的惯用手法。因为哨兵要和 redis 服务器保持连接，所以必然会定时检测

17.3. 哨兵与 REDIS 服务器的互联

193

和 redis 服务器的连接状态，如果在配置的时候就直接连接 redis 服务器，也不是不

可以，只是用这种方法能让代码更清爽！

在定时程序的调用链中，确实发现了哨兵主动连接 redis 服务器的过程：sentinelTimer()-

sentinelHandleRedisInstance()->sentinelReconnectInstance()。

sentinelReconnectInstance() 负责连接被标记为 SRI_DISCONNECT 的 redis

服务器。它对一个 redis 服务器发起了两个连接：1、普通连接（sentinelRedisIn-

stance.cc,Commands connection）2 、订阅发布专用连接（sentinelRedisInstance.pc,publish

connection）。为什么需要分这两个连接呢？因为对于一个客户端连接来说，redis 服

务器要么专门处理普通的命令，要么专门处理订阅发布命令，这在之前订阅发布篇幅

中专门有提及这个细节。

void sentinelReconnectInstance ( sentinelRedisInstance * r i ) {

i f ( ! ( ri −>f l a g s & SRI_DISCONNECTED) ) return ;

* Commands connection . */

i f ( ri −>cc == NULL) {

ri −>cc = redisAsyncConnect ( ri −>addr−>ip , ri −>addr−>

port ) ;

/ 连接出错

i f ( ri −>cc−>err ) {

/ 错误处理

}

else {

/ 此连接被绑定到 redis 服务器的事件中心

. . . . .

}

/ 此哨兵会订阅所有主从机的 h e l l o 订阅频道，每个哨兵都会

定期将自己监视的服务器和自己的信息发送到主从服务器的

h e l l o 频道，从而此哨兵就能发现其他服务器，并且也能将自

己的监测的数据散播到其他服务器。这就是 redis 所谓的

auto discover .

* Pub / Sub */

i f (( ri −>f l a g s & (SRI_MASTER|SRI_SLAVE) ) && ri −>pc ==

NULL) {

ri −>pc = redisAsyncConnect ( ri −>addr−>ip , ri −>addr−>

port ) ;

/ 连接出错

194

CHAPTER 17. REDIS 监视机制

i f ( ri −>pc−>err ) {

/ 错误处理

}

else {

/ 此连接被绑定到 redis 服务器的事件中心

. . . . .

/ 订阅了 r i 上的 __sentinel__ : h e l l o 频道

* Now we subscribe to the Sentinels ” Hello ”

channel . */

r e t v a l = redisAsyncCommand ( ri −>pc ,

sentinelReceiveHelloMessages , NULL, ”

SUBSCRIBE␣%s ” ,

SENTINEL_HELLO_CHANNEL) ;

. . . . .

}

redis 在定时程序中会尝试对所有的 master 作重连接。这里会有一个疑问，之前

有提到从机（slave），哨兵又是在什么时候连接了从机和哨兵呢？

7.4 HELLO 命令

我们从上面 sentinelReconnectInstance() 的源码得知，哨兵对于一个 redis 服

务器管理了两个连接：普通命令连接和订阅发布专用连接。其中，哨兵在初始化

订阅发布连接的时候，做了两个工作：一是，有向 redis 服务器发送 SUBSCRIBE

SENTINEL_HELLO_CHANNEL 命令；二是，注册了回调函数 sentinelReceiveHel-

loMessages()。稍稍理解大概可以画出下面的数据流向图：

17.4. HELLO 命令

195

196

CHAPTER 17. REDIS 监视机制

从源码来看，哨兵 A 向 master 1 的 HELLO 频道发布的数据有：哨兵 A 的 IP

地址，端口，runid，当前配置版本，以及主机 1 的 IP，端口，当前配置版本。从上

图可以看出，其他所有监视同一 redis 服务器的哨兵都能都到一份 HELLO 数据，这

是订阅发布相关的技术，在之前的篇章中详细讲过。

在定时程序的调用链：sentinelTimer()->sentinelHandleRedisInstance()->sentinelPingInstance()

中，哨兵会向 redis 服务器的 hello 频道发布数据，这些或许能为我们上面提到的问

题提供答案。很容易就能在 sentinel.c 文件中找到向 hello 频道发布数据的函数：

int sentinelSendHello ( sentinelRedisInstance * r i ) {

/ r i 可以是一个主机，从机。

/ 只是用主机和从机作为一个中转，主从机收到 p u b l i s h 命令

后会将数据传输给

/ 订阅了 h e l l o 频道的哨兵。这里可能会有疑问，为什么不直

接发给哨兵？？？

char ip [REDIS_IP_STR_LEN ] ;

17.4. HELLO 命令

197

char payload [REDIS_IP_STR_LEN+1024];

int r e t v a l ;

sentinelRedisInstance *master = ( ri −>f l a g s & SRI_MASTER)

r i : ri −>master ;

sentinelAddr *master_addr =

sentinelGetCurrentMasterAddress ( master ) ;

* Try to obtain our own IP address . */

i f ( anetSockName ( ri −>cc−>c . fd , ip , sizeof ( ip ) ,NULL) == −1)

return REDIS_ERR;

i f ( ri −>f l a g s & SRI_DISCONNECTED) return REDIS_ERR;

/ 格式化需要发送的数据，包括：

/ 哨兵 IP 地址，端口， runnid ，当前配置版本，

/ 主机 IP 地址，端口，当前配置的版本

* Format and send the Hello message . */

s n p r i n t f ( payload , sizeof ( payload ) ,

”

%s ,%d,%s ,% llu , ” /* Info about t h i s s e n t i n e l . */

”

%s ,%s ,%d,% l l u ” , /* Info about current master . */

ip , server . port , server . runid ,

(

unsigned long long ) s e n t i n e l . current_epoch ,

* −−− */

master−>name , master_addr−>ip , master_addr−>port ,

(

unsigned long long ) master−>config_epoch ) ;

r e t v a l = redisAsyncCommand ( ri −>cc ,

sentinelPublishReplyCallback , NULL, ”PUBLISH␣%s ␣%s ” ,

SENTINEL_HELLO_CHANNEL, payload ) ;

i f ( r e t v a l != REDIS_OK) return REDIS_ERR;

ri −>pending_commands++;

return REDIS_OK;

}

【

的，顾名思义就好了。这里不是说浅尝辄止，而是能加快我们了解哨兵模块工作原理。】

在这里，我们无需知道 redisAsyncConnect()，redisAsyncCommand() 是做什么用

当 redis 服务器收到来自哨兵的数据时候，会向所有订阅 hello 频道的哨兵发布

数据，由此刚才注册的回调函数 sentinelReceiveHelloMessages() 就被调用了。回调函

数 sentinelReceiveHelloMessages() 做了两件事情：

. 发现其他监视同一 redis 服务器的哨兵

. 更新配置版本，当其他哨兵传递的配置版本更高的时候，会更新 redis 主服务器

配置（IP 地址和端口）

198

CHAPTER 17. REDIS 监视机制

总结一下这里的工作原理，哨兵会向 hello 频道发送包括：哨兵自己的 IP 地址

和端口，runid，当前的配置版本；其所监视主机的 IP 地址，端口，当前的配置版

本。【这里要说清楚，什么是 runid 和配置版本】虽然未知的信息很多，但我们可以得

知，当一个哨兵新加入到一个 redis 集群中时，就能通过 hello 频道，发现其他更多的

哨兵，而它自己也能够被其他的哨兵发现。这是 redis 所谓 auto discover 的一部分。

Chapter 18

redis 集群

8.1 前奏

redis cluster 的内容暂时不会在本章节中详细讨论，待稳定版本出来过后再作打

算。集群的概念早在 redis 3.0 之前讨论了，3.0 才在源码中出现，但此还处于开发版

本，不建议应用于正式的产品，问题会比较多，不建议使用。

redis 集群要考虑的问题，

1. 节点之间怎么据的同步，如何做到数据一致性。一主一备的模式，可以用 redis

内部实现的主从备份实现数据同步。但节点不断增多，存在多个 master 的时候，

同步的难度会越大。

. 如何做到负载均衡？请求量大的时候，如何将请求尽量均分到各个服务器节点，

负载均衡算法做的不好会导致雪崩。

. 如何做到平滑拓展？当业务量增加的时候，能否通过简单的配置即让新的 redis

节点变为可用。

. 可用性如何？当某些节点鼓掌，能否快速恢复服务器集群的工作能力。

. ......

一个稳健的后台系统需要太多的考虑。redis 尚未具备集群的能力，但作为内存

数据库效率性能远比传统的数据库要高，这里是趋势。

200

CHAPTER 18. REDIS 集群

8.2 也谈一致性哈希算法（consistent hashing）

8.2.1 背景

通常，业务量较大的时候，考虑到性能的问题（索引速度慢和访问量过大），不会

把所有的数据存放在一个 redis 服务器上，甚至不能放在一个物理机上。这里需要将

一堆的键值均分存储到多个 redis 服务器，可以通过：

target = hash ( key )\%N，其中 target 为目标节点， key 为键，N 为

r e d i s 节点的个数

哈希取余的方式会将不同的 key 分发到不同的服务器上。

但考虑如下场景：

. 业务量突然增加，现有服务器不够用。增加服务器节点后，依然通过上面的计

算方式：hash(key)%(N+1) 做数据分片和分发，但之前的 key 会被分发到与之

前不同的服务器上，导致大量的数据失效，需要重新写入（set）redis 服务器。

. 其中的一个服务器挂了。如果不做及时的修复，大量被分发到此服务器请求都

会失效。

这也是两个问题。

18.2. 也谈一致性哈希算法（CONSISTENT HASHING）

201

8.2.2 一致性哈希算法

设定一个圆环上 0 23 ̂2 -1 的点，每个点对应一个缓存区，每个键值对存储的位置

也经哈希计算后对应到环上节点。但现实中不可能有如此多的节点，所以倘若键值对

经哈希计算后对应的位置没有节点，那么顺时针找一个节点存储它。

202

CHAPTER 18. REDIS 集群

考虑增加服务器节点的情况，该节点顺时针方向的数据仍然被存储到顺时针方向

的节点上，但它逆时针方向的数据被存储到它自己。这时候只有部分数据会失效，被

映射到新的缓存区。

考虑节点减少的情况。该缺失节点顺时针方向上的数据仍然被存储到其顺时针方

向上的节点，设为 beta，其逆时针方向上的数据会被存储到 beta 上。同样，只有有

部分数据失效，被重新映射到新的服务器节点。

18.2. 也谈一致性哈希算法（CONSISTENT HASHING）

203

这种情况比较麻烦，上面图中 gamma 节点失效后，会有大量数据映射到 alpha

节点，最怕 alpha 扛不住，接下去 beta 也扛不住，这就是多米诺骨牌效应;)。这里涉

及到数据平衡性和负载均衡的话题。数据平衡性是说，数据尽可能均分到每个节点上

去，存储达到均衡。

8.2.3 虚拟节点

将多个虚拟节点对应到一个真实的节点，存储可以达到均衡的效果。之前的映射

方案为：

key −> node

中间多了一个层虚拟节点后，多了一层映射关系：

key −> <virtual node> −> node

多一个层的好处就是能做到一层的改动对另一层无感知。

204

CHAPTER 18. REDIS 集群

例如当需要增加物理存储节点的时候，只需要按需增加虚拟节点即可，同时更新

映射关系表。

当节点不幸失效的时候，只需要调整虚拟节点和真实节点之间的映射关系即可。

当发现数据分布已经不均衡了，可以调整虚拟节点和真实节点之间的映射，达到

新的平衡。

而这对用户来说完全是透明的。同时，负载均衡的目的也达到了。

8.3 一致性哈希解决的问题

在一致性哈希算法中，数据经哈希后存储在不同的节点中，因此在没有增加或减

少机器的情况下，避开了一致性的话题。在增加或者减少节点的时候，只有一部分的

数据被映射到新的节点，这也说明此时数据会出现不一致的情况，但只有少部分的数

据做变更，可以看出一致性哈希算法尽量弱化了一致性问题。

一致性哈细算法的前提是，有一个好的哈希算法，即假设 hash() 算法计算所得

哈希结果区间是 [0,N-1]，那么对于一个 key，hash(key) 结果落在 [0,N-1] 每个点的概

率应为 1

N。现在较为流行的哈希算法，有如 md5，SHA。

18.4. 怎么实现？

205

8.4 怎么实现？

一致性哈希算法，既可以在客户端实现，也可以在中间件上实现（如 proxy）。在

客户端实现中，当客户端初始化的时候，需要初始化一张预备的 redis 节点的映射表：

hash(key) => <redis node>. 这有一个缺点，假设有多个客户端，当映射表发生变化

的时候，多个客户端需要同时拉取新的映射表。

另一个种是中间件（proxy）的实现方法，即在客户端和 redis 节点之间加多一个

代理，代理经过哈希计算后将对应某个 key 的请求分发到对应的节点，一致性哈希算

法就在中间件里面实现。可以发现，twemproxy 就是这么做的。

8.5 twemproxy - redis 集群管理方案

twemproxy 是 twitter 开源的一个轻量级的后端代理，兼容 redis/memcache 协

议，可用以管理 redis/memcache 集群。

twemproxy 内部有实现一致性哈希算法，对于客户端而言，twemproxy 相当于是

缓存数据库的入口，它无需知道后端的部署是怎样的。twemproxy 会检测与每个节点

的连接是否健康，出现异常的节点会被剔除；待一段时间后，twemproxy 会再次尝试

连接被剔除的节点。

通常，一个 redis 节点池可以分由多个 twemproxy 管理，少数 twemproxy 负责

写，多数负责读。twemproxy 可以实时获取节电池内的所有 redis 节点的状态，但其

对故障修复的支持还有待提高。解决的方法是可以借助 redis sentinel 来实现自动的

主从切换，当主机 down 掉后，sentinel 会自动将从机配置为主机。而 twemproxy 可

206 CHAPTER 18. REDIS 集群

以定时向 redis sentinel 拉取信息，从而替换出现异常的节点。

twemproxy 的更多细节，这里不再做深入的讨论。

8.6 redis 官方版本支持的集群

最新版本的 redis 也开始支持集群特性了，再也不用靠着外援过日子了。

基本的思想是，集群里的每个 redis 都只存储一定的键值对，这个“一定”可以

通过默认或自定义的哈希函数来决定，当一个 redis 收到请求后，会首先查看此键值

对是否该由自己来处理，是则继续往下执行；否则会产生一个类似于 http 3XX 的重

定向，要求客户端去请求集群中的另一个 redis。

redis 每一个实例都会通过遵守一定的协议来维护这个集群的可用性，稳定性。有

兴趣可前往官网了解 redis 集群的实现细则。

Part IV

redis 操练

Chapter 19

积分排行榜

9.1 需求

积分排行榜是 redis 的经典应用。

倘若数据都存在数据库中，每次访问网页都需要对所有的数据做排序，对于日访

问量大的网站来说，不仅服务器吃不消，用户体验也不佳。在 redis 中提供了 sorted

set 数据结构——有序集合，其底层实现是跳表，因此插入和删除的效率都很高，适

用于需实时排序的场景，游戏中的积分排行榜就是一个例子。

9.2 ZSET 命令简介

针对有序集合，redis 准备了一系列的命令，实现排行榜需要了解相关命令的使

用。

. ZADD：添加新的元素，用法如下：

ZADD key score member [score member ...] key 表示有序集合的键名；member

即是元素数据，score 表示元素的积分。内部主要是按 member 和 score 来排序。

. ZRANGE：按分数从低到高返回给定排名区间的元素，用法如下：

ZRANGE key start stop [WITHSCORES] start 表示起始排名，stop 为终止排

名。ZRANGE 的实现也不难，类二分搜索即可。TODO

. ZREVRANGE：按分数从高到低返回给定排名区间的元素，用法和上面的一样。

. ZRANK：返回某个元素的排名，用法如下：

ZRANK key member 原理类似，类二分搜索 TODO

210

CHAPTER 19. 积分排行榜

9.3 实现

拿论坛距离，需要在论坛首页展示最热的几个帖子，这些热帖会经常更新的。当某

个帖子被访问时，对于帖子的访问次数，除了写数据库之外，还要写 redis ，即更新 score。

用 python 写一个 leaderboard：

»»»»»»»»

9.4 性能

访问论坛首页的时候，就可以直接从 redis 直接获取最热的帖子，返回某个帖子

的排名复杂度为 O(logN * m)，其中 N 为跳表的长度，m 为匹配长度。

Chapter 20

分布式锁

在 *nix 系统编程中，遇到多个进程或者线程共享一块资源的时候，通常会使用

系统自身提供的锁，譬如一个进程里的多线程，会用互斥锁；多个进程之间，会用信

号量等。这个场景中所谓的共享资源仅仅限于本地，倘若共享资源存在于网络上，本

地的“锁”就不起作用了。互斥访问某个网络上的资源，需要有一个存在于网络上的

锁服务器，负责锁的申请与回收。redis 可以充当锁服务器的角色。首先，redis 是单

进程单线程的工作模式，所有前来申请锁资源的请求都被排队处理，能保证锁资源的

同步访问。

可以借助 redis 管理锁资源，来实现网络资源的互斥。

我们可以在 redis 服务器设置一个键值对，用以表示一把互斥锁，当申请锁的时

候，要求申请方设置（SET）这个键值对，当释放锁的时候，要求释放方删除（DEL）

这个键值对。譬如申请锁的过程，可以用下面的伪代码表示：

lock = r e d i s . get ( ”mutex_lock” ) ;

i f ( ! lock )

error ( ” apply␣ the ␣ lock ␣ error . ” ) ;

else

−

− 确定可以申请锁

r e d i s . set ( ”mutex_lock” , ” locking ” ) ;

do_something () ;

这种申请锁的方法，涉及到客户端和 redis 服务器的多次交互，当客户端确定可

以加锁的时候，可能这时候锁已经被其他客户端申请了，最终导致两个客户端同时持

有锁，互斥的语意非常容易被打破。在 redis 官方文档描述了一些方法并且参看了网

上的文章，好些方法都提及了这个问题。我们会发现，这些方法的共同特点就是申请

锁资源的整个过程分散在客户端和服务端，如此很容易出现数据一致性的问题。因此，

212

CHAPTER 20. 分布式锁

最好的办法是将“申请/释放锁”的逻辑操作都放在服务器上，redis lua 脚本可以胜任。

下面给出申请互斥锁的 lua 脚本：

−

− apply for lock

l o c a l key = KEYS[ 1 ]

l o c a l res = r e d i s . c a l l ( ’ get ’ , key )

−− 锁被占用，申请失败

i f res == ’ 0 ’ then

return −1

−_e −_l_s_e锁可以被申请

l o c a l s e t r e s = r e d i s . c a l l ( ’ set ’ , key , 0)

i f s e t r e s [ ’ ok ’ ] == ’OK’ then

return 0

end

return −1

get 命令不成功返回 (nil)

实验命令：保存 lua 脚本 redis-cli script load ”$(cat mutex_lock.lua)”

同样，释放锁的操作也可以在 lua 脚本中实现。

−

− r e l e a e lock

l o c a l key = KEYS[ 1 ]

l o c a l s e t r e s = r e d i s . c a l l ( ’ set ’ , key , 1)

i f s e t r e s [ ’ ok ’ ] == ’OK’ then

return 0

return −1

如上 lua 脚本基本的锁管理的问题，将锁的管理逻辑放在服务器端，可见 lua 能

拓展 redis 服务器的功能。但上面的锁管理方案是有问题的。

0.1 死锁的问题

首先是客户端崩溃导致的死锁。按照上面的方法，当某个客户端申请锁后因崩溃

等原因无法释放锁，那么其他客户端无法申请锁，会导致死锁。

20.1. 死锁的问题

213

一般，申请锁是为了让多个访问方对某块数据作互斥访问（修改），而我们应该

将访问的时间控制在足够短，如果持有锁的时间过长，系统整体的性能肯定是下降

的。可以给定一个足够长的超时时间，当访问方超时后尚未释放锁，可以自动把锁释放。

redis 提供了 TTL 功能，键值对在超时后会自动被剔除，在 redis 的数据集中有

一个哈希表专门用作键值对的超时。所以，我们有下面的 lua 代码：

−

− apply for lock

l o c a l key = KEYS[ 1 ]

l o c a l timeout = KEYS[ 2 ]

l o c a l res = r e d i s . c a l l ( ’ get ’ , key )

−

− 锁被占用，申请失败

i f res == ’ 0 ’ then

return −1

−

− 锁可以被申请

else

l o c a l s e t r e s = r e d i s . c a l l ( ’ set ’ , key , 0)

l o c a l exp_res = r e d i s . c a l l ( ’ pexpire ’ , key , timeout )

i f exp_res == 1 then

return 0

end

return −1

如此能够解决锁持有者崩溃而锁资源无法释放带来的死锁问题。

再者是 redis 服务器崩溃导致的死锁。当管理锁资源的 redis 服务器宕机了，客

户端既无法申请也无法释放锁，死锁形成了。一种解决的方法是设置一个备份 redis

服务器，当 redis 主机宕机后，可以使用备份机，但这需要保证主备的数据是同步的，

不允许有延迟。

在同步有延迟的情况下，依旧会出现两个客户端同时持有锁的问题。

214

CHAPTER 20. 分布式锁

Chapter 21

消息中间件

接触 linux 系统编程的时候，曾经学到消息队列是 IPC 的一种方式，这种通讯方

式通常只用于本地的进程，基于共享内存的《无锁消息队列》即是一个很好的中间件，

详见这里。但这篇提到的消息队列，也被称为消息中间件，通常在分布式系统中用到。

提及消息中间件的时候，还会涉及生产者和消费者两个概念。消息中间件是负责

接收来自生产者的消息，并存储并转发给对应的消费者，生产者可以按 topic 发布各

样消息，消费者也可以按 topic 订阅各样消息。生产者只管往消息队列里推送消息，

不用等待消费者的回应；消费者只管从消息队列中取出数据并处理，可用可靠性等问

题都交由消息中间件来负责。

216

CHAPTER 21. 消息中间件

说白了，这种分布式的消息中间件即是网络上一个服务器，我们可以往里面扔数

据，里面的数据会被消息中间件推送或者被别人拉取，消息中间件取到一个数据中转

的作用。生产者和消费者通常有两种对应关系，一个生产者对应一个消费者，以及一

个生产者对应多个消费者。在这篇文章中，介绍了消息中间件的三个特点：解耦，异

步和并行。读者可以自行理解。一些不需要及时可靠响应的业务场景，消息中间件可

以大大提高业务上层的吞吐量。

目前消息中间件一族里边有一些优秀的作品，RabbitMQ, Jafka/Kafka。redis 也

可以作为一个入门级的消息队列。上面提到的一个生产者对应一个消费者，redis 的

blist 可以实现；一个生产者对应多个消费者，redis 的 pub/sub 模式可以实现。值得

注意的是，使用 redis 作为消息中间件，假如消费者有一段时间断开了与 redis 的连

接，它将不会收到这段时间内 redis 内的数据，这一点从 pub/sub 的实现可以知道。

严格意义上的消息中间件，需要保证数据的可靠性。

1.1 以分布式的消息队列

在平时的开发当中，消息队列算是最常见的应用了。在本机的时候，可以使用系统

提供的消息队列，或者基于共享内存的循环消息队列，来实现本机进程以及进程之间

的通信。对于异机部署的多个进程，就需要用到分布式的消息队列了，来看看这个场景：

21.1. 以分布式的消息队列

217

生产者，基于 redis 的消息队列，3 个 worker 组都分别部署在不同的机器上，生

产者会快速将产出内容（如需要存储的数据或者日志等）推送到消息队列服务器上，

这是 worker group 就能消费了。

这种实现可以借助 redis 中的 blist 实现。在这里用 c 实现了一个生产者和 worker

group 的示例代码：

/ comm. h

ifndef COMM_H__

define COMM_H__

#include <inttypes . h>

typedef struct {

char ip [ 3 2 ] ;

uint16_t port ;

char queue_name [ 2 5 6 ] ;

config_t ;

}

void Usage (char *program ) {

_ap_br_oi n_r_tt f₍₎( ”_; Usage : ␣%s ␣−h␣ ip ␣−p␣ port ␣−l ␣ t e s t \n” , program ) ;

}

const size_t max_cmd_len = 512;

#endif

218

CHAPTER 21. 消息中间件

生产者的代码：

/ producer . cc

include <stdio . h>

include <s t d l i b . h>

include <unistd . h>

include <s t r i n g . h>

#include ”comm. h”

#include ” h i r e d i s / h i r e d i s . h”

void test_redis_client ()

{

redisContext * rc = redisConnect ( ” 1 2 7 . 0 . 0 . 1 ” ,6379) ;

i f (NULL == rc | | rc != NULL && rc−>err ) {

f p r i n t f ( stderr , ” error : ␣%s \n” , rc−>e r r s t r ) ;

return ;

}

/ s et name

redisReply * reply = ( redisReply *) redisCommand ( rc , ”

set ␣name␣dylan ” ) ;

p r i n t f ( ”%s \n” , reply−>s t r ) ;

/ get name

reply = ( redisReply *) redisCommand ( rc , ” get ␣name” ) ;

p r i n t f ( ”%s \n” , reply−>s t r ) ;

}

int main ( int argc , char *argv [ ] ) {

i f ( argc < 7)

Usage ( argv [ 0 ] ) ;

config_t config ;

for ( int i = EOF;

(

i = getopt ( argc , argv , ”h : p : l : ” ) ) != EOF; ) {

switch ( i ) {

case ’h ’ : s n p r i n t f ( config . ip , sizeof ( config . ip ) , ”%

s ” , optarg ) ; break ;

21.1. 以分布式的消息队列

219

case ’p ’ : config . port = atoi ( optarg ) ; break ;

case ’ l ’ : s n p r i n t f ( config . queue_name , sizeof (

config . queue_name) , ”%s ” , optarg ) ; break ;

default : Usage ( argv [ 0 ] ) ; break ;

}

redisContext * rc = redisConnect ( config . ip , config . port ) ;

i f (NULL == rc | | rc != NULL && rc−>err ) {

f p r i n t f ( stderr , ” error : ␣%s \n” , rc−>e r r s t r ) ;

return −1;

}

redisReply * reply = NULL;

char cmd [ max_cmd_len ] ;

s n p r i n t f (cmd, sizeof (cmd) , ”LPUSH␣%s ␣ task ” , config .

queue_name) ;

p r i n t f ( ”cmd=%s \n” ,cmd) ;

int count = 100;

while ( count−−) {

reply = ( redisReply *) redisCommand ( rc , cmd) ;

i f ( reply && reply−>type == REDIS_REPLY_INTEGER) {

}

else {

p r i n t f ( ”BLPUSH␣ error \n” ) ;

}

return 0;

}

消费者的代码：

/ consumer . cc

include ”comm. h”

#include ” h i r e d i s / h i r e d i s . h”

int DoLogic (char *data , size_t len ) ;

int main ( int argc , char *argv [ ] ) {

220

CHAPTER 21. 消息中间件

i f ( argc < 7)

Usage ( argv [ 0 ] ) ;

config_t config ;

for ( int i = EOF;

(

i = getopt ( argc , argv , ”h : p : l : ” ) ) != EOF; ) {

switch ( i ) {

case ’h ’ : s n p r i n t f ( config . ip , sizeof ( config . ip ) , ”%

s ” , optarg ) ; break ;

case ’p ’ : config . port = atoi ( optarg ) ; break ;

case ’ l ’ : s n p r i n t f ( config . queue_name , sizeof (

config . queue_name) , ”%s ” , optarg ) ; break ;

default : Usage ( argv [ 0 ] ) ; break ;

}

redisContext * rc = redisConnect ( config . ip , config . port ) ;

i f (NULL == rc | | rc != NULL && rc−>err ) {

f p r i n t f ( stderr , ” error : ␣%s \n” , rc−>e r r s t r ) ;

return −1;

}

redisReply * reply = NULL;

char cmd [ max_cmd_len ] ;

s n p r i n t f (cmd, sizeof (cmd) , ”BRPOP␣%s ␣ task ␣30” , config .

queue_name) ;

int seq = 0;

while ( true ) {

reply = ( redisReply *) redisCommand ( rc , cmd) ;

i f ( reply && reply−>type == REDIS_REPLY_STRING) {

DoLogic ( reply−>str , reply−>len ) ;

}

else i f ( reply && reply−>type == REDIS_REPLY_ARRAY)

{

for ( size_t i =0; i<reply−>elements ; i +=2) {

p r i n t f ( ”%d−>%s \n” , seq++,reply−>element [ i ]−> s t r

)

;

}

else {

}

p r i n t f ( ”BRPOP␣ error , ␣ reply−>type=%d\n” , reply−>

type ) ;

21.1. 以分布式的消息队列

221

break ;

}

return 0;

}

int DoLogic (char *data , size_t len ) {

p r i n t f ( ” reply=%s \n” , data ) ;

return 0;

}

222

CHAPTER 21. 消息中间件

Part V

其他

Chapter 22

内存数据管理

2.1 共享对象

在 redis 服务器初始化的时候，便将一些常用的字符串变量创建好了，免去 redis

在线服务时不必要的字符串创建。共享对象的结构体为 struct sharedObjectsStruct，

摘抄它的内容如下：

struct sharedObjectsStruct {

robj * c r l f , *ok , * err , *emptybulk , * czero , *cone , *

cnegone , *pong , *space ,

. . . . .

}

;

譬如在 redis 通信协议里面，会较多使用的”

n”。这些字符串都在 initServer() 函数被初始化。

2.2 两种内存分配策略

在 zmalloc.c 中 redis 对内存分配策略做了包装。redis 允许使用四种内存管理策

略，分别是 jemalloc,tcmalloc, 苹果系统自带的 malloc 和其他系统自带的 malloc。当

有前面三种分配策略的时候，就使用前面三种，最后一个种分配策略是不选之选。

jemalloc 是 freebsd 操作系统自带的内存分配策略，它具有速度快，多线程优化

的特点 TODO，ﬁrefox 以及 facebook 都在使用 jemalloc。而 tcmalloc 是 google 开

发的，内部集成了很多内存分配的测试工具，chrome 浏览器和 protobuf TODO 用的

226

CHAPTER 22. 内存数据管理

都是 tcmalloc。两者在业界都很出名，性能也不分伯仲。redis 是一个内存数据库，对

存取的速度要求非常高，因此一个好的内存分配策略能帮助提升 redis 的性能。

本篇不会对这两种内存分配策略做深入的讲解。

2.3 memory aware 支持

redis 所说的 memory aware 即为能感知所使用内存总量的特性，能够实时获取

redis 所使用内存的大小，从而监控内存。所使用的思路较为简单，每次分配/释放

内存的时候都更新一个全局的内存使用值。我们先来看 malloc_size(void *ptr) 函数，

这种类似的函数的存在只是为了方便开发人员监控内存。

上述的内存分配策略 jemalloc,tcmalloc 和苹果系统自带的内存分配策略可以实

时获取指针所指内存的大小，如果上述三种内存分配策略都不支持，redis 有一个种

近似的方法来记录指针所指内存的大小，这个 trick 和 sds 字符串的做法是类似的。

zmalloc() 函数会在所需分配内存大小的基础上，预留一个整型的空间，来存储

指针所指内存的大小。这种办法是备选的，其所统计的所谓“指针所指内存大小”不

够准确。因为，平时我们所使用的 malloc() 申请内存空间的时候，可能实际申请的内

存大小会比所需大，也就是说有一部分内存被浪费了，所以 redis 提供的这种方法不

能统计浪费的内存空间。

摘抄 zmalloc() 函数的实现：

void * zmalloc ( size_t s i z e ) {

/ 预留了一小段空间

void * ptr = malloc ( s i z e+PREFIX_SIZE) ;

/ 内存溢出

/ error . out of memory .

22.3. MEMORY AWARE 支持

227

i f ( ! ptr ) zmalloc_oom_handler ( s i z e ) ;

/ 更新已用内存大小。

/ jemalloc , tcmalloc 或者苹果系统支持实时获取指针所指内存大小

ifdef HAVE_MALLOC_SIZE

update_zmalloc_stat_alloc ( zmalloc_size ( ptr ) ) ;

return ptr ;

/ 其他情况使用 redis 自己的策略获知指针所指内存大小

else

(( size_t *) ptr ) = s i z e ;

update_zmalloc_stat_alloc ( s i z e+PREFIX_SIZE) ;

return (char*) ptr+PREFIX_SIZE;

#endif

}

update_zmalloc_stat_alloc() 宏所要做的即为更新内存占用数值大小，因为这

个数值是全局的，所以 redis 做了互斥的保护。有同学可能会有疑问，redis 服务器的

工作模式不是单进程单线程的么，这里不需要做互斥的保护。在 redis 关闭一些客户

端连接的时候，有时 TODO 交给后台线程来做。因此，严格意义上来讲，互斥是要做的。

update_zmalloc_stat_alloc() 宏首先会检测 zmalloc_thread_safe 值是否为 1，

zmalloc_thread_safe 默认为 0，也就是说 redis 默认不考虑互斥的情况；倘若 zmal-

loc_thread_safe 为 1，会使用原子操作函数或加锁的方式更新内存占用数值。

/ 更新已使用内存大小

define update_zmalloc_stat_alloc (__n) do { \

size_t _n = (__n) ; \

/ 按 4字节向上取整

i f (_n&(sizeof ( long ) −1)) _n += sizeof ( long )−(_n&(sizeof (

long ) −1)) ; \

/ 如果设置了线程安全，调用专门线程安全函数

i f ( zmalloc_thread_safe ) { \

/ 使用院子操作或者互斥锁，更新内存占用数值

used_memory

update_zmalloc_stat_add (_n) ; \

else { \

used_memory += _n; \

}

while (0)

}

上述是分配内存的情况，释放内存的情况则反过来。

228

CHAPTER 22. 内存数据管理

2.4 zmalloc_get_private_dirty() 函数

在 RDB 持久化的篇章中，曾经提到这函数，我打算在这一节中稍微详细展开讲。

操作系统为每一个进程维护了一个虚拟地址空间，虚拟地址空间对应着物理地址空

间，在虚拟地址空间上的连续并不代表物理地址空间上的连续。

22.4. ZMALLOC_GET_PRIVATE_DIRTY() 函数

229

在 linux 编程中，进程调用 fork() 函数后会产生子进程。之前的做法是，将父进

230

CHAPTER 22. 内存数据管理

程的物理空间为子进程拷贝一份。出于效率的考虑，可以只在父子进程出现写内存操

作的时候，才为子进程拷贝一份。如此不仅节省了内存空间，且提高了 fork() 的效率。

在 RDB 持久化过程中，父进程继续提供服务，子进程进行 RDB 持久化。持久化完

毕后，会调用 zmalloc_get_private_dirty() 获取写时拷贝的内存大小，此值即为子

进程在 RDB 持久化操作过程中所消耗的内存。

2.5 总结

redis 是内存数据库，对内存的使用较为谨慎。

有一点建议。我们前面讲过，redis 服务器中有多个数据集，在平时的数据集的选

择上，可以按业务来讲不同来将数据存储在不同的数据集中。将数据集中在一两个数

据集，查询的效率会降低。

22.6. REDIS 日志和断言

231

2.6 redis 日志和断言

linux 的世界里，最好用的调试工具不是 gdb，而是日志和 printf。

日志在一个软件系统中是非常常见的，一个关键的作用即定位错误，当系统出问

题首先想到就是日志，查看日志能快速定位问题。redis 中的日志模块较为简单。我们

在 redis 源码中，到处都可以见到 redisLog()。

通常，日志会分为几个级别。在 redis 中 5 个日志级别，在 redis.h 文件中有定义：

* Log l e v e l s */

define REDIS_DEBUG 0

志信息

define REDIS_VERBOSE 1

define REDIS_NOTICE 2

define REDIS_WARNING 3

define REDIS_LOG_RAW (1<<10) /* Modifier to log without

// 调试级别，这一级别产生最多的日

timestamp */

define REDIS_DEFAULT_VERBOSITY REDIS_NOTICE

服务器的配置结构体中，struct redisServer.verbosity 是用来设定日志级别的，

譬如将日志级别设定为 REDIS_NOTICE 后，代码中 REDIS_VERBOSE 和 RE-

DIS_DEBUG 级别的日志都不会被打印。

日志级别值越是低，日志级别越高，产生了日志也就越多，开发人员在产品上线

之前会将日志级别调至最低，方便发现定位或发现潜在的问题。而上线之后，可以将

志级别降低，减少调试日志。如果日志级别过高，则日志量大，可能会对线上的服务

产生影响，因为写日志就是写文件操作，系统调用是要消耗时间的。

日志是想要记录某一个时间点，在哪里发送了什么事情，以方便出现问题的时候，

恢复现场，快速定位问题所在。“某一时间点”即添加时间戳；“ 在哪里”即程序执行

的位置，对应的是源码的文件，行号函数等；“发生了什么事情”即记录一些关键数据。

/ redis 日志函数，会将给定的数据写入日志文件，和常用的

p r i n t f 函数用法差不多

void redisLog ( int level , const char *fmt , . . . ) {

va_list ap ;

char msg [REDIS_MAX_LOGMSG_LEN] ;

/ 如果日志级别小于预设的日志级别，直接返回

232

CHAPTER 22. 内存数据管理

i f (( l e v e l&0x f f ) < server . verbosity ) return ;

va_start (ap , fmt ) ;

vsnprintf (msg , sizeof (msg) , fmt , ap ) ;

va_end( ap ) ;

/ redisLogRaw () 函数将给定的信息，在增加时间戳和进程 id

后写入日志文件

redisLogRaw ( level , msg) ;

}

2.7 redis 断言

为什么需要断言？»»»»»»> TODO 当你认为某些事情在正常情况下不可能出现，

应尽可能结束任务，而不是捕捉错误，尝试挽救。同样在西加加里，使用 try...catch()

会让程序的逻辑变乱，甚至让程序的行为变得不可预测，大胆的使用断言吧。

redis 中不仅仅实现了断言，且在断言失败的时候会打印一些关键的信息。

在 redis.h 中定义了两个断言相关的宏：

define redisAssertWithInfo (_c , _o, _e) ((_e) ?( void ) 0 : (

redisAssertWithInfo (_c , _o,#_e ,__FILE__,__LINE__) , _exit (1)

)

#define redisAssert (_e) ((_e) ?( void ) 0 : ( _redisAssert(#_e ,

__FILE__,__LINE__) , _exit (1) ) )

如果断言为真，执行一个空操作；断言为假，会打印关键的信息。

redisAssert() 函数会记录断言发生的错误信息，文件名和行号：

void _redisAssert (char * estr , char * f i l e , int l i n e ) {

/ 向日志文件中写入 BUG头部

bugReportStart () ;

/ 将文件名，行号，错误信息写入日志

redisLog (REDIS_WARNING, ”===␣ASSERTION␣FAILED␣===” ) ;

redisLog (REDIS_WARNING, ”==>␣%s:%d␣’%s ’ ␣ i s ␣not␣ true ” , f i l e ,

line , e s t r ) ;

22.7. REDIS 断言

233

/ 如果需要，可以记录错误信息，文件名和行号，以便在进程崩溃后

调试（ gdb core ？）

ifdef HAVE_BACKTRACE

server . a s s e r t _ f a i l e d = e s t r ;

server . a s s e r t _ f i l e = f i l e ;

server . assert_line = l i n e ;

redisLog (REDIS_WARNING, ” ( f o r c i n g ␣SIGSEGV␣ to ␣ print ␣ the ␣bug

␣

report . ) ” ) ;

endif

/ 强制 segmentation f a u l t 。无效的内存访问，可以产生

SIGSEGV，如此会

/ 产生 coredump 文件以供进程崩溃后调试使用

(( char*) −1) = ’x ’ ;

}

这有个小有意思的语句：*((char*)-1) = ’x’;

(

char *)-1 表示指向地址值为 -1 的指针，它所指向的内存肯定是非法的，对非法

内存的操作会触发 SIGSEGV 信号，进程结束后会产生 coredump 文件，方便调试使

用。使用 gdb、可执行文件和 coredump 文件能快速定位问题所在，即使进程已经崩

溃了。

redisAssertWithInfo() 函数会打印 redis 服务器当前服务的客户端和某个关键

redis 对象的信息，具体请参看源码，在这不展开了。

234

CHAPTER 22. 内存数据管理

Chapter 23

redis 与 memcache

3.1 单进程单线程与单进程多线程

redis 是单进程单线程的工作模式，所有的请求都被排队处理处理，因此缓存数

据没有互斥的需求。而 memcached 是单进程多线程的工作模式，请求到达时，主线

程会将请求分发给多个工作线程，因此必须要做数据的互斥。

在处理请求的能力上，两者是不相上下的。理论上在一台支持多线程的机器上，

memecached 的 get 操作的吞吐量会较 redis 高。

那到底是多线程还是单线程优秀？多线程一般会增加程序逻辑的复杂度，需要考

虑线程与线程之间的同步与互斥，一定程度上拉低了每个线程的吞吐量（工作量），更

多的时间是花在了等待互斥锁上。一般建议在系统设计的时候多考虑系统的横向扩展

性 TODO，使用每个进程单个线程的模式。这里没有信条，不是非黑即白，就看什么

样的方法解决什么样的问题了。

3.2 丰富与简单的数据结构

redis 有丰富的原生数据结构，包括字符串，链表，集合，有序集合，哈希表，

二进制数组等，可见 redis 能适用于更. 多的场景，可以当作一个数据结构数据库。

memcached 在这方面较 redis 逊色，只能做简单的 key/value 存储。

3.3 其他

除了上面所说，与 memcached 比较：

CHAPTER 23. REDIS 与 MEMCACHE

. redis 原生支持主从复制，可以实现一主多从的场景，提高了可用性

2. redis 原生支持 RDB 和 AOF 两种持久化方式。前者是将内存中的数据整体落

地，后者是将数据的更新落地，类似于 mysql 中的 binlog。memcached 原生并

不支持持久化

. redis 支持事务

. redis 支持键值对的过期时间设置

. redis 3.0 中已经开始支持 redis 集群了

对比下来，redis 好玩多了。

3.4 性能测试

曾经被问到 redis 和 memcached 哪个更快？在测试的时候，需要保证测试的客

观环境是一样的，这包括测试机器，客户端除了在构造协议的逻辑部分不一样外，其

他都应该是保持一致的。

测试环境：

ubuntu, Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz 4 核心 memcache 1.4.14 redis

3.1.99

. 随着 payload 增大，会越影响读写性能，尤其是 redis

. redis，memcache（worker 线程数为 1），读写性能不分上下，redis 更优一点

. memcache 的 worker 线程达到一定个数，会导致读写的性能下降

. 在机器上测试的时候发现，无论如何都不能把 cpu 撑满，这时候瓶颈往往不在

cpu，而在网络的 IO 上

默认情况下，memcached 默认键长设置为 256B，存储数据长度限制为 1M。可

以通过 memcached 的 -I 选项调整默认 slab 页面大小，从而可以调整存储数据长度

的限制，但 memcached 官方是不建议这种做法的。

没有非黑即白的答案，只有哪个工具在哪种场景下更为适用。