Lua解释器的异常处理机制

1. API中的lua_pcall

在C语言与Lua的交互中， Lua为C语言提供了三个API用于C语言中调用Lua中编写的函数：

1
2
3

void lua_call (lua_State *L, int nargs, int nresults);
void lua_pcall(lua_State *L,int nargs, int nresults,int nerrfunc);
void lua_pcall(lua_State *L,int nargs, int nresults,int nerrfunc,void* ud);

其中lua_cpcall仅仅是在lua_pcall函数的基础之上增加了一个用户参数而已，可以看成lua_pcall 的一个特殊用法

/* luaconf.h */
#define lua_cpcall(L,f,u)  \
	(lua_pushcfunction(L, (f)), \
	 lua_pushlightuserdata(L,(u)), \
	 lua_pcall(L,1,0,0))

所以我们这篇文章对lua_cpcall不进行单独地讨论。

lua_call 与lua_pcall两个函数在实现上差别较大，而且涉及到了Lua解释器的错误处理方式，由于lua_call不涉及异常处理，所以在本篇文章中重点讨论lua_pcall。

首先来看lua_pcall的定义：

1 2	/* lua.h */ #define lua_pcall(L,n,r,f) lua_pcallk(L, (n), (r), (f), 0, NULL)

我们看到，lua_pcall仅仅是一个宏定义，在Lua内部实际调用的函数是lua_pcallk这个函数，去掉其中各种变量检查，赋值之类的干扰语句，我们看到lua_pcallk实际上做了如下事情：

/* lapi.c */
LUA_API int lua_pcallk (lua_State *L, int nargs, int nresults, int errfunc,
                        int ctx, lua_CFunction k) {
	// ……
	lua_lock(L);
	// ……
	if (k == NULL || L->nny > 0) {  /* no continuation or no yieldable? */
		// ……
		status = luaD_pcall(L, f_call, &c, savestack(L, c.func), func);
	}
	else {  /* prepare continuation (call is already protected by 'resume') */
		CallInfo *ci = L->ci;
		ci->u.c.k = k;  /* save continuation */
		// ……
		luaD_call(L, c.func, nresults, 1);  /* do the call */
		// ……
		status = LUA_OK;  /* if it is here, there were no errors */
	}
	// ……
	lua_unlock(L);
	return status;
}

当用户提供了异常处理函数（lua_CFunction k）的时候，就调用luaD_call；如果不提供自定义的异常处理函数的时候就使用luaD_pcall，在这个函数中封装了Lua自己的异常处理代码。

但很少有人在编写用户层代码的时候为lua_pcall提供最后一个k参数，通常的做法都是传入NULL值，从而使用Lua解释器自带的异常处理方法。

	lua_getglobal(p_l, "rmac_output");
	lua_pushnumber(p_l, 0);
	lua_pushnumber(p_l, 0);
	lua_pushnumber(p_l, 0);
	lua_pushnumber(p_l, 1);
	lua_pushnumber(p_l, 27);
	lua_pushnumber(p_l, 1);
	lua_pushnumber(p_l, 38);
	if( lua_pcall(p_l, 7, 1, 0) != 0 )
	{
		// ……
	}

我个人也觉得，如果不是十分必要，还是尽可能使用Lua自身提供的异常处理代码，这样既能与Lua解释器本身契合度更高，也能更有效地进行异常之后的恢复工作

于是我们查看luaD_pcall函数，这个函数其实非常简单：

/* ldo.c */
int luaD_pcall (lua_State *L, Pfunc func, void *u,
                ptrdiff_t old_top, ptrdiff_t ef) {
	// ……
	status = luaD_rawrunprotected(L, func, u);
	if (status != LUA_OK) {  /* an error occurred? */
		// 在这里进行现场恢复与并重新整理运行堆栈
	}
	// ……
	return status;
}

接下来是整个异常处理中最重要的一个函数（luaD_rawrunprotected），在这里找到了Lua的异常处理代码：

/* ldo.c */
int luaD_rawrunprotected (lua_State *L, Pfunc f, void *ud) {
	// ……
	struct lua_longjmp lj;
	// ……
	L->errorJmp = &lj;
	LUAI_TRY(L, &lj,
	 	 (*f)(L, ud);
	);
	// ……
	return lj.status;
}

2. setjmp 与 longjmp

我们知道在标准C语言中是不存在类似C++ 中try {……} catch {……} 这样的异常处理机制的，如果想在C程序中实现异常处理，通常就要用到C标准库<setjmp.h>中的两个接口函数，setjmp和longjmp，而事实上在我们使用的CodeSourcery的自带库中<setjmp.h>文件也只导出了这两个函数。

这两个函数的配对可以实现高级语言中try与catch类似的效果，举个例子来说：

#include <stdio.h>
#include <unistd.h>
#include <setjmp.h>
 
jmp_buf mark;
 
void test_jmp_fun ( void ){
	int count = 0;
	while( 1 ){
		printf("1st Loop num is: %d\r\n", count);
		sleep (2); // 模拟阻塞操作
		if( count >= 5 ) {// 假定count超过5为异常
			longjmp(mark, -1);
		}
		count++;
	}
}
 
int main ( int argc, char* argv[] ){
	if( setjmp(mark) == 0 ) { /* try */
		test_jmp_fun();
	}
	else {/* catch */
		int count = 0;
		printf("Catch error!\r\n");
		while( 1 ){
			printf("2nd Loop num is: %d\r\n", count);
			sleep (2);
			count++;
		}
	}
}

上面这段程序的最终输出结果是：

1st Loop num is: 0
1st Loop num is: 1
1st Loop num is: 2
1st Loop num is: 3
1st Loop num is: 4
1st Loop num is: 5
Catch error!
2st Loop num is: 0
2st Loop num is: 1
2st Loop num is: 3
……

另外，得注意的是setjmp和longjmp只能实现在单个执行栈内的长跳，所以在多线程程序中的使用要特别注意，如果在两个线程之间jmp，程序可能会发生崩溃。

3. LUAI_TRY与setjmp

于是我们在回头看luaD_rawrunprotected函数的实现，里面有一个LUAI_TRY的宏，其实现代码是：

1 2	/* ldo.c */ #define LUAI_TRY(L,c,a) if (setjmp((c)->b) == 0) { a }

这个宏定义在luaD_rawrunprotected函数中展开后为：

/* ldo.c */
int luaD_rawrunprotected (lua_State *L, Pfunc f, void *ud) {
	// ……
	struct lua_longjmp lj;
	// ……
	L->errorJmp = &lj;
	if( setjmp((&lj)->b) == 0) {
	 	 (*f)(L, ud);
	};
	// ……
	return lj.status;
}

在上面的代码中还有个被引用的数据结构叫做struct lua_longjmp，没有解决：

/* ldo.c */
#define luai_jmpbuf		jmp_buf
// ……
/* chain list of long jump buffers */
struct lua_longjmp {
	struct lua_longjmp *previous;
	luai_jmpbuf b;
	volatile int status;  /* error code */
};

至此，除了“(*f)(L, ud);”这句话，C语言调用lua_pcall 的过程已经基本清晰了，我们基本可以判定，Lua通过调用setjmp保存函数当前的上下文环境，之后调用一个函数开始解释执行Lua脚本。

我们可以进一步推测，如果在解释执行过程中出现任何异常，就会调用longjmp抛出异常，然后直接返回setjmp这个位置。

接下来定位Lua的字节码解释函数，沿着调用路径回溯回去可以追踪到luaD_pcall这个函数的第二个参数“f_call”，这个函数非常简单：

/* lapi.c */
static void f_call (lua_State *L, void *ud) {
	struct CallS *c = cast(struct CallS *, ud);
	luaD_call(L, c->func, c->nresults, 0);
}

luaD_call函数去掉各种逻辑条件判断

/* ldo.c */
void luaD_call (lua_State *L, StkId func, int nResults, int allowyield) {
	// ……
	luaV_execute(L);  /* call it */
	// ……
}

最终确定luaD_rawrunprotected函数中LUAI_TRY逻辑下的“(*f)(L, ud);”就是Lua的字节码解释执行程序。

4. longjmp去哪儿了？

以最近遇到的内存空间不足异常为例，寻找Lua在内存不足的时候是如何抛出的异常。

在Lua脚本中经常会使用到table，而table的使用过程一定会伴随着内存的申请与释放，废话不多说直接上代码，翻到我们上面已经定位到的luaV_execute函数：

/* lvm.c */
void luaV_execute (lua_State *L) {
	// ……
	vmdispatch (GET_OPCODE(i)) {
		// ……
		vmcase(OP_NEWTABLE,
			// ……
			Table *t = luaH_new(L);
			// ……
      )
	// ……
	}
	// ……
}

这里面有两个宏定义vmdispatch和vmcase实际上就是switch和case，Lua解释器将脚本翻译成一系列字节码，其中表的创建就是OP_NEWTABLE

/* ltable.c */
Table *luaH_new (lua_State *L) {
	Table *t = &luaC_newobj(L, LUA_TTABLE, sizeof(Table), NULL, 0)->h;
	// ……
	return t;
}

继续寻找内存操作：

/* lgc.c */
GCObject *luaC_newobj (lua_State *L, int tt, size_t sz, GCObject **list,
                       int offset) {
	// ……
	char *raw = cast(char *, luaM_newobject(L, novariant(tt), sz));
	// ……
}

1 2	/* lmem.h */ #define luaM_newobject(L,tag,s) luaM_realloc_(L, NULL, tag, (s))

终于定位到lua的内存操作函数：

/* lmem.c */
void *luaM_realloc_ (lua_State *L, void *block, size_t osize, size_t nsize) {
	void *newblock;
 
	// ……
	newblock = (*g->frealloc)(g->ud, block, osize, nsize);
	if (newblock == NULL && nsize > 0) {
		// ……
		luaC_fullgc(L, 1);  /* try to free some memory... */
		newblock = (*g->frealloc)(g->ud, block, osize, nsize); /* try again */
	}
	if (newblock == NULL)
		luaD_throw(L, LUA_ERRMEM);
	}
	// ……
	return newblock;
}

暂时不管(*g->frealloc)，上面这个函数的执行逻辑很清晰：首先执行内存申请操作，如果失败（返回NULL）则尝试调用垃圾回收函数luaC_fullgc进行垃圾回收，之后再次调用内存申请，如果继续失败，则抛出异常，抛出异常使用luaD_throw函数：

/* ldo.c */
l_noret luaD_throw (lua_State *L, int errcode) {
	if (L->errorJmp) {  /* thread has an error handler? */
		L->errorJmp->status = errcode;  /* set status */
		LUAI_THROW(L, L->errorJmp);  /* jump to it */
	}
	else {  /* thread has no error handler */
		L->status = cast_byte(errcode);  /* mark it as dead */
		if (G(L)->mainthread->errorJmp) {  /* main thread has a handler? */
			/* copy error obj. */
			setobjs2s(L, G(L)->mainthread->top++, L->top - 1);
			/* re-throw in main thread */
			luaD_throw(G(L)->mainthread, errcode);  
		}
		else {  /* no handler at all; abort */
			if (G(L)->panic) {  /* panic function? */
				lua_unlock(L);
				G(L)->panic(L);  /* call it (last chance to jump out) */
			}
			abort();
		}
	}
}

打开LUAI_THROW的定义我们终于找到longjmp：

1 2	/* ldo.c */ #define LUAI_THROW(L,c) longjmp((c)->b, 1)

这样这段代码就只剩一个(*g->frealloc)没有解决了，在C语言使用Lua之前进行初始化的时候一定会调用一个叫做“luaL_newstate”的API函数，它的实现是这样的：

/* lauxlib.c */
LUALIB_API lua_State *luaL_newstate (void) {
	lua_State *L = lua_newstate(l_alloc, NULL);
	if (L) lua_atpanic(L, &panic);
	return L;
}

/* lstate.c */
LUA_API lua_State *lua_newstate (lua_Alloc f, void *ud) {
	// ……
	g->frealloc = f;
	// ……
}

在lua.h函数中有这样的定义：

/* ldo.c */
/*
** prototype for memory-allocation functions
*/
typedef void * (*lua_Alloc) (void *ud, void *ptr, size_t osize, size_t nsize);

所以 (*g->frealloc)就是l_alloc函数：

/* ldo.c */
static void *l_alloc (void *ud, void *ptr, size_t osize, size_t nsize) {
	 (void)ud; (void)osize;  /* not used */
	if (nsize == 0) {
		free(ptr);
		return NULL;
	}
	else
		return realloc(ptr, nsize);
}

Lua的代码中将free、alloc、realloc三个功能实现在了同一个函数中，通过nsize来区分，当nsize为0的时候使用free功能，否则使用alloc功能。

5. 异常的抛出流程

尽管luaD_throw注释得非常详细，但是要理解这个函数的执行逻辑，还需要首先解决两个数据结构，L和G(L)。Lua为了实现协程与平台无关，于是自己在解释器内维护了一个类似于操作系统多线程的机制：

Lua为每一个协程维护一个数据结构（L），又有一个全局的数据结构存放公共数据（G），G中有个变量叫做“mainthread”，它指向L，L中也有一个变量（L_G）它指向全局结构G。

打开G(L)这个宏的定义你就可以看到：

1 2	/* lstate.h */ #define G(L) (L->l_G)

也就是说，当你只有一个协程的时候，就如上面图中的结构一样，G(L)->mainthread与L其实是相等的。于是解决了这个问题之后我们再次打开luaD_rawrunprotected函数的源代码：

/* ldo.c */
int luaD_rawrunprotected (lua_State *L, Pfunc f, void *ud) {
	// ……
	struct lua_longjmp lj;
	// ……
	L->errorJmp = &lj;
	if( setjmp((&lj)->b) == 0) {
	 	 (*f)(L, ud);
	};
	// ……
	return lj.status;
}

变量lj中存放着环境缓存变量lj.b（luai_jmpbuf b），于是我们将“LUAI_THROW(L, L->errorJmp);”这句话展开后得到：

1	longjmp((L->errorJmp)->b, 1)

刚好和前面的setjmp匹配组成一组异常处理代码。

最终我们总结一下使用luaD_throw函数抛出异常的过程：

（1）查找当前协程是否存在异常返回点，如果有则长跳返回；
（2）如果没有则查找主协程中是否存在异常返回点，如果有则长跳返回；
（3）如果没查找到任何异常返回点，则直接报错并终止程序（abort()）；

6. 异常返回之后还发生了什么？

还是以内存不足为例。

当Lua在luaM_realloc_函数（参看第4小节）使用luaD_throw抛出内存不足异常的时候，会在第二个参数中指定一个错误代码：

/* lmem.c */
void *luaM_realloc_ (lua_State *L, void *block, size_t osize, size_t nsize) {
	// ……
	luaD_throw(L, LUA_ERRMEM);

这个错误代码是给Lua状态机使用的，标示了Lua的运行状态，其他相关的状态还有：

/* lua.h */
/* thread status */
#define LUA_OK		0
#define LUA_YIELD	1
#define LUA_ERRRUN	2
#define LUA_ERRSYNTAX	3
#define LUA_ERRMEM	4
#define LUA_ERRGCMM	5
#define LUA_ERRERR	6

而在luaD_throw中将这个代码保存在了L中：

/* ldo.c */
l_noret luaD_throw (lua_State *L, int errcode) {
	if (L->errorJmp) {  /* thread has an error handler? */
		L->errorJmp->status = errcode;  /* set status */
		LUAI_THROW(L, L->errorJmp);  /* jump to it */
		// ……
}

又在异常返回点处返回给了上层函数：

/* ldo.c */
int luaD_rawrunprotected (lua_State *L, Pfunc f, void *ud) {
	// ……
	return lj.status;
}

最终在函数中接收到了返回的状态，并进行相关处理：

/* ldo.c */
int luaD_pcall (lua_State *L, Pfunc func, void *u,
                ptrdiff_t old_top, ptrdiff_t ef) {
	// ……
	status = luaD_rawrunprotected(L, func, u);
	if (status != LUA_OK) {  /* an error occurred? */
		StkId oldtop = restorestack(L, old_top);
		luaF_close(L, oldtop);  /* close possible pending closures */
		// ……
		luaD_shrinkstack(L);
	}
	// ……
	return status;
}

这里有一个重要的处理函数luaD_shrinkstack，尤其是当发生内存不足异常之后，因为在这个函数中要重构堆栈：

/* ldo.c */
void luaD_shrinkstack (lua_State *L) {
	// ……
	luaD_reallocstack(L, goodsize);  /* shrink it */
}

这里就不对这个函数再次递归了。

7. 最后的问题

在异常之后的处理过程中，虽然同样涉及到了“realloc”内存操作，但却没有再次使用LUAI_TRY设置异常返回点，也就是说，当再次执行到luaM_realloc_函数，如果此时依然没有足够的内存的话，那么将会因为查找不到异常返回点最终终止整个程序的执行。

我推测Lua的设计者针对内存不足的异常，在处理的过程中通过缩减堆栈大小来进行最后一次的尝试，如果还是内存不足则证明系统内存已经无法支撑Lua程序运行，所以就不再需要设置其他异常返回点了，可以直接终止程序。

但这样的设计可能没有考虑那些本身内存资源不是很充分的平台，在多线程情况下，如果某个线程临时占用内存，可能等待片刻后就会释放的情况。

针对这样的情况，有一种解决的途径，就是采用一个退避的算法，当内存不足异常抛出之后，调用系统sleep函数休眠一段短暂的时间，之后再次运行错误处理，如果此时内存已经被释放了出来，那么Lua仍然可以继续运行。

技术部落

Lua解释器的异常处理机制

发表评论取消回复

发表评论 取消回复

发表评论取消回复