一些铺垫(扯淡)
历史上,在Python 2.4以及之前的版本,py代码的执行,也就是从源码到bytecode分为两步:
- 解析py源码成为分析树 (Parser/pgen.c)
- 基于分析树优化缩减bytecode (Python/compile.c)
也是由于Guido van Rossum大叔的历史局限性,Python当年就被实现成这么一个样子了,运行的也挺好。后来随着LLVM党的崛起,大家逐渐认识到这种模式的局限性,后面还会说。
正经说起来,一个标准的“现代”解释器(编译器)应该是“介事儿”的:
- 把py源码解析成分析树 (Parser/pgen.c)
- 把分析树转化成抽象语法树AST(Abstract Syntax Tree) (Python/ast.c)
- 把AST转化成CFG(Control Flow Graph) (Python/compile.c)
- 基于Control Flow Graph优化生成bytecode (Python/compile.c)
从Python 2.5开始,CPython“改邪归正”走上了现代编译器的康庄大道。简单来说就是把原先的第二步拆解成3个步骤。后面我们会着重简要讲一下后面三步的过程:
这里不会阐述太多关于语法解析如何工作的内容,除了必须的有助于我们了解编译过程的内容。如果想要了解Python语法解析的细节,还是建议你去直接阅读CPython的源码。
语法分析树
2.5版以后的Python的语法解释器是基于“龙书”的LL(1) parser的一个标准的实现。Python的语法定义可以在CPython源码的Grammar/Grammar (https://github.com/python/cpython/blob/master/Grammar/Grammar)看到:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
# Grammar for Python # NOTE WELL: You should also follow all the steps listed at # https://docs.python.org/devguide/grammar.html # Start symbols for the grammar: # single_input is a single interactive statement; # file_input is a module or sequence of commands read from an input file; # eval_input is the input for the eval() functions. # NB: compound_stmt in single_input is followed by extra NEWLINE! single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE file_input: (NEWLINE | stmt)* ENDMARKER eval_input: testlist NEWLINE* ENDMARKER decorator: '@' dotted_name [ '(' [arglist] ')' ] NEWLINE decorators: decorator+ decorated: decorators (classdef | funcdef | async_funcdef) async_funcdef: ASYNC funcdef funcdef: 'def' NAME parameters ['->' test] ':' suite parameters: '(' [typedargslist] ')'typedargslist: (tfpdef ['=' test] (',' tfpdef ['=' test])* [',' [ '*' [tfpdef] (',' tfpdef ['=' test])* [',' ['**' tfpdef [',']]] | '**' tfpdef [',']]] | '*' [tfpdef] (',' tfpdef ['=' test])* [',' ['**' tfpdef [',']]] | '**' tfpdef [',']) tfpdef: NAME [':' test] varargslist: (vfpdef ['=' test] (',' vfpdef ['=' test])* [',' [ '*' [vfpdef] (',' vfpdef ['=' test])* [',' ['**' vfpdef [',']]] | '**' vfpdef [',']]] | '*' [vfpdef] (',' vfpdef ['=' test])* [',' ['**' vfpdef [',']]] | '**' vfpdef [',']) vfpdef: NAME stmt: simple_stmt | compound_stmt simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt | import_stmt | global_stmt | nonlocal_stmt | assert_stmt) expr_stmt: testlist_star_expr (annassign | augassign (yield_expr|testlist) | ('=' (yield_expr|testlist_star_expr))*) annassign: ':' test ['=' test] testlist_star_expr: (test|star_expr) (',' (test|star_expr))* [',']augassign: ('+=' | '-=' | '*=' | '@=' | '/=' | '%=' | '&=' | '|=' | '^=' | '<<=' | '>>=' | '**=' | '//=') # For normal and annotated assignments, additional restrictions enforced by the interpreter del_stmt: 'del' exprlist pass_stmt: 'pass' flow_stmt: break_stmt | continue_stmt | return_stmt | raise_stmt | yield_stmt break_stmt: 'break' continue_stmt: 'continue' return_stmt: 'return' [testlist] yield_stmt: yield_expr raise_stmt: 'raise' [test ['from' test]] import_stmt: import_name | import_from import_name: 'import' dotted_as_names # note below: the ('.' | '...') is necessary because '...' is tokenized as ELLIPSIS import_from: ('from' (('.' | '...')* dotted_name | ('.' | '...')+) 'import' ('*' | '(' import_as_names ')' | import_as_names)) import_as_name: NAME ['as' NAME] dotted_as_name: dotted_name ['as' NAME] import_as_names: import_as_name (',' import_as_name)* [','] dotted_as_names: dotted_as_name (',' dotted_as_name)* dotted_name: NAME ('.' NAME)* global_stmt: 'global' NAME (',' NAME)* nonlocal_stmt: 'nonlocal' NAME (',' NAME)* assert_stmt: 'assert' test [',' test] compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt async_stmt: ASYNC (funcdef | with_stmt | for_stmt) if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite] while_stmt: 'while' test ':' suite ['else' ':' suite] for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]try_stmt: ('try' ':' suite ((except_clause ':' suite)+ ['else' ':' suite] ['finally' ':' suite] | 'finally' ':' suite)) with_stmt: 'with' with_item (',' with_item)* ':' suite with_item: test ['as' expr] # NB compile.c makes sure that the default except clause is last except_clause: 'except' [test ['as' NAME]] suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT test: or_test ['if' or_test 'else' test] | lambdef test_nocond: or_test | lambdef_nocond lambdef: 'lambda' [varargslist] ':' test lambdef_nocond: 'lambda' [varargslist] ':' test_nocond or_test: and_test ('or' and_test)* and_test: not_test ('and' not_test)* not_test: 'not' not_test | comparison comparison: expr (comp_op expr)* # <> isn't actually a valid comparison operator in Python. It's here for the# sake of a __future__ import described in PEP 401 (which really works :-) comp_op: '<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not' 'in'|'is'|'is' 'not' star_expr: '*' expr expr: xor_expr ('|' xor_expr)* xor_expr: and_expr ('^' and_expr)* and_expr: shift_expr ('&' shift_expr)* shift_expr: arith_expr (('<<'|'>>') arith_expr)* arith_expr: term (('+'|'-') term)* term: factor (('*'|'@'|'/'|'%'|'//') factor)* factor: ('+'|'-'|'~') factor | power power: atom_expr ['**' factor] atom_expr: [AWAIT] atom trailer* atom: ('(' [yield_expr|testlist_comp] ')' | '[' [testlist_comp] ']' | '{' [dictorsetmaker] '}' | NAME | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False')testlist_comp: (test|star_expr) ( comp_for | (',' (test|star_expr))* [','] ) trailer: '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME subscriptlist: subscript (',' subscript)* [','] subscript: test | [test] ':' [test] [sliceop] sliceop: ':' [test] exprlist: (expr|star_expr) (',' (expr|star_expr))* [','] testlist: test (',' test)* [','] dictorsetmaker: ( ((test ':' test | '**' expr) (comp_for | (',' (test ':' test | '**' expr))* [','])) | ((test | star_expr) (comp_for | (',' (test | star_expr))* [','])) ) classdef: 'class' NAME ['(' [arglist] ')'] ':' suite arglist: argument (',' argument)* [','] # The reason that keywords are test nodes instead of NAME is that using NAME # results in an ambiguity. ast.c makes sure it's a NAME. # "test '=' test" is really "keyword '=' test", but we have no such token. # These need to be in a single rule to avoid grammar that is ambiguous # to our LL(1) parser. Even though 'test' includes '*expr' in star_expr, # we explicitly match '*' here, too, to give it proper precedence. # Illegal combinations and orderings are blocked in ast.c: # multiple (test comp_for) arguments are blocked; keyword unpackings # that precede iterable unpackings are blocked; etc. argument: ( test [comp_for] | test '=' test | '**' test | '*' test ) comp_iter: comp_for | comp_if comp_for: [ASYNC] 'for' exprlist 'in' or_test [comp_iter] comp_if: 'if' test_nocond [comp_iter] # not used in grammar, but may appear in "node" passed from Parser to Compiler encoding_decl: NAME yield_expr: 'yield' [yield_arg] yield_arg: 'from' test | testlist |
各种token值的定义都在Include/graminit.h (https://github.com/python/cpython/blob/master/Include/graminit.h) 。组成语法解析树的node * structs定义在Include/node.h(https://github.com/python/cpython/blob/master/Include/node.h)。
从node结构体里查询是由一组定义在 Include/token.h 的一组宏实现的:
- CHILD(node *, int)
Returns the nth child of the node using zero-offset indexing
- RCHILD(node *, int)
Returns the nth child of the node from the right side; use negative numbers!
- NCH(node *)
Number of children the node has
- STR(node *)
String representation of the node; e.g., will return : for a COLON token
- TYPE(node *)
The type of node as specified in Include/graminit.h
- REQ(node *, TYPE)
Assert that the node is the type that is expected
- LINENO(node *)
retrieve the line number of the source code that led to the creation of the parse rule; defined in Python/ast.c
我们可以通过while语法的定义了解到上面所有的示例:
1 |
while_stmt: 'while' test ':' suite ['else' ':' suite] |
这段定义表示这个节点将会有如下等式成立:
1 |
TYPE(node) == while_stmt; |
子节点的编号将会是4或者7,取决于’while’后面是否有一个’else’声明。通过下面的代码我们可以确定后面的第一个 ‘:’ 是否存在以及代表什么。
1 |
(REQ(CHILD(node, 2), COLON) |
抽象语法树 (AST)
抽象语法树(AST)是程序结构的一种高抽象层次的表达,有了它我们并再不需要源代码的存在了。它可以认为是源代码等价的一种抽象表达。CPython采用Zephyr抽象语法定义语言(ASDL)(http://www.cs.princeton.edu/research/techreps/TR-554-97)来描述AST。
Python的AST定义可以在源代码的Parser/Python.asdl (https://github.com/python/cpython/blob/master/Parser/Python.asdl)找到:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
-- ASDL's 7 builtin types are: -- identifier, int, string, bytes, object, singleton, constant -- -- singleton: None, True or False -- constant can be None, whereas None means "no value" for object. module Python { mod = Module(stmt* body, string? docstring) | Interactive(stmt* body) | Expression(expr body) -- not really an actual node but useful in Jython's typesystem. | Suite(stmt* body) stmt = FunctionDef(identifier name, arguments args, stmt* body, expr* decorator_list, expr? returns, string? docstring) | AsyncFunctionDef(identifier name, arguments args, stmt* body, expr* decorator_list, expr? returns, string? docstring) | ClassDef(identifier name, expr* bases, keyword* keywords, stmt* body, expr* decorator_list, string? docstring) | Return(expr? value) | Delete(expr* targets) | Assign(expr* targets, expr value) | AugAssign(expr target, operator op, expr value) -- 'simple' indicates that we annotate simple name without parens | AnnAssign(expr target, expr annotation, expr? value, int simple) -- use 'orelse' because else is a keyword in target languages | For(expr target, expr iter, stmt* body, stmt* orelse) | AsyncFor(expr target, expr iter, stmt* body, stmt* orelse) | While(expr test, stmt* body, stmt* orelse) | If(expr test, stmt* body, stmt* orelse) | With(withitem* items, stmt* body) | AsyncWith(withitem* items, stmt* body) | Raise(expr? exc, expr? cause) | Try(stmt* body, excepthandler* handlers, stmt* orelse, stmt* finalbody) | Assert(expr test, expr? msg) | Import(alias* names) | ImportFrom(identifier? module, alias* names, int? level) | Global(identifier* names) | Nonlocal(identifier* names) | Expr(expr value) | Pass | Break | Continue -- XXX Jython will be different -- col_offset is the byte offset in the utf8 string the parser uses attributes (int lineno, int col_offset) -- BoolOp() can use left & right? expr = BoolOp(boolop op, expr* values) | BinOp(expr left, operator op, expr right) | UnaryOp(unaryop op, expr operand) | Lambda(arguments args, expr body) | IfExp(expr test, expr body, expr orelse) | Dict(expr* keys, expr* values) | Set(expr* elts) | ListComp(expr elt, comprehension* generators) | SetComp(expr elt, comprehension* generators) | DictComp(expr key, expr value, comprehension* generators) | GeneratorExp(expr elt, comprehension* generators) -- the grammar constrains where yield expressions can occur | Await(expr value) | Yield(expr? value) | YieldFrom(expr value) -- need sequences for compare to distinguish between -- x < 4 < 3 and (x < 4) < 3 | Compare(expr left, cmpop* ops, expr* comparators) | Call(expr func, expr* args, keyword* keywords) | Num(object n) -- a number as a PyObject. | Str(string s) -- need to specify raw, unicode, etc? | FormattedValue(expr value, int? conversion, expr? format_spec) | JoinedStr(expr* values) | Bytes(bytes s) | NameConstant(singleton value) | Ellipsis | Constant(constant value) -- the following expression can appear in assignment context | Attribute(expr value, identifier attr, expr_context ctx) | Subscript(expr value, slice slice, expr_context ctx) | Starred(expr value, expr_context ctx) | Name(identifier id, expr_context ctx) | List(expr* elts, expr_context ctx) | Tuple(expr* elts, expr_context ctx) -- col_offset is the byte offset in the utf8 string the parser uses attributes (int lineno, int col_offset) expr_context = Load | Store | Del | AugLoad | AugStore | Param slice = Slice(expr? lower, expr? upper, expr? step) | ExtSlice(slice* dims) | Index(expr value) boolop = And | Or operator = Add | Sub | Mult | MatMult | Div | Mod | Pow | LShift | RShift | BitOr | BitXor | BitAnd | FloorDiv unaryop = Invert | Not | UAdd | USub cmpop = Eq | NotEq | Lt | LtE | Gt | GtE | Is | IsNot | In | NotIn comprehension = (expr target, expr iter, expr* ifs, int is_async) excepthandler = ExceptHandler(expr? type, identifier? name, stmt* body) attributes (int lineno, int col_offset) arguments = (arg* args, arg? vararg, arg* kwonlyargs, expr* kw_defaults, arg? kwarg, expr* defaults) arg = (identifier arg, expr? annotation) attributes (int lineno, int col_offset) -- keyword arguments supplied to call (NULL identifier for **kwargs) keyword = (identifier? arg, expr value) -- import name with optional 'as' alias. alias = (identifier name, identifier? asname) withitem = (expr context_expr, expr? optional_vars) } |
每种AST节点(声明,表达式,一些特殊类型例如:列表推导式、异常处理handler)都被ASDL描述定义。大多数的AST定义都对应一种特定的源码结构,例如一个’if’定义或者一个属性查找。这些定义都是和具体实现Python的语言无关的,也就是说无论CPython、PyPy、Jyphon甚至IronPython,都是用的同样的ASDL。
下面的Python ASDL结构展示了Python的函数定义相关语法结构和实现:
1 2 3 4 5 6 7 |
module Python { stmt = FunctionDef(identifier name, arguments args, stmt* body, expr* decorators) | Return(expr? value) | Yield(expr value) attributes (int lineno) } |
上面的示例描述了3种不同的语法结构:def函数定义、return声明、yield声明。这三种语法结构的语法声明被用 ‘|’ 分割。它们接受的参数类型和个数各不相同。
每个参数定义前面还带了不同的修饰符(这点和正则很像)来说明需要多少个参数:
- ‘?’说明这个参数是可有可无的(0~1个);
- ‘*’说明这个参数的数量是0或者更多(≥ 0);
- 没有修饰符表示这个参数是必须且只能有一个的。
例如,函数定义(def)接受一个 identifier 作为函数名,’arguments’作为参数,0或者多个stmt作为函数体,0或者多个expr作为装饰器。
请注意这里的’arguments’,不像大家可能会认为的和stmt一样,它是作为一个node类型会由一个单独的AST定义所表示。
所有的这三种语法结构类型都有一个 ‘attributes’ 参数,所以 ‘attributes’ 前面没有 ‘|’ 。
上面的ASDL会被翻译生成如下的C语言结构体类型:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
typedef struct _stmt *stmt_ty; struct _stmt { enum { FunctionDef_kind=1, Return_kind=2, Yield_kind=3 } kind; union { struct { identifier name; arguments_ty args; asdl_seq *body; } FunctionDef; struct { expr_ty value; } Return; struct { expr_ty value; } Yield; } v; int lineno; } |
一同被生成的还有一些构造函数,它们会给这些语法结构分配一个stmt_ty结构体,并附带一些必要的初始化。’kind’ enum字段表明下面声明的是哪种union。FunctionDef() 的构造函数将会吧’kind’设置成 FunctionDef_kind 并且初始化 ‘name’, ‘args’, ‘body’, 和 ‘attributes’ 字段。
理论结合实战,使学员既可掌握快速从零构建一套实用、完整、可扩展的运维自动化平台:
- 深度结合使用流行的zabbix、Ansible、Git、Docker、Rancher、ELK等开源框架与工具, 以应用最广泛的django框架为基础,构建一站式运维自动化平台
- 通过深度剖析与二次开发定制,结合REST API、运维流程化、运维可视化、运维平台化 思想来构造企业级的运维自动化解决方案
- 在老师带领下大战Zabbix、CMDB、集群自动化部署上线、ELK日志大数据分析、Docker 容器管理平台等多个最新实战,天天实战,招招实用
课程时间:5期运维自动化 8月6日开班
上课模式:网络直播班 线下面授班
咨询QQ(1):979950755 小夏
QQ(2):279312229 ada
报名咨询电话:010-56061309
发表评论