整个互联网本质上讲就是一个庞大的数据传输网络,不同的应用对于数据传输有不同的要求:有的关注传输的吞吐率,也就是速度;有的关注消息的可靠性、完整性;有的要求消息要有低延时。在基础设施固定的前提下,但作为一个程序设计者、一个运维人员、一个网络工程师,我们的目标都是尽可能的降低成本,提高网络服务质量。
而成本很多时候的体现就是对计算资源的消耗,其中最重要的一个资源就是CPU资源。虚拟化技术、vHost的发展让一台服务器硬件能够承载更多的站点沙盒。这样就使得传输数据时尽可能少的占用CPU资源变得更为重要。即使刨除计算成本的考虑,数据传输时CPU资源消耗的降低也能让延时敏感的应用受益良多(我们知道CPU消耗的少就是CPU处理用时少,从而让数据更加及时的到达用户端)。
Sendfile(2)在这个时代背景下于2003年前后被加入Linux Kernel,陆续在各大UNIX、Linux、Solaris平台上获得了支持。这个系统内核调用本身被设计出来是用来从磁盘到TCP协议栈拷贝数据用的,但也我们也是可以把它用来做两个文件之间的数据拷贝。
在Linux Kernel 2.6版本中,这个系统调用的原型是这样的:
1 2 |
<span class="kt">ssize_t</span> <span class="n">sendfile</span><span class="p">(</span><span class="kt">int</span> <span class="n">out_fd</span><span class="p">,</span> <span class="kt">int</span> <span class="n">in_fd</span><span class="p">,</span> <span class="kt">off_t</span> <span class="o">*</span><span class="n">offset</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">count</span><span class="p">)</span> |
- in_fd 被打开是等待读数据的fd.
- out_fd 被打开是等待写数据的fd.
- Offset 是在正式开始读取数据之前应该向前偏移的byte数.
- count 是需要在两个fd之间“搬移”的数据的byte数.
也是由于推出的比较晚,POSIX还没有来得及规范接口,所以各个平台的实现稍有不同。所有就经常会见到类似下面的代码来做兼容性的宏定义:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
<span class="cm">/**</span> <span class="cm"> * @brief sendfile wrapper</span> <span class="cm"> *</span> <span class="cm"> * @see</span> <span class="cm"> * @note</span> <span class="cm"> * @author auxten <auxtenwpc@gmail.com></span> <span class="cm"> * @date 2011-8-1</span> <span class="cm"> **/</span> <span class="cp">#define ERR_RW_RETRIABLE(e) \</span> <span class="cp"> ((e) == EINTR || (e) == EAGAIN || (e) == EWOULDBLOCK)</span> <span class="k">static</span> <span class="kr">inline</span> <span class="kt">int</span> <span class="nf">gsendfile</span><span class="p">(</span><span class="kt">int</span> <span class="n">out_fd</span><span class="p">,</span> <span class="kt">int</span> <span class="n">in_fd</span><span class="p">,</span> <span class="kt">off_t</span> <span class="o">*</span><span class="n">offset</span><span class="p">,</span> <span class="n">GKO_UINT64</span> <span class="o">*</span><span class="n">count</span><span class="p">)</span> <span class="p">{</span> <span class="cp">#if defined (__APPLE__)</span> <span class="kt">int</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">sendfile</span><span class="p">(</span><span class="n">in_fd</span><span class="p">,</span> <span class="n">out_fd</span><span class="p">,</span> <span class="o">*</span><span class="n">offset</span><span class="p">,</span> <span class="p">(</span><span class="kt">off_t</span> <span class="o">*</span><span class="p">)</span> <span class="n">count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span> <span class="k">if</span> <span class="p">(</span><span class="n">ret</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span> <span class="o">&&</span> <span class="o">!</span><span class="n">ERR_RW_RETRIABLE</span><span class="p">(</span><span class="n">errno</span><span class="p">))</span> <span class="k">return</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">);</span> <span class="k">return</span> <span class="p">(</span><span class="o">*</span><span class="n">count</span><span class="p">);</span> <span class="cp">#elif defined (__FreeBSD__)</span> <span class="kt">int</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">sendfile</span><span class="p">(</span><span class="n">in_fd</span><span class="p">,</span> <span class="n">out_fd</span><span class="p">,</span> <span class="o">*</span><span class="n">offset</span><span class="p">,</span> <span class="o">*</span><span class="n">count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="p">(</span><span class="kt">off_t</span> <span class="o">*</span><span class="p">)</span> <span class="n">count</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span> <span class="k">if</span> <span class="p">(</span><span class="n">ret</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span> <span class="o">&&</span> <span class="o">!</span><span class="n">ERR_RW_RETRIABLE</span><span class="p">(</span><span class="n">errno</span><span class="p">))</span> <span class="k">return</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">);</span> <span class="k">return</span> <span class="p">(</span><span class="o">*</span><span class="n">count</span><span class="p">);</span> <span class="cp">#elif defined(__linux__)</span> <span class="kt">int</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">sendfile</span><span class="p">(</span><span class="n">out_fd</span><span class="p">,</span> <span class="n">in_fd</span><span class="p">,</span> <span class="n">offset</span><span class="p">,</span> <span class="o">*</span><span class="n">count</span><span class="p">);</span> <span class="k">if</span> <span class="p">(</span><span class="n">ret</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span> <span class="o">&&</span> <span class="n">ERR_RW_RETRIABLE</span><span class="p">(</span><span class="n">errno</span><span class="p">))</span> <span class="p">{</span> <span class="cm">/** if this is EAGAIN or EINTR return 0; otherwise, -1 **/</span> <span class="k">return</span> <span class="p">(</span><span class="mi">0</span><span class="p">);</span> <span class="p">}</span> <span class="k">return</span> <span class="p">(</span><span class="n">ret</span><span class="p">);</span> <span class="cp">#endif</span> <span class="p">}</span> |
摘自:gingko/gingko.h at master · auxten/gingko · GitHub
在sendfile(2)出现之前,我们想要把一个文件发送到socket上需要进行如下几个步骤:
- 调用read(2)函数,文件数据被copy到内核缓冲区
- read(2)函数返回,文件数据从内核缓冲区copy到用户缓冲区
- write(2)函数调用,将文件数据从用户缓冲区copy到内核与socket相关的缓冲区。
- 数据从socket缓冲区copy到相关协议引擎。
如下图所示:
From: Zero Copy I: User-Mode Perspective
我们可以看到,在这个过程当中数据实际上是经过了四次copy操作:
1 2 |
硬盘 —> 内核buffer —> 用户buffer —> 内核socket缓冲区 —> TCP协议栈 |
写成伪代码大致是下面这样:
1 2 3 4 5 6 |
<span class="kt">int</span> <span class="n">out_fd</span><span class="p">,</span> <span class="kt">int</span> <span class="n">in_fd</span><span class="p">;</span> <span class="kt">char</span> <span class="n">buffer</span><span class="p">[</span><span class="n">BUFLEN</span><span class="p">];</span> <span class="n">read</span><span class="p">(</span><span class="n">in_fd</span><span class="p">,</span> <span class="n">buffer</span><span class="p">,</span> <span class="n">BUFLEN</span><span class="p">);</span> <span class="cm">/* 系统调用, 会陷入内核态 */</span> <span class="n">write</span><span class="p">(</span><span class="n">out_fd</span><span class="p">,</span> <span class="n">buffer</span><span class="p">,</span> <span class="n">BUFLEN</span><span class="p">);</span> <span class="cm">/* 系统调用, 会陷入内核态 */</span> |
我们可以看到,相比sendfile(2),“Read & Write”方式带来的性能损耗主要有两点:
- 不必要的内存拷贝。
- 系统调用带来的额外的用户态/内核态上下文切换(Context Switch)。
发表评论