命名空间相关系统调用

和 namespace 相关的系统调用主要有三个：

clone：创建新进程并设置它的 namespace
setns：让进程加入已经存在 namespace
unshare：让进程加入新的 namespace

clone

clone 类似于 fork 系统调用，可以创建一个新的进程，不同的是你可以指定要子进程要执行的函数以及通过参数控制子进程的运行环境（比如这篇文章主要介绍的 namespace）

http://man7.org/linux/man-pages/man2/clone.2.html

下面是 clone(2) 的定义：

#include <sched.h>

int clone(int (*fn)(void *), void *child_stack,
         int flags, void *arg, ...
         /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );

它有四个重要的参数：

fn 参数是一个函数指针，子进程启动的时候会调用这个函数来执行
arg 作为参数传给该函数。当这个函数返回，子进程的运行也就结束，函数的返回结果就是 exit code
child_stack 参数指定了子进程 stack 开始的内存地址，因为 stack 都会从高位到地位增长，所以这个指针需要指向分配 stack 的最高位地址
flags 是子进程启动的一些配置信息，包括信号（子进程结束的时候发送给父进程的信号 SIGCHLD）、子进程要运行的 namespace 信息（上面已经看到的 CLONE_NEWIPC，CLONE_NEWNET、CLONE_NEWIPC等）、其他配置信息（可以参考 clone(2) man page）

clone 示例

创建一个新的进程，并执行 /bin/bash，这样就可以接受命令，方便我们查看新进程的信息

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/wait.h>
#include <unistd.h>

// 设置子进程要使用的栈空间
#define STACK_SIZE (1024 * 1024)
static char container_stack[STACK_SIZE];

#define errExit(code, msg) \
    ;                      \
    {                      \
        if (code == -1)    \
        {                  \
            perror(msg);   \
            exit(-1);      \
        }                  \
    }

char *const container_args[] = {"/bin/bash", NULL};

static int container_func(void *arg)
{
    pid_t pid = getpid();
    printf("Container[%d] - inside the container!\n", pid);

    // 用一个新的 bash 来替换掉当前子进程
    // 这样我们就能通过 bash 查看当前子进程的情况
    // bash 退出后，子进程执行完毕
    execv(container_args[0], container_args);

    // 从这里开始的代码将不会被执行到，因为当前子进程已经被上面的 bash 替换掉了;
    // 所以如果执行到这里，一定是出错了
    printf("Container[%d] - oops!\n", pid);
    return 1;
}

int main(int argc, char *argv[])
{
    pid_t pid = getpid();
    printf("Parent[%d] - create a container!\n", pid);

    // 创建并启动子进程，调用该函数后，父进程将继续往后执行，也就是执行后面的 waitpid
    pid_t child_pid =
        clone(
            // 子进程将执行 container_func 这个函数
            container_func,
            container_stack + sizeof(container_stack),
            // 这里 SIGCHLD 是子进程退出后返回给父进程的信号，跟 namespace 无关
            SIGCHLD,
            // 传给child_func的参数
            NULL);
    errExit(child_pid, "clone");

    waitpid(child_pid, NULL, 0); // 等待子进程结束

    printf("Parent[%d] - container exited!\n", pid);
    return 0;
}

这段代码完成：

通过 clone 创建出一个子进程，并设置启动时的参数
在子进程中调用 execv 来执行 /bin/bash，等待用户进行交互
子进程退出之后，父进程也跟着退出

在不同的部分打印出了父进程和子进程的 pid 以及简单的信息，具体的执行效果：

> gcc container.c -o container

# 运行程序，可以看到父进程和子进程的 pid 分别是 31948 和 31949
# 分别在子进程和父进程中查看 /proc/$$/ns 的内容，发现两者完全一样，也就是说它们属于同一个 namespace
# 因为没有在 clone 的时候修改子进程的 namespace
> ./container
Parent[31948] - create a container!
Container[31949] - inside the container!
[root@dev-container-0 test]# ls -al /proc/$$/ns 
total 0
dr-x--x--x 2 root root 0  3月 11 15:09 .
dr-xr-xr-x 9 root root 0  3月 11 15:08 ..
lrwxrwxrwx 1 root root 0  3月 11 15:09 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 ipc -> 'ipc:[4026536660]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 mnt -> 'mnt:[4026536662]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 net -> 'net:[4026536373]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 pid -> 'pid:[4026536663]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 pid_for_children -> 'pid:[4026536663]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 time -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 uts -> 'uts:[4026536659]'
[root@dev-container-0 test]# exit 
exit
Parent[31948] - container exited!

> ls -al /proc/$$/ns 
total 0
dr-x--x--x 2 root root 0  3月 11 15:09 .
dr-xr-xr-x 9 root root 0  3月 11 14:41 ..
lrwxrwxrwx 1 root root 0  3月 11 15:09 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 ipc -> 'ipc:[4026536660]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 mnt -> 'mnt:[4026536662]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 net -> 'net:[4026536373]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 pid -> 'pid:[4026536663]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 pid_for_children -> 'pid:[4026536663]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 time -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0  3月 11 15:09 uts -> 'uts:[4026536659]'

setns

setns 能够把某个进程加入到给定的 namespace，它的定义是这样的：

int setns(int fd, int nstype);

fd 参数是一个文件描述符，指向 /proc/[pid]/ns/ 目录下的某个 namespace，调用这个函数的进程就会被加入到 fd 指向文件所代表的 namespace，fd 可以通过打开 namespace 对应的文件获取

nstype 限定进程可以加入的 namespaces，可能的取值是：

0: 可以加入任意的 namespaces
CLONE_NEWIPC：fd 必须指向 ipc namespace
CLONE_NEWNET：fd 必须指向 network namespace
CLONE_NEWNS：fd 必须指向 mount namespace
CLONE_NEWPID：fd 必须指向 PID namespace
CLONE_NEWUSER： fd 必须指向 user namespace
CLONE_NEWUTS： fd 必须指向 UTS namespace

如果不知道 fd 指向的 namespace 类型（比如 fd 是其他进程打开的，然后通过参数传递过来），然后在应用中希望明确指定特种类型的 namespace，nstype 就非常有用

更详细地说，setns 能够让进程离开现在所在的某个特性的 namespace，加入到另外一个同类型的已经存在的 namespace

需要注意的是：CLONE_NEWPID 和其他 namespace 不同，把进程加入到 PID namespace 并不会修改该进程的 PID namespace，而只修改它所有子进程的 PID namespace

unshare

int unshare(int flags);

unshare 比较简单，只有一个参数 flags，它的含义和 clone 的 flags 相同

unshare 和 setns 的区别是，setns 只能让进程加入到已经存在的 namespace 中，而 unshare 则让进程离开当前的 namespace，加入到新建的 namespace 中

unshare 和 clone 的区别在于：

unshare 是把当前进程进入到新的 namespace
clone 是创建新的进程，然后让新创建的进程（子进程）加入到新的 namespace