Skip to content

Commit

Permalink
Merge branch 'bpf_iter'
Browse files Browse the repository at this point in the history
Yonghong Song says:

====================
Motivation:
  The current way to dump kernel data structures mostly:
    1. /proc system
    2. various specific tools like "ss" which requires kernel support.
    3. drgn
  The dropback for the first two is that whenever you want to dump more, you
  need change the kernel. For example, Martin wants to dump socket local
  storage with "ss". Kernel change is needed for it to work ([1]).
  This is also the direct motivation for this work.

  drgn ([2]) solves this proble nicely and no kernel change is not needed.
  But since drgn is not able to verify the validity of a particular pointer value,
  it might present the wrong results in rare cases.

  In this patch set, we introduce bpf iterator. Initial kernel changes are
  still needed for interested kernel data, but a later data structure change
  will not require kernel changes any more. bpf program itself can adapt
  to new data structure changes. This will give certain flexibility with
  guaranteed correctness.

  In this patch set, kernel seq_ops is used to facilitate iterating through
  kernel data, similar to current /proc and many other lossless kernel
  dumping facilities. In the future, different iterators can be
  implemented to trade off losslessness for other criteria e.g. no
  repeated object visits, etc.

User Interface:
  1. Similar to prog/map/link, the iterator can be pinned into a
     path within a bpffs mount point.
  2. The bpftool command can pin an iterator to a file
         bpftool iter pin <bpf_prog.o> <path>
  3. Use `cat <path>` to dump the contents.
     Use `rm -f <path>` to remove the pinned iterator.
  4. The anonymous iterator can be created as well.

  Please see patch #19 andd #20 for bpf programs and bpf iterator
  output examples.

  Note that certain iterators are namespace aware. For example,
  task and task_file targets only iterate through current pid namespace.
  ipv6_route and netlink will iterate through current net namespace.

  Please see individual patches for implementation details.

Performance:
  The bpf iterator provides in-kernel aggregation abilities
  for kernel data. This can greatly improve performance
  compared to e.g., iterating all process directories under /proc.
  For example, I did an experiment on my VM with an application forking
  different number of tasks and each forked process opening various number
  of files. The following is the result with the latency with unit of microseconds:

    # of forked tasks   # of open files    # of bpf_prog calls  # latency (us)
    100                 100                11503                7586
    1000                1000               1013203              709513
    10000               100                1130203              764519

  The number of bpf_prog calls may be more than forked tasks multipled by
  open files since there are other tasks running on the system.
  The bpf program is a do-nothing program. One millions of bpf calls takes
  less than one second.

  Although the initial motivation is from Martin's sk_local_storage,
  this patch didn't implement tcp6 sockets and sk_local_storage.
  The /proc/net/tcp6 involves three types of sockets, timewait,
  request and tcp6 sockets. Some kind of type casting or other
  mechanism is needed to handle all these socket types in one
  bpf program. This will be addressed in future work.

  Currently, we do not support kernel data generated under module.
  This requires some BTF work.

  More work for more iterators, e.g., tcp, udp, bpf_map elements, etc.

Changelog:
  v3 -> v4:
    - in bpf_seq_read(), if start() failed with an error, return that
      error to user space (Andrii)
    - in bpf_seq_printf(), if reading kernel memory failed for
      %s and %p{i,I}{4,6}, set buffer to empty string or address 0.
      Documented this behavior in uapi header (Andrii)
    - fix a few error handling issues for bpftool (Andrii)
    - A few other minor fixes and cosmetic changes.
  v2 -> v3:
    - add bpf_iter_unreg_target() to unregister a target, used in the
      error path of the __init functions.
    - handle err != 0 before handling overflow (Andrii)
    - reference count "task" for task_file target (Andrii)
    - remove some redundancy for bpf_map/task/task_file targets
    - add bpf_iter_unreg_target() in ip6_route_cleanup()
    - Handling "%%" format in bpf_seq_printf() (Andrii)
    - implement auto-attach for bpf_iter in libbpf (Andrii)
    - add macros offsetof and container_of in bpf_helpers.h (Andrii)
    - add tests for auto-attach and program-return-1 cases
    - some other minor fixes
  v1 -> v2:
    - removed target_feature, using callback functions instead
    - checking target to ensure program specified btf_id supported (Martin)
    - link_create change with new changes from Andrii
    - better handling of btf_iter vs. seq_file private data (Martin, Andrii)
    - implemented bpf_seq_read() (Andrii, Alexei)
    - percpu buffer for bpf_seq_printf() (Andrii)
    - better syntax for BPF_SEQ_PRINTF macro (Andrii)
    - bpftool fixes (Quentin)
    - a lot of other fixes
  RFC v2 -> v1:
    - rename bpfdump to bpf_iter
    - use bpffs instead of a new file system
    - use bpf_link to streamline and simplify iterator creation.

References:
  [1]: https://lore.kernel.org/bpf/20200225230427.1976129-1-kafai@fb.com
  [2]: /~https://github.com/osandov/drgn
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
  • Loading branch information
Alexei Starovoitov committed May 10, 2020
2 parents 8086fba + 6879c04 commit 180139d
Show file tree
Hide file tree
Showing 43 changed files with 2,664 additions and 14 deletions.
19 changes: 19 additions & 0 deletions fs/proc/proc_net.c
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,25 @@ static const struct proc_ops proc_net_seq_ops = {
.proc_release = seq_release_net,
};

int bpf_iter_init_seq_net(void *priv_data)
{
#ifdef CONFIG_NET_NS
struct seq_net_private *p = priv_data;

p->net = get_net(current->nsproxy->net_ns);
#endif
return 0;
}

void bpf_iter_fini_seq_net(void *priv_data)
{
#ifdef CONFIG_NET_NS
struct seq_net_private *p = priv_data;

put_net(p->net);
#endif
}

struct proc_dir_entry *proc_create_net_data(const char *name, umode_t mode,
struct proc_dir_entry *parent, const struct seq_operations *ops,
unsigned int state_size, void *data)
Expand Down
36 changes: 36 additions & 0 deletions include/linux/bpf.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ struct seq_file;
struct btf;
struct btf_type;
struct exception_table_entry;
struct seq_operations;

extern struct idr btf_idr;
extern spinlock_t btf_idr_lock;
Expand Down Expand Up @@ -319,6 +320,7 @@ enum bpf_reg_type {
PTR_TO_TP_BUFFER, /* reg points to a writable raw tp's buffer */
PTR_TO_XDP_SOCK, /* reg points to struct xdp_sock */
PTR_TO_BTF_ID, /* reg points to kernel struct */
PTR_TO_BTF_ID_OR_NULL, /* reg points to kernel struct or NULL */
};

/* The information passed from prog-specific *_is_valid_access
Expand Down Expand Up @@ -657,6 +659,7 @@ struct bpf_prog_aux {
bool offload_requested;
bool attach_btf_trace; /* true if attaching to BTF-enabled raw tp */
bool func_proto_unreliable;
bool btf_id_or_null_non0_off;
enum bpf_tramp_prog_type trampoline_prog_type;
struct bpf_trampoline *trampoline;
struct hlist_node tramp_hlist;
Expand Down Expand Up @@ -1021,6 +1024,7 @@ static inline void bpf_enable_instrumentation(void)

extern const struct file_operations bpf_map_fops;
extern const struct file_operations bpf_prog_fops;
extern const struct file_operations bpf_iter_fops;

#define BPF_PROG_TYPE(_id, _name, prog_ctx_type, kern_ctx_type) \
extern const struct bpf_prog_ops _name ## _prog_ops; \
Expand Down Expand Up @@ -1080,6 +1084,7 @@ int generic_map_update_batch(struct bpf_map *map,
int generic_map_delete_batch(struct bpf_map *map,
const union bpf_attr *attr,
union bpf_attr __user *uattr);
struct bpf_map *bpf_map_get_curr_or_next(u32 *id);

extern int sysctl_unprivileged_bpf_disabled;

Expand Down Expand Up @@ -1126,6 +1131,37 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd);
int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
int bpf_obj_get_user(const char __user *pathname, int flags);

#define BPF_ITER_FUNC_PREFIX "__bpf_iter__"
#define DEFINE_BPF_ITER_FUNC(target, args...) \
extern int __bpf_iter__ ## target(args); \
int __init __bpf_iter__ ## target(args) { return 0; }

typedef int (*bpf_iter_init_seq_priv_t)(void *private_data);
typedef void (*bpf_iter_fini_seq_priv_t)(void *private_data);

struct bpf_iter_reg {
const char *target;
const struct seq_operations *seq_ops;
bpf_iter_init_seq_priv_t init_seq_private;
bpf_iter_fini_seq_priv_t fini_seq_private;
u32 seq_priv_size;
};

struct bpf_iter_meta {
__bpf_md_ptr(struct seq_file *, seq);
u64 session_id;
u64 seq_num;
};

int bpf_iter_reg_target(struct bpf_iter_reg *reg_info);
void bpf_iter_unreg_target(const char *target);
bool bpf_iter_prog_supported(struct bpf_prog *prog);
int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
int bpf_iter_new_fd(struct bpf_link *link);
bool bpf_link_is_iter(struct bpf_link *link);
struct bpf_prog *bpf_iter_get_info(struct bpf_iter_meta *meta, bool in_stop);
int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);

int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value,
Expand Down
1 change: 1 addition & 0 deletions include/linux/bpf_types.h
Original file line number Diff line number Diff line change
Expand Up @@ -124,3 +124,4 @@ BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
#ifdef CONFIG_CGROUP_BPF
BPF_LINK_TYPE(BPF_LINK_TYPE_CGROUP, cgroup)
#endif
BPF_LINK_TYPE(BPF_LINK_TYPE_ITER, iter)
3 changes: 3 additions & 0 deletions include/linux/proc_fs.h
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,9 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo
void *data);
extern struct pid *tgid_pidfd_to_pid(const struct file *file);

extern int bpf_iter_init_seq_net(void *priv_data);
extern void bpf_iter_fini_seq_net(void *priv_data);

#ifdef CONFIG_PROC_PID_ARCH_STATUS
/*
* The architecture which selects CONFIG_PROC_PID_ARCH_STATUS must
Expand Down
47 changes: 46 additions & 1 deletion include/uapi/linux/bpf.h
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ enum bpf_cmd {
BPF_LINK_GET_FD_BY_ID,
BPF_LINK_GET_NEXT_ID,
BPF_ENABLE_STATS,
BPF_ITER_CREATE,
};

enum bpf_map_type {
Expand Down Expand Up @@ -218,6 +219,7 @@ enum bpf_attach_type {
BPF_TRACE_FEXIT,
BPF_MODIFY_RETURN,
BPF_LSM_MAC,
BPF_TRACE_ITER,
__MAX_BPF_ATTACH_TYPE
};

Expand All @@ -228,6 +230,7 @@ enum bpf_link_type {
BPF_LINK_TYPE_RAW_TRACEPOINT = 1,
BPF_LINK_TYPE_TRACING = 2,
BPF_LINK_TYPE_CGROUP = 3,
BPF_LINK_TYPE_ITER = 4,

MAX_BPF_LINK_TYPE,
};
Expand Down Expand Up @@ -612,6 +615,11 @@ union bpf_attr {
__u32 type;
} enable_stats;

struct { /* struct used by BPF_ITER_CREATE command */
__u32 link_fd;
__u32 flags;
} iter_create;

} __attribute__((aligned(8)));

/* The description below is an attempt at providing documentation to eBPF
Expand Down Expand Up @@ -3069,6 +3077,41 @@ union bpf_attr {
* See: clock_gettime(CLOCK_BOOTTIME)
* Return
* Current *ktime*.
*
* int bpf_seq_printf(struct seq_file *m, const char *fmt, u32 fmt_size, const void *data, u32 data_len)
* Description
* seq_printf uses seq_file seq_printf() to print out the format string.
* The *m* represents the seq_file. The *fmt* and *fmt_size* are for
* the format string itself. The *data* and *data_len* are format string
* arguments. The *data* are a u64 array and corresponding format string
* values are stored in the array. For strings and pointers where pointees
* are accessed, only the pointer values are stored in the *data* array.
* The *data_len* is the *data* size in term of bytes.
*
* Formats **%s**, **%p{i,I}{4,6}** requires to read kernel memory.
* Reading kernel memory may fail due to either invalid address or
* valid address but requiring a major memory fault. If reading kernel memory
* fails, the string for **%s** will be an empty string, and the ip
* address for **%p{i,I}{4,6}** will be 0. Not returning error to
* bpf program is consistent with what bpf_trace_printk() does for now.
* Return
* 0 on success, or a negative errno in case of failure.
*
* * **-EBUSY** Percpu memory copy buffer is busy, can try again
* by returning 1 from bpf program.
* * **-EINVAL** Invalid arguments, or invalid/unsupported formats.
* * **-E2BIG** Too many format specifiers.
* * **-EOVERFLOW** Overflow happens, the same object will be tried again.
*
* int bpf_seq_write(struct seq_file *m, const void *data, u32 len)
* Description
* seq_write uses seq_file seq_write() to write the data.
* The *m* represents the seq_file. The *data* and *len* represent the
* data to write in bytes.
* Return
* 0 on success, or a negative errno in case of failure.
*
* * **-EOVERFLOW** Overflow happens, the same object will be tried again.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
Expand Down Expand Up @@ -3196,7 +3239,9 @@ union bpf_attr {
FN(get_netns_cookie), \
FN(get_current_ancestor_cgroup_id), \
FN(sk_assign), \
FN(ktime_get_boot_ns),
FN(ktime_get_boot_ns), \
FN(seq_printf), \
FN(seq_write),

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
Expand Down
2 changes: 1 addition & 1 deletion kernel/bpf/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
obj-y := core.o
CFLAGS_core.o += $(call cc-disable-warning, override-init)

obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o
obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
obj-$(CONFIG_BPF_SYSCALL) += disasm.o
Expand Down
Loading

0 comments on commit 180139d

Please sign in to comment.