This doc will show you how to use this framework, how it works, what it can do, etc.
This document still needs to be perfected.
- 1 - Introduction
- 2 - Usage of this framework
- 3 - Collect information of the target project
- 4 - Parse the data
- 5 - Use the information to do something
- 6 - Future work
- 7 - References
- 8 - Greetings
SRCINV is short for source code investigator.
Source code(or binary code) is not only for human who read it, but also for the compiler(or interpreter or CPU).
This framework aims to do code audit automatically for open source projects.
For QA/researchers, they could find different kinds of bugs in their projects.
This framework may also automatically generate samples to test the bug and patches for the bug.
-
Directories
- analysis
parse resfile(s), and provide some helper functions for hacking to call
- bin
srcinv binaries and modules
- collect
As a compiler plugin, run in target project root directory, collect the compiling data(AST)
- config
runtime configs. Currently in-use file is module.json
- core
the main process, handle user input, parse configs, load modules, etc.
- doc
Documatations, ChangeLog
- hacking
do anything you want to do. For example, locations that field gid of struct cred get used, list functions in directory or a single souce file, etc.
- include
header files
- lib
provide some compiler functions to use in analysis.
- testcase
some cases that should be handled in analysis.
- tmp
result data: resfile(s), src.saved, log, etc
The framework could only run on 64bit GNU/Linux systems, and need personality system call to disable the current process aslr.
The colorful prompt could be incompatible.
Libraries:
- clib
- ncurses
- readline
- libcapstone
Header files:
- gcc plugin library
For now, the framework is only for C projects compiled by GCC.
This framework has several levels:
- level 0: SRCINV(the main).
- level 1: collect, analysis, hacking.
Commands for level 0:
SRCINV> help
================== USAGE INFO ==================
help
Show this help message
quit
Exit this main process
exit
Exit this main process
do_sh
Execute bash command
showlog
Show current log messages
reload_config
Reload config
load_srcfile
(srcid)
flush_srcfile
Write current src info to srcfile
collect
Enter collect subshell
analysis
Enter analysis subshell
hacking
Enter hacking subshell
================== USAGE END ==================
Commands for level 1 - collect:
<1> collect> help
================== USAGE INFO ==================
help
Show this help message
exit
Return to the previous shell
quit
Return to the previous shell
show
(type) [path]
type: (SRC/BIN)_(KERN/USER)_(LINUX/...)_(...) check si_core.h for more
Pick appropriate module and show comment
================== USAGE END ==================
Commands for level 1 - analysis:
<1> analysis> help
================== USAGE INFO ==================
help
Show this help message
exit
Return to the previous shell
quit
Return to the previous shell
parse
(resfile) (kernel) (builtin) (step) (auto_Y)
Get information of resfile, steps are:
0 Get all information
1 Get information adjusted
2 Get base information
3 Get detail information
4 Prepare for step5
5 Get indirect call information
6 Check if all GIMPLE_CALL are set
getoffs
(resfile) (filecnt)
Count filecnt files and calculate the offset
cmdline
(resfile) (filepath)
Show the command line used to compile the file
================== USAGE END ==================
Commands for level 1 - hacking:
================== USAGE INFO ==================
help
Show this help message
exit
Return to the previous shell
quit
Return to the previous shell
itersn
[output_path]
Traversal all sinodes to stderr/file
havefun
Run hacking modules
load_sibuf
(sibuf_addr)
Loading specific sibuf
list_function
[dir/file] (format)
List functions in dir or file
format:
0(normal output)
1(markdown output)
================== USAGE END ==================
This framework take a few steps to parse the project: collect the information, parse the collected data, use the parsed data.
NOTE: before collecting information of a project, make sure the resfile not exist yet.
When compiling a source file, GCC will handle several data formats: AST, GIMPLE,
etc. This framework collects the lower GIMPLEs right at ALL_IPA_PASSES_END
,
researchers could use these information to parse the project.
Generally, a GCC plugin could only see the information of the current compiling file. When we put the information of all compiled files together, we could see the big picture of the project, make it more convenience to parse the project.
For each project, this framework use a src structure to track all information, which could be:
- sibuf: a list of compiled files
- resfile: a list of all the resfiles for the target project.
For linux kernel, there could be one vmlinux-resfile which is builtin and a bunch of module-resfiles which are non-builtin.
- sinodes: nodes for non-local variables, functions, data types.
These nodes could be searched by name or by location.
For static file-scope variables, static file-scope functions, data types have a location, we use location to search and insert.
For global variables, global functions, and data types with a name but without a location, we use name to search and insert.
For data types without location or name, we put them into sibuf.
About GCC plugin, this framework gets the lower GIMPLEs.
source file ---> AST ---> High-level GIMPLE ---> Low-level GIMPLE
We collect the GIMPLEs at ALL_IPA_PASSES_END.
all_lowering_passes
|---> useless [GIMPLE_PASS]
|---> mudflap1 [GIMPLE_PASS]
|---> omplower [GIMPLE_PASS]
|---> lower [GIMPLE_PASS]
|---> ehopt [GIMPLE_PASS]
|---> eh [GIMPLE_PASS]
|---> cfg [GIMPLE_PASS]
...
all_ipa_passes
|---> visibility [SIMPLE_IPA_PASS]
...
Each source file will have only one chance to call the plugin callback function at ALL_IPA_PASSES_END. It means that the data of a function we collected before would not be modified when we collecting some other function's data.
For more about GCC plugin, check refs[1] or refs[2].
For C projects, collect/c.cc
is used when compile the target project. It is
not used in si_core
process. It is a GCC plugin. When this plugin gets
involved, it detects the current compling file's name and full path, register
a callback function for PLUGIN_ALL_IPA_PASSES_START event. The resfile is
PAGE_SIZE aligned.
We can't register PLUGIN_PRE_GENERICIZE to handle each tree_function_decl, cause the function may be incompleted.
static void test_func0(void);
static void test_func1(void)
{
test_func0();
}
static void test_func0(void)
{
/* test_func0 body */
}
In this case, when PLUGIN_PRE_GENERICIZE event happens, the function test_func1 is handled. However, the tree_function_decl for test_func0 is not initialized yet.
When the registered callback function gets called, the function body would be
moved to tree_function_decl->f->cfg
from tree_function_decl->saved_tree
:
tree_function_decl->saved_tree --->
tree_function_decl->f->gimple_body --->
tree_function_decl->f->cfg
When collect data of a var, we focus on non-local variables. GCC provides a function to test if this is a global var.
static inline bool is_global_var (const_tree t)
{
return (TREE_STATIC(t) || DECL_EXTERNAL(t));
}
for VAR_DECL, TREE_STATIC check whether this var has a static storage or not. DECL_EXTERNAL check if this var is defined elsewhere. We add an extra check:
if (is_global_var(node) &&
((!DECL_CONTEXT(node)) ||
(TREE_CODE(DECL_CONTEXT(node)) == TRANSLATION_UNIT_DECL))) {
objs[start].is_global_var = 1;
}
DECL_CONTEXT is NULL or the TREE_CODE of it is TRANSLATION_UNIT_DECL, we take this var as a non-local variable.
The test is for ubuntu 18.04 linux-5.3.y, gcc 8.3.0. The size of the data we collect is about 30G.
Check Chapter 3
Please add new si_type
in include/si_core.h
, do not forget to modify
config/module.json
and si_module.c
.
For linux kernel, EXTRA_AFLAGS
is needed while run make
.
Note that, we can not set a file_content.si_type.*
to BOTH/ANY value, the
BOTH/ANY values are for modules in collect/analysis.
The main feature of the framework is parsing the data. However, it is the most complicated part.
This is implemented in analysis/analysis.c and analysis/gcc/c.cc(for GCC c project).
For a better experience, the framework use ncurses to show the parse status (check the main Makefile).
To parse the data, we need to load the resfile into memory. However, the resfile for linux-5.3.y is about 30G. We can not load it all. Here comes the solution:
- Disable aslr and restart the process.
- Use mmap to load resfile at RESFILE_BUF_START, if the resfile current loaded in memory is larger than RESFILE_BUF_SIZE, unmap the last, and do the next mmap just at the end of the last memory area.
- For each compiled file, we use a sibuf to track the address where to load the data, the size, and the offset of the resfile.
- Use
resfile__resfile_load()
to load a compiled file's information.
memory layout:
0x0000000000000000 0x0000000000400000 NULL pages
0x0000000000400000 0x0000000000403000 si_core .text
0x0000000000602000 0x0000000000603000 si_core .rodata ...
0x0000000000603000 0x0000000000605000 si_core .data ...
0x0000000000605000 0x0000000000647000 heap
SRC_BUF_START RESFILE_BUF_START srcfile
RESFILE_BUF_START 0x???????????????? resfiles
0x0000700000000000 0x00007fffffffffff threads libs plugins stack...
If SRC_BUF_START is 0x100000000, RESFILE_BUF_START is 0x1000000000, the size of src memory area is up to 64G, the size of resfile could be 1024G or larger.
The size of the src.saved after PHASE6 is 4.7G. It takes about 6 hours to do all the six phases.
The data we collect, a lot of pointers in it. We must adjust these pointers before we read it, locations as well.
Note that, location_t in GCC is 4 bytes, so we set it an offset value.
use *(expanded_location *)(sibuf->payload + loc_value)
to get the location.
Base info include:
- TYPE_FUNC_GLOBAL,
- TYPE_FUNC_STATIC,
- TYPE_VAR_GLOBAL,
- TYPE_VAR_STATIC,
- TYPE_TYPE,
- TYPE_FILE
Check if the location or name exists in sinodes. If not, alloc a new sinode. If exists, and searched by name, should check if the name conflict(weak symbols?).
For TYPE_TYPE, need to know the type it points to, or the size, or the fields.
For TYPE_VAR_*, just get the type of it.
For function, get the type of return value, the arguments list, and the function body.
PHASE4, set possible_list for each non-local variables, get variables and functions(not direct call) marked(use_at_list).
PHASE5, handle marked functions, then try best to trace calls.
A direct call is the second op of GIMPLE_CALL statement is addr of a function. If the second op of GIMPLE_CALL is VAR_DECL or PARM_DECL, it is an indirect call.
NOTE, we are dealing with tree_ssa_name now.
Nothing to do here, do phase5 again.
This section will discuss some efficient steps help to locate bugs.
Do the STEP one by one, and backup the resfile/src.saved files to resfile.x/src.saved.x(x is the step number). So we can restore them while the parsing crashed.
I suggest you to backup resfile/src.saved after STEP1 and STEP3.
If HAVE_CLIB_DBG_FUNC
is enabled, the stack trace message shows which thread
has crashed. With this, we can find the source file the thread is parsing.
Then, edit Makefile
, use O0
, and recompile.
In PARSE mode, we provide a command one_sibuf
, it is used to parse only one
source file. Thus, we can do one_sibuf /path/to/the/file STEP 1
to
reproduce the crash.
We need to write some plugins to use parsed data. As we collect the lower GIMPLEs, these are what we handle in hacking plugins.
For example(obsolete):
static void test_func(int flag)
{
int need_free;
char *buf;
if (flag) {
buf = (char *)malloc(0x10);
need_free = 1;
}
/* do something here */
if (need_free)
free(buf);
}
plugins/uninit.cc shows how to detect this kind of bugs.
This plugin detects all functions one by one:
- generate all possible code_path of this function
- traversal all local variables(not static)
- traversal code_paths, find first position this variable used
- check if the first used statement is to read this variable. The demo
show_detail: Show detail of variables, functions, types.
For data types, sometimes, we want to know where does the field of a structure
get used at. show_detail xxx.yyy.zzz type [all|src|used|offset|size]
should
output the message.
We can do more. If a structure contains some fields without a name, we use *
.
struct xxx {
struct list_head sibling;
union {
struct hlist_node list0;
struct hlist_node list1;
};
/* ... */
}
show_detail xxx.*.list0 type used
will show the used-at results.
Some used_at locations still can not be found.
Example, hlist_for_each_entry_safe(x,n,head,xxlist): the xxlist can not be found uses here because it is not used directly, the xxlist.next is used instead.
In analysis/gcc/c.cc,
__4_mark_gimple_op()
->__4_mark_component_ref(op)
, the first operand of op would be a COMPONENT_REF while the second one is a FIELD_DECL(next in hlist_node). Thus, the first COMPONENT_REF is xxlist which is a hlist_node, we don't get that use here!
Test result for clib
<1> hacking> show_detail clib_mm.* type
src: /home/zerons/workspace/clib/include/clib_mm.h 43 19
Used at(sibling):
clib_mm_find 26 2 0x10075b7bbc 1
clib_mm_setup 187 2 0x10075905cc 1
Offset:
0
Size:
128
Used at(desc):
clib_mm_find 27 8 0x10075b7154 1
clib_mm_setup 178 10 0x1007594088 1
Offset:
128
Size:
64
Used at(fd):
clib_mm_init 51 8 0x10075b076f 0
clib_mm_dump 63 7 0x10075ae833 1
clib_mm_expand 120 9 0x10075a87a0 1
Offset:
192
Size:
32
Used at(refcount):
clib_mm_find 28 15 0x10075b6604 1
clib_mm_put 142 26 0x100759f5de 1
Offset:
256
Size:
64
Used at(mm_start):
clib_mm_init 52 14 0x10075b0837 0
clib_mm_expand 103 33 0x10075abb2c 1
clib_mm_expand 114 51 0x10075a7ce0 1
clib_mm_expand 120 54 0x10075a8888 1
clib_mm_cleanup 214 50 0x100757fad2 1
Offset:
320
Size:
64
Used at(mm_head):
clib_mm_init 53 13 0x10075b08ff 0
clib_mm_dump 65 58 0x10075ae043 1
Offset:
384
Size:
64
Used at(mm_cur):
clib_mm_init 54 12 0x10075b09c7 0
clib_mm_dump 65 46 0x10075adfd3 1
clib_mm_expand 76 21 0x10075a5042 1
clib_mm_get 246 16 0x1007572c43 1
clib_mm_get 247 12 0x1007572de3 0
Offset:
448
Size:
64
Used at(mm_tail):
clib_mm_init 55 13 0x10075b0b07 0
clib_mm_expand 76 8 0x10075a53ca 1
clib_mm_expand 94 14 0x10075a4792 1
clib_mm_expand 94 14 0x10075a648a 0
clib_mm_expand 114 38 0x10075a7d50 1
clib_mm_cleanup 214 37 0x1007580582 1
Offset:
512
Size:
64
Used at(mm_end):
clib_mm_init 56 12 0x10075b0c47 0
clib_mm_expand 82 8 0x10075a4ce2 1
clib_mm_expand 120 44 0x10075a88f8 1
Offset:
576
Size:
64
Used at(expandable):
clib_mm_init 57 16 0x10075b0d87 0
clib_mm_expand 97 7 0x10075ac054 1
clib_mm_expand 121 7 0x10075a89d8 1
Offset:
640
Size:
8
Check TODO.md
[1] GNU Compiler Collection Internals
[2] GCC source code
[3] 深入分析GCC
[4] gcc plugins for linux kernel
Thanks to the author of <<深入分析GCC>> for helping me to understand the inside of GCC.
Thanks to PYQ and other workmates, with your understanding and support, I can focus on this framework.
Thanks to CG for his support during the development of the framework.
Certainly... Thanks to all the people who managed to read the whole text.
There is still a lot of work to be done. Ideas or contributions are always welcome. Feel free to send push requests.