Skip to content

silpol/mrsync

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

   Copyright (C)  2008 Renaissance Technologies Corp.
                  main developer: HP Wei <hp@rentec.com>
   Copyright (C)  2006 Renaissance Technologies Corp.
                  main developer: HP Wei <hp@rentec.com>
   Copyright (C)  2005 Renaissance Technologies Corp. 
   Copyright (C)  2001 Renaissance Technologies Corp.
                  main developer: HP Wei <hp@rentec.com>

   This program is free software; you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation; either version 2, or (at your option)
   any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License
   along with this program; see the file COPYING.
   If not, write to the Free Software Foundation,
   59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.  

20060416 version 3.0
   This is a major update from previous version 2.1. New features include:
      -- large file support
         multicaster can now transfer file whose size is larger than 2**31-1.
      -- platform independence (between linux, unix)
         We have tested this independence from a Sun to a Linux machine.
      -- catching slow machine as the feedback monitor dynamically
         This significantly reduces the number of occurence of SIT-OUT-RECEIVING-FILE cases.
      -- removing meta-file-info
         Since our basic algorithm is different from that of Aaron's, the meta-file-info
         is redundant for in our case.
      -- backup feature (as in rsync)
         If the target machines are NFS data servers, backup feature is absolutely
         necessary. The feature is implemented in the low level multica* codes,
         I will incorporate it into mrsync.py in the next month or so.      
      -- mcast IP address and port options
         They are now tested and are working.
         This allows multiple multicasting sessions running simultaneously.
      -- code cleanup in both mrsync.py and *.c
      -- code fixing for 64bit arch;
         tested on Debian 64 bit arch by Nicolas Marot in France
      -- logic flaw which under certain condition
         caused premature dropout due to 
         unsuccessful EOF, CLOSE_FILE 
         and caused unwanranted SIT-OUT case
      -- code provision for IPv6 (but not tested yet)
version 3.1.1
      -- minor bug fixes.
      -- add checking in multicatcher to check if the 
         system is ready for write.
      -- codes for IPv6 are ready (but not tested)
         IPv4 is tested ok.
version 3.2.0
      -- monitor change improvement
      -- handshake improvement (e.g. seq #)
      -- if one machine skips a file, all will NOT close()
version 4.0.0
      -- previously, missing page report is sent back one page
         at a time. I change it to ~60K pages at a time. This
         drastically reduces the network traffic.
         Performance (syncing time) is in general improved.
      -- clean up scheme in choosing a monitor machine.
      -- misc code cleaning up and minor adjustments.

version 4.0.1
      -- Dave Close at ThalesGroup in CA points out
         the type of c in "c = getopt()" should have been "int c"
         [ It had been "char c". ]

The mrsync package includes the following files:
--------------------
README           -- this file (usage example in the end.)
README.more      -- another readme.
GPL.License      -- legal stuff
mrsync.py        -- the driver script that glues everything together
mrsync_config.py -- system dependent configuration files
cmdToTarget.py   -- module to send command to remote machines
Makefile.Sun
Makefile.linux
backup.c
complaint_sender.c
complaints.c
file_operations.c
global.c
id_map.c
multicaster.c
multicatcher.c
page_reader.c
parse_synclist.c
rtt.c
rttcatcher.c
rttcomplaint_sender.c
rttcomplaints.c
rttmissings.c
rttpage_reader.c
rttsends.c
sends.c
set_catcher_mcast.c
set_mcast.c
setup_socket.c
signal.c
timing.c
trFilelist.c
main.h       
rttproto.h
proto.h      
signal.h
rttmain.h

------------
To install, 
------------
(1) copy these files to a directory on the target and
    master machines.
    (All target machines need at least a copy of the
     multicatcher.)
    Or the best (efficient) solution is to install the package
    on a common file server to which all machines can access.
(2) Depending on which platform you are in,
    mv Makefile.Sun (or MakeFile.linux) into Makefile.
(3) edit the system dependent section in Makefile.
    Especially, the bindir variable.
    It is used in the 'install' to copy the executable
    binaries into the destination directory.
(4) make
    to compile and generate the executables. (the warnings for format are harmless). 
(5) make install
    to copy the executables into $bindir.
(6) edit mrsync_config.py
       change the paths for your system.
       mrsync_config.py is read by mrsync.py
(7) check if your system has rsync installed 
    in the directory specified in rsyncPath and remote_rsyncPath
    in the file mrsync_config.py
(8) check if your system has python installed.
    Python, like perl, is included in most linux distribution.
    As distributed, mrsync.py invokes python through the 1st line
    !/usr/bin/env python
    If this does not work for you system, you replace
    the 1st line of mrsync.py with correct one.

------------------------------
Usage of mrsync.py
------------------------------

(1) the function of ping -- 
    This is removed from multicaster.
    Instead, please use rtt which gives more useful info.
    See 'Usage of rtt' below.
     
(2) mrsync.py -m /dirX/target.list 
              -s /dirY/data 
              -t /dirZ/data 

    This is a routine job carried out at Renaissance Technologies Corp.
    The file target.list lists the names of all target machines (one name per line).

    Before carrying out multicasting, mrsync.py invokes rsync, 
    to find those files in the source directory (-s),
    compared with the target directory (-t) in the first target machine, 
    that need to be transfered to all target machines.
    If -t is not specified, then mrysnc assumes that the dest_path
    on target machines is same as that in -s.

    mrsync.py uses rsync with the following minimum options:
    --rsh=rsh -avW --dry-run --delete
    This should work in most of routine cases.
    This minumum option is specified in mrsync_config.py.
    I did not test if other combination of options will work
    so I would advise that you do not change the minimum option.

    However, you can add additional options on top of that minimum
    options. Use '-o' in mrsync.py
    e.g.
    mrsync2.py -m list -s /path/sync/test -t /dest/sync/test -v 1 -o '-H'
    This turns on --hard-link option in rsync.  But hardlink is very tricky
    to deal with.  The minumum options supplied in mrsync_config.py is the one
    that we find works pretty good in terms of dealing with hardlinks.
    I do find that some rare sequence of syncing sessions requires that I repeat 
    another round of mrsync.py with additional
    -o '-H' to set the hardlinks on the targets.

    The output of rsync is stored in a tmp file, which in turn gets
    parsed by trFilelist.  trFilelist translates the list of files into
    a specific format for multicaster.
    The reason for this two step process is to remove direct dependence of
    multicaster on rsync.
    The gain is that we can now do the case as shown in (3).
    
(3) mrsync.py -m machine1,machine2
              -s /dirB/data/ 
              -l *.c,subs/foo*,sub2/path/[0-9][0-9].ext
    
    Here have two new things as compared with the case in (2).
    First, the two target machines are listed directly 
    in -m option in the form of csv list 
    (instead of listing them in a machine_list file).
    Second, the -l option is specified.
    -l option lists the files that we want to be synced.
    i.e.  with this option, we bypass the rsync.
    In this example, we want to sync
       all /dirB/data/*.c files,
       all /dirB/data/subs/foo* files,
       and all /dirB/data/sub2/path/[0-9][0-9].ext
    The pattern specified should be as those recognizable by 'ls'.

    Obviously, this usage is convenient for syncing small number of
    specific files to a couple of machines.

(4) mrsync.py -m /dirA/target_list
              -s /dirB/data
              -A 239.255.70.88
              -P 8010

    Same functionality as in (2).
    But here we specify the multicast IP address and port number
    for use by both multicatcher and multicaster.

    This allows multiple mrsync sessions to run simultaneously
    as long as each session uses its own mcast_address and port_number.
    [ No overlap between sessions. ]

(5) mrsync.py -m /dirA/target_list
              -s /path
              -X
    Same functionality as in (2).
    But (2) is a normal sync. This one invokes the -X.
    In a normal sync, a given file (let's say /path/foo) is synced
    with two steps,
       (a) sync the file to /path/foo.tmp
       (b) mv /path/foo.tmp /path/foo
    With -X option, the same file is synced with three steps,
       (a) rm /path/foo
       (b) sync the file to /path/foo.tmp
       (c) mv /path/foo.tmp /path/foo
    This is useful when the target disk is almost full, and
    could not afford to have foo.tmp and foo co-existing.

(6) By default, mrsync.py prints out only very essential messages.
    If you want to see more messages (related to some detailed status including
    a rtt_histogram at the end of the session),
    you can spcify '-v 1' in the mrsync.py command line argument list.
    '-v 2' will print out even more detailed messages --- for debugging purpose!

(7) other options of mrsync.py
    type 'mrsync.py'  and see them for yourself.

--------------------------------------------------------------------------------
Usage of rtt
----------------
rtt is a utility for testing between master and one machine.
Its main function is to report the (histogram) distribution of the 
round-trip-time interval for pages sent.
I use it to ping a machine
and to measure the typical rtt time and to see if the network is 
connected properly.  System administrators can use the rtt time info
to turn up the network.

Example usage:

(1) on master, I invoke rtt and it gives the report as follows.

master:~ 1>rtt -r ssh -m target -n 1000 -s 64000

using ssh to invoke rttcatcher on target
Sun Apr 16 17:36:58 EDT 2006
ssh target '/path/rttcatcher -a 239.255.67.200 -p 7900 1>/dev/null 2>/dev/null & echo $!'
remote pid = 27283
Sending data...
Missing pages =          0 ( 0.00%) out of total pages = 1000
rtt histogram
msec counts
---- --------
   1 989
   2 3
   3 7
   4 1

 
The option -m specifies the name of the target machine.
-n specifies the number of pages that we want to send.
-s specifies the size of each page.  (maximum = 64512)

In this example, the total number of pages sent is 1000 
which is the sum of the counts in the rtt historgram table.
The histogram shows that the majority of rtt is around 1-2 msec.
If the network is really busy, the distribution of the histogram 
will be broadened (instead of concentrated around one particular value).

(2) more options for rtt.
    type 'rtt'  and see them on the screen.

--------------------------------------------------------------------------------

HP Wei
hp@rentec.com
Renaissance Technologies Corp.