-
Notifications
You must be signed in to change notification settings - Fork 19
Getting Started with Millstone
This guide walks you through cloning the latest stable Amazon Machine Image (AMI) configured with Millstone. Most new users will want to use this guide. Docs for individuals wishing to configure their instance or modify source code are coming soon.
You need to create to an Amazon Web Services (AWS) account. Brad Chapman's getting started guide for cloudbiolinux has a solid first chapter with instructions on getting everything setup. /~https://github.com/chapmanb/cloudbiolinux/blob/master/doc/intro/gettingStarted_CloudBioLinux.pdf?raw=true
-
In the EC2 console, navigate to Instances, then press the Launch Instance button. This will launch a wizard that walks you through setting up a new instance. The following steps provide instructions for configuring this instance.
-
In the new instance wizard, search for 'Millstone', choose the latest Millstone AMI. As of this writing, the most current is
millstone_combined_2015_02_03
. Make sure you're in the N. Virginia instance (upper right dropdown) -
On the 'Choose instance type' tab, select an instance according to your needs. We recommend m3.medium (select General Purpose on the left).
-
In 'Configure instance', the only setting we recommend changing is explicitly setting the Availability Zone (we always use
us-east-1a
). You can only move EBS (Amazon hard-drives) between instances in the same zone, so it'll make things easier to consistently make everything in the same zone. -
In 'Add storage', increase the size of the root drive to the amount of space that you'll need. For bacterial genomes, about 2 GB per sample should be more than enough (i.e. 100 samples = 200 GB).
-
In 'Tag instance', fill in an informative value for the 'Name' key. I like the name to include the date it was created and a description of what the instance is running (e.g. 2014_04_01_mutate_all_the_things).
-
For security group, configure a group appropriate to your needs. Most users will want to create a security group with all of the following open (NOTE: This will make your instance publically visible, but login is still required.):
- All ICMP
- All TCP
- All UDP
- SSH
-
Continue to the final tab where you'll press 'Launch the instance'. Select or create a key. If you create the key, download and save the private key. (NOTE: If you lose the private key there's no way to ssh back into your instance. You'll have to terminate it and create a new one.)
It takes about 5-10 minutes for the instance to launch and all bootstrapping to finish, after which your Millstone is ready to grind!
From the web, visit: ec2-xx-xx-xx-xx.compute-1.amazonaws.com (replacing the x's). This can be found by going to "Instances", finding your created instance, and copying the address under Public DNS (see image). It may take some time for your instance to initialize, so wait until all status checks are completed before attempting to log in.
To ssh in:
ssh -i ~/.ssh/your-key.pem ubuntu@ec2-xx-xx-xx-xx.compute-1.amazonaws.com
Note: If permissions fail, chmod the your private key's permissions to 700 using chmod.
Once you navigate to the public DNA, you should be greeted by a splash screen. Register a new user. (currently only one user account can be made per server). Then once you click 'New Project...' and name your project, you will be taken automatically to the New Alignment screen.
Proceed through the alignment steps using the numbered green buttons at the top.
-
First, you'll be asked to set a name for the alignment. An alignment consists of a set of samples aligned to a reference genome.
-
Add a reference genome to your project by clicking the new button, and selecting load file from NCBI - Simply fill in the accession number (for instance U00096.2 for E. coli) and give the reference genome a name. If you'd like to use a custom reference, you can upload the file through the browser. Check to make sure you've got the right accession number by comparing your genome's size to the number of nucleotides present in the reference genome.
NOTE: If you want Millstone to annotate your mutations using SNPEff, your reference genome file must be in Genbank format, not FASTA.
-
Once that's done, move on to the samples section. Each genome sample you upload must consist of a pair of forward and reverse FASTQ files. You can either upload samples through the browser, or you can upload them in batch to the server using a the command line via
scp
. The command line approach is better for large numbers of samples, but is more complicated. It is detailed in the Manual Upload section at the bottom of this guide.Uploading each sample individually through the browser is the less technical approach. For each sample, choose a name and forward and reverse read file.
NOTE: Millstone can work with
gzip
-ed FASTQ files, and they will be faster to upload.