-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhow-to-process-a-fastq-file-in-parallel.html
178 lines (150 loc) · 12.8 KB
/
how-to-process-a-fastq-file-in-parallel.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
<!doctype html>
<html lang="en">
<head>
<!-- Required meta tags -->
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<title> How-to: Process a FASTQ file in parallel | Computational Approaches to Biological Problems
</title>
<link rel="canonical" href="./how-to-process-a-fastq-file-in-parallel.html">
<link rel="stylesheet" href="./theme/css/bootstrap.min.css">
<link rel="stylesheet" href="./theme/css/fontawesome.min.css">
<link rel="stylesheet" href="./theme/css/pygments/default.min.css">
<link rel="stylesheet" href="./theme/css/theme.css">
<meta name="description" content="In this post, I will be demonstrating how to read FASTQ files using the htsjdk java library and then process those files using functions that allow for parallel processing. I'll be using JShell as described in a previous post, and to begin, we'll designate the class path to the HTSJDK …">
</head>
<body>
<header class="header">
<div class="container">
<div class="row">
<div class="col-sm-4">
<a href="./">
<img class="img-fluid rounded" src=./images/LucasBoatwrightHeadShot-161x215.jpeg alt="Computational Approaches to Biological Problems">
</a>
</div>
<div class="col-sm-8">
<h1 class="title"><a href="./">Computational Approaches to Biological Problems</a></h1>
<p class="text-muted">and other biological inklings</p>
<ul class="list-inline">
<li class="list-inline-item"><a href="./index.html">Home</a></li>
<li class="list-inline-item"><a href="./archives.html">posts</a></li>
<li class="list-inline-item"><a href="/~https://github.com/jlboat">GitHub</a></li>
<li class="list-inline-item"><a href="https://scholar.google.com/citations?hl=en&user=q5G2bJEAAAAJ&view_op=list_works&gmla=AJsN-F4gfKvkjBIT6Nv8Xy97IIUHuhFi9jF8x5ntZveyPGo_xO2ws-_5NjCw-_9z9CD2mWJnlbL1DeMd-0l-IUNYPsQ1NR4wu8s_l6YjZYlsb9mTQ1d8TX8">Pubs</a></li>
<li class="list-inline-item text-muted">|</li>
<li class="list-inline-item"><a href="./pages/about-me.html">About me</a></li>
</ul>
</div>
</div> </div>
</header>
<div class="main">
<div class="container">
<h1> How-to: Process a FASTQ file in parallel
</h1>
<hr>
<article class="article">
<header>
<ul class="list-inline">
<li class="list-inline-item text-muted" title="2018-11-13T00:00:00-05:00">
<i class="fas fa-clock"></i>
Tue 13 November 2018
</li>
<li class="list-inline-item">
<i class="fas fa-folder-open"></i>
<a href="./category/how-to.html">How-to</a>
</li>
<li class="list-inline-item">
<i class="fas fa-user"></i>
<a href="./author/j-lucas-boatwright.html">J. Lucas Boatwright</a> </li>
<li class="list-inline-item">
<i class="fas fa-tag"></i>
<a href="./tag/bioinformatics.html">#bioinformatics</a>, <a href="./tag/education.html">#education</a>, <a href="./tag/java.html">#java</a> </li>
</ul>
</header>
<div class="content">
<p>In this post, I will be demonstrating how to read FASTQ files using the htsjdk java library and then process those files using functions that allow for parallel processing.</p>
<p>I'll be using JShell as described in a previous post, and to begin, we'll designate the class path to the HTSJDK library.</p>
<div class="highlight"><pre><span></span><code><span class="nb">export</span> <span class="nv">CLASSPATH</span><span class="o">=</span><span class="s2">"/home/lboat/IdeaProjects/htsjdk/build/classes/java/main/"</span>
jshell
</code></pre></div>
<p>Now that JShell is open, we can import all of the necessary functions.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">java.io.File</span><span class="p">;</span>
<span class="kn">import</span> <span class="nn">java.io.BufferedReader</span><span class="p">;</span>
<span class="kn">import</span> <span class="nn">java.io.FileInputStream</span><span class="p">;</span>
<span class="kn">import</span> <span class="nn">java.io.InputStreamReader</span><span class="p">;</span>
<span class="kn">import</span> <span class="nn">java.util.zip.GZIPInputStream</span><span class="p">;</span>
<span class="kn">import</span> <span class="nn">htsjdk.samtools.fastq.FastqRecord</span>
<span class="kn">import</span> <span class="nn">htsjdk.samtools.fastq.FastqReader</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="n">File</span> <span class="n">file</span> <span class="o">=</span> <span class="k">new</span> <span class="n">File</span><span class="p">(</span><span class="s">"R1.fastq.gz"</span><span class="p">)</span>
<span class="n">FileInputStream</span> <span class="n">fis</span> <span class="o">=</span> <span class="k">new</span> <span class="n">FileInputStream</span><span class="p">(</span><span class="n">file</span><span class="p">)</span>
<span class="n">GZIPInputStream</span> <span class="n">gis</span> <span class="o">=</span> <span class="k">new</span> <span class="n">GZIPInputStream</span><span class="p">(</span><span class="n">fis</span><span class="p">)</span>
<span class="n">InputStreamReader</span> <span class="n">isr</span> <span class="o">=</span> <span class="k">new</span> <span class="n">InputStreamReader</span><span class="p">(</span><span class="n">gis</span><span class="p">)</span>
<span class="n">BufferedReader</span> <span class="n">br</span> <span class="o">=</span> <span class="k">new</span> <span class="n">BufferedReader</span><span class="p">(</span><span class="n">isr</span><span class="p">)</span>
<span class="c1">// I'm adding file here (which is optional) so we can see the file name</span>
<span class="n">FastqReader</span> <span class="n">fr</span> <span class="o">=</span> <span class="k">new</span> <span class="n">FastqReader</span><span class="p">(</span><span class="n">file</span><span class="p">,</span><span class="n">br</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="n">FastqRecord</span> <span class="n">record</span> <span class="o">=</span> <span class="n">fr</span><span class="p">.</span><span class="na">next</span><span class="p">()</span>
<span class="n">record</span><span class="p">.</span><span class="na">getReadLength</span><span class="p">()</span>
<span class="n">record</span><span class="p">.</span><span class="na">getBaseQualities</span><span class="p">()</span>
<span class="n">Spliterator</span><span class="o"><</span><span class="n">FastqRecord</span><span class="o">></span> <span class="n">spliterator</span> <span class="o">=</span> <span class="n">fr</span><span class="p">.</span><span class="na">spliterator</span><span class="p">()</span>
<span class="c1">// Note: You can only iterate over a stream once, </span>
<span class="c1">// so typically you won't save the stream to a variable</span>
<span class="n">Stream</span><span class="o"><</span><span class="n">FastqRecord</span><span class="o">></span> <span class="n">stream</span> <span class="o">=</span> <span class="n">StreamSupport</span><span class="p">.</span><span class="na">stream</span><span class="p">(</span><span class="n">spliterator</span><span class="p">,</span> <span class="kc">true</span><span class="p">)</span>
<span class="n">stream</span><span class="p">.</span><span class="na">isParallel</span><span class="p">()</span>
<span class="c1">// Just want to print the </span>
<span class="c1">//stream.forEach(n -> System.out.println(n.getBaseQualities()))</span>
<span class="n">Map</span><span class="o"><</span><span class="n">String</span><span class="p">,</span> <span class="n">Long</span><span class="o">></span> <span class="n">map</span> <span class="o">=</span> <span class="n">stream</span><span class="p">.</span><span class="na">collect</span><span class="p">(</span><span class="n">groupingBy</span><span class="p">(</span><span class="n">FastqRecord</span><span class="p">::</span><span class="n">getReadString</span><span class="p">,</span> <span class="n">counting</span><span class="p">()));</span>
<span class="c1">// Now, we have all of our reads counted</span>
</code></pre></div>
<p>Now we get the path to the FASTA file and generate FASTA object using a FactoryBuilder (A specialized Java class for generating objects).</p>
<div class="highlight"><pre><span></span><code><span class="c1">// Get the path to our file</span>
<span class="n">Path</span> <span class="n">path</span> <span class="o">=</span> <span class="n">Paths</span><span class="p">.</span><span class="na">get</span><span class="p">(</span><span class="s">"big.fasta"</span><span class="p">)</span>
<span class="c1">// Generate a FASTA object</span>
<span class="n">ReferenceSequenceFile</span> <span class="n">fasta</span> <span class="o">=</span> <span class="n">ReferenceSequenceFileFactory</span><span class="p">.</span><span class="na">getReferenceSequenceFile</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
<span class="c1">// Index our FASTA file</span>
<span class="n">FastaSequenceIndex</span> <span class="n">fsi</span> <span class="o">=</span> <span class="n">FastaSequenceIndexCreator</span><span class="p">.</span><span class="na">buildFromFasta</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
</code></pre></div>
<p>The FASTA file is now indexed and we can readily check some specific metrics for this FASTA.</p>
<div class="highlight"><pre><span></span><code><span class="c1">// Determine the number of sequences in our file</span>
<span class="n">fsi</span><span class="p">.</span><span class="na">size</span><span class="p">()</span>
<span class="c1">// Get sequence-specific information</span>
<span class="n">fsi</span><span class="p">.</span><span class="na">getIndexEntry</span><span class="p">(</span><span class="s">"sequence_5"</span><span class="p">)</span>
<span class="err">#</span> <span class="n">contig</span> <span class="n">sequence_5</span><span class="p">;</span> <span class="n">location</span> <span class="mi">5152</span><span class="p">;</span> <span class="n">size</span> <span class="mi">999</span><span class="p">;</span> <span class="n">basesPerLine</span> <span class="mi">60</span><span class="p">;</span> <span class="n">bytesPerLine</span> <span class="mi">61</span>
<span class="c1">// Notice, our FASTA isn't recognized as indexed</span>
<span class="n">fasta</span><span class="p">.</span><span class="na">isIndexed</span><span class="p">()</span>
<span class="err">#</span> <span class="kc">false</span>
<span class="c1">// Write our index </span>
<span class="n">fsi</span><span class="p">.</span><span class="na">write</span><span class="p">(</span><span class="n">Paths</span><span class="p">.</span><span class="na">get</span><span class="p">(</span><span class="s">"big.fa.fai"</span><span class="p">))</span>
<span class="c1">// Re-open the FASTA using this index</span>
<span class="n">fasta</span> <span class="o">=</span> <span class="n">ReferenceSequenceFileFactory</span><span class="p">.</span><span class="na">getReferenceSequenceFile</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
<span class="c1">// Notice, our FASTA is now indexed</span>
<span class="n">fasta</span><span class="p">.</span><span class="na">isIndexed</span><span class="p">()</span>
<span class="kc">true</span>
<span class="c1">// You may want to check the first sequnce</span>
<span class="n">ReferenceSequence</span> <span class="n">rs</span> <span class="o">=</span> <span class="n">fasta</span><span class="p">.</span><span class="na">nextSequence</span><span class="p">()</span>
<span class="c1">// Look at its name</span>
<span class="n">rs</span><span class="p">.</span><span class="na">getName</span><span class="p">()</span>
<span class="c1">// Look at the sequence</span>
<span class="n">rs</span><span class="p">.</span><span class="na">getBaseString</span><span class="p">()</span>
</code></pre></div>
</div>
</article>
</div>
</div>
<footer class="footer">
<div class="container">
<div class="row">
<ul class="col-sm-6 list-inline">
<li class="list-inline-item"><a href="./authors.html">Authors</a></li>
<li class="list-inline-item"><a href="./archives.html">Archives</a></li>
<li class="list-inline-item"><a href="./categories.html">Categories</a></li>
<li class="list-inline-item"><a href="./tags.html">Tags</a></li>
</ul>
<p class="col-sm-6 text-sm-right text-muted">
Generated by <a href="/~https://github.com/getpelican/pelican" target="_blank">Pelican</a>
/ <a href="/~https://github.com/nairobilug/pelican-alchemy" target="_blank">✨</a>
</p>
</div> </div>
</footer>
</body>
</html>