-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
240 lines (169 loc) · 24.3 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>The Best Or NoThing</title>
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<meta property="og:type" content="website">
<meta property="og:title" content="The Best Or NoThing">
<meta property="og:url" content="http://yoursite.com/index.html">
<meta property="og:site_name" content="The Best Or NoThing">
<meta name="twitter:card" content="summary">
<meta name="twitter:title" content="The Best Or NoThing">
<link rel="alternate" href="/atom.xml" title="The Best Or NoThing" type="application/atom+xml">
<link rel="icon" href="/favicon.png">
<link href="//fonts.googleapis.com/css?family=Source+Code+Pro" rel="stylesheet" type="text/css">
<link rel="stylesheet" href="/css/style.css">
</head>
<body>
<div id="container">
<div id="wrap">
<header id="header">
<div id="banner"></div>
<div id="header-outer" class="outer">
<div id="header-title" class="inner">
<h1 id="logo-wrap">
<a href="/" id="logo">The Best Or NoThing</a>
</h1>
</div>
<div id="header-inner" class="inner">
<nav id="main-nav">
<a id="main-nav-toggle" class="nav-icon"></a>
<a class="main-nav-link" href="/">Home</a>
<a class="main-nav-link" href="/archives">Archives</a>
</nav>
<nav id="sub-nav">
<a id="nav-rss-link" class="nav-icon" href="/atom.xml" title="RSS Feed"></a>
<a id="nav-search-btn" class="nav-icon" title="Search"></a>
</nav>
<div id="search-form-wrap">
<form action="//google.com/search" method="get" accept-charset="UTF-8" class="search-form"><input type="search" name="q" results="0" class="search-form-input" placeholder="Search"><button type="submit" class="search-form-submit"></button><input type="hidden" name="sitesearch" value="http://yoursite.com"></form>
</div>
</div>
</div>
</header>
<div class="outer">
<section id="main">
<article id="post-Tensorflow的数据输入" class="article article-type-post" itemscope itemprop="blogPost">
<div class="article-meta">
<a href="/2017/03/10/Tensorflow的数据输入/" class="article-date">
<time datetime="2017-03-10T02:19:34.000Z" itemprop="datePublished">2017-03-10</time>
</a>
<div class="article-category">
<a class="article-category-link" href="/categories/Tensorflow/">Tensorflow</a>
</div>
</div>
<div class="article-inner">
<header class="article-header">
<h1 itemprop="name">
<a class="article-title" href="/2017/03/10/Tensorflow的数据输入/">Tensorflow的数据输入</a>
</h1>
</header>
<div class="article-entry" itemprop="articleBody">
<h2 id="引文"><a href="#引文" class="headerlink" title="引文"></a>引文</h2><p>数据和网络模型是深度学习最重要的两块内容。在图像识别领域,网络结构往往比较固定,因此数据的往往影响了最终的训练成果。拿到辛苦准备好的数据以后,怎么把数据输入到网络模型中进行训练时非常麻烦的一件事。Tensorflow作为当前最火的深度学习框架,提供了多种数据输入的方法:</p>
<ol>
<li>Feeding(通过python代码,在每个step提供数据。<code>placeholder</code>)</li>
<li>Reading from files(建立一条数据通道,在tensorflow graph的开始处从文件中读取数据。)</li>
<li>Preloaded data(定义常量或者变量)</li>
</ol>
<p>本文介绍两种主流的数据输入方法,最常用的<code>placeholder</code>输入以及<code>input pipeline</code>方法。</p>
<h2 id="Placeholder"><a href="#Placeholder" class="headerlink" title="Placeholder"></a>Placeholder</h2><p><code>placeholder</code>方法用起来非常简单直观。placeholder就是占位的意思,当需要数据的时候,可以人为的用数据填满这个placeholder,tensorflow则开始利用这个数据训练模型。通过下面的代码可以非常直观的看到placeholder是怎么工作的。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">import</span> tensorflow <span class="keyword">as</span> tf</div><div class="line"></div><div class="line"></div><div class="line">a = tf.placeholder(tf.int16)</div><div class="line">b = tf.placeholder(tf.int16)</div><div class="line"></div><div class="line">add = tf.add(a, b)</div><div class="line">mul = tf.multiply(a, b)</div><div class="line"></div><div class="line"><span class="keyword">with</span> tf.Session() <span class="keyword">as</span> sess:</div><div class="line"> <span class="comment"># Run every operation with variable input</span></div><div class="line"> print(<span class="string">"Addition with variables: %i"</span> % sess.run(add, feed_dict={a: <span class="number">2</span>, b: <span class="number">3</span>}))</div><div class="line"> print(<span class="string">"Multiplication with variables: %i"</span> % sess.run(mul, feed_dict={a: <span class="number">2</span>, b: <span class="number">3</span>}))</div></pre></td></tr></table></figure>
<p>运行上面的代码,可以得到以下的结果:</p>
<blockquote>
<p>Addition with variables: 5</p>
<p>Multiplication with variables: 6</p>
</blockquote>
<p>从上面这个例子来看,placeholder非常简单好用,但是随着数据变得复杂,这种方法就会显得不那么方便了。另一点需要关注的就是性能问题,这种输入数据的方式是单线程的,因此每次都需要等待python处理完数据才能进行下一步操作。同时,在python和c++之间不停切换也会影响性能。这里给出的建议就是<strong>不用通过python变量向tensorflow session输入数据!</strong></p>
<h2 id="数据输入管道(Input-Pipeline)"><a href="#数据输入管道(Input-Pipeline)" class="headerlink" title="数据输入管道(Input Pipeline)"></a>数据输入管道(Input Pipeline)</h2><p>通常一个数据管道主要由几部分组成:</p>
<ol>
<li>一串文件名,比如python的list</li>
<li>文件名队列</li>
<li>该文件类型对应的Reader</li>
<li>该文件类型对应的Decoder</li>
<li>样本队列</li>
</ol>
<h3 id="队列(Queue)"><a href="#队列(Queue)" class="headerlink" title="队列(Queue)"></a>队列(Queue)</h3><p>下面通过一个例子来解释tensorflow里的队列是如何工作的。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">import</span> tensorflow <span class="keyword">as</span> tf</div><div class="line"></div><div class="line"></div><div class="line">x_input_data = tf.random_normal([<span class="number">3</span>], mean=<span class="number">-1</span>, stddev=<span class="number">4</span>)</div><div class="line"></div><div class="line"><span class="comment"># We build a FIFOQueue inside the graph </span></div><div class="line"><span class="comment"># You can see it as a waiting line that holds waiting data</span></div><div class="line"><span class="comment"># In this case, a line with only 3 positions</span></div><div class="line">q = tf.FIFOQueue(capacity=<span class="number">3</span>, dtypes=tf.float32)</div><div class="line"></div><div class="line">enqueue_op = q.enqueue_many(x_input_data) <span class="comment"># <- x1 - x2 -x3 |</span></div><div class="line"></div><div class="line">input = q.dequeue() </div><div class="line"></div><div class="line">input = tf.Print(input, data=[q.size()], message=<span class="string">"Nb elements left:"</span>)</div><div class="line"></div><div class="line"><span class="comment"># fake graph: START</span></div><div class="line">y = input + <span class="number">1</span></div><div class="line"><span class="comment"># fake graph: END </span></div><div class="line"></div><div class="line"><span class="comment"># We start the session as usual</span></div><div class="line"><span class="keyword">with</span> tf.Session() <span class="keyword">as</span> sess:</div><div class="line"> <span class="comment"># We first run the enqueue_op to load our data into the queue</span></div><div class="line"> sess.run(enqueue_op)</div><div class="line"> <span class="comment"># Now, our queue holds 3 elements, it's full. </span></div><div class="line"> <span class="comment"># We can start to consume our data</span></div><div class="line"> sess.run(y)</div><div class="line"> sess.run(y) </div><div class="line"> sess.run(y) </div><div class="line"> <span class="comment"># Now our queue is empty, if we call it again, our program will hang right here</span></div><div class="line"> sess.run(y)</div></pre></td></tr></table></figure>
<p>在这个例子里首先建立了一个容量为3的队列,接着定义了一个出队列的操作。当启动session后,程序执行了4次出队列操作。前3次,队列中的数据依次减少。因为这个程序仍然是单线程的,因此到第四次程序阻塞,等待队列中加入数据。Tensorflow提供了简单的API,使用这些方法可以很简单的进行多线程队列操作。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div><div class="line">51</div><div class="line">52</div><div class="line">53</div><div class="line">54</div><div class="line">55</div><div class="line">56</div><div class="line">57</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">import</span> tensorflow <span class="keyword">as</span> tf</div><div class="line"></div><div class="line"></div><div class="line">x_input_data = tf.random_normal([<span class="number">6</span>], mean=<span class="number">-1</span>, stddev=<span class="number">4</span>)</div><div class="line"></div><div class="line"></div><div class="line">q = tf.FIFOQueue(capacity=<span class="number">3</span>, dtypes=tf.float32)</div><div class="line"></div><div class="line"><span class="comment"># To check what is happening in this case:</span></div><div class="line"><span class="comment"># we will print a message each time "x_input_data" is actually computed</span></div><div class="line"><span class="comment"># to be used in the "enqueue_many" operation</span></div><div class="line">x_input_data = tf.Print(x_input_data, data=[x_input_data], message=<span class="string">"Raw inputs data generated:"</span>, summarize=<span class="number">6</span>)</div><div class="line">enqueue_op = q.enqueue_many(x_input_data)</div><div class="line"></div><div class="line"><span class="comment"># To leverage multi-threading we create a "QueueRunner"</span></div><div class="line"><span class="comment"># that will handle the "enqueue_op" outside of the main thread</span></div><div class="line"><span class="comment"># We don't need much parallelism here, so we will use only 1 thread</span></div><div class="line">numberOfThreads = <span class="number">1</span></div><div class="line">qr = tf.train.QueueRunner(q, [enqueue_op] * numberOfThreads)</div><div class="line"><span class="comment"># Don't forget to add your "QueueRunner" to the QUEUE_RUNNERS collection</span></div><div class="line">tf.train.add_queue_runner(qr)</div><div class="line"></div><div class="line">input = q.dequeue()</div><div class="line">input = tf.Print(input, data=[q.size(), input], message=<span class="string">"Nb elements left, input:"</span>)</div><div class="line"></div><div class="line"><span class="comment"># fake graph: START</span></div><div class="line">y = input + <span class="number">1</span></div><div class="line"><span class="comment"># fake graph: END</span></div><div class="line"></div><div class="line"><span class="comment"># We start the session as usual ...</span></div><div class="line"><span class="keyword">with</span> tf.Session() <span class="keyword">as</span> sess:</div><div class="line"> <span class="comment"># But now we build our coordinator to coordinate our child threads with</span></div><div class="line"> <span class="comment"># the main thread</span></div><div class="line"> coord = tf.train.Coordinator()</div><div class="line"> <span class="comment"># Beware, if you don't start all your queues before runnig anything</span></div><div class="line"> <span class="comment"># The main threads will wait for them to start and you will hang again</span></div><div class="line"> <span class="comment"># This helper start all queues in tf.GraphKeys.QUEUE_RUNNERS</span></div><div class="line"> threads = tf.train.start_queue_runners(coord=coord)</div><div class="line"></div><div class="line"> <span class="comment"># The QueueRunner will automatically call the enqueue operation</span></div><div class="line"> <span class="comment"># asynchronously in its own thread ensuring that the queue is always full</span></div><div class="line"> <span class="comment"># No more hanging for the main process, no more waiting for the GPU</span></div><div class="line"> sess.run(y)</div><div class="line"> sess.run(y)</div><div class="line"> sess.run(y)</div><div class="line"> sess.run(y)</div><div class="line"> sess.run(y)</div><div class="line"> sess.run(y)</div><div class="line"> sess.run(y)</div><div class="line"> sess.run(y)</div><div class="line"> sess.run(y)</div><div class="line"> sess.run(y)</div><div class="line"></div><div class="line"> <span class="comment"># We request our child threads to stop ...</span></div><div class="line"> coord.request_stop()</div><div class="line"> <span class="comment"># ... and we wait for them to do so before releasing the main thread</span></div><div class="line"> coord.join(threads)</div></pre></td></tr></table></figure>
<p>在这个例子中,使用了<code>QueueRunner</code>和<code>Coordinator</code>。<code>QueueRunner</code>接受两个参数,一个队列和一些入队列的操作。<code>Coordinator</code>不需要接受任何参数,它只负责管理<code>tf.train.</code>命名空间下的所有队列。运行上面的程序,你会发现当队列为空的时候,<code>QueueRunner</code>启动的线程自动启动入队列操作。</p>
<h3 id="Reader和Decoder"><a href="#Reader和Decoder" class="headerlink" title="Reader和Decoder"></a>Reader和Decoder</h3><p>首先需要根据数据的格式选择Reader和Decoder,比如针对csv格式的文件,可以使用<a href="https://www.tensorflow.org/api_docs/python/tf/TextLineReader" target="_blank" rel="external">TextLineReader</a>和<a href="https://www.tensorflow.org/api_docs/python/tf/decode_csv" target="_blank" rel="external">decode_csv</a>。更多的格式可以参考<a href="https://www.tensorflow.org/programmers_guide/reading_data" target="_blank" rel="external">官方文档</a>。下面是一个模版程序:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line"><span class="function"><span class="keyword">def</span> <span class="title">read_my_file_format</span><span class="params">(filename_queue)</span>:</span></div><div class="line"> reader = tf.SomeReader()</div><div class="line"> key, record_string = reader.read(filename_queue)</div><div class="line"> example, label = tf.some_decoder(record_string)</div><div class="line"> processed_example = some_processing(example)</div><div class="line"> <span class="keyword">return</span> processed_example, label</div></pre></td></tr></table></figure>
<h3 id="将数据读入到队列中"><a href="#将数据读入到队列中" class="headerlink" title="将数据读入到队列中"></a>将数据读入到队列中</h3><p>下面这个程序展示了如何使用队列进行读取操作。Reader会记录当前的读取位置,再次调用的时候会返回当前文件的下一个example。同时<code>QueueRunner</code>会负责在当前文件读完以后传入下一个文件。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">image, label = read_my_file_format(filename_queue)</div><div class="line">sess = tf.Session()</div><div class="line"></div><div class="line"><span class="comment"># Required. See below for explanation</span></div><div class="line">init = tf.initialize_all_variables()</div><div class="line">sess.run(init)</div><div class="line">tf.train.start_queue_runners(sess=sess)</div><div class="line"></div><div class="line"><span class="comment"># first example from file</span></div><div class="line">label_val_1, image_val_1 = sess.run([label, image])</div><div class="line"><span class="comment"># second example from file</span></div><div class="line">label_val_2, image_val_2 = sess.run([label, image])</div></pre></td></tr></table></figure>
<p>Tensorflow在幕后做了很多工作,因此这个程序看起来不是很容易理解。但是总结起来可以分为以下几个步骤:</p>
<ol>
<li>构造一个队列,包含了所有文件名。</li>
<li>通过符合数据格式的Reader依次读取数据,Reader会记录当前读到的位置。</li>
<li>读完当前文件内的所有数据,QueueRunner将下一个文件中的数据继续提供给Reader。</li>
</ol>
<h3 id="构成Batch"><a href="#构成Batch" class="headerlink" title="构成Batch"></a>构成Batch</h3><p>这里已经解释了大部分内容,但在实际训练中,提供给graph的往往是Batch。Tensorflow提供了很简单的方法去<a href="https://www.tensorflow.org/api_docs/python/tf/train/shuffle_batch" target="_blank" rel="external">构成batch</a>。</p>
<p>##总结</p>
<p>Feed方法是tensorflwo初学者接触到的第一个传入数据的方法。但是传入效率往往不高,在数据集比较大的情况下往往不是很合适。Pipeline方法涉及到多线程,tensorflow在幕后做了很多工作,因此比较难以理解。但是性能上会有明显的提升。本文中将Reader和Decoder的介绍做了简化。其实配合tensorflow的数据格式<a href="https://www.tensorflow.org/api_guides/python/python_io#tfrecords_format_details" target="_blank" rel="external">TFRecords</a>,数据传入工作将会更加简化。</p>
<p>##参考<br><a href="https://www.tensorflow.org/programmers_guide/reading_data" target="_blank" rel="external">https://www.tensorflow.org/programmers_guide/reading_data</a><br><a href="https://indico.io/blog/tensorflow-data-inputs-part1-placeholders-protobufs-queues/" target="_blank" rel="external">https://indico.io/blog/tensorflow-data-inputs-part1-placeholders-protobufs-queues/</a><br><a href="https://blog.metaflow.fr/tensorflow-how-to-optimise-your-input-pipeline-with-queues-and-multi-threading-e7c3874157e0#.cjzlw0297" target="_blank" rel="external">https://blog.metaflow.fr/tensorflow-how-to-optimise-your-input-pipeline-with-queues-and-multi-threading-e7c3874157e0#.cjzlw0297</a><br><a href="https://www.tensorflow.org/programmers_guide/threading_and_queues" target="_blank" rel="external">https://www.tensorflow.org/programmers_guide/threading_and_queues</a></p>
</div>
<footer class="article-footer">
<a data-url="http://yoursite.com/2017/03/10/Tensorflow的数据输入/" data-id="cj037ccwj00005or2kmpq4rjr" class="article-share-link">Share</a>
<ul class="article-tag-list"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/Coding/">Coding</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/Tensorflow/">Tensorflow</a></li></ul>
</footer>
</div>
</article>
</section>
<aside id="sidebar">
<div class="widget-wrap">
<h3 class="widget-title">Categories</h3>
<div class="widget">
<ul class="category-list"><li class="category-list-item"><a class="category-list-link" href="/categories/Tensorflow/">Tensorflow</a></li></ul>
</div>
</div>
<div class="widget-wrap">
<h3 class="widget-title">Tags</h3>
<div class="widget">
<ul class="tag-list"><li class="tag-list-item"><a class="tag-list-link" href="/tags/Coding/">Coding</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/Tensorflow/">Tensorflow</a></li></ul>
</div>
</div>
<div class="widget-wrap">
<h3 class="widget-title">Tag Cloud</h3>
<div class="widget tagcloud">
<a href="/tags/Coding/" style="font-size: 10px;">Coding</a> <a href="/tags/Tensorflow/" style="font-size: 10px;">Tensorflow</a>
</div>
</div>
<div class="widget-wrap">
<h3 class="widget-title">Archives</h3>
<div class="widget">
<ul class="archive-list"><li class="archive-list-item"><a class="archive-list-link" href="/archives/2017/03/">March 2017</a></li></ul>
</div>
</div>
<div class="widget-wrap">
<h3 class="widget-title">Recent Posts</h3>
<div class="widget">
<ul>
<li>
<a href="/2017/03/10/Tensorflow的数据输入/">Tensorflow的数据输入</a>
</li>
</ul>
</div>
</div>
</aside>
</div>
<footer id="footer">
<div class="outer">
<div id="footer-info" class="inner">
© 2017 tbornt<br>
Powered by <a href="http://hexo.io/" target="_blank">Hexo</a>
</div>
</div>
</footer>
</div>
<nav id="mobile-nav">
<a href="/" class="mobile-nav-link">Home</a>
<a href="/archives" class="mobile-nav-link">Archives</a>
</nav>
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.0.3/jquery.min.js"></script>
<link rel="stylesheet" href="/fancybox/jquery.fancybox.css">
<script src="/fancybox/jquery.fancybox.pack.js"></script>
<script src="/js/script.js"></script>
</div>
</body>
</html>