Merge pull request #84 from Roche/dev

merge new version 1.0.3 from dev
Roche · Nov 6, 2020 · 2a0d9ca · 2a0d9ca
2 parents abc02d7 + 197d527
commit 2a0d9ca
Show file tree

Hide file tree

Showing 24 changed files with 4,497 additions and 2,304 deletions.
diff --git a/README.md b/README.md
@@ -40,6 +40,7 @@ the original applications in this regard.**
     - [Reading only the headers](#reading-only-the-headers)
     - [Reading selected columns](#reading-selected-columns)
     - [Reading rows in chunks](#reading-rows-in-chunks)
+    - [Reading files in parallel processes](#reading-files-in-parallel-processes)
     - [Reading value labels](#reading-value-labels)
     - [Missing Values](#missing-values)
       + [SPSS](#spss)
@@ -71,7 +72,8 @@ and you have to specify the encoding otherwise in python 3 instead of strings yo
 
 This package corrects those problems.
 
-**1. Good Performance:** Here a comparison of reading a 190 Mb sas7dat file with different methods. As you can see
+**1. Good Performance:** Here a comparison of reading a 190 Mb sas7dat file having 202 K rows 
+by 70 columns with numeric, character and date-like columns using different methods. As you can see
 pyreadstat is the fastest for python and matches the speeds of R Haven.
 
 | Method | time  |
@@ -105,6 +107,23 @@ some specific columns, and you want to do it quick. This package offers the poss
 it possible a very fast metadata scraping (Pandas read_sas can also do it if you pass the value iterator=True).
 In addition it offers the capability to read sas7bcat files separately from the sas7bdat files.
 
+More recently there has been a lot of interest from users on using pyreadstat to read SPSS sav files. After improvements
+in pyreadstat 1.0.3 below some benchmarks are presented. The small file is 200K rows x 100 columns (152 Mb)
+containing only numeric columns  and
+the big file is 294K rows x 666 columns (1.5 Gb). There are two versions of the big file: one containing numeric
+columns only and one with a mix of numeric and character. Pyreadstat gives two ways to read files: reading in
+a single process using read_sav and reading it in multiple processes using read_file_multiprocessing (see later
+in the readme for more information).
+
+| Method | small  | big numeric | big mixed |
+| :----- | :----: | :---------: | :-------: |
+| pyreadstat read_sav | 2.3 s | 28 s | 40 s |
+| pyreadstat read_file_multiprocessing | 0.8 s | 10 s | 21 s |
+
+As you see performance degrades in pyreadstat when reading a table with both numeric and character types. This
+is because numpy and pandas do not have a native type for strings but they use a generic object type which
+brings a big hit in performance. The situation can be improved tough by reading files in multiple processes.
+
 
 ## Dependencies
 
@@ -120,7 +139,6 @@ users have sometimes reported problems. In those cases it may help to install li
 on mac). Readstat also depends on zlib; it was reported not to be installed by default on Lubuntu. If you face this problem installing the
 library solves it.
 
-
 ## Installation
 
 ### Using pip
@@ -335,6 +353,20 @@ for df, meta in reader:
     # do some cool calculations here for the chunk
 ```
 
+#### Reading files in parallel processes
+
+Another challenge when reading large files is the time consumed in the operation. In order to alleviate this
+pyreadstat provides a function "read_file_multiprocessing" to read a file in parallel processes using the python multiprocessing library.
+Speed ups in the process will depend on a number of factors such as number of processes available, RAM, 
+content of the file etc.
+
+```python
+import pyreadstat
+
+fpath = "path/to/file.sav" 
+df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, fpath) 
+```
+
 #### Reading value labels
 
 For sas7bdat files, value labels are stored in separated sas7bcat files. You can use them in combination with the sas7bdat

diff --git a/change_log.md b/change_log.md
@@ -1,3 +1,8 @@
+# 1.0.3 (github, pypi and conda )
+* Improved performance specially for big files.
+* added a method to read files in parallel
+* added license information to setup.py
+
 # 1.0.2 (github, pypi and conda 2020.09.05)
 * Updated default widths for DATE and DATETIME formats (from Readstat src). That makes the files readable both in SPSS and PSPP,
   solves issue #69.

diff --git a/docs/_build/doctrees/environment.pickle b/docs/_build/doctrees/environment.pickle
diff --git a/docs/_build/doctrees/index.doctree b/docs/_build/doctrees/index.doctree
diff --git a/docs/_build/html/.buildinfo b/docs/_build/html/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: b512e4c32556bf1138a5a49da857c454
+config: 4c9fea8c6d7cc40754033f4e92df1160
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/docs/_build/html/_static/documentation_options.js b/docs/_build/html/_static/documentation_options.js
@@ -1,6 +1,6 @@
 var DOCUMENTATION_OPTIONS = {
     URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
-    VERSION: '1.0.2',
+    VERSION: '1.0.3',
     LANGUAGE: 'None',
     COLLAPSE_INDEX: false,
     BUILDER: 'html',

diff --git a/docs/_build/html/genindex.html b/docs/_build/html/genindex.html
@@ -7,7 +7,7 @@
 
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
-  <title>Index &mdash; pyreadstat 1.0.2 documentation</title>
+  <title>Index &mdash; pyreadstat 1.0.3 documentation</title>
 
 
 
@@ -186,6 +186,8 @@ <h2 id="R">R</h2>
       <li><a href="index.html#pyreadstat.pyreadstat.read_dta">read_dta() (in module pyreadstat.pyreadstat)</a>
 </li>
       <li><a href="index.html#pyreadstat.pyreadstat.read_file_in_chunks">read_file_in_chunks() (in module pyreadstat.pyreadstat)</a>
+</li>
+      <li><a href="index.html#pyreadstat.pyreadstat.read_file_multiprocessing">read_file_multiprocessing() (in module pyreadstat.pyreadstat)</a>
 </li>
       <li><a href="index.html#pyreadstat.pyreadstat.read_por">read_por() (in module pyreadstat.pyreadstat)</a>
 </li>

diff --git a/docs/_build/html/index.html b/docs/_build/html/index.html
@@ -7,7 +7,7 @@
 
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
-  <title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.0.2 documentation</title>
+  <title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.0.3 documentation</title>
 
 
 
@@ -257,15 +257,39 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 </dd>
 <dt class="field-even">Yields</dt>
 <dd class="field-even"><ul class="simple">
-<li><p><strong>data_frame</strong> (<em>pandas dataframe</em>) – a pandas data frame with the data (no data in this case, so will be empty)</p></li>
-<li><p><em>metadata</em> – object with metadata. The member value_labels is the one that contains the formats.
+<li><p><strong>data_frame</strong> (<em>pandas dataframe</em>) – a pandas data frame with the data</p></li>
+<li><p><em>metadata</em> – object with metadata.
 Look at the documentation for more information.</p></li>
 <li><p><strong>it</strong> (<em>generator</em>) – A generator that reads the file in chunks.</p></li>
 </ul>
 </dd>
 </dl>
 </dd></dl>
 
+<dl class="py function">
+<dt id="pyreadstat.pyreadstat.read_file_multiprocessing">
+<code class="sig-prename descclassname">pyreadstat.pyreadstat.</code><code class="sig-name descname">read_file_multiprocessing</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyreadstat.pyreadstat.read_file_multiprocessing" title="Permalink to this definition">¶</a></dt>
+<dd><p>Reads a file in parallel using multiprocessing.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>read_function</strong> (<em>pyreadstat function</em>) – a pyreadstat reading function</p></li>
+<li><p><strong>file_path</strong> (<em>string</em>) – path to the file to be read</p></li>
+<li><p><strong>num_processes</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of processes to spawn, by default the total number of cores</p></li>
+<li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function.
+row_limit and row_offset will be discarded if present as they are used internally.</p></li>
+</ul>
+</dd>
+<dt class="field-even">Returns</dt>
+<dd class="field-even"><p><ul class="simple">
+<li><p><strong>data_frame</strong> (<em>pandas dataframe</em>) – a pandas data frame with the data</p></li>
+<li><p><em>metadata</em> – object with metadata. Look at the documentation for more information.</p></li>
+</ul>
+</p>
+</dd>
+</dl>
+</dd></dl>
+
 <dl class="py function">
 <dt id="pyreadstat.pyreadstat.read_por">
 <code class="sig-prename descclassname">pyreadstat.pyreadstat.</code><code class="sig-name descname">read_por</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyreadstat.pyreadstat.read_por" title="Permalink to this definition">¶</a></dt>

diff --git a/docs/_build/html/objects.inv b/docs/_build/html/objects.inv
@@ -2,7 +2,7 @@
 # Project: pyreadstat
 # Version: 
 # The remainder of this file is compressed using zlib.
-xڥ��j�0��y
-�횰�}��v4��&f�mb�kn}��ޞdvD�
-�M���W$�aP�H���B�ֽףEx�4��z�ȹ�	�*��M2�F��x���vg,
-��F��m���(��V�������!�^�9"�4������.{iGVn�y|���1\x����?�E7g �^�k�z�쬃�V��*����������S��{t�3��ң��.L���~o�޹�z�Lԥ!��rP�5�9f޹��-V?g5��
+xڥ��j�0��y
+�횰�}��z4��&f�mb�kn{��ޞdND�
+�M���W$�q@�#Ij�_a\�^'���i<6���s��U��5S 4��`��"���7ؽ�(��K�#�g�'K&^a�Ƶ�f��l��e�x���.�C1{̣u�H"O-�o�i�R���	��;�E��!,���?Å�g��o������^�);��u
+��L{5��-Z�{�p�����Y�zt�3���p�]�S������Ƌ�l3R����#�Au� �y��F�X��r��

diff --git a/docs/_build/html/py-modindex.html b/docs/_build/html/py-modindex.html
@@ -7,7 +7,7 @@
 
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
-  <title>Python Module Index &mdash; pyreadstat 1.0.2 documentation</title>
+  <title>Python Module Index &mdash; pyreadstat 1.0.3 documentation</title>
 
 
 

diff --git a/docs/_build/html/search.html b/docs/_build/html/search.html
@@ -7,7 +7,7 @@
 
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
-  <title>Search &mdash; pyreadstat 1.0.2 documentation</title>
+  <title>Search &mdash; pyreadstat 1.0.3 documentation</title>
 
 
 

diff --git a/docs/_build/html/searchindex.js b/docs/_build/html/searchindex.js
diff --git a/docs/conf.py b/docs/conf.py
@@ -26,7 +26,7 @@
 # The short X.Y version
 version = ''
 # The full version, including alpha/beta/rc tags
-release = '1.0.2'
+release = '1.0.3'
 
 
 # -- General configuration ---------------------------------------------------

diff --git a/pyreadstat/__init__.py b/pyreadstat/__init__.py
@@ -17,7 +17,7 @@
 from .pyreadstat import read_sas7bdat, read_xport, read_dta, read_sav, read_por, read_sas7bcat
 from .pyreadstat import write_sav, write_dta, write_xport, write_por
 from .pyreadstat import set_value_labels, set_catalog_to_sas
-from .pyreadstat import read_file_in_chunks
+from .pyreadstat import read_file_in_chunks, read_file_multiprocessing
 from ._readstat_parser import ReadstatError, metadata_container
 
-__version__ = "1.0.2"
+__version__ = "1.0.3"
Original file line number	Diff line number	Diff line change
Expand Up		@@ -7,7 +7,7 @@

		<meta name="viewport" content="width=device-width, initial-scale=1.0">

		<title>Python Module Index — pyreadstat 1.0.2 documentation</title>
		<title>Python Module Index — pyreadstat 1.0.3 documentation</title>



Expand Down