Skip to content

Commit

Permalink
Merge pull request #84 from Roche/dev
Browse files Browse the repository at this point in the history
merge new version 1.0.3 from dev
  • Loading branch information
ofajardo authored Nov 6, 2020
2 parents abc02d7 + 197d527 commit 2a0d9ca
Show file tree
Hide file tree
Showing 24 changed files with 4,497 additions and 2,304 deletions.
36 changes: 34 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ the original applications in this regard.**
- [Reading only the headers](#reading-only-the-headers)
- [Reading selected columns](#reading-selected-columns)
- [Reading rows in chunks](#reading-rows-in-chunks)
- [Reading files in parallel processes](#reading-files-in-parallel-processes)
- [Reading value labels](#reading-value-labels)
- [Missing Values](#missing-values)
+ [SPSS](#spss)
Expand Down Expand Up @@ -71,7 +72,8 @@ and you have to specify the encoding otherwise in python 3 instead of strings yo

This package corrects those problems.

**1. Good Performance:** Here a comparison of reading a 190 Mb sas7dat file with different methods. As you can see
**1. Good Performance:** Here a comparison of reading a 190 Mb sas7dat file having 202 K rows
by 70 columns with numeric, character and date-like columns using different methods. As you can see
pyreadstat is the fastest for python and matches the speeds of R Haven.

| Method | time |
Expand Down Expand Up @@ -105,6 +107,23 @@ some specific columns, and you want to do it quick. This package offers the poss
it possible a very fast metadata scraping (Pandas read_sas can also do it if you pass the value iterator=True).
In addition it offers the capability to read sas7bcat files separately from the sas7bdat files.

More recently there has been a lot of interest from users on using pyreadstat to read SPSS sav files. After improvements
in pyreadstat 1.0.3 below some benchmarks are presented. The small file is 200K rows x 100 columns (152 Mb)
containing only numeric columns and
the big file is 294K rows x 666 columns (1.5 Gb). There are two versions of the big file: one containing numeric
columns only and one with a mix of numeric and character. Pyreadstat gives two ways to read files: reading in
a single process using read_sav and reading it in multiple processes using read_file_multiprocessing (see later
in the readme for more information).

| Method | small | big numeric | big mixed |
| :----- | :----: | :---------: | :-------: |
| pyreadstat read_sav | 2.3 s | 28 s | 40 s |
| pyreadstat read_file_multiprocessing | 0.8 s | 10 s | 21 s |

As you see performance degrades in pyreadstat when reading a table with both numeric and character types. This
is because numpy and pandas do not have a native type for strings but they use a generic object type which
brings a big hit in performance. The situation can be improved tough by reading files in multiple processes.


## Dependencies

Expand All @@ -120,7 +139,6 @@ users have sometimes reported problems. In those cases it may help to install li
on mac). Readstat also depends on zlib; it was reported not to be installed by default on Lubuntu. If you face this problem installing the
library solves it.


## Installation

### Using pip
Expand Down Expand Up @@ -335,6 +353,20 @@ for df, meta in reader:
# do some cool calculations here for the chunk
```

#### Reading files in parallel processes

Another challenge when reading large files is the time consumed in the operation. In order to alleviate this
pyreadstat provides a function "read_file_multiprocessing" to read a file in parallel processes using the python multiprocessing library.
Speed ups in the process will depend on a number of factors such as number of processes available, RAM,
content of the file etc.

```python
import pyreadstat

fpath = "path/to/file.sav"
df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, fpath)
```

#### Reading value labels

For sas7bdat files, value labels are stored in separated sas7bcat files. You can use them in combination with the sas7bdat
Expand Down
5 changes: 5 additions & 0 deletions change_log.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# 1.0.3 (github, pypi and conda )
* Improved performance specially for big files.
* added a method to read files in parallel
* added license information to setup.py

# 1.0.2 (github, pypi and conda 2020.09.05)
* Updated default widths for DATE and DATETIME formats (from Readstat src). That makes the files readable both in SPSS and PSPP,
solves issue #69.
Expand Down
Binary file modified docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/_build/doctrees/index.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/_build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: b512e4c32556bf1138a5a49da857c454
config: 4c9fea8c6d7cc40754033f4e92df1160
tags: 645f666f9bcd5a90fca523b33c5a78b7
2 changes: 1 addition & 1 deletion docs/_build/html/_static/documentation_options.js
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
var DOCUMENTATION_OPTIONS = {
URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
VERSION: '1.0.2',
VERSION: '1.0.3',
LANGUAGE: 'None',
COLLAPSE_INDEX: false,
BUILDER: 'html',
Expand Down
4 changes: 3 additions & 1 deletion docs/_build/html/genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Index &mdash; pyreadstat 1.0.2 documentation</title>
<title>Index &mdash; pyreadstat 1.0.3 documentation</title>



Expand Down Expand Up @@ -186,6 +186,8 @@ <h2 id="R">R</h2>
<li><a href="index.html#pyreadstat.pyreadstat.read_dta">read_dta() (in module pyreadstat.pyreadstat)</a>
</li>
<li><a href="index.html#pyreadstat.pyreadstat.read_file_in_chunks">read_file_in_chunks() (in module pyreadstat.pyreadstat)</a>
</li>
<li><a href="index.html#pyreadstat.pyreadstat.read_file_multiprocessing">read_file_multiprocessing() (in module pyreadstat.pyreadstat)</a>
</li>
<li><a href="index.html#pyreadstat.pyreadstat.read_por">read_por() (in module pyreadstat.pyreadstat)</a>
</li>
Expand Down
30 changes: 27 additions & 3 deletions docs/_build/html/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.0.2 documentation</title>
<title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.0.3 documentation</title>



Expand Down Expand Up @@ -257,15 +257,39 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
</dd>
<dt class="field-even">Yields</dt>
<dd class="field-even"><ul class="simple">
<li><p><strong>data_frame</strong> (<em>pandas dataframe</em>) – a pandas data frame with the data (no data in this case, so will be empty)</p></li>
<li><p><em>metadata</em> – object with metadata. The member value_labels is the one that contains the formats.
<li><p><strong>data_frame</strong> (<em>pandas dataframe</em>) – a pandas data frame with the data</p></li>
<li><p><em>metadata</em> – object with metadata.
Look at the documentation for more information.</p></li>
<li><p><strong>it</strong> (<em>generator</em>) – A generator that reads the file in chunks.</p></li>
</ul>
</dd>
</dl>
</dd></dl>

<dl class="py function">
<dt id="pyreadstat.pyreadstat.read_file_multiprocessing">
<code class="sig-prename descclassname">pyreadstat.pyreadstat.</code><code class="sig-name descname">read_file_multiprocessing</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyreadstat.pyreadstat.read_file_multiprocessing" title="Permalink to this definition"></a></dt>
<dd><p>Reads a file in parallel using multiprocessing.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>read_function</strong> (<em>pyreadstat function</em>) – a pyreadstat reading function</p></li>
<li><p><strong>file_path</strong> (<em>string</em>) – path to the file to be read</p></li>
<li><p><strong>num_processes</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of processes to spawn, by default the total number of cores</p></li>
<li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function.
row_limit and row_offset will be discarded if present as they are used internally.</p></li>
</ul>
</dd>
<dt class="field-even">Returns</dt>
<dd class="field-even"><p><ul class="simple">
<li><p><strong>data_frame</strong> (<em>pandas dataframe</em>) – a pandas data frame with the data</p></li>
<li><p><em>metadata</em> – object with metadata. Look at the documentation for more information.</p></li>
</ul>
</p>
</dd>
</dl>
</dd></dl>

<dl class="py function">
<dt id="pyreadstat.pyreadstat.read_por">
<code class="sig-prename descclassname">pyreadstat.pyreadstat.</code><code class="sig-name descname">read_por</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyreadstat.pyreadstat.read_por" title="Permalink to this definition"></a></dt>
Expand Down
8 changes: 4 additions & 4 deletions docs/_build/html/objects.inv
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# Project: pyreadstat
# Version:
# The remainder of this file is compressed using zlib.
xڥ��j�0 ��y
�횰�}��v4��&f�mb�kn}��ޞdvD�
�M���W$�aP�H���B�ֽףEx�4��z�ȹ� �*��M2�F��x���vg,
��F��m���(��V���� ���!�^�9"�4������.{iGVn�y|���1\x� ���?�E7g �^�k�z�쬃�V��*����������S��{t�3��ң��.L���~o�޹�z�Lԥ!��rP�5�9f޹��-V?g5��
xڥ��j�0 ��y
�횰�}��z4��&f�mb�kn{��ޞdND�
�M���W$�q@�#Ij�_a\�^'�� �i<6���s��U��5S 4��`��"���7ؽ�(��K�#�g�'K& ^a�Ƶ�f��l��e�x���.�C1{̣u�H"O-�o�i�R��� ��;�E��!,�� �?Å�g��o������^�);��u
��L{5��-Z�{�p�����Y�zt�3��� p�]�S������Ƌ�l3R����#�Au� �y��F�X��r��
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/py-modindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Python Module Index &mdash; pyreadstat 1.0.2 documentation</title>
<title>Python Module Index &mdash; pyreadstat 1.0.3 documentation</title>



Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/search.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Search &mdash; pyreadstat 1.0.2 documentation</title>
<title>Search &mdash; pyreadstat 1.0.3 documentation</title>



Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/searchindex.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
# The short X.Y version
version = ''
# The full version, including alpha/beta/rc tags
release = '1.0.2'
release = '1.0.3'


# -- General configuration ---------------------------------------------------
Expand Down
4 changes: 2 additions & 2 deletions pyreadstat/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
from .pyreadstat import read_sas7bdat, read_xport, read_dta, read_sav, read_por, read_sas7bcat
from .pyreadstat import write_sav, write_dta, write_xport, write_por
from .pyreadstat import set_value_labels, set_catalog_to_sas
from .pyreadstat import read_file_in_chunks
from .pyreadstat import read_file_in_chunks, read_file_multiprocessing
from ._readstat_parser import ReadstatError, metadata_container

__version__ = "1.0.2"
__version__ = "1.0.3"
Loading

0 comments on commit 2a0d9ca

Please sign in to comment.