Lucie

A Python package that will automatically import un-importable datasets on the UCI ML repository, by using heuristics to figure out how each dataset is likely to be laid out. Our package name "Lucie" is inspired by its function: loading UCI examples

Preserves the UCIML API to the best of its ability, with one minor change (see below).

Succeeds on 95.4% of the top 250 datasets, and includes manually cleaned datasets to allow all top 100 datasets to be imported.

GitHub: /~https://github.com/ArnaoutLab/Lucie

PyPI Package: pip install lucie

ArXiV Paper: https://arxiv.org/abs/2410.09119

Usage and API Changes

For previously importable datasets, we have one minor change: instead of returning the dataset itself, this program will return a tuple, with the first element indicating the dataset format, and the second representing the dataset.

Example:

from lucie import fetch_ucirepo

# fetch dataset 
- iris = fetch_ucirepo(id=53)
+ iris = fetch_ucirepo(id=53)[1]

# data (as pandas dataframes) 
X = iris.data.features 
y = iris.data.targets 
  
# metadata 
print(iris.metadata) 

# variable information 
print(iris.variables)

For datasets that were previously not importable, but are importable via lucie, we return a dictionary, with keys representing filenames and values representing the associated Pandas dataframe. For example:

from lucie import fetch_ucirepo
diabetes = fetch_ucirepo(name='diabetes')
print(diabetes)

"""
output is:
('custom',
 {'diabetes_complete.tsv':
  0      04-21-1991   9:09  33  009
  1      04-21-1991   9:09  34  013
  2      04-21-1991  17:08  62  119
  3      04-21-1991  17:08  33  007
  ...
 }
)
"""

print(diabetes[1]['diabetes_complete.tsv'])

"""
output is:
0      04-21-1991   9:09  33  009
1      04-21-1991   9:09  34  013
2      04-21-1991  17:08  62  119
3      04-21-1991  17:08  33  007
4      04-21-1991  22:51  48  123
...    ...          ...   .. ...
[29329 rows x 4 columns]
"""

In some rare cases, we instead return a JSON dictionary representing the directory structure. However, in practice we did not encounter this in any of the top 250 datasets. If you would like to see how that API would work, we have created a bogus dataset to test this functionality. You can try it by running this script:

from lucie import extract_and_get_path

extract_and_get_path('bogus_sets/bogus_json.zip') # bogus json dataset

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
bogus_sets		bogus_sets
coverage		coverage
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ourflowchart.pdf		ourflowchart.pdf
ourresults.pdf		ourresults.pdf
paper.bib		paper.bib
paper.md		paper.md
pyproject.toml		pyproject.toml
thedatasetsbyviews.pdf		thedatasetsbyviews.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lucie

Usage and API Changes

About

Releases

Packages

Contributors 2

Languages

License

ArnaoutLab/Lucie

Folders and files

Latest commit

History

Repository files navigation

Lucie

Usage and API Changes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages