Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NOAA s3 bucket and reversed time files to test_combine_echodata #830

Merged
merged 96 commits into from
Oct 7, 2022
Merged
Changes from all commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
321e53d
start creating the structure for lazy echodata combine
b-reyes Aug 27, 2022
662bca4
Merge branch 'dev' into lazy-comb-files
b-reyes Aug 30, 2022
c1426f2
create PreprocessCallable class and add functionality to laze_combine
b-reyes Aug 30, 2022
bb59291
finish creating a working version of lazy_combine
b-reyes Aug 31, 2022
e58df72
start working on v2 of combine_lazily
b-reyes Aug 31, 2022
e2b9ec6
get a working version of direct_write in combine_lazily_v2
b-reyes Sep 1, 2022
1d8dffa
make construct_lazy_ds return ds_unwritten
b-reyes Sep 1, 2022
b4d9a13
correctly write all variables and dimensions for the Environment grou…
b-reyes Sep 2, 2022
67877a1
account for the rest of the constant dimensions
b-reyes Sep 2, 2022
44faf4d
add comments and documentation to code in combine_lazily_v2
b-reyes Sep 2, 2022
2a89e6d
make combine_lazily_v2 into a class
b-reyes Sep 6, 2022
71fc731
add mechanism to strore dataset attributes and make first attempt at …
b-reyes Sep 6, 2022
6be4dc0
delay region write in direct_write
b-reyes Sep 7, 2022
ce62334
add sychronizer for to_zarr and turn off blosc threads when using com…
b-reyes Sep 8, 2022
36afe2b
Rename class and add attributes from all datasets to the Provenance g…
b-reyes Sep 8, 2022
8e95644
add additional type checks to combine
b-reyes Sep 8, 2022
a7b51e7
rename combine_lazily_v2.py to zarr_combine.py
b-reyes Sep 8, 2022
932355e
start simplifying the logic needed to append data and removal of para…
b-reyes Sep 9, 2022
36768c6
reorganize code and include original compressor in encodings
b-reyes Sep 10, 2022
3d87f0e
document functions and add retries in compute
b-reyes Sep 12, 2022
339ce72
start implementing checks for time and channel coordinates
b-reyes Sep 12, 2022
c2af831
add TODO statements
b-reyes Sep 13, 2022
3665a56
fix pre-commit issues
b-reyes Sep 13, 2022
8eaed23
add routine to check Dataset attributes and drop them if they are num…
b-reyes Sep 15, 2022
7ff0ea1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 15, 2022
b7fd81e
set all variables and dims compressor to be the same in io.py and def…
b-reyes Sep 16, 2022
db4cf9a
merge in origin branch
b-reyes Sep 16, 2022
9bdc0a9
Merge branch 'dev' into lazy-comb-files
b-reyes Sep 16, 2022
14ccb84
change conversion to combination
b-reyes Sep 16, 2022
3c7ad86
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 16, 2022
a34e6c3
set compressor encoding for all types of zarr variables
b-reyes Sep 16, 2022
3740e94
Merge branch 'lazy-comb-files' of /~https://github.com/b-reyes/echopype…
b-reyes Sep 16, 2022
aaedf5a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 16, 2022
64d5242
change Provenance attribute name back to conversion and add zarr comp…
b-reyes Sep 16, 2022
2f84ffa
resolve conflict
b-reyes Sep 16, 2022
b3993bc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 16, 2022
496c470
remove unnecessary import
b-reyes Sep 16, 2022
85c5cc2
Merge branch 'dev' into lazy-comb-files
b-reyes Sep 16, 2022
2735334
remove chunking in Platform group for EK60 set_groups
b-reyes Sep 16, 2022
ace66dc
add todo about filename variable write
b-reyes Sep 16, 2022
efe940d
allow for variables with different sized dims to be written (primaril…
b-reyes Sep 21, 2022
7c76299
Merge branch 'dev' into lazy-comb-files
b-reyes Sep 22, 2022
5e806d1
document and finalize check_channels
b-reyes Sep 22, 2022
7e55531
document and finalize check_ascending_ds_times
b-reyes Sep 22, 2022
41bccab
investigate decompression error and create routines that can identify…
b-reyes Sep 24, 2022
b5f2acd
start working on locking writes to zarr
b-reyes Sep 25, 2022
321c820
remove locking scheme attempt and return to corrupted approach, place…
b-reyes Sep 26, 2022
12f5829
create general get_zarr_compression function in io so that it can be …
b-reyes Sep 26, 2022
f4922b0
change filenames numbering to range(len(eds))
b-reyes Sep 26, 2022
449f4e5
remove old time checks and replace with new time check for combined d…
b-reyes Sep 27, 2022
419b174
remove unused old combine code
b-reyes Sep 27, 2022
74920ec
implement a reverse time check and update zarr and ed_comb appropriat…
b-reyes Sep 27, 2022
fbd88a9
resolve conflict in combine.py
b-reyes Sep 27, 2022
2b3b0af
remove alternative combine .py scripts
b-reyes Sep 27, 2022
5636f52
finish documenting zarr_combine
b-reyes Sep 27, 2022
ddfb5fc
begin documenting the new combine api, create code section to validat…
b-reyes Sep 28, 2022
d495b14
remove commented out lock code and remove reference to dask.distributed
b-reyes Sep 28, 2022
df3fa1a
finalize docs and comments for the combine_echodata api
b-reyes Sep 28, 2022
bd01a28
revise combine_echodata bullet points and code section
b-reyes Sep 28, 2022
6f9b16a
modify Notes bullet points in combine_echodata docs
b-reyes Sep 28, 2022
4290c02
correct and highlight the default zarr_path in combine_echodata docs
b-reyes Sep 28, 2022
e9f1ecd
construct mapping for lock scheme
b-reyes Sep 28, 2022
757825e
remove append dimensions when doing a prallel write to zarr files and…
b-reyes Sep 29, 2022
531f5ad
remove append dimensions from dataset that will be written and add cl…
b-reyes Sep 29, 2022
940cc21
create class variable max_append_chunk_size that sets an upperbound o…
b-reyes Sep 30, 2022
9b8a7b9
resolve conflicts with dev branch
b-reyes Sep 30, 2022
dc1abf7
start documenting chunk mapping functions
b-reyes Sep 30, 2022
8d90363
finish documenting current function that construct the uniform to non…
b-reyes Sep 30, 2022
61c5265
add function that writes all append dimensions and finish documenting…
b-reyes Sep 30, 2022
b9f2b28
remove unnecessary test_cluster_dump folder
b-reyes Oct 1, 2022
42a2934
add back in items in test_data README
b-reyes Oct 1, 2022
b244f4d
modify docs, close client if it was not provided, include duplicate_p…
b-reyes Oct 3, 2022
260215e
add distributed to requirements.txt
b-reyes Oct 3, 2022
6dbde23
change distributed in requirements.txt to the dask specific version
b-reyes Oct 3, 2022
58e842e
import dask.distibuted and include dask.distibuted in typing of combi…
b-reyes Oct 4, 2022
dbe8b7a
Simplify the logic for checking the input client and printing the das…
b-reyes Oct 4, 2022
d670fef
add overwrite kwarg to combine_echodata and rectify warning caused by…
b-reyes Oct 4, 2022
d18aa6c
modify input to validate_output_path so it will work with s3 buckets
b-reyes Oct 5, 2022
1cee3eb
remove double quotes
b-reyes Oct 5, 2022
ee05db4
allow Path type for zarr_path
b-reyes Oct 5, 2022
3d96012
add union typing
b-reyes Oct 5, 2022
e5334c1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 5, 2022
fe0379b
add storage_options to open_converted call
b-reyes Oct 5, 2022
bf4c48f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 5, 2022
bfb9209
add storage_options to zarr.open_array call
b-reyes Oct 5, 2022
5e5c882
only allow zarr_path to be a string and remove option for Path type
b-reyes Oct 5, 2022
3acce9d
send client dashboard link to the logger instead of printing it
b-reyes Oct 5, 2022
4facc49
set storage_option equal to empty dict in _modify_prov_filenames
b-reyes Oct 5, 2022
e120218
add reversed ping time test data to test_combine_echodata parameters
b-reyes Oct 5, 2022
c08fd7d
add EK60 and EK80 file sets from noaa s3 bucket
b-reyes Oct 5, 2022
d74e98e
remove remote pull of noaa EK80 data and replace it with a reference …
b-reyes Oct 6, 2022
21c3b4a
remove remote pull of noaa ek60 data and point to files in test_data,…
b-reyes Oct 7, 2022
0bcdf9a
add more description in TODO comment
b-reyes Oct 7, 2022
18a6cd8
add space to lines in TODO comment so they get highlighted
b-reyes Oct 7, 2022
b7264d5
resolve conflicts with dev branch
b-reyes Oct 7, 2022
3546908
resolve docstring conflicts with dev branch
b-reyes Oct 7, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
190 changes: 126 additions & 64 deletions echopype/tests/echodata/test_echodata_combine.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
from typing import Any, List, Dict
from textwrap import dedent
from pathlib import Path

Expand All @@ -9,7 +8,6 @@
import echopype
from echopype.utils.coding import DEFAULT_ENCODINGS
from echopype.qc import exist_reversed_time
from echopype.core import SONAR_MODELS

import tempfile
from dask.distributed import Client
Expand All @@ -35,6 +33,16 @@ def ek60_reversed_ping_time_test_data(test_path):
return [test_path["EK60"].joinpath(*f) for f in files]


@pytest.fixture
def ek60_diff_range_sample_test_data(test_path):
files = [
("ncei-wcsd", "SH1701", "TEST-D20170114-T202932.raw"),
("ncei-wcsd", "SH1701", "TEST-D20170114-T203337.raw"),
("ncei-wcsd", "SH1701", "TEST-D20170114-T203853.raw"),
]
return [test_path["EK60"].joinpath(*f) for f in files]


@pytest.fixture
def ek80_test_data(test_path):
files = [
Expand All @@ -46,8 +54,34 @@ def ek80_test_data(test_path):
return [test_path["EK80_NEW"].joinpath(*f) for f in files]


@pytest.fixture
def ek80_broadband_same_range_sample_test_data(test_path):
files = [
("ncei-wcsd", "SH1707", "Reduced_D20170826-T205615.raw"),
("ncei-wcsd", "SH1707", "Reduced_D20170826-T205659.raw"),
("ncei-wcsd", "SH1707", "Reduced_D20170826-T205742.raw"),
]
return [test_path["EK80"].joinpath(*f) for f in files]


@pytest.fixture
def ek80_narrowband_diff_range_sample_test_data(test_path):
files = [
("ncei-wcsd", "SH2106", "EK80", "Reduced_Hake-D20210701-T130426.raw"),
("ncei-wcsd", "SH2106", "EK80", "Reduced_Hake-D20210701-T131325.raw"),
("ncei-wcsd", "SH2106", "EK80", "Reduced_Hake-D20210701-T131621.raw"),
]
return [test_path["EK80"].joinpath(*f) for f in files]


@pytest.fixture
def azfp_test_data(test_path):

# TODO: in the future we should replace these files with another set of
# similarly small set of files, for example the files from the location below:
# "https://rawdata.oceanobservatories.org/files/CE01ISSM/R00015/instrmts/dcl37/ZPLSC_sn55076/DATA/202109/*"
# This is because we have lost track of where the current files came from,
# since the filenames does not contain the site identifier.
files = [
("ooi", "18100407.01A"),
("ooi", "18100408.01A"),
Expand All @@ -64,22 +98,38 @@ def azfp_test_xml(test_path):
@pytest.fixture(
params=[
{
"sonar_model": "EK60",
"xml_file": None,
"files": "ek60_test_data",
},
# {
# "sonar_model": "EK80",
# "xml_file": None,
# "files": "ek80_test_data",
# },
"sonar_model": "EK60",
"xml_file": None,
"files": "ek60_test_data"
},
{
"sonar_model": "AZFP",
"xml_file": "azfp_test_xml",
"files": "azfp_test_data",
}
"sonar_model": "EK60",
"xml_file": None,
"files": "ek60_diff_range_sample_test_data"
},
{
"sonar_model": "EK60",
"xml_file": None,
"files": "ek60_reversed_ping_time_test_data"
},
{
"sonar_model": "AZFP",
"xml_file": "azfp_test_xml",
"files": "azfp_test_data"
},
{
"sonar_model": "EK80",
"xml_file": None,
"files": "ek80_broadband_same_range_sample_test_data"
},
{
"sonar_model": "EK80",
"xml_file": None,
"files": "ek80_narrowband_diff_range_sample_test_data"
}
],
ids=["ek60", "azfp"] #["ek60", "ek80", "azfp"]
ids=["ek60", "ek60_diff_range_sample", "ek60_reversed_time", "azfp",
"ek80_bb_same_range_sample", "ek80_nb_diff_range_sample"]
)
def raw_datasets(request):
files = request.param["files"]
Expand All @@ -93,8 +143,7 @@ def raw_datasets(request):
files,
request.param['sonar_model'],
xml_file,
SONAR_MODELS[request.param['sonar_model']]["concat_dims"],
SONAR_MODELS[request.param['sonar_model']]["concat_data_vars"],
request.node.callspec.id
)


Expand All @@ -103,10 +152,13 @@ def test_combine_echodata(raw_datasets):
files,
sonar_model,
xml_file,
concat_dims,
concat_data_vars,
param_id,
) = raw_datasets

if param_id == "ek80_nb_diff_range_sample":
pytest.xfail("The files in ek80_nb_diff_range_sample cause an error when correcting a reversed time. "
"Once this is fixed, these files should be ran.")

eds = [echopype.open_raw(file, sonar_model, xml_file) for file in files]

append_dims = {"filenames", "time1", "time2", "time3", "ping_time"}
Expand All @@ -133,52 +185,62 @@ def test_combine_echodata(raw_datasets):
# add dimension for Provenance group
all_drop_dims.append("echodata_filename")

for group_name in combined.group_paths:

# get all Datasets to be combined
combined_group: xr.Dataset = combined[group_name]
eds_groups = [
ed[group_name]
for ed in eds
if ed[group_name] is not None
]

# all grp dimensions that are in all_drop_dims
if combined_group is None:
grp_drop_dims = []
concat_dims = []
else:
grp_drop_dims = list(set(combined_group.dims).intersection(set(all_drop_dims)))
concat_dims = list(set(combined_group.dims).intersection(append_dims))

# concat all Datasets along each concat dimension
diff_concats = []
for dim in concat_dims:

drop_dims = [c_dim for c_dim in concat_dims if c_dim != dim]

diff_concats.append(xr.concat([ed_subset.drop_dims(drop_dims) for ed_subset in eds_groups], dim=dim,
coords="minimal", data_vars="minimal"))

if len(diff_concats) < 1:
test_ds = eds_groups[0] # needed for groups that do not have append dims
else:
# create the full combined Dataset
test_ds = xr.merge(diff_concats, compat="override")

# correctly set filenames values for constructed combined Dataset
if "filenames" in test_ds:
test_ds.filenames.values[:] = np.arange(len(test_ds.filenames), dtype=int)
# add dimension for reversed time that will show up in Provenance group
all_drop_dims.append("platform_old_time1_dim")

# correctly modify Provenance attributes so we can do a direct compare
if group_name == "Provenance":
test_ds.attrs["reversed_ping_times"] = 0

del test_ds.attrs["conversion_time"]
del combined_group.attrs["conversion_time"]
for group_name in combined.group_paths:

if (combined_group is not None) and (test_ds is not None):
assert test_ds.identical(combined_group.drop_dims(grp_drop_dims))
# ignore the Platform group for ek60_reversed_time parameters (these are checked elsewhere)
if (param_id != "ek60_reversed_time") or (group_name != "Platform"):

# get all Datasets to be combined
combined_group: xr.Dataset = combined[group_name]
eds_groups = [
ed[group_name]
for ed in eds
if ed[group_name] is not None
]

# all grp dimensions that are in all_drop_dims
if combined_group is None:
grp_drop_dims = []
concat_dims = []
else:
grp_drop_dims = list(set(combined_group.dims).intersection(set(all_drop_dims)))
concat_dims = list(set(combined_group.dims).intersection(append_dims))

# concat all Datasets along each concat dimension
diff_concats = []
for dim in concat_dims:

drop_dims = [c_dim for c_dim in concat_dims if c_dim != dim]

diff_concats.append(xr.concat([ed_subset.drop_dims(drop_dims) for ed_subset in eds_groups], dim=dim,
coords="minimal", data_vars="minimal"))

if len(diff_concats) < 1:
test_ds = eds_groups[0] # needed for groups that do not have append dims
else:
# create the full combined Dataset
test_ds = xr.merge(diff_concats, compat="override")

# correctly set filenames values for constructed combined Dataset
if "filenames" in test_ds:
test_ds.filenames.values[:] = np.arange(len(test_ds.filenames), dtype=int)

# correctly modify Provenance attributes so we can do a direct compare
if group_name == "Provenance":

if param_id == "ek60_reversed_time":
test_ds.attrs["reversed_ping_times"] = 1
else:
test_ds.attrs["reversed_ping_times"] = 0

del test_ds.attrs["conversion_time"]
del combined_group.attrs["conversion_time"]

if (combined_group is not None) and (test_ds is not None):
assert test_ds.identical(combined_group.drop_dims(grp_drop_dims))

temp_zarr_dir.cleanup()

Expand Down