Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for DynamoDB online_read in batches #2371

Merged

Conversation

TremaMiguel
Copy link
Member

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #2247

Context

Given a list of entity_ids set a batch_size parameter to determine the number of items to send in a batch_get_item request to DynamoDB.

DynamoDB can retrieve up to 16MB of data and the record size limit is 400kb, batch_size value could be at most 40.

TremaMiguel and others added 5 commits March 5, 2022 18:50
Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
@TremaMiguel TremaMiguel requested a review from a team as a code owner March 6, 2022 02:20
@TremaMiguel TremaMiguel requested review from mavysavydav and removed request for a team March 6, 2022 02:20
@TremaMiguel TremaMiguel changed the title [python-package] Dynamo db online write read in batches feat: Dynamo db online write read in batches Mar 6, 2022
@codecov-commenter
Copy link

codecov-commenter commented Mar 6, 2022

Codecov Report

Merging #2371 (c7ab086) into master (45db6dc) will increase coverage by 0.09%.
The diff coverage is 97.64%.

@@            Coverage Diff             @@
##           master    #2371      +/-   ##
==========================================
+ Coverage   83.54%   83.64%   +0.09%     
==========================================
  Files         123      125       +2     
  Lines       10749    10820      +71     
==========================================
+ Hits         8980     9050      +70     
- Misses       1769     1770       +1     
Flag Coverage Δ
integrationtests 72.97% <75.60%> (-0.26%) ⬇️
unittests 58.03% <95.29%> (+0.49%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdk/python/tests/utils/online_store_utils.py 90.47% <90.47%> (ø)
sdk/python/feast/infra/online_stores/dynamodb.py 88.12% <100.00%> (+1.79%) ⬆️
...ts/unit/online_store/test_dynamodb_online_store.py 100.00% <100.00%> (ø)
.../integration/online_store/test_universal_online.py 71.05% <0.00%> (+0.23%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45db6dc...c7ab086. Read the comment docs.

@TremaMiguel
Copy link
Member Author

TremaMiguel commented Mar 6, 2022

@adchia @vlin-lgtm I didn't found any tests for DynamoDBOnlineStore methods. I'm thinking about adding some using moto.

Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
@woop
Copy link
Member

woop commented Mar 6, 2022

@TremaMiguel thanks for this. Would you mind changing your PR title to some kind of explaination of what the PR does? Something like "Add support for DynamoDB reads and writes in batches". This will become a part of our changelog.

@TremaMiguel TremaMiguel changed the title feat: Dynamo db online write read in batches feat: Add support for DynamoDB online_read in batches Mar 6, 2022
@woop woop added the ok-to-test label Mar 6, 2022
@adchia
Copy link
Collaborator

adchia commented Mar 6, 2022

Another thing is to not use git pull and instead use eg git pull --rebase (you also should rebase changes against master again since there were test breakages that now are fixed)

as far as tests go, technically there is a sets of tests that call into dynamo (test_universal_online for example). Though I don't think there's a test that does a batch fetch.

moto looks interesting though. might be a good idea to use that so we can let contributors run tests locally for aws. would love to see how that looks.

Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
@feast-ci-bot feast-ci-bot added size/L and removed size/M labels Mar 7, 2022
@TremaMiguel
Copy link
Member Author

@adchia added a test for online_read method parametrized to test 5, 50 or 100 samples at a time.

To run these test AWS credential shoud be set

export AWS_ACCESS_KEY_ID=your_access_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_access_key

I think we can open an issue to improve tests for DynamoDBOnlineStore, especially online_write_batch.

adchia and others added 2 commits March 7, 2022 18:06
Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
@TremaMiguel
Copy link
Member Author

TremaMiguel commented Mar 7, 2022

@adchia I've changed tests to mock dynamodb behavior through moto, no need to set AWS credentials.

Some tests are failing due to #2376

Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
@TremaMiguel TremaMiguel force-pushed the feat/dynamo_db_online_write_read branch from 153afe5 to e52a895 Compare March 8, 2022 00:07
Copy link

@vlin-lgtm vlin-lgtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we hide batch_size from users? So something like this can work:

batch_size = 40
entity_ids = ...
batches = []
for i in range(0, len(entity_ids), batch_size): 
  batches.append(entity_ids[i:i + batch_size])

for batch in batches:
  batch_entity_ids = ...
  ...
  ...
  ...
  response = dynamodb_resource.batch_get_item(RequestItems=batch_entity_ids)
  ...

So this always tries to send at most 40 entities at a time. It is better to maximize the batch size as much as possible so there is less network calls to DynamoDB.

sdk/python/feast/infra/online_stores/dynamodb.py Outdated Show resolved Hide resolved
sdk/python/feast/infra/online_stores/dynamodb.py Outdated Show resolved Hide resolved
@TremaMiguel
Copy link
Member Author

@adchia could you provide guidance on how to run integration tests locally for AWS? It appears there's something missing in CONTRIBUTING.md

AWS
TODO(adchia): flesh out setting up AWS login (or create helper script)
Modify RedshiftDataSourceCreator to use your credentials

@adchia
Copy link
Collaborator

adchia commented Mar 11, 2022

@adchia could you provide guidance on how to run integration tests locally for AWS? It appears there's something missing in CONTRIBUTING.md

AWS
TODO(adchia): flesh out setting up AWS login (or create helper script)
Modify RedshiftDataSourceCreator to use your credentials

Hey!

So what I'd actually do is make a new repo_configuration that uses a local offline store + dynamo as the online store. That's configured at /~https://github.com/feast-dev/feast/blob/master/sdk/python/tests/integration/feature_repos/repo_configuration.py#L88-L88

You can basically just ignore all these and set
DEFAULT_FULL_REPO_CONFIGS = [IntegrationTestRepoConfig(online_store=DYNAMO_CONFIG)]. Then as long as you're properly authenticated to be able to create / teardown new dynamo tables, then tests should work fine.

Normally to run everything, you'd run make test-python-universal, but in this case the only test failures are in test_universal_online, so you can just do:

FEAST_USAGE=False IS_TEST=True python -m pytest -n 8 --integration --universal sdk/python/tests/integration/online_store/

Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
@TremaMiguel
Copy link
Member Author

TremaMiguel commented Mar 12, 2022

@adchia some integrations tests expect results from DynamoDB query to have the same order as input entity_keys, I was reading that BatchGetItem don't return results in order.

I added a sort_response parameter that defaults to True to sort DynamoDB results in the same order as the input entity_keys

Do you have any suggestions to improve the implementation of _sort_dynamodb_response method?

Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
@TremaMiguel
Copy link
Member Author

TremaMiguel commented Mar 12, 2022

@adchia test_online_store_cleanup with the case Redshift as offline store and DynamoDB as online store fails.

After doing some digging, there's a difference in the result of fs.get_online_features operation.

  1. The first one is to check if features are available, returning driver_id and value as results values.
metadata {
  feature_names {
    val: "driver_id"
    val: "value"
  }
}
results {
  values {
    int64_val: 5001
  }
  values {
    float_val: 0.03799890726804733
  }
  statuses: PRESENT
  statuses: PRESENT
  event_timestamps {
  }
  event_timestamps {
    seconds: 1647104400
  }
}
results {
  values {
    int64_val: 5002
  }
  values {
    float_val: 0.8108303546905518
  }
  statuses: PRESENT
  statuses: PRESENT
  event_timestamps {
  }
  event_timestamps {
    seconds: 1647104400
  }
}
  1. Second case check if features for other are available after one FeatureView is deleted, results are similar as the previous one.

  2. This one is different, the third case verify that features for other features are deleted, the result from fs.get_online_features is missing the values from the "value" feature

metadata {
  feature_names {
    val: "driver_id"
    val: "value"
  }
}
results {
  values {
    int64_val: 5001
  }
  values {
  }
  statuses: PRESENT
  statuses: NOT_FOUND
  event_timestamps {
  }
  event_timestamps {
  }
}
results {
  values {
    int64_val: 5002
  }
  statuses: PRESENT
  event_timestamps {
  }
}
results {
  values {
    int64_val: 5003
  }
  statuses: PRESENT
  event_timestamps {
  }
}

do you have an idea why is this happening?

Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
@TremaMiguel
Copy link
Member Author

@adchia just to follow up the conversations, is there something you'd like to add?

TremaMiguel and others added 2 commits March 21, 2022 18:15
Signed-off-by: Miguel Trejo <armando.trejo.marrufo@gmail.com>
Copy link
Collaborator

@adchia adchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adchia, TremaMiguel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feast-ci-bot feast-ci-bot merged commit 702ec49 into feast-dev:master Mar 23, 2022
@vlin-lgtm
Copy link

I'm late in the game, but I'd like to thank you @TremaMiguel very much for contributing this! This is going improve our feature serving latency by a great deal! ❤️

@TremaMiguel
Copy link
Member Author

Thanks @vlin-lgtm, I believe there's room for improvement related to unit testing the DynamoDb Online store module, we could use something like moto for this, as we've done with the online_read method.

Having these unitary tests plus the local integration tests would definitively help make the code more robust and facilitate future changes.

achals pushed a commit that referenced this pull request Apr 14, 2022
# [0.20.0](v0.19.0...v0.20.0) (2022-04-14)

### Bug Fixes

* Add inlined data sources to the top level registry ([#2456](#2456)) ([356788a](356788a))
* Add new value types to types.ts for web ui ([#2463](#2463)) ([ad5694e](ad5694e))
* Add PushSource proto and Python class ([#2428](#2428)) ([9a4bd63](9a4bd63))
* Add spark to lambda dockerfile ([#2480](#2480)) ([514666f](514666f))
* Added private_key auth for Snowflake ([#2508](#2508)) ([c42c9b0](c42c9b0))
* Added Redshift and Spark typecheck to data_source event_timestamp_col inference ([#2389](#2389)) ([04dea73](04dea73))
* Building of go extension fails ([#2448](#2448)) ([7d1efd5](7d1efd5))
* Bump the number of versions bumps expected to 27 ([#2549](#2549)) ([ecc9938](ecc9938))
* Create __init__ files for the proto-generated python dirs ([#2410](#2410)) ([e17028d](e17028d))
* Don't prevent apply from running given duplicate empty names in data sources. Also fix repeated apply of Spark data source. ([#2415](#2415)) ([b95f441](b95f441))
* Dynamodb deduplicate batch write request by partition keys ([#2515](#2515)) ([70d4a13](70d4a13))
* Ensure that __init__ files exist in proto dirs ([#2433](#2433)) ([9b94f7b](9b94f7b))
* Fix DataSource constructor to unbreak custom data sources ([#2492](#2492)) ([712653e](712653e))
* Fix default feast apply path without any extras ([#2373](#2373)) ([6ba7fc7](6ba7fc7))
* Fix definitions.py with new definition ([#2541](#2541)) ([eefc34a](eefc34a))
* Fix entity row to use join key instead of name ([#2521](#2521)) ([c22fa2c](c22fa2c))
* Fix Java Master ([#2499](#2499)) ([e083458](e083458))
* Fix registry proto ([#2435](#2435)) ([ea6a9b2](ea6a9b2))
* Fix some inconsistencies in the docs and comments in the code ([#2444](#2444)) ([ad008bf](ad008bf))
* Fix spark docs ([#2382](#2382)) ([d4a606a](d4a606a))
* Fix Spark template to work correctly on feast init -t spark ([#2393](#2393)) ([ae133fd](ae133fd))
* Fix the feature repo fixture used by java tests  ([#2469](#2469)) ([32e925e](32e925e))
* Fix unhashable Snowflake and Redshift sources ([cd8f1c9](cd8f1c9))
* Fixed bug in passing config file params to snowflake python connector ([#2503](#2503)) ([34f2b59](34f2b59))
* Fixing Spark template to include source name ([#2381](#2381)) ([a985f1d](a985f1d))
* Make name a keyword arg for the Entity class ([#2467](#2467)) ([43847de](43847de))
* Making a name for data sources not a breaking change ([#2379](#2379)) ([71d7ae2](71d7ae2))
* Minor link fix in `CONTRIBUTING.md` ([#2481](#2481)) ([2917e27](2917e27))
* Preserve ordering of features in _get_column_names ([#2457](#2457)) ([495b435](495b435))
* Relax click python requirement to >=7 ([#2450](#2450)) ([f202f92](f202f92))
* Remove date partition column field from datasources that don't s… ([#2478](#2478)) ([ce35835](ce35835))
* Remove docker step from unit test workflow ([#2535](#2535)) ([6f22f22](6f22f22))
* Remove spark from the AWS Lambda dockerfile ([#2498](#2498)) ([6abae16](6abae16))
* Request data api update ([#2488](#2488)) ([0c9e5b7](0c9e5b7))
* Schema update ([#2509](#2509)) ([cf7bbc2](cf7bbc2))
* Simplify DataSource.from_proto logic ([#2424](#2424)) ([6bda4d2](6bda4d2))
* Snowflake api update ([#2487](#2487)) ([1181a9e](1181a9e))
* Support passing batch source to streaming sources for backfills ([#2523](#2523)) ([90db1d1](90db1d1))
* Timestamp update ([#2486](#2486)) ([bf23111](bf23111))
* Typos in Feast UI error message ([#2432](#2432)) ([e14369d](e14369d))
* Update feature view APIs to prefer keyword args ([#2472](#2472)) ([7c19cf7](7c19cf7))
* Update file api ([#2470](#2470)) ([83a11c6](83a11c6))
* Update Makefile to cd into python dir before running commands ([#2437](#2437)) ([ca32155](ca32155))
* Update redshift api ([#2479](#2479)) ([4fa73a9](4fa73a9))
* Update some fields optional in UI parser ([#2380](#2380)) ([cff7ac3](cff7ac3))
* Use a single version of jackson libraries and upgrade to 2.12.6.1 ([#2473](#2473)) ([5be1cc6](5be1cc6))
* Use dateutil parser to parse materialization times ([#2464](#2464)) ([6c55e49](6c55e49))
* Use the correct dockerhub image tag when building feature servers ([#2372](#2372)) ([0d62c1d](0d62c1d))

### Features

* Add `/write-to-online-store` method to the python feature server ([#2423](#2423)) ([d2fb048](d2fb048))
* Add description, tags, owner fields to all feature view classes ([#2440](#2440)) ([ed5e928](ed5e928))
* Add DQM Logging on GRPC Server with FileLogStorage for Testing ([#2403](#2403)) ([57a97d8](57a97d8))
* Add Feast types in preparation for changing type system ([#2475](#2475)) ([4864252](4864252))
* Add Field class ([#2500](#2500)) ([1279612](1279612))
* Add support for DynamoDB online_read in batches ([#2371](#2371)) ([702ec49](702ec49))
* Add Support for DynamodbOnlineStoreConfig endpoint_url parameter ([#2485](#2485)) ([7b863d1](7b863d1))
* Add templating for dynamodb table name ([#2394](#2394)) ([f591088](f591088))
* Allow local feature server to use Go feature server if enabled ([#2538](#2538)) ([a2ef375](a2ef375))
* Allow using entity's join_key in get_online_features ([#2420](#2420)) ([068c765](068c765))
* Data Source Api Update ([#2468](#2468)) ([6b96b21](6b96b21))
* Go server ([#2339](#2339)) ([d12e7ef](d12e7ef)), closes [#2354](#2354) [#2361](#2361) [#2332](#2332) [#2356](#2356) [#2363](#2363) [#2349](#2349) [#2355](#2355) [#2336](#2336) [#2361](#2361) [#2363](#2363) [#2344](#2344) [#2354](#2354) [#2347](#2347) [#2350](#2350) [#2356](#2356) [#2355](#2355) [#2349](#2349) [#2352](#2352) [#2341](#2341) [#2336](#2336) [#2373](#2373) [#2315](#2315) [#2372](#2372) [#2332](#2332) [#2349](#2349) [#2336](#2336) [#2361](#2361) [#2363](#2363) [#2344](#2344) [#2354](#2354) [#2347](#2347) [#2350](#2350) [#2356](#2356) [#2355](#2355) [#2349](#2349) [#2352](#2352) [#2341](#2341) [#2336](#2336) [#2373](#2373) [#2379](#2379) [#2380](#2380) [#2382](#2382) [#2364](#2364) [#2366](#2366) [#2386](#2386)
* Graduate write_to_online_store out of experimental status ([#2426](#2426)) ([e7dd4b7](e7dd4b7))
* Make feast PEP 561 compliant ([#2405](#2405)) ([3c41f94](3c41f94)), closes [#2420](#2420) [#2418](#2418) [#2425](#2425) [#2426](#2426) [#2427](#2427) [#2431](#2431) [#2433](#2433) [#2420](#2420) [#2418](#2418) [#2425](#2425) [#2426](#2426) [#2427](#2427) [#2431](#2431) [#2433](#2433)
* Makefile for contrib for Issue [#2364](#2364) ([#2366](#2366)) ([a02325b](a02325b))
* Support on demand feature views in go feature server ([#2494](#2494)) ([6edd274](6edd274))
* Switch from Feature to Field ([#2514](#2514)) ([6a03bed](6a03bed))
* Use a daemon thread to monitor the go feature server exclusively ([#2391](#2391)) ([0bb5e8c](0bb5e8c))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DynamoDB batch retrieval does slow sequential gets (online_read)
7 participants