diff --git a/docs/README.md b/docs/README.md index 8d5be25ae2..b838e5fe5b 100644 --- a/docs/README.md +++ b/docs/README.md @@ -65,7 +65,7 @@ Explore the following resources to get started with Feast: * [Quickstart](getting-started/quickstart.md) is the fastest way to get started with Feast * [Concepts](getting-started/concepts/) describes all important Feast API concepts * [Architecture](getting-started/architecture-and-components/) describes Feast's overall architecture. -* [Tutorials](tutorials/tutorials-overview.md) shows full examples of using Feast in machine learning applications. +* [Tutorials](tutorials/tutorials-overview/) shows full examples of using Feast in machine learning applications. * [Running Feast with Snowflake/GCP/AWS](how-to-guides/feast-snowflake-gcp-aws/) provides a more in-depth guide to using Feast. * [Reference](reference/feast-cli-commands.md) contains detailed API and design documents. * [Contributing](project/contributing.md) contains resources for anyone who wants to contribute to Feast. diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index 2a0edd16b0..410ca6a5c6 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -10,7 +10,7 @@ * [Quickstart](getting-started/quickstart.md) * [Concepts](getting-started/concepts/README.md) * [Overview](getting-started/concepts/overview.md) - * [Data source](getting-started/concepts/data-source.md) + * [Data ingestion](getting-started/concepts/data-ingestion.md) * [Entity](getting-started/concepts/entity.md) * [Feature view](getting-started/concepts/feature-view.md) * [Feature retrieval](getting-started/concepts/feature-retrieval.md) @@ -31,11 +31,11 @@ ## Tutorials -* [Overview](tutorials/tutorials-overview.md) -* [Driver ranking](tutorials/driver-ranking-with-feast.md) -* [Fraud detection on GCP](tutorials/fraud-detection.md) -* [Real-time credit scoring on AWS](tutorials/real-time-credit-scoring-on-aws.md) -* [Driver stats on Snowflake](tutorials/driver-stats-on-snowflake.md) +* [Sample use-case tutorials](tutorials/tutorials-overview/README.md) + * [Driver ranking](tutorials/tutorials-overview/driver-ranking-with-feast.md) + * [Fraud detection on GCP](tutorials/tutorials-overview/fraud-detection.md) + * [Real-time credit scoring on AWS](tutorials/tutorials-overview/real-time-credit-scoring-on-aws.md) + * [Driver stats on Snowflake](tutorials/tutorials-overview/driver-stats-on-snowflake.md) * [Validating historical features with Great Expectations](tutorials/validating-historical-features.md) * [Using Scalable Registry](tutorials/using-scalable-registry.md) * [Building streaming features](tutorials/building-streaming-features.md) @@ -50,12 +50,12 @@ * [Load data into the online store](how-to-guides/feast-snowflake-gcp-aws/load-data-into-the-online-store.md) * [Read features from the online store](how-to-guides/feast-snowflake-gcp-aws/read-features-from-the-online-store.md) * [Running Feast in production](how-to-guides/running-feast-in-production.md) -* [Upgrading from Feast 0.9](https://docs.google.com/document/u/1/d/1AOsr\_baczuARjCpmZgVd8mCqTF4AZ49OEyU4Cn-uTT0/edit) * [Upgrading for Feast 0.20+](how-to-guides/automated-feast-upgrade.md) -* [Adding a customer provider](how-to-guides/creating-a-custom-provider.md) -* [Adding a custom batch materialization engine](how-to-guides/creating-a-custom-materialization-engine.md) -* [Adding a new online store](how-to-guides/adding-support-for-a-new-online-store.md) -* [Adding a new offline store](how-to-guides/adding-a-new-offline-store.md) +* [Customizing Feast](how-to-guides/customizing-feast/README.md) + * [Adding a custom batch materialization engine](how-to-guides/customizing-feast/creating-a-custom-materialization-engine.md) + * [Adding a new offline store](how-to-guides/customizing-feast/adding-a-new-offline-store.md) + * [Adding a new online store](how-to-guides/customizing-feast/adding-support-for-a-new-online-store.md) + * [Adding a custom provider](how-to-guides/customizing-feast/creating-a-custom-provider.md) * [Adding or reusing tests](how-to-guides/adding-or-reusing-tests.md) ## Reference diff --git a/docs/getting-started/architecture-and-components/batch-materialization-engine.md b/docs/getting-started/architecture-and-components/batch-materialization-engine.md index fb3c83ccb4..7be22fe125 100644 --- a/docs/getting-started/architecture-and-components/batch-materialization-engine.md +++ b/docs/getting-started/architecture-and-components/batch-materialization-engine.md @@ -4,7 +4,6 @@ A batch materialization engine is a component of Feast that's responsible for mo A materialization engine abstracts over specific technologies or frameworks that are used to materialize data. It allows users to use a pure local serialized approach (which is the default LocalMaterializationEngine), or delegates the materialization to seperate components (e.g. AWS Lambda, as implemented by the the LambdaMaterializaionEngine). -If the built-in engines are not sufficient, you can create your own custom materialization engine. Please see [this guide](../../how-to-guides/creating-a-custom-materialization-engine.md) for more details. +If the built-in engines are not sufficient, you can create your own custom materialization engine. Please see [this guide](../../how-to-guides/customizing-feast/creating-a-custom-materialization-engine.md) for more details. Please see [feature\_store.yaml](../../reference/feature-repository/feature-store-yaml.md#overview) for configuring engines. - diff --git a/docs/getting-started/architecture-and-components/offline-store.md b/docs/getting-started/architecture-and-components/offline-store.md index bd6d55fd1e..c59a526a53 100644 --- a/docs/getting-started/architecture-and-components/offline-store.md +++ b/docs/getting-started/architecture-and-components/offline-store.md @@ -1,6 +1,6 @@ # Offline store -An offline store is an interface for working with historical time-series feature values that are stored in [data sources](../../getting-started/concepts/data-source.md). +An offline store is an interface for working with historical time-series feature values that are stored in [data sources](../../getting-started/concepts/data-ingestion.md). The `OfflineStore` interface has several different implementations, such as `BigQueryOfflineStore`, each of which is backed by a different storage and compute engine. For more details on which offline stores are supported, please see [Offline Stores](../../reference/offline-stores/). diff --git a/docs/getting-started/architecture-and-components/provider.md b/docs/getting-started/architecture-and-components/provider.md index 9eadf73ded..89f01c4e5b 100644 --- a/docs/getting-started/architecture-and-components/provider.md +++ b/docs/getting-started/architecture-and-components/provider.md @@ -1,10 +1,9 @@ # Provider -A provider is an implementation of a feature store using specific feature store components \(e.g. offline store, online store\) targeting a specific environment \(e.g. GCP stack\). +A provider is an implementation of a feature store using specific feature store components (e.g. offline store, online store) targeting a specific environment (e.g. GCP stack). -Providers orchestrate various components \(offline store, online store, infrastructure, compute\) inside an environment. For example, the `gcp` provider supports [BigQuery](https://cloud.google.com/bigquery) as an offline store and [Datastore](https://cloud.google.com/datastore) as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers \(`local`, `gcp`, and `aws`\) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the `gcp` provider but use Redis as the online store instead of Datastore. +Providers orchestrate various components (offline store, online store, infrastructure, compute) inside an environment. For example, the `gcp` provider supports [BigQuery](https://cloud.google.com/bigquery) as an offline store and [Datastore](https://cloud.google.com/datastore) as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers (`local`, `gcp`, and `aws`) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the `gcp` provider but use Redis as the online store instead of Datastore. -If the built-in providers are not sufficient, you can create your own custom provider. Please see [this guide](../../how-to-guides/creating-a-custom-provider.md) for more details. +If the built-in providers are not sufficient, you can create your own custom provider. Please see [this guide](../../how-to-guides/customizing-feast/creating-a-custom-provider.md) for more details. Please see [feature\_store.yaml](../../reference/feature-repository/feature-store-yaml.md#overview) for configuring providers. - diff --git a/docs/getting-started/concepts/README.md b/docs/getting-started/concepts/README.md index fd6c8173a7..e805e3b486 100644 --- a/docs/getting-started/concepts/README.md +++ b/docs/getting-started/concepts/README.md @@ -4,8 +4,8 @@ [overview.md](overview.md) {% endcontent-ref %} -{% content-ref url="data-source.md" %} -[data-source.md](data-source.md) +{% content-ref url="data-ingestion.md" %} +[data-ingestion.md](data-ingestion.md) {% endcontent-ref %} {% content-ref url="entity.md" %} diff --git a/docs/getting-started/concepts/data-ingestion.md b/docs/getting-started/concepts/data-ingestion.md new file mode 100644 index 0000000000..8361b5f79f --- /dev/null +++ b/docs/getting-started/concepts/data-ingestion.md @@ -0,0 +1,86 @@ +# Data ingestion + +### Data source + +The data source refers to raw underlying data (e.g. a table in BigQuery). + +Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or when materializing features into an online store. + +Below is an example data source with a single entity (`driver`) and two features (`trips_today`, and `rating`). + +![Ride-hailing data source](<../../.gitbook/assets/image (16).png>) + +Feast supports primarily **time-stamped** tabular data as data sources. There are many kinds of possible data sources: + +* **Batch data sources:** ideally, these live in data warehouses (BigQuery, Snowflake, Redshift), but can be in data lakes (S3, GCS, etc). Feast supports ingesting and querying data across both. +* **Stream data sources**: Feast does **not** have native streaming integrations. It does however facilitate making streaming features available in different environments. There are two kinds of sources: + * **Push sources** allow users to push features into Feast, and make it available for training / batch scoring ("offline"), for realtime feature serving ("online") or both. + * **\[Alpha] Stream sources** allow users to register metadata from Kafka or Kinesis sources. The onus is on the user to ingest from these sources, though Feast provides some limited helper methods to ingest directly from Kafka / Kinesis topics. +* **(Experimental) Request data sources:** This is data that is only available at request time (e.g. from a user action that needs an immediate model prediction response). This is primarily relevant as an input into **on-demand feature views**, which allow light-weight feature engineering and combining features across sources. + +### Batch data ingestion + +Ingesting from batch sources is only necessary to power real-time models. This is done through **materialization**. Under the hood, Feast manages an _offline store_ (to scalably generate training data from batch sources) and an _online store_ (to provide low-latency access to features for real-time models). + +A key command to use in Feast is the `materialize_incremental` command, which fetches the latest values for all entities in the batch source and ingests these values into the online store. + +Materialization can be called programmatically or through the CLI: + +
+ +Code example: programmatic scheduled materialization + +This snippet creates a feature store object which points to the registry (which knows of all defined features) and the online store (DynamoDB in this case), and + +```python +# Define Python callable +def materialize(): + repo_config = RepoConfig( + registry=RegistryConfig(path="s3://[YOUR BUCKET]/registry.pb"), + project="feast_demo_aws", + provider="aws", + offline_store="file", + online_store=DynamoDBOnlineStoreConfig(region="us-west-2") + ) + store = FeatureStore(config=repo_config) + store.materialize_incremental(datetime.datetime.now()) + +# (In production) Use Airflow PythonOperator +materialize_python = PythonOperator( + task_id='materialize_python', + python_callable=materialize, +) +``` + +
+ +
+ +Code example: CLI based materialization + +#### How to run this in the CLI + +```bash +CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S") +feast materialize-incremental $CURRENT_TIME +``` + +#### How to run this on Airflow + +```python +# Use BashOperator +materialize_bash = BashOperator( + task_id='materialize', + bash_command=f'feast materialize-incremental {datetime.datetime.now().replace(microsecond=0).isoformat()}', +) +``` + +
+ +### Stream data ingestion + +Ingesting from stream sources happens either via a Push API or via a contrib processor that leverages an existing Spark context. + +* To **push data into the offline or online stores**: see [push sources](../../reference/data-sources/push.md) for details. +* (experimental) To **use a contrib Spark processor** to ingest from a topic, see [Tutorial: Building streaming features](../../tutorials/building-streaming-features.md) + diff --git a/docs/getting-started/concepts/data-source.md b/docs/getting-started/concepts/data-source.md deleted file mode 100644 index 46ed24c519..0000000000 --- a/docs/getting-started/concepts/data-source.md +++ /dev/null @@ -1,16 +0,0 @@ -# Data source - -The data source refers to raw underlying data (e.g. a table in BigQuery). - -Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or when materializing features into an online store. - -Below is an example data source with a single entity (`driver`) and two features (`trips_today`, and `rating`). - -![Ride-hailing data source](<../../.gitbook/assets/image (16).png>) - -Feast supports primarily **time-stamped** tabular data as data sources. There are many kinds of possible data sources: - -* **Batch data sources:** ideally, these live in data warehouses (BigQuery, Snowflake, Redshift), but can be in data lakes (S3, GCS, etc). Feast supports ingesting and querying data across both. -* **Stream data sources**: Feast does **not** have native streaming integrations. It does however facilitate making streaming features available in different environments. There are two kinds of sources: - * **Push sources** allow users to push features into Feast, and make it available for training / batch scoring ("offline"), for realtime feature serving ("online") or both. - * **\[Alpha] Stream sources** allow users to register metadata from Kafka or Kinesis sources. The onus is on the user to ingest from these sources, though Feast provides some limited helper methods to ingest directly from Kafka / Kinesis topics. diff --git a/docs/getting-started/concepts/entity.md b/docs/getting-started/concepts/entity.md index 54fab50830..1ea3037ef2 100644 --- a/docs/getting-started/concepts/entity.md +++ b/docs/getting-started/concepts/entity.md @@ -33,14 +33,11 @@ At _training time_, users control what entities they want to look up, for exampl At _serving time_, users specify _entity key(s)_ to fetch the latest feature values for to power a real-time model prediction (e.g. a fraud detection model that needs to fetch the transaction user's features). {% hint style="info" %} -**Q: Can I retrieve features for **_**all**_** entities in Feast?** - -Kind of. \ - - -In practice, this is most relevant for _batch scoring models_ (e.g. predict user churn for all existing users) that are offline only. For these use cases, Feast supports generating features for a SQL backed list of entities. There is an [open GitHub issue](/~https://github.com/feast-dev/feast/issues/1611) that welcomes contribution to make this a more intuitive API. +**Q: Can I retrieve features for all entities?** +Kind of. +In practice, this is most relevant for _batch scoring models_ (e.g. predict user churn for all existing users) that are offline only. For these use cases, Feast supports generating features for a SQL-backed list of entities. There is an [open GitHub issue](/~https://github.com/feast-dev/feast/issues/1611) that welcomes contribution to make this a more intuitive API. For _real-time feature retrieval_, there is no out of the box support for this because it would promote expensive and slow scan operations. Users can still pass in a large list of entities for retrieval, but this does not scale well. {% endhint %} diff --git a/docs/getting-started/concepts/feature-retrieval.md b/docs/getting-started/concepts/feature-retrieval.md index 85b7d9c5b7..01dfe96344 100644 --- a/docs/getting-started/concepts/feature-retrieval.md +++ b/docs/getting-started/concepts/feature-retrieval.md @@ -1,14 +1,152 @@ # Feature retrieval -## Dataset +## Overview -A dataset is a collection of rows that is produced by a historical retrieval from Feast in order to train a model. A dataset is produced by a join from one or more feature views onto an entity dataframe. Therefore, a dataset may consist of features from multiple feature views. +Generally, Feast supports several patterns of feature retrieval: -**Dataset vs Feature View:** Feature views contain the schema of data and a reference to where data can be found \(through its data source\). Datasets are the actual data manifestation of querying those data sources. +1. Training data generation (via `feature_store.get_historical_features(...)`) +2. Offline feature retrieval for batch scoring (via `feature_store.get_historical_features(...)`) +3. Online feature retrieval for real-time model predictions (via `feature_store.get_online_features(...)`) -**Dataset vs Data Source:** Datasets are the output of historical retrieval, whereas data sources are the inputs. One or more data sources can be used in the creation of a dataset. +Each of these retrieval mechanisms accept: + +* some way of specifying entities (to fetch features for) +* some way to specify the features to fetch (either via [feature services](feature-retrieval.md#feature-services), which group features needed for a model version, or [feature references](feature-retrieval.md#feature-references)) + +
+ +How to: generate training data + +Feast abstracts away point-in-time join complexities with the `get_historical_features` API. + +It expects an **entity dataframe (or SQL query)** and a **list of feature references (or feature service)** + +#### **Option 1: using feature references (to pick individual features when exploring data)** + +```python +entity_df = pd.DataFrame.from_dict( + { + "driver_id": [1001, 1002, 1003, 1004, 1001], + "event_timestamp": [ + datetime(2021, 4, 12, 10, 59, 42), + datetime(2021, 4, 12, 8, 12, 10), + datetime(2021, 4, 12, 16, 40, 26), + datetime(2021, 4, 12, 15, 1, 12), + datetime.now() + ] + } +) +training_df = store.get_historical_features( + entity_df=entity_df, + features=[ + "driver_hourly_stats:conv_rate", + "driver_hourly_stats:acc_rate", + "driver_daily_features:daily_miles_driven" + ], +).to_df() +print(training_df.head()) +``` + +#### Option 2: using feature services (to version models) + +```python +entity_df = pd.DataFrame.from_dict( + { + "driver_id": [1001, 1002, 1003, 1004, 1001], + "event_timestamp": [ + datetime(2021, 4, 12, 10, 59, 42), + datetime(2021, 4, 12, 8, 12, 10), + datetime(2021, 4, 12, 16, 40, 26), + datetime(2021, 4, 12, 15, 1, 12), + datetime.now() + ] + } +) +training_df = store.get_historical_features( + entity_df=entity_df, + features=store.get_feature_service("model_v1"), +).to_df() +print(training_df.head()) +``` + +
+ +
+ +How to: retrieve offline features for batch scoring + +The main difference here from training data generation is how to handle timestamps in the entity dataframe. You want to pass in the **current time** to get the latest feature values for all your entities. + +#### Option 1: fetching features with entity dataframe + +```python +from feast import FeatureStore +import pandas as pd + +store = FeatureStore(repo_path=".") + +# Get the latest feature values for unique entities +entity_df = pd.DataFrame.from_dict({"driver_id": [1001, 1002, 1003, 1004, 1005],}) +entity_df["event_timestamp"] = pd.to_datetime("now", utc=True) +batch_scoring_features = store.get_historical_features( + entity_df=entity_df, features=store.get_feature_service("model_v2"), +).to_df() +# predictions = model.predict(batch_scoring_features) +``` + +#### Option 2: fetching features using a SQL query to generate entities + +```python +from feast import FeatureStore +import pandas as pd + +store = FeatureStore(repo_path=".") + +# Get the latest feature values for unique entities +batch_scoring_features = store.get_historical_features( + entity_df=""" + SELECT + user_id, + CURRENT_TIME() as event_timestamp + FROM entity_source_table + WHERE user_last_active_time BETWEEN '2019-01-01' and '2020-12-31' + GROUP BY user_id + """ + , + features=store.get_feature_service("model_v2"), +).to_df() +# predictions = model.predict(batch_scoring_features) +``` + +
+ +
+ +How to: retrieve online features for real-time model inference + +Feast will ensure the latest feature values for registered features are available. At retrieval time, you need to supply a list of **entities** and the corresponding **features** to be retrieved. Similar to `get_historical_features`, we recommend using feature services as a mechanism for grouping features in a model version. + +_Note: unlike `get_historical_features`, the `entity_rows` **do not need timestamps** since you only want one feature value per entity key._ + +```python +features = store.get_online_features( + features=[ + "driver_hourly_stats:conv_rate", + "driver_hourly_stats:acc_rate", + "driver_daily_features:daily_miles_driven", + ], + entity_rows=[ + { + "driver_id": 1001, + } + ], +).to_dict() +``` + +
## Feature Services + A feature service is an object that represents a logical group of features from one or more [feature views](feature-view.md#feature-view). Feature Services allows features from within a feature view to be used as needed by an ML model. Users can expect to create one feature service per model version, allowing for tracking of the features used by models. {% tabs %} @@ -38,6 +176,7 @@ Applying a feature service does not result in an actual service being deployed. Feature services enable referencing all or some features from a feature view. Retrieving from the online store with a feature service + ```python from feast import FeatureStore feature_store = FeatureStore('.') # Initialize the feature store @@ -49,6 +188,7 @@ features = feature_store.get_online_features( ``` Retrieving from the offline store with a feature service + ```python from feast import FeatureStore feature_store = FeatureStore('.') # Initialize the feature store @@ -78,7 +218,7 @@ online_features = fs.get_online_features( ) ``` -It is possible to retrieve features from multiple feature views with a single request, and Feast is able to join features from multiple tables in order to build a training dataset. However, It is not possible to reference \(or retrieve\) features from multiple projects at the same time. +It is possible to retrieve features from multiple feature views with a single request, and Feast is able to join features from multiple tables in order to build a training dataset. However, it is not possible to reference (or retrieve) features from multiple projects at the same time. {% hint style="info" %} Note, if you're using [Feature views without entities](feature-view.md#feature-views-without-entities), then those features can be added here without additional entity values in the `entity_rows` @@ -90,3 +230,10 @@ The timestamp on which an event occurred, as found in a feature view's data sour Event timestamps are used during point-in-time joins to ensure that the latest feature values are joined from feature views onto entity rows. Event timestamps are also used to ensure that old feature values aren't served to models during online serving. +## Dataset + +A dataset is a collection of rows that is produced by a historical retrieval from Feast in order to train a model. A dataset is produced by a join from one or more feature views onto an entity dataframe. Therefore, a dataset may consist of features from multiple feature views. + +**Dataset vs Feature View:** Feature views contain the schema of data and a reference to where data can be found (through its data source). Datasets are the actual data manifestation of querying those data sources. + +**Dataset vs Data Source:** Datasets are the output of historical retrieval, whereas data sources are the inputs. One or more data sources can be used in the creation of a dataset. diff --git a/docs/getting-started/concepts/feature-view.md b/docs/getting-started/concepts/feature-view.md index ce8aa2b0d6..ed5b5f3a9f 100644 --- a/docs/getting-started/concepts/feature-view.md +++ b/docs/getting-started/concepts/feature-view.md @@ -2,7 +2,23 @@ ## Feature views -A feature view is an object that represents a logical group of time-series feature data as it is found in a [data source](data-source.md). Feature views consist of zero or more [entities](entity.md), one or more [features](feature-view.md#feature), and a [data source](data-source.md). Feature views allow Feast to model your existing feature data in a consistent way in both an offline (training) and online (serving) environment. Feature views generally contain features that are properties of a specific object, in which case that object is defined as an entity and included in the feature view. If the features are not related to a specific object, the feature view might not have entities; see [feature views without entities](feature-view.md#feature-views-without-entities) below. +{% hint style="warning" %} +**Note**: feature views do not work with non-timestamped data. A workaround is to insert dummy timestamps +{% endhint %} + +A feature view is an object that represents a logical group of time-series feature data as it is found in a [data source](data-ingestion.md). Depending on the kind of feature view, it may contain some lightweight (experimental) feature transformations (see [\[Alpha\] On demand feature views](feature-view.md#alpha-on-demand-feature-views)). + +Feature views consist of: + +* a [data source](data-ingestion.md) +* zero or more [entities](entity.md) + * If the features are not related to a specific object, the feature view might not have entities; see [feature views without entities](feature-view.md#feature-views-without-entities) below. +* a name to uniquely identify this feature view in the project. +* (optional, but recommended) a schema specifying one or more [features](feature-view.md#feature) (without this, Feast will infer the schema by reading from the data source) +* (optional, but recommended) metadata (for example, description, or other free-form metadata via `tags`) +* (optional) a TTL, which limits how far back Feast will look when generating historical datasets + +Feature views allow Feast to model your existing feature data in a consistent way in both an offline (training) and online (serving) environment. Feature views generally contain features that are properties of a specific object, in which case that object is defined as an entity and included in the feature view. {% tabs %} {% tab title="driver_trips_feature_view.py" %} @@ -31,10 +47,6 @@ Feature views are used during * Loading of feature values into an online store. Feature views determine the storage schema in the online store. Feature values can be loaded from batch sources or from [stream sources](../../reference/data-sources/push.md). * Retrieval of features from the online store. Feature views provide the schema definition to Feast in order to look up features from the online store. -{% hint style="info" %} -Feast does not generate feature values. It acts as the ingestion and serving system. The data sources described within feature views should reference feature values in their already computed form. -{% endhint %} - ## Feature views without entities If a feature view contains features that are not related to a specific entity, the feature view can be defined without entities (only event timestamps are needed for this feature view). @@ -132,16 +144,30 @@ trips_today = Field( ) ``` -Together with [data sources](data-source.md), they indicate to Feast where to find your feature values, e.g., in a specific parquet file or BigQuery table. Feature definitions are also used when reading features from the feature store, using [feature references](feature-retrieval.md#feature-references). +Together with [data sources](data-ingestion.md), they indicate to Feast where to find your feature values, e.g., in a specific parquet file or BigQuery table. Feature definitions are also used when reading features from the feature store, using [feature references](feature-retrieval.md#feature-references). Feature names must be unique within a [feature view](feature-view.md#feature-view). ## \[Alpha] On demand feature views -On demand feature views allows users to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both historical retrieval and online retrieval paths: +On demand feature views allows data scientists to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both historical retrieval and online retrieval paths. + +Currently, these transformations are executed locally. This is fine for online serving, but does not scale well offline. + +### Why use on demand feature views? + +This enables data scientists to easily impact the online feature retrieval path. For example, a data scientist could + +1. Call `get_historical_features` to generate a training dataframe +2. Iterate in notebook on feature engineering in Pandas +3. Copy transformation logic into on demand feature views and commit to a dev branch of the feature repository +4. Verify with `get_historical_features` (on a small dataset) that the transformation gives expected output over historical data +5. Verify with `get_online_features` on dev branch that the transformation correctly outputs online features +6. Submit a pull request to the staging / prod branches which impact production traffic ```python from feast import Field, RequestSource +from feast.on_demand_feature_view import on_demand_feature_view from feast.types import Float64 # Define a request data source which encodes features / information only diff --git a/docs/getting-started/concepts/overview.md b/docs/getting-started/concepts/overview.md index 7134073792..ffbad86c03 100644 --- a/docs/getting-started/concepts/overview.md +++ b/docs/getting-started/concepts/overview.md @@ -1,14 +1,29 @@ # Overview -The top-level namespace within Feast is a [project](overview.md#project). Users define one or more [feature views](feature-view.md) within a project. Each feature view contains one or more [features](feature-view.md#feature). These features typically relate to one or more [entities](entity.md). A feature view must always have a [data source](data-source.md), which in turn is used during the generation of training [datasets](feature-retrieval.md#dataset) and when materializing feature values into the online store. +### Feast project structure -![](../../.gitbook/assets/image%20%287%29.png) +The top-level namespace within Feast is a **project**. Users define one or more [feature views](feature-view.md) within a project. Each feature view contains one or more [features](feature-view.md#feature). These features typically relate to one or more [entities](entity.md). A feature view must always have a [data source](data-ingestion.md), which in turn is used during the generation of training [datasets](feature-retrieval.md#dataset) and when materializing feature values into the online store. -## Project +![](<../../.gitbook/assets/image (7).png>) -Projects provide complete isolation of feature stores at the infrastructure level. This is accomplished through resource namespacing, e.g., prefixing table names with the associated project. Each project should be considered a completely separate universe of entities and features. It is not possible to retrieve features from multiple projects in a single request. We recommend having a single feature store and a single project per environment \(`dev`, `staging`, `prod`\). +**Projects** provide complete isolation of feature stores at the infrastructure level. This is accomplished through resource namespacing, e.g., prefixing table names with the associated project. Each project should be considered a completely separate universe of entities and features. It is not possible to retrieve features from multiple projects in a single request. We recommend having a single feature store and a single project per environment (`dev`, `staging`, `prod`). -{% hint style="info" %} -Projects are currently being supported for backward compatibility reasons. Projects may change in the future as we simplify the Feast API. -{% endhint %} +### Data ingestion +For _offline use cases_ that only rely on batch data, Feast does not need to ingest data and can query your existing data (leveraging a compute engine, whether it be a data warehouse or (experimental) Spark / Trino). Feast can help manage **pushing** streaming features to a batch source to make features available for training. + +For _online use cases_, Feast supports **ingesting** features from batch sources to make them available online (through a process called **materialization**), and **pushing** streaming features to make them available both offline / online. We explore this more in the next concept page ([Data ingestion](data-ingestion.md)) + +### Feature registration and retrieval + +Features are _registered_ as code in a version controlled repository, and tie to data sources + model versions via the concepts of **entities, feature views,** and **feature services.** We explore these concepts more in the upcoming concept pages. These features are then _stored_ in a **registry**, which can be accessed across users and services. The features can then be _retrieved_ via SDK API methods or via a deployed **feature server** which exposes endpoints to query for online features (to power real time models). + + + +Feast supports several patterns of feature retrieval. + +| Use case | Example | API | +| :------------------------------------------------------: | :----------------------------------------------------------------------------------------------------: | :-----------------------: | +| Training data generation | Fetching user and item features for (user, item) pairs when training a production recommendation model | `get_historical_features` | +| Offline feature retrieval for batch predictions | Predicting user churn for all users on a daily basis | `get_historical_features` | +| Online feature retrieval for real-time model predictions | Fetching pre-computed features to predict whether a real-time credit card transaction is fraudulent | `get_online_features` | diff --git a/docs/getting-started/faq.md b/docs/getting-started/faq.md index b2438fdf7a..047c888ad0 100644 --- a/docs/getting-started/faq.md +++ b/docs/getting-started/faq.md @@ -10,7 +10,7 @@ We encourage you to ask questions on [Slack](https://slack.feast.dev) or [GitHub ### Do you have any examples of how Feast should be used? -The [quickstart](quickstart.md) is the easiest way to learn about Feast. For more detailed tutorials, please check out the [tutorials](../tutorials/tutorials-overview.md) page. +The [quickstart](quickstart.md) is the easiest way to learn about Feast. For more detailed tutorials, please check out the [tutorials](../tutorials/tutorials-overview/) page. ## Concepts @@ -19,13 +19,14 @@ The [quickstart](quickstart.md) is the easiest way to learn about Feast. For mor No, there are [feature views without entities](concepts/feature-view.md#feature-views-without-entities). ### How does Feast handle model or feature versioning? -Feast expects that each version of a model corresponds to a different feature service. -Feature views once they are used by a feature service are intended to be immutable and not deleted (until a feature service is removed). In the future, `feast plan` and `feast apply will throw errors if it sees this kind of behavior. +Feast expects that each version of a model corresponds to a different feature service. + +Feature views once they are used by a feature service are intended to be immutable and not deleted (until a feature service is removed). In the future, `feast plan` and `feast apply` will throw errors if it sees this kind of behavior. ### What is the difference between data sources and the offline store? -The data source itself defines the underlying data warehouse table in which the features are stored. The offline store interface defines the APIs required to make an arbitrary compute layer work for Feast (e.g. pulling features given a set of feature views from their sources, exporting the data set results to different formats). Please see [data sources](concepts/data-source.md) and [offline store](architecture-and-components/offline-store.md) for more details. +The data source itself defines the underlying data warehouse table in which the features are stored. The offline store interface defines the APIs required to make an arbitrary compute layer work for Feast (e.g. pulling features given a set of feature views from their sources, exporting the data set results to different formats). Please see [data sources](concepts/data-ingestion.md) and [offline store](architecture-and-components/offline-store.md) for more details. ### Is it possible to have offline and online stores from different providers? @@ -34,6 +35,7 @@ Yes, this is possible. For example, you can use BigQuery as an offline store and ## Functionality ### How do I run `get_historical_features` without providing an entity dataframe? + Feast does not provide a way to do this right now. This is an area we're actively interested in contributions for. See [GitHub issue](/~https://github.com/feast-dev/feast/issues/1611) ### Does Feast provide security or access control? @@ -49,14 +51,16 @@ Yes. In earlier versions of Feast, we used Feast Spark to manage ingestion from ### Does Feast support feature transformation? There are several kinds of transformations: -- On demand transformations (See [docs](../reference/alpha-on-demand-feature-view.md)) - - These transformations are Pandas transformations run on batch data when you call `get_historical_features` and at online serving time when you call `get_online_features. - - Note that if you use push sources to ingest streaming features, these transformations will execute on the fly as well -- Batch transformations (WIP, see [RFC](https://docs.google.com/document/d/1964OkzuBljifDvkV-0fakp2uaijnVzdwWNGdz7Vz50A/edit#)) - - These will include SQL + PySpark based transformations on batch data sources. -- Streaming transformations (RFC in progress) + +* On demand transformations (See [docs](../reference/alpha-on-demand-feature-view.md)) + * These transformations are Pandas transformations run on batch data when you call `get_historical_features` and at online serving time when you call \`get\_online\_features. + * Note that if you use push sources to ingest streaming features, these transformations will execute on the fly as well +* Batch transformations (WIP, see [RFC](https://docs.google.com/document/d/1964OkzuBljifDvkV-0fakp2uaijnVzdwWNGdz7Vz50A/edit)) + * These will include SQL + PySpark based transformations on batch data sources. +* Streaming transformations (RFC in progress) ### Does Feast have a Web UI? + Yes. See [documentation](../reference/alpha-web-ui.md). ### Does Feast support composite keys? @@ -84,7 +88,7 @@ Yes. Specifically: ### Does Feast support X storage engine? -The list of supported offline and online stores can be found [here](../reference/offline-stores/) and [here](../reference/online-stores/), respectively. The [roadmap](../roadmap.md) indicates the stores for which we are planning to add support. Finally, our Provider abstraction is built to be extensible, so you can plug in your own implementations of offline and online stores. Please see more details about custom providers [here](../how-to-guides/creating-a-custom-provider.md). +The list of supported offline and online stores can be found [here](../reference/offline-stores/) and [here](../reference/online-stores/), respectively. The [roadmap](../roadmap.md) indicates the stores for which we are planning to add support. Finally, our Provider abstraction is built to be extensible, so you can plug in your own implementations of offline and online stores. Please see more details about customizing Feast [here](../how-to-guides/customizing-feast/). ### Does Feast support using different clouds for offline vs online stores? @@ -92,7 +96,7 @@ Yes. Using a GCP or AWS provider in `feature_store.yaml` primarily sets default ### How can I add a custom online store? -Please follow the instructions [here](../how-to-guides/adding-support-for-a-new-online-store.md). +Please follow the instructions [here](../how-to-guides/customizing-feast/adding-support-for-a-new-online-store.md). ### Can the same storage engine be used for both the offline and online store? @@ -105,10 +109,6 @@ Yes. There are two ways to use S3 in Feast: * Using Redshift as a data source via Spectrum ([AWS tutorial](https://docs.aws.amazon.com/redshift/latest/dg/tutorial-nested-data-create-table.html)), and then continuing with the [Running Feast with Snowflake/GCP/AWS](../how-to-guides/feast-snowflake-gcp-aws/) guide. See a [presentation](https://youtu.be/pMFbRJ7AnBk?t=9463) we did on this at our apply() meetup. * Using the `s3_endpoint_override` in a `FileSource` data source. This endpoint is more suitable for quick proof of concepts that won't necessarily scale for production use cases. -### How can I use Spark with Feast? - -Feast supports ingestion via Spark (See ) does not support Spark natively. However, you can create a [custom provider](../how-to-guides/creating-a-custom-provider.md) that will support Spark, which can help with more scalable materialization and ingestion. - ### Is Feast planning on supporting X functionality? Please see the [roadmap](../roadmap.md). @@ -119,7 +119,6 @@ Please see the [roadmap](../roadmap.md). For more details on contributing to the Feast community, see [here](../community.md) and this [here](../project/contributing.md). - ## Feast 0.9 (legacy) ### What is the difference between Feast 0.9 and Feast 0.10+? @@ -130,7 +129,6 @@ Feast 0.10+ is much lighter weight and more extensible than Feast 0.9. It is des Please see this [document](https://docs.google.com/document/d/1AOsr\_baczuARjCpmZgVd8mCqTF4AZ49OEyU4Cn-uTT0). If you have any questions or suggestions, feel free to leave a comment on the document! - ### What are the plans for Feast Core, Feast Serving, and Feast Spark? -Feast Core and Feast Serving were both part of Feast Java. We plan to support Feast Serving. We will not support Feast Core; instead we will support our object store based registry. We will not support Feast Spark. For more details on what we plan on supporting, please see the [roadmap](../roadmap.md). \ No newline at end of file +Feast Core and Feast Serving were both part of Feast Java. We plan to support Feast Serving. We will not support Feast Core; instead we will support our object store based registry. We will not support Feast Spark. For more details on what we plan on supporting, please see the [roadmap](../roadmap.md). diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md index ba7b871d01..fb73c1ba8c 100644 --- a/docs/getting-started/quickstart.md +++ b/docs/getting-started/quickstart.md @@ -13,15 +13,15 @@ You can run this tutorial in Google Colab or run it on your localhost, following ## Overview -In this tutorial, we use feature stores to generate training data and power online model inference for a ride-sharing driver satisfaction prediction model. Feast solves several common issues in this flow: +In this tutorial, we use feature stores to generate training data and power online model inference for a ride-sharing driver satisfaction prediction model. Feast solves several common issues in this flow: 1. **Training-serving skew and complex data joins:** Feature values often exist across multiple tables. Joining these datasets can be complicated, slow, and error-prone. * Feast joins these tables with battle-tested logic that ensures _point-in-time_ correctness so future feature values do not leak to models. * Feast alerts users to offline / online skew with data quality monitoring -2. **Online feature availability:** At inference time, models often need access to features that aren't readily available and need to be precomputed from other datasources. +2. **Online feature availability:** At inference time, models often need access to features that aren't readily available and need to be precomputed from other datasources. * Feast manages deployment to a variety of online stores (e.g. DynamoDB, Redis, Google Cloud Datastore) and ensures necessary features are consistently _available_ and _freshly computed_ at inference time. 3. **Feature reusability and model versioning:** Different teams within an organization are often unable to reuse features across projects, resulting in duplicate feature creation logic. Models have data dependencies that need to be versioned, for example when running A/B tests on model versions. - * Feast enables discovery of and collaboration on previously used features and enables versioning of sets of features (via _feature services_). + * Feast enables discovery of and collaboration on previously used features and enables versioning of sets of features (via _feature services_). * Feast enables feature transformation so users can re-use transformation logic across online / offline usecases and across models. ## Step 1: Install Feast @@ -40,7 +40,7 @@ pip install feast ## Step 2: Create a feature repository -Bootstrap a new feature repository using `feast init` from the command line. +Bootstrap a new feature repository using `feast init` from the command line. {% tabs %} {% tab title="Bash" %} @@ -134,9 +134,9 @@ Valid values for `provider` in `feature_store.yaml` are: * gcp: use BigQuery/Snowflake with Google Cloud Datastore/Redis * aws: use Redshift/Snowflake with DynamoDB/Redis -Note that there are many other sources Feast works with, including Azure, Hive, Trino, and PostgreSQL via community plugins. See [Third party integrations](../getting-started/third-party-integrations.md) for all supported datasources. +Note that there are many other sources Feast works with, including Azure, Hive, Trino, and PostgreSQL via community plugins. See [Third party integrations](third-party-integrations.md) for all supported datasources. -A custom setup can also be made by following [adding a custom provider](../how-to-guides/creating-a-custom-provider.md). +A custom setup can also be made by following [adding a custom provider](../how-to-guides/customizing-feast/creating-a-custom-provider.md). ### Inspecting the raw data @@ -387,8 +387,6 @@ driver_stats_fs = FeatureService( ) ``` -{% tabs %} -{% tab title="Python" %} ```python from feast import FeatureStore feature_store = FeatureStore('.') # Initialize the feature store @@ -430,6 +428,6 @@ One of the ways to view this is with the `feast ui` command. * Read the [Concepts](concepts/) page to understand the Feast data model. * Read the [Architecture](architecture-and-components/) page. -* Check out our [Tutorials](../tutorials/tutorials-overview.md) section for more examples on how to use Feast. +* Check out our [Tutorials](../tutorials/tutorials-overview/) section for more examples on how to use Feast. * Follow our [Running Feast with Snowflake/GCP/AWS](../how-to-guides/feast-snowflake-gcp-aws/) guide for a more in-depth tutorial on using Feast. * Join other Feast users and contributors in [Slack](https://slack.feast.dev) and become part of the community! diff --git a/docs/getting-started/third-party-integrations.md b/docs/getting-started/third-party-integrations.md index ef47a11029..8e6a600aa0 100644 --- a/docs/getting-started/third-party-integrations.md +++ b/docs/getting-started/third-party-integrations.md @@ -5,13 +5,13 @@ We integrate with a wide set of tools and technologies so you can make Feast wor {% hint style="info" %} Don't see your offline store or online store of choice here? Check out our guides to make a custom one! -* [Adding a new offline store](../how-to-guides/adding-a-new-offline-store.md) -* [Adding a new online store](../how-to-guides/adding-support-for-a-new-online-store.md) +* [Adding a new offline store](../how-to-guides/customizing-feast/adding-a-new-offline-store.md) +* [Adding a new online store](../how-to-guides/customizing-feast/adding-support-for-a-new-online-store.md) {% endhint %} ## Integrations -See [Functionality and Roadmap](../../README.md#-functionality-and-roadmap) +See [Functionality and Roadmap](../../#-functionality-and-roadmap) ## Standards @@ -19,7 +19,7 @@ In order for a plugin integration to be highlighted, it must meet the following 1. The plugin must have tests. Ideally it would use the Feast universal tests (see this [guide](../how-to-guides/adding-or-reusing-tests.md) for an example), but custom tests are fine. 2. The plugin must have some basic documentation on how it should be used. -3. The author must work with a maintainer to pass a basic code review (e.g. to ensure that the implementation roughly matches the core Feast implementations). +3. The author must work with a maintainer to pass a basic code review (e.g. to ensure that the implementation roughly matches the core Feast implementations). In order for a plugin integration to be merged into the main Feast repo, it must meet the following requirements: diff --git a/docs/how-to-guides/customizing-feast/README.md b/docs/how-to-guides/customizing-feast/README.md new file mode 100644 index 0000000000..91c04e2f35 --- /dev/null +++ b/docs/how-to-guides/customizing-feast/README.md @@ -0,0 +1,24 @@ +# Customizing Feast + +Feast is highly pluggable and configurable: + +* One can use existing plugins (offline store, online store, batch materialization engine, providers) and configure those using the built in options. See reference documentation for details. +* The other way to customize Feast is to build your own custom components, and then point Feast to delegate to them. + +Below are some guides on how to add new custom components: + +{% content-ref url="adding-a-new-offline-store.md" %} +[adding-a-new-offline-store.md](adding-a-new-offline-store.md) +{% endcontent-ref %} + +{% content-ref url="adding-support-for-a-new-online-store.md" %} +[adding-support-for-a-new-online-store.md](adding-support-for-a-new-online-store.md) +{% endcontent-ref %} + +{% content-ref url="creating-a-custom-materialization-engine.md" %} +[creating-a-custom-materialization-engine.md](creating-a-custom-materialization-engine.md) +{% endcontent-ref %} + +{% content-ref url="creating-a-custom-provider.md" %} +[creating-a-custom-provider.md](creating-a-custom-provider.md) +{% endcontent-ref %} diff --git a/docs/how-to-guides/adding-a-new-offline-store.md b/docs/how-to-guides/customizing-feast/adding-a-new-offline-store.md similarity index 85% rename from docs/how-to-guides/adding-a-new-offline-store.md rename to docs/how-to-guides/customizing-feast/adding-a-new-offline-store.md index c548538fce..91b23eaad5 100644 --- a/docs/how-to-guides/adding-a-new-offline-store.md +++ b/docs/how-to-guides/customizing-feast/adding-a-new-offline-store.md @@ -2,7 +2,7 @@ ## Overview -Feast makes adding support for a new offline store easy. Developers can simply implement the [OfflineStore](../../sdk/python/feast/infra/offline\_stores/offline\_store.py#L41) interface to add support for a new store (other than the existing stores like Parquet files, Redshift, and Bigquery). +Feast makes adding support for a new offline store easy. Developers can simply implement the [OfflineStore](../../../sdk/python/feast/infra/offline\_stores/offline\_store.py#L41) interface to add support for a new store (other than the existing stores like Parquet files, Redshift, and Bigquery). In this guide, we will show you how to extend the existing File offline store and use in a feature repo. While we will be implementing a specific store, this guide should be representative for adding support for any new offline store. @@ -22,7 +22,7 @@ The process for using a custom offline store consists of 8 steps: ## 1. Defining an OfflineStore class {% hint style="info" %} - OfflineStore class names must end with the OfflineStore suffix! +OfflineStore class names must end with the OfflineStore suffix! {% endhint %} ### Contrib offline stores @@ -31,23 +31,26 @@ New offline stores go in `sdk/python/feast/infra/offline_stores/contrib/`. #### What is a contrib plugin? -- Not guaranteed to implement all interface methods -- Not guaranteed to be stable. -- Should have warnings for users to indicate this is a contrib plugin that is not maintained by the maintainers. +* Not guaranteed to implement all interface methods +* Not guaranteed to be stable. +* Should have warnings for users to indicate this is a contrib plugin that is not maintained by the maintainers. #### How do I make a contrib plugin an "official" plugin? + To move an offline store plugin out of contrib, you need: -- GitHub actions (i.e `make test-python-integration`) is setup to run all tests against the offline store and pass. -- At least two contributors own the plugin (ideally tracked in our `OWNERS` / `CODEOWNERS` file). + +* GitHub actions (i.e `make test-python-integration`) is setup to run all tests against the offline store and pass. +* At least two contributors own the plugin (ideally tracked in our `OWNERS` / `CODEOWNERS` file). #### Define the offline store class -The OfflineStore class contains a couple of methods to read features from the offline store. Unlike the OnlineStore class, Feast does not manage any infrastructure for the offline store. + +The OfflineStore class contains a couple of methods to read features from the offline store. Unlike the OnlineStore class, Feast does not manage any infrastructure for the offline store. To fully implement the interface for the offline store, you will need to implement these methods: * `pull_latest_from_table_or_query` is invoked when running materialization (using the `feast materialize` or `feast materialize-incremental` commands, or the corresponding `FeatureStore.materialize()` method. This method pull data from the offline store, and the `FeatureStore` class takes care of writing this data into the online store. * `get_historical_features` is invoked when reading values from the offline store using the `FeatureStore.get_historical_features()` method. Typically, this method is used to retrieve features when training ML models. -* (optional) `offline_write_batch` is a method that supports directly pushing a pyarrow table to a feature view. Given a feature view with a specific schema, this function should write the pyarrow table to the batch source defined. More details about the push api can be found [here](docs/reference/data-sources/push.md). This method only needs implementation if you want to support the push api in your offline store. +* (optional) `offline_write_batch` is a method that supports directly pushing a pyarrow table to a feature view. Given a feature view with a specific schema, this function should write the pyarrow table to the batch source defined. More details about the push api can be found [here](../docs/reference/data-sources/push.md). This method only needs implementation if you want to support the push api in your offline store. * (optional) `pull_all_from_table_or_query` is a method that pulls all the data from an offline store from a specified start date to a specified end date. This method is only used for **SavedDatasets** as part of data quality monitoring validation. * (optional) `write_logged_features` is a method that takes a pyarrow table or a path that points to a parquet file and writes the data to a defined source defined by `LoggingSource` and `LoggingConfig`. This method is only used internally for **SavedDatasets**. @@ -140,29 +143,30 @@ To fully implement the interface for the offline store, you will need to impleme ) # Implementation here. pass - ``` {% endcode %} ### 1.1 Type Mapping Most offline stores will have to perform some custom mapping of offline store datatypes to feast value types. -- The function to implement here are `source_datatype_to_feast_value_type` and `get_column_names_and_types` in your `DataSource` class. + +* The function to implement here are `source_datatype_to_feast_value_type` and `get_column_names_and_types` in your `DataSource` class. * `source_datatype_to_feast_value_type` is used to convert your DataSource's datatypes to feast value types. * `get_column_names_and_types` retrieves the column names and corresponding datasource types. Add any helper functions for type conversion to `sdk/python/feast/type_map.py`. -- Be sure to implement correct type mapping so that Feast can process your feature columns without casting incorrectly that can potentially cause loss of information or incorrect data. + +* Be sure to implement correct type mapping so that Feast can process your feature columns without casting incorrectly that can potentially cause loss of information or incorrect data. ## 2. Defining an OfflineStoreConfig class Additional configuration may be needed to allow the OfflineStore to talk to the backing store. For example, Redshift needs configuration information like the connection information for the Redshift instance, credentials for connecting to the database, etc. -To facilitate configuration, all OfflineStore implementations are **required** to also define a corresponding OfflineStoreConfig class in the same file. This OfflineStoreConfig class should inherit from the `FeastConfigBaseModel` class, which is defined [here](../../sdk/python/feast/repo\_config.py#L44). +To facilitate configuration, all OfflineStore implementations are **required** to also define a corresponding OfflineStoreConfig class in the same file. This OfflineStoreConfig class should inherit from the `FeastConfigBaseModel` class, which is defined [here](../../../sdk/python/feast/repo\_config.py#L44). The `FeastConfigBaseModel` is a [pydantic](https://pydantic-docs.helpmanual.io) class, which parses yaml configuration into python objects. Pydantic also allows the model classes to define validators for the config classes, to make sure that the config classes are correctly defined. -This config class **must** container a `type` field, which contains the fully qualified class name of its corresponding OfflineStore class. +This config class **must** container a `type` field, which contains the fully qualified class name of its corresponding OfflineStore class. Additionally, the name of the config class must be the same as the OfflineStore class, with the `Config` suffix. @@ -195,7 +199,7 @@ online_store: ``` {% endcode %} -This configuration information is available to the methods of the OfflineStore, via the `config: RepoConfig` parameter which is passed into the methods of the OfflineStore interface, specifically at the `config.offline_store` field of the `config` parameter. This fields in the `feature_store.yaml` should map directly to your `OfflineStoreConfig` class that is detailed above in Section 2. +This configuration information is available to the methods of the OfflineStore, via the `config: RepoConfig` parameter which is passed into the methods of the OfflineStore interface, specifically at the `config.offline_store` field of the `config` parameter. This fields in the `feature_store.yaml` should map directly to your `OfflineStoreConfig` class that is detailed above in Section 2. {% code title="feast_custom_offline_store/file.py" %} ```python @@ -225,7 +229,7 @@ Custom offline stores may need to implement their own instances of the `Retrieva The `RetrievalJob` interface exposes two methods - `to_df` and `to_arrow`. The expectation is for the retrieval job to be able to return the rows read from the offline store as a parquet DataFrame, or as an Arrow table respectively. -Users who want to have their offline store support **scalable batch materialization** for online use cases (detailed in this [RFC](https://docs.google.com/document/d/1J7XdwwgQ9dY_uoV9zkRVGQjK9Sy43WISEW6D5V9qzGo/edit#heading=h.9gaqqtox9jg6)) will also need to implement `to_remote_storage` to distribute the reading and writing of offline store records to blob storage (such as S3). This may be used by a custom [Materialization Engine](/~https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/materialization/batch_materialization_engine.py#L72) to parallelize the materialization of data by processing it in chunks. If this is not implemented, Feast will default to local materialization (pulling all records into memory to materialize). +Users who want to have their offline store support **scalable batch materialization** for online use cases (detailed in this [RFC](https://docs.google.com/document/d/1J7XdwwgQ9dY\_uoV9zkRVGQjK9Sy43WISEW6D5V9qzGo/edit#heading=h.9gaqqtox9jg6)) will also need to implement `to_remote_storage` to distribute the reading and writing of offline store records to blob storage (such as S3). This may be used by a custom [Materialization Engine](/~https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/materialization/batch\_materialization\_engine.py#L72) to parallelize the materialization of data by processing it in chunks. If this is not implemented, Feast will default to local materialization (pulling all records into memory to materialize). {% code title="feast_custom_offline_store/file.py" %} ```python @@ -258,7 +262,7 @@ class CustomFileRetrievalJob(RetrievalJob): Before this offline store can be used as the batch source for a feature view in a feature repo, a subclass of the `DataSource` [base class](https://rtd.feast.dev/en/master/index.html?highlight=DataSource#feast.data\_source.DataSource) needs to be defined. This class is responsible for holding information needed by specific feature views to support reading historical values from the offline store. For example, a feature view using Redshift as the offline store may need to know which table contains historical feature values. -The data source class should implement two methods - `from_proto`, and `to_proto`. +The data source class should implement two methods - `from_proto`, and `to_proto`. For custom offline stores that are not being implemented in the main feature repo, the `custom_options` field should be used to store any configuration needed by the data source. In this case, the implementer is responsible for serializing this configuration into bytes in the `to_proto` method and reading the value back from bytes in the `from_proto` method. @@ -317,9 +321,9 @@ class CustomFileDataSource(FileSource): ``` {% endcode %} -## 5. Using the custom offline store +## 5. Using the custom offline store -After implementing these classes, the custom offline store can be used by referencing it in a feature repo's `feature_store.yaml` file, specifically in the `offline_store` field. The value specified should be the fully qualified class name of the OfflineStore. +After implementing these classes, the custom offline store can be used by referencing it in a feature repo's `feature_store.yaml` file, specifically in the `offline_store` field. The value specified should be the fully qualified class name of the OfflineStore. As long as your OfflineStore class is available in your Python environment, it will be imported by Feast dynamically at runtime. @@ -372,17 +376,17 @@ driver_hourly_stats_view = FeatureView( Even if you have created the `OfflineStore` class in a separate repo, you can still test your implementation against the Feast test suite, as long as you have Feast as a submodule in your repo. 1. In order to test against the test suite, you need to create a custom `DataSourceCreator` that implement our testing infrastructure methods, `create_data_source` and optionally, `created_saved_dataset_destination`. - * `create_data_source` should create a datasource based on the dataframe passed in. It may be implemented by uploading the contents of the dataframe into the offline store and returning a datasource object pointing to that location. See `BigQueryDataSourceCreator` for an implementation of a data source creator. - * `created_saved_dataset_destination` is invoked when users need to save the dataset for use in data validation. This functionality is still in alpha and is **optional**. + * `create_data_source` should create a datasource based on the dataframe passed in. It may be implemented by uploading the contents of the dataframe into the offline store and returning a datasource object pointing to that location. See `BigQueryDataSourceCreator` for an implementation of a data source creator. + * `created_saved_dataset_destination` is invoked when users need to save the dataset for use in data validation. This functionality is still in alpha and is **optional**. +2. Make sure that your offline store doesn't break any unit tests first by running: -2. Make sure that your offline store doesn't break any unit tests first by running: ``` make test-python ``` +3. Next, set up your offline store to run the universal integration tests. These are integration tests specifically intended to test offline and online stores against Feast API functionality, to ensure that the Feast APIs works with your offline store. -3. Next, set up your offline store to run the universal integration tests. These are integration tests specifically intended to test offline and online stores against Feast API functionality, to ensure that the Feast APIs works with your offline store. - - Feast parametrizes integration tests using the `FULL_REPO_CONFIGS` variable defined in `sdk/python/tests/integration/feature_repos/repo_configuration.py` which stores different offline store classes for testing. - - To overwrite the default configurations to use your own offline store, you can simply create your own file that contains a `FULL_REPO_CONFIGS` dictionary, and point Feast to that file by setting the environment variable `FULL_REPO_CONFIGS_MODULE` to point to that file. The module should add new `IntegrationTestRepoConfig` classes to the `AVAILABLE_OFFLINE_STORES` by defining an offline store that you would like Feast to test with. + * Feast parametrizes integration tests using the `FULL_REPO_CONFIGS` variable defined in `sdk/python/tests/integration/feature_repos/repo_configuration.py` which stores different offline store classes for testing. + * To overwrite the default configurations to use your own offline store, you can simply create your own file that contains a `FULL_REPO_CONFIGS` dictionary, and point Feast to that file by setting the environment variable `FULL_REPO_CONFIGS_MODULE` to point to that file. The module should add new `IntegrationTestRepoConfig` classes to the `AVAILABLE_OFFLINE_STORES` by defining an offline store that you would like Feast to test with. A sample `FULL_REPO_CONFIGS_MODULE` looks something like this: @@ -394,8 +398,7 @@ Even if you have created the `OfflineStore` class in a separate repo, you can st AVAILABLE_OFFLINE_STORES = [("local", PostgreSQLDataSourceCreator)] ``` - -4. You should swap out the `FULL_REPO_CONFIGS` environment variable and run the integration tests against your offline store. In the example repo, the file that overwrites `FULL_REPO_CONFIGS` is `feast_custom_offline_store/feast_tests.py`, so you would run: +4. You should swap out the `FULL_REPO_CONFIGS` environment variable and run the integration tests against your offline store. In the example repo, the file that overwrites `FULL_REPO_CONFIGS` is `feast_custom_offline_store/feast_tests.py`, so you would run: ```bash export FULL_REPO_CONFIGS_MODULE='feast_custom_offline_store.feast_tests' @@ -403,20 +406,17 @@ Even if you have created the `OfflineStore` class in a separate repo, you can st ``` If the integration tests fail, this indicates that there is a mistake in the implementation of this offline store! - 5. Remember to add your datasource to `repo_config.py` similar to how we added `spark`, `trino`, etc, to the dictionary `OFFLINE_STORE_CLASS_FOR_TYPE` and add the necessary configuration to `repo_configuration.py`. Namely, `AVAILABLE_OFFLINE_STORES` should load your repo configuration module. ### 7. Dependencies -Add any dependencies for your offline store to our `sdk/python/setup.py` under a new `__REQUIRED` list with the packages and add it to the setup script so that if your offline store is needed, users can install the necessary python packages. These packages should be defined as extras so that they are not installed by users by default. -You will need to regenerate our requirements files. To do this, create separate pyenv environments for python 3.8, 3.9, and 3.10. In each environment, run the following commands: +Add any dependencies for your offline store to our `sdk/python/setup.py` under a new `__REQUIRED` list with the packages and add it to the setup script so that if your offline store is needed, users can install the necessary python packages. These packages should be defined as extras so that they are not installed by users by default. You will need to regenerate our requirements files. To do this, create separate pyenv environments for python 3.8, 3.9, and 3.10. In each environment, run the following commands: ``` export PYTHON= make lock-python-ci-dependencies ``` - ### 8. Add Documentation Remember to add documentation for your offline store. @@ -425,12 +425,12 @@ Remember to add documentation for your offline store. 2. You should also add a reference in `docs/reference/data-sources/README.md` and `docs/SUMMARY.md` to these markdown files. **NOTE**: Be sure to document the following things about your offline store: -- How to create the datasource and most what configuration is needed in the `feature_store.yaml` file in order to create the datasource. -- Make sure to flag that the datasource is in alpha development. -- Add some documentation on what the data model is for the specific offline store for more clarity. -- Finally, generate the python code docs by running: + +* How to create the datasource and most what configuration is needed in the `feature_store.yaml` file in order to create the datasource. +* Make sure to flag that the datasource is in alpha development. +* Add some documentation on what the data model is for the specific offline store for more clarity. +* Finally, generate the python code docs by running: ```bash make build-sphinx ``` - diff --git a/docs/how-to-guides/adding-support-for-a-new-online-store.md b/docs/how-to-guides/customizing-feast/adding-support-for-a-new-online-store.md similarity index 86% rename from docs/how-to-guides/adding-support-for-a-new-online-store.md rename to docs/how-to-guides/customizing-feast/adding-support-for-a-new-online-store.md index d1f5986f18..fe16347b73 100644 --- a/docs/how-to-guides/adding-support-for-a-new-online-store.md +++ b/docs/how-to-guides/customizing-feast/adding-support-for-a-new-online-store.md @@ -2,13 +2,12 @@ ## Overview -Feast makes adding support for a new online store (database) easy. Developers can simply implement the [OnlineStore](../../sdk/python/feast/infra/online\_stores/online\_store.py#L26) interface to add support for a new store (other than the existing stores like Redis, DynamoDB, SQLite, and Datastore). +Feast makes adding support for a new online store (database) easy. Developers can simply implement the [OnlineStore](../../../sdk/python/feast/infra/online\_stores/online\_store.py#L26) interface to add support for a new store (other than the existing stores like Redis, DynamoDB, SQLite, and Datastore). In this guide, we will show you how to integrate with MySQL as an online store. While we will be implementing a specific store, this guide should be representative for adding support for any new online store. The full working code for this guide can be found at [feast-dev/feast-custom-online-store-demo](/~https://github.com/feast-dev/feast-custom-online-store-demo). - The process of using a custom online store consists of 6 steps: 1. Defining the `OnlineStore` class. @@ -21,7 +20,7 @@ The process of using a custom online store consists of 6 steps: ## 1. Defining an OnlineStore class {% hint style="info" %} - OnlineStore class names must end with the OnlineStore suffix! +OnlineStore class names must end with the OnlineStore suffix! {% endhint %} ### Contrib online stores @@ -30,19 +29,21 @@ New online stores go in `sdk/python/feast/infra/online_stores/contrib/`. #### What is a contrib plugin? -- Not guaranteed to implement all interface methods -- Not guaranteed to be stable. -- Should have warnings for users to indicate this is a contrib plugin that is not maintained by the maintainers. +* Not guaranteed to implement all interface methods +* Not guaranteed to be stable. +* Should have warnings for users to indicate this is a contrib plugin that is not maintained by the maintainers. #### How do I make a contrib plugin an "official" plugin? + To move an online store plugin out of contrib, you need: -- GitHub actions (i.e `make test-python-integration`) is setup to run all tests against the online store and pass. -- At least two contributors own the plugin (ideally tracked in our `OWNERS` / `CODEOWNERS` file). + +* GitHub actions (i.e `make test-python-integration`) is setup to run all tests against the online store and pass. +* At least two contributors own the plugin (ideally tracked in our `OWNERS` / `CODEOWNERS` file). The OnlineStore class broadly contains two sets of methods * One set deals with managing infrastructure that the online store needed for operations -* One set deals with writing data into the store, and reading data from the store. +* One set deals with writing data into the store, and reading data from the store. ### 1.1 Infrastructure Methods @@ -50,11 +51,11 @@ There are two methods that deal with managing infrastructure for online stores, * `update` is invoked when users run `feast apply` as a CLI command, or the `FeatureStore.apply()` sdk method. -The `update` method should be used to perform any operations necessary before data can be written to or read from the store. The `update` method can be used to create MySQL tables in preparation for reads and writes to new feature views. +The `update` method should be used to perform any operations necessary before data can be written to or read from the store. The `update` method can be used to create MySQL tables in preparation for reads and writes to new feature views. * `teardown` is invoked when users run `feast teardown` or `FeatureStore.teardown()`. -The `teardown` method should be used to perform any clean-up operations. `teardown` can be used to drop MySQL indices and tables corresponding to the feature views being deleted. +The `teardown` method should be used to perform any clean-up operations. `teardown` can be used to drop MySQL indices and tables corresponding to the feature views being deleted. {% code title="feast_custom_online_store/mysql.py" %} ```python @@ -123,10 +124,10 @@ def teardown( ### 1.2 Read/Write Methods -There are two methods that deal with writing data to and from the online stores.`online_write_batch `and `online_read`. +There are two methods that deal with writing data to and from the online stores.`online_write_batch` and `online_read`. -* `online_write_batch `is invoked when running materialization (using the `feast materialize` or `feast materialize-incremental` commands, or the corresponding `FeatureStore.materialize()` method. -* `online_read `is invoked when reading values from the online store using the `FeatureStore.get_online_features()` method. +* `online_write_batch` is invoked when running materialization (using the `feast materialize` or `feast materialize-incremental` commands, or the corresponding `FeatureStore.materialize()` method. +* `online_read` is invoked when reading values from the online store using the `FeatureStore.get_online_features()` method. {% code title="feast_custom_online_store/mysql.py" %} ```python @@ -210,22 +211,24 @@ def online_read( ### 1.3 Type Mapping Most online stores will have to perform some custom mapping of online store datatypes to feast value types. -- The function to implement here are `source_datatype_to_feast_value_type` and `get_column_names_and_types` in your `DataSource` class. + +* The function to implement here are `source_datatype_to_feast_value_type` and `get_column_names_and_types` in your `DataSource` class. * `source_datatype_to_feast_value_type` is used to convert your DataSource's datatypes to feast value types. * `get_column_names_and_types` retrieves the column names and corresponding datasource types. Add any helper functions for type conversion to `sdk/python/feast/type_map.py`. -- Be sure to implement correct type mapping so that Feast can process your feature columns without casting incorrectly that can potentially cause loss of information or incorrect data. + +* Be sure to implement correct type mapping so that Feast can process your feature columns without casting incorrectly that can potentially cause loss of information or incorrect data. ## 2. Defining an OnlineStoreConfig class Additional configuration may be needed to allow the OnlineStore to talk to the backing store. For example, MySQL may need configuration information like the host at which the MySQL instance is running, credentials for connecting to the database, etc. -To facilitate configuration, all OnlineStore implementations are **required** to also define a corresponding OnlineStoreConfig class in the same file. This OnlineStoreConfig class should inherit from the `FeastConfigBaseModel` class, which is defined [here](../../sdk/python/feast/repo\_config.py#L44). +To facilitate configuration, all OnlineStore implementations are **required** to also define a corresponding OnlineStoreConfig class in the same file. This OnlineStoreConfig class should inherit from the `FeastConfigBaseModel` class, which is defined [here](../../../sdk/python/feast/repo\_config.py#L44). The `FeastConfigBaseModel` is a [pydantic](https://pydantic-docs.helpmanual.io) class, which parses yaml configuration into python objects. Pydantic also allows the model classes to define validators for the config classes, to make sure that the config classes are correctly defined. -This config class **must** container a `type` field, which contains the fully qualified class name of its corresponding OnlineStore class. +This config class **must** container a `type` field, which contains the fully qualified class name of its corresponding OnlineStore class. Additionally, the name of the config class must be the same as the OnlineStore class, with the `Config` suffix. @@ -254,7 +257,7 @@ online_store: ``` {% endcode %} -This configuration information is available to the methods of the OnlineStore, via the`config: RepoConfig` parameter which is passed into all the methods of the OnlineStore interface, specifically at the `config.online_store` field of the `config` parameter. +This configuration information is available to the methods of the OnlineStore, via the`config: RepoConfig` parameter which is passed into all the methods of the OnlineStore interface, specifically at the `config.online_store` field of the `config` parameter. {% code title="feast_custom_online_store/mysql.py" %} ```python @@ -281,9 +284,9 @@ def online_write_batch( ``` {% endcode %} -## 3. Using the custom online store +## 3. Using the custom online store -After implementing both these classes, the custom online store can be used by referencing it in a feature repo's `feature_store.yaml` file, specifically in the `online_store` field. The value specified should be the fully qualified class name of the OnlineStore. +After implementing both these classes, the custom online store can be used by referencing it in a feature repo's `feature_store.yaml` file, specifically in the `online_store` field. The value specified should be the fully qualified class name of the OnlineStore. As long as your OnlineStore class is available in your Python environment, it will be imported by Feast dynamically at runtime. @@ -302,7 +305,7 @@ online_store: ``` {% endcode %} -If additional configuration for the online store is **not **required, then we can omit the other fields and only specify the `type` of the online store class as the value for the `online_store`. +If additional configuration for the online store is \*\*not \*\*required, then we can omit the other fields and only specify the `type` of the online store class as the value for the `online_store`. {% code title="feature_repo/feature_store.yaml" %} ```yaml @@ -319,15 +322,14 @@ online_store: feast_custom_online_store.mysql.MySQLOnlineStore Even if you have created the `OnlineStore` class in a separate repo, you can still test your implementation against the Feast test suite, as long as you have Feast as a submodule in your repo. -1. In the Feast submodule, we can run all the unit tests and make sure they pass: +1. In the Feast submodule, we can run all the unit tests and make sure they pass: + ``` make test-python ``` - - 2. The universal tests, which are integration tests specifically intended to test offline and online stores, should be run against Feast to ensure that the Feast APIs works with your online store. - - Feast parametrizes integration tests using the `FULL_REPO_CONFIGS` variable defined in `sdk/python/tests/integration/feature_repos/repo_configuration.py` which stores different online store classes for testing. - - To overwrite these configurations, you can simply create your own file that contains a `FULL_REPO_CONFIGS` variable, and point Feast to that file by setting the environment variable `FULL_REPO_CONFIGS_MODULE` to point to that file. + * Feast parametrizes integration tests using the `FULL_REPO_CONFIGS` variable defined in `sdk/python/tests/integration/feature_repos/repo_configuration.py` which stores different online store classes for testing. + * To overwrite these configurations, you can simply create your own file that contains a `FULL_REPO_CONFIGS` variable, and point Feast to that file by setting the environment variable `FULL_REPO_CONFIGS_MODULE` to point to that file. A sample `FULL_REPO_CONFIGS_MODULE` looks something like this: @@ -341,10 +343,8 @@ AVAILABLE_ONLINE_STORES = {"postgres": (None, PostgreSQLDataSourceCreator)} ``` {% endcode %} - If you are planning to start the online store up locally(e.g spin up a local Redis Instance) for testing, then the dictionary entry should be something like: - ```python { "sqlite": ({"type": "sqlite"}, None), @@ -352,10 +352,8 @@ If you are planning to start the online store up locally(e.g spin up a local Red } ``` - If you are planning instead to use a Dockerized container to run your tests against your online store, you can define a `OnlineStoreCreator` and replace the `None` object above with your `OnlineStoreCreator` class. - If you create a containerized docker image for testing, developers who are trying to test with your online store will not have to spin up their own instance of the online store for testing. An example of an `OnlineStoreCreator` is shown below: {% code title="sdk/python/tests/integration/feature_repos/universal/online_store/redis.py" %} @@ -381,33 +379,33 @@ export FULL_REPO_CONFIGS_MODULE='feast_custom_online_store.feast_tests' make test-python-universal ``` -- If there are some tests that fail, this indicates that there is a mistake in the implementation of this online store! - +* If there are some tests that fail, this indicates that there is a mistake in the implementation of this online store! ### 5. Add Dependencies Add any dependencies for your online store to our `sdk/python/setup.py` under a new `_REQUIRED` list with the packages and add it to the setup script so that if your online store is needed, users can install the necessary python packages. These packages should be defined as extras so that they are not installed by users by default. -- You will need to regenerate our requirements files. To do this, create separate pyenv environments for python 3.8, 3.9, and 3.10. In each environment, run the following commands: + +* You will need to regenerate our requirements files. To do this, create separate pyenv environments for python 3.8, 3.9, and 3.10. In each environment, run the following commands: ``` export PYTHON= make lock-python-ci-dependencies ``` - ### 6. Add Documentation Remember to add the documentation for your online store. -1. Add a new markdown file to `docs/reference/online-stores/`. + +1. Add a new markdown file to `docs/reference/online-stores/`. 2. You should also add a reference in `docs/reference/online-stores/README.md` and `docs/SUMMARY.md`. Add a new markdown document to document your online store functionality similar to how the other online stores are documented. **NOTE**:Be sure to document the following things about your online store: -- Be sure to cover how to create the datasource and what configuration is needed in the `feature_store.yaml` file in order to create the datasource. -- Make sure to flag that the online store is in alpha development. -- Add some documentation on what the data model is for the specific online store for more clarity. -- Finally, generate the python code docs by running: + +* Be sure to cover how to create the datasource and what configuration is needed in the `feature_store.yaml` file in order to create the datasource. +* Make sure to flag that the online store is in alpha development. +* Add some documentation on what the data model is for the specific online store for more clarity. +* Finally, generate the python code docs by running: ```bash make build-sphinx ``` - diff --git a/docs/how-to-guides/creating-a-custom-materialization-engine.md b/docs/how-to-guides/customizing-feast/creating-a-custom-materialization-engine.md similarity index 92% rename from docs/how-to-guides/creating-a-custom-materialization-engine.md rename to docs/how-to-guides/customizing-feast/creating-a-custom-materialization-engine.md index 935ac3dc99..cca7bd3621 100644 --- a/docs/how-to-guides/creating-a-custom-materialization-engine.md +++ b/docs/how-to-guides/customizing-feast/creating-a-custom-materialization-engine.md @@ -1,4 +1,4 @@ -# Adding a custom materialization engine +# Adding a custom batch materialization engine ### Overview @@ -7,10 +7,10 @@ Feast batch materialization operations (`materialize` and `materialize-increment Custom batch materialization engines allow Feast users to extend Feast to customize the materialization process. Examples include: * Setting up custom materialization-specific infrastructure during `feast apply` (e.g. setting up Spark clusters or Lambda Functions) -* Launching custom batch ingestion \(materialization\) jobs \(Spark, Beam, AWS Lambda\) +* Launching custom batch ingestion (materialization) jobs (Spark, Beam, AWS Lambda) * Tearing down custom materialization-specific infrastructure during `feast teardown` (e.g. tearing down Spark clusters, or deleting Lambda Functions) -Feast comes with built-in materialization engines, e.g, `LocalMaterializationEngine`, and an experimental `LambdaMaterializationEngine`. However, users can develop their own materialization engines by creating a class that implements the contract in the [BatchMaterializationEngine class](/~https://github.com/feast-dev/feast/blob/6d7b38a39024b7301c499c20cf4e7aef6137c47c/sdk/python/feast/infra/materialization/batch_materialization_engine.py#L72). +Feast comes with built-in materialization engines, e.g, `LocalMaterializationEngine`, and an experimental `LambdaMaterializationEngine`. However, users can develop their own materialization engines by creating a class that implements the contract in the [BatchMaterializationEngine class](/~https://github.com/feast-dev/feast/blob/6d7b38a39024b7301c499c20cf4e7aef6137c47c/sdk/python/feast/infra/materialization/batch\_materialization\_engine.py#L72). ### Guide @@ -79,14 +79,13 @@ class MyCustomEngine(LocalMaterializationEngine): ) for task in tasks ] - ``` Notice how in the above engine we have only overwritten two of the methods on the `LocalMaterializatinEngine`, namely `update` and `materialize`. These two methods are convenient to replace if you are planning to launch custom batch jobs. #### Step 2: Configuring Feast to use the engine -Configure your [feature\_store.yaml](../reference/feature-repository/feature-store-yaml.md) file to point to your new engine class: +Configure your [feature\_store.yaml](../../reference/feature-repository/feature-store-yaml.md) file to point to your new engine class: ```yaml project: repo @@ -99,7 +98,7 @@ offline_store: type: file ``` -Notice how the `batch_engine` field above points to the module and class where your engine can be found. +Notice how the `batch_engine` field above points to the module and class where your engine can be found. #### Step 3: Using the engine @@ -109,7 +108,7 @@ Now you should be able to use your engine by running a Feast command: feast apply ``` -```text +``` Registered entity driver_id Registered feature view driver_hourly_stats Deploying infrastructure for driver_hourly_stats diff --git a/docs/how-to-guides/creating-a-custom-provider.md b/docs/how-to-guides/customizing-feast/creating-a-custom-provider.md similarity index 94% rename from docs/how-to-guides/creating-a-custom-provider.md rename to docs/how-to-guides/customizing-feast/creating-a-custom-provider.md index 40ec20ee6a..027ca20c39 100644 --- a/docs/how-to-guides/creating-a-custom-provider.md +++ b/docs/how-to-guides/customizing-feast/creating-a-custom-provider.md @@ -6,8 +6,8 @@ All Feast operations execute through a `provider`. Operations like materializing Custom providers allow Feast users to extend Feast to execute any custom logic. Examples include: -* Launching custom streaming ingestion jobs \(Spark, Beam\) -* Launching custom batch ingestion \(materialization\) jobs \(Spark, Beam\) +* Launching custom streaming ingestion jobs (Spark, Beam) +* Launching custom batch ingestion (materialization) jobs (Spark, Beam) * Adding custom validation to feature repositories during `feast apply` * Adding custom infrastructure setup logic which runs during `feast apply` * Extending Feast commands with in-house metrics, logging, or tracing @@ -87,7 +87,7 @@ It is possible to overwrite all the methods on the provider class. In fact, it i #### Step 2: Configuring Feast to use the provider -Configure your [feature\_store.yaml](../reference/feature-repository/feature-store-yaml.md) file to point to your new provider class: +Configure your [feature\_store.yaml](../../reference/feature-repository/feature-store-yaml.md) file to point to your new provider class: ```yaml project: repo @@ -100,7 +100,7 @@ offline_store: type: file ``` -Notice how the `provider` field above points to the module and class where your provider can be found. +Notice how the `provider` field above points to the module and class where your provider can be found. #### Step 3: Using the provider @@ -110,7 +110,7 @@ Now you should be able to use your provider by running a Feast command: feast apply ``` -```text +``` Registered entity driver_id Registered feature view driver_hourly_stats Deploying infrastructure for driver_hourly_stats @@ -128,4 +128,3 @@ That's it. You should now have a fully functional custom provider! ### Next steps Have a look at the [custom provider demo repository](/~https://github.com/feast-dev/feast-custom-provider-demo) for a fully functional example of a custom provider. Feel free to fork it when creating your own custom provider! - diff --git a/docs/how-to-guides/running-feast-in-production.md b/docs/how-to-guides/running-feast-in-production.md index 34a5a9ca88..04166809a5 100644 --- a/docs/how-to-guides/running-feast-in-production.md +++ b/docs/how-to-guides/running-feast-in-production.md @@ -9,7 +9,7 @@ Overview of typical production configuration is given below: ![Overview](production-simple.png) {% hint style="success" %} -**Important note:** Feast is highly customizable and modular. Most Feast blocks are loosely connected and can be used independently. Hence, you are free to build your own production configuration. +**Important note:** Feast is highly customizable and modular. Most Feast blocks are loosely connected and can be used independently. Hence, you are free to build your own production configuration. For example, you might not have a stream source and, thus, no need to write features in real-time to an online store. Or you might not need to retrieve online features. Feast also often provides multiple options to achieve the same goal. We discuss tradeoffs below. {% endhint %} @@ -30,14 +30,14 @@ The first step to setting up a deployment of Feast is to create a Git repository ### Setting up CI/CD to automatically update the registry -We recommend typically setting up CI/CD to automatically run `feast plan` and `feast apply` when pull requests are opened / merged. +We recommend typically setting up CI/CD to automatically run `feast plan` and `feast apply` when pull requests are opened / merged. ### Setting up multiple environments Most teams will need to have a feature store deployed to more than one environment. There are two common ways teams approach this 1. Have separate GitHub branches for each environment -2. Have separate `feature_store.yaml` files that correspond to each environment +2. Have separate `feature_store.yaml` files that correspond to each environment For the second approach, we have created an example repository ([Feast Repository Example](/~https://github.com/feast-dev/feast-ci-repo-example)) which contains two Feast projects, one per environment. diff --git a/docs/reference/alpha-web-ui.md b/docs/reference/alpha-web-ui.md index 47fee38fd6..7d21a3d45d 100644 --- a/docs/reference/alpha-web-ui.md +++ b/docs/reference/alpha-web-ui.md @@ -9,7 +9,7 @@ The Feast Web UI allows users to explore their feature repository through a Web * Browsing Feast objects (feature views, entities, data sources, feature services, and saved datasets) and their relationships * Searching and filtering for Feast objects by tags -![Sample UI](ui.png) +![Sample UI](../../ui/sample.png) ## Usage diff --git a/docs/reference/data-sources/README.md b/docs/reference/data-sources/README.md index 223338508b..6ab2e4b083 100644 --- a/docs/reference/data-sources/README.md +++ b/docs/reference/data-sources/README.md @@ -1,6 +1,6 @@ # Data sources -Please see [Data Source](../../getting-started/concepts/data-source.md) for an explanation of data sources. +Please see [Data Source](../../getting-started/concepts/data-ingestion.md) for an explanation of data sources. {% content-ref url="file.md" %} [file.md](file.md) diff --git a/docs/tutorials/driver-ranking-with-feast.md b/docs/tutorials/driver-ranking-with-feast.md deleted file mode 100644 index 4ad34cd9c0..0000000000 --- a/docs/tutorials/driver-ranking-with-feast.md +++ /dev/null @@ -1,25 +0,0 @@ ---- -description: >- - Making a prediction using a linear regression model is a common use case in - ML. This model predicts if a driver will complete a trip based on features - ingested into Feast. ---- - -# Driver ranking - -In this example, you'll learn how to use some of the key functionality in Feast. The tutorial runs in both local mode and on the Google Cloud Platform \(GCP\). For GCP, you must have access to a GCP project already, including read and write permissions to BigQuery. - -## [Driver Ranking Example](/~https://github.com/feast-dev/feast-driver-ranking-tutorial) - -This tutorial guides you on how to use Feast with [Scikit-learn](https://scikit-learn.org/stable/). You will learn how to: - -* Train a model locally \(on your laptop\) using data from [BigQuery](https://cloud.google.com/bigquery/) -* Test the model for online inference using [SQLite](https://www.sqlite.org/index.html) \(for fast iteration\) -* Test the model for online inference using [Firestore](https://firebase.google.com/products/firestore) \(for production use\) - -Try it and let us know what you think! - -| ![](../.gitbook/assets/colab_logo_32px.png)[ Run in Google Colab ](https://colab.research.google.com/github/feast-dev/feast-driver-ranking-tutorial/blob/master/notebooks/Driver_Ranking_Tutorial.ipynb) | ![](../.gitbook/assets/github-mark-32px.png)[ View Source in Github](/~https://github.com/feast-dev/feast-driver-ranking-tutorial/blob/master/notebooks/Driver_Ranking_Tutorial.ipynb) | -| :--- | :--- | - - diff --git a/docs/tutorials/tutorials-overview.md b/docs/tutorials/tutorials-overview.md deleted file mode 100644 index 9432783a69..0000000000 --- a/docs/tutorials/tutorials-overview.md +++ /dev/null @@ -1,15 +0,0 @@ -# Overview - -These Feast tutorials showcase how to use Feast to simplify end to end model training / serving. - -{% page-ref page="fraud-detection.md" %} - -{% page-ref page="driver-ranking-with-feast.md" %} - -{% page-ref page="real-time-credit-scoring-on-aws.md" %} - -{% page-ref page="driver-stats-on-snowflake.md" %} - -{% page-ref page="validating-historical-features.md" %} - -{% page-ref page="using-scalable-registry.md" %} diff --git a/docs/tutorials/tutorials-overview/README.md b/docs/tutorials/tutorials-overview/README.md new file mode 100644 index 0000000000..76cb2bea6b --- /dev/null +++ b/docs/tutorials/tutorials-overview/README.md @@ -0,0 +1,19 @@ +# Sample use-case tutorials + +These Feast tutorials showcase how to use Feast to simplify end to end model training / serving. + +{% content-ref url="driver-ranking-with-feast.md" %} +[driver-ranking-with-feast.md](driver-ranking-with-feast.md) +{% endcontent-ref %} + +{% content-ref url="fraud-detection.md" %} +[fraud-detection.md](fraud-detection.md) +{% endcontent-ref %} + +{% content-ref url="real-time-credit-scoring-on-aws.md" %} +[real-time-credit-scoring-on-aws.md](real-time-credit-scoring-on-aws.md) +{% endcontent-ref %} + +{% content-ref url="driver-stats-on-snowflake.md" %} +[driver-stats-on-snowflake.md](driver-stats-on-snowflake.md) +{% endcontent-ref %} diff --git a/docs/tutorials/tutorials-overview/driver-ranking-with-feast.md b/docs/tutorials/tutorials-overview/driver-ranking-with-feast.md new file mode 100644 index 0000000000..54f3035319 --- /dev/null +++ b/docs/tutorials/tutorials-overview/driver-ranking-with-feast.md @@ -0,0 +1,23 @@ +--- +description: >- + Making a prediction using a linear regression model is a common use case in + ML. This model predicts if a driver will complete a trip based on features + ingested into Feast. +--- + +# Driver ranking + +In this example, you'll learn how to use some of the key functionality in Feast. The tutorial runs in both local mode and on the Google Cloud Platform (GCP). For GCP, you must have access to a GCP project already, including read and write permissions to BigQuery. + +## [Driver Ranking Example](/~https://github.com/feast-dev/feast-driver-ranking-tutorial) + +This tutorial guides you on how to use Feast with [Scikit-learn](https://scikit-learn.org/stable/). You will learn how to: + +* Train a model locally (on your laptop) using data from [BigQuery](https://cloud.google.com/bigquery/) +* Test the model for online inference using [SQLite](https://www.sqlite.org/index.html) (for fast iteration) +* Test the model for online inference using [Firestore](https://firebase.google.com/products/firestore) (for production use) + +Try it and let us know what you think! + +| ![](../../.gitbook/assets/colab\_logo\_32px.png)[ Run in Google Colab](https://colab.research.google.com/github/feast-dev/feast-driver-ranking-tutorial/blob/master/notebooks/Driver\_Ranking\_Tutorial.ipynb) | ![](../../.gitbook/assets/github-mark-32px.png)[ View Source in Github](/~https://github.com/feast-dev/feast-driver-ranking-tutorial/blob/master/notebooks/Driver\_Ranking\_Tutorial.ipynb) | +| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | diff --git a/docs/tutorials/driver-stats-on-snowflake.md b/docs/tutorials/tutorials-overview/driver-stats-on-snowflake.md similarity index 100% rename from docs/tutorials/driver-stats-on-snowflake.md rename to docs/tutorials/tutorials-overview/driver-stats-on-snowflake.md diff --git a/docs/tutorials/fraud-detection.md b/docs/tutorials/tutorials-overview/fraud-detection.md similarity index 51% rename from docs/tutorials/fraud-detection.md rename to docs/tutorials/tutorials-overview/fraud-detection.md index 7bdfde760e..30564d0b0c 100644 --- a/docs/tutorials/fraud-detection.md +++ b/docs/tutorials/tutorials-overview/fraud-detection.md @@ -17,13 +17,9 @@ Our end-to-end example will perform the following workflows: * Building point-in-time correct training datasets from feature data and training a model * Making online predictions from feature data -Here's a high-level picture of our system architecture on Google Cloud Platform \(GCP\): - - - -![](../.gitbook/assets/data-systems-fraud-2x.jpg) - -| ![](../.gitbook/assets/colab_logo_32px.png) [Run in Google Colab](https://colab.research.google.com/github/feast-dev/feast-fraud-tutorial/blob/master/notebooks/Fraud_Detection_Tutorial.ipynb) | ![](../.gitbook/assets/github-mark-32px.png)[ View Source on Github](/~https://github.com/feast-dev/feast-fraud-tutorial/blob/main/notebooks/Fraud_Detection_Tutorial.ipynb) | -| :--- | :--- | +Here's a high-level picture of our system architecture on Google Cloud Platform (GCP): +![](../../.gitbook/assets/data-systems-fraud-2x.jpg) +| ![](../../.gitbook/assets/colab\_logo\_32px.png) [Run in Google Colab](https://colab.research.google.com/github/feast-dev/feast-fraud-tutorial/blob/master/notebooks/Fraud\_Detection\_Tutorial.ipynb) | ![](../../.gitbook/assets/github-mark-32px.png)[ View Source on Github](/~https://github.com/feast-dev/feast-fraud-tutorial/blob/main/notebooks/Fraud\_Detection\_Tutorial.ipynb) | +| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | diff --git a/docs/tutorials/real-time-credit-scoring-on-aws.md b/docs/tutorials/tutorials-overview/real-time-credit-scoring-on-aws.md similarity index 74% rename from docs/tutorials/real-time-credit-scoring-on-aws.md rename to docs/tutorials/tutorials-overview/real-time-credit-scoring-on-aws.md index 43f8c98133..6268aba1f1 100644 --- a/docs/tutorials/real-time-credit-scoring-on-aws.md +++ b/docs/tutorials/tutorials-overview/real-time-credit-scoring-on-aws.md @@ -10,20 +10,18 @@ When individuals apply for loans from banks and other credit providers, the deci In this example, we will demonstrate how a real-time credit scoring system can be built using Feast and Scikit-Learn on AWS, using feature data from S3. -This real-time system accepts a loan request from a customer and responds within 100ms with a decision on whether their loan has been approved or rejected. +This real-time system accepts a loan request from a customer and responds within 100ms with a decision on whether their loan has been approved or rejected. ## [Real-time Credit Scoring Example](/~https://github.com/feast-dev/real-time-credit-scoring-on-aws-tutorial) This end-to-end tutorial will take you through the following steps: -* Deploying S3 with Parquet as your primary data source, containing both [loan features](/~https://github.com/feast-dev/real-time-credit-scoring-on-aws-tutorial/blob/22fc6c7272ef033e7ba0afc64ffaa6f6f8fc0277/data/loan_table_sample.csv) and [zip code features](/~https://github.com/feast-dev/real-time-credit-scoring-on-aws-tutorial/blob/22fc6c7272ef033e7ba0afc64ffaa6f6f8fc0277/data/zipcode_table_sample.csv) +* Deploying S3 with Parquet as your primary data source, containing both [loan features](/~https://github.com/feast-dev/real-time-credit-scoring-on-aws-tutorial/blob/22fc6c7272ef033e7ba0afc64ffaa6f6f8fc0277/data/loan\_table\_sample.csv) and [zip code features](/~https://github.com/feast-dev/real-time-credit-scoring-on-aws-tutorial/blob/22fc6c7272ef033e7ba0afc64ffaa6f6f8fc0277/data/zipcode\_table\_sample.csv) * Deploying Redshift as the interface Feast uses to build training datasets * Registering your features with Feast and configuring DynamoDB for online serving * Building a training dataset with Feast to train your credit scoring model * Loading feature values from S3 into DynamoDB * Making online predictions with your credit scoring model using features from DynamoDB -| ![](../.gitbook/assets/github-mark-32px.png)[ View Source on Github](/~https://github.com/feast-dev/real-time-credit-scoring-on-aws-tutorial) | -| :--- | - - +| ![](../../.gitbook/assets/github-mark-32px.png)[ View Source on Github](/~https://github.com/feast-dev/real-time-credit-scoring-on-aws-tutorial) | +| ---------------------------------------------------------------------------------------------------------------------------------------------- |