Configuration and environment variables in Feathr

Feathr uses a YAML file and a few environment variables to allow end users to have more flexibility. See the example of the following configurations in this file.

In that YAML file, it contains the configurations that are used by Feathr. All the configurations can be overwritten by environment variables with concatenation of __ for different layers of this config file. For example, feathr_runtime_location for databricks can be overwritten by setting this environment variable: SPARK_CONFIG__DATABRICKS__FEATHR_RUNTIME_LOCATION. For example, you can set it in python:

os.environ['SPARK_CONFIG__DATABRICKS__FEATHR_RUNTIME_LOCATION'] = "https://azurefeathrstorage.blob.core.windows.net/public/feathr-assembly-LATEST.jar"

or in shell environment:

export SPARK_CONFIG__DATABRICKS__FEATHR_RUNTIME_LOCATION=https://azurefeathrstorage.blob.core.windows.net/public/feathr-assembly-LATEST.jar

This allows end users to store the configuration in a secure way, say in Kubernetes secrets, key vault, etc. All the configurations available for end users to configure are listed below.

Default behaviors

Feathr will get the configurations in the following order:

If the key is set in the environment variable, Feathr will use the value of that environment variable
If it’s not set in the environment, then a value is retrieved from the feathr_config.yaml file with the same config key.
If it’s not available in the feathr_config.yaml file, Feathr will try to reterive the value from a key vault service. Currently only Azure Key Vault is supported.

A list of environment variables that Feathr uses

| Environment Variable | Description | Required? | | ————– | ————– | ————– | | SECRETS__AZURE_KEY_VAULT__NAME | Name of the Azure Key Vault service so that Feathr can get credentials from that service. | Optional | | AZURE_CLIENT_ID | Client ID for authentication into Azure Services. Read here for more details. | This is required if you are using Service Principal to login with Feathr. | | AZURE_TENANT_ID |Client ID for authentication into Azure Services. Read here for more details. |This is required if you are using Service Principal to login with Feathr. | | AZURE_CLIENT_SECRET | Client ID for authentication into Azure Services. Read here for more details. |This is required if you are using Service Principal to login with Feathr. | | OFFLINE_STORE__ADLS__ADLS_ENABLED | Whether to enable ADLS as offline store or not. |Optional | | ADLS_ACCOUNT | ADLS account that you connect to. |Required if using ADLS as an offline store. | | ADLS_KEY | ADLS key that you connect to. |Required if using ADLS as an offline store. | | OFFLINE_STORE__WASB__WASB_ENABLED | Whether to enable Azure BLOB storage as offline store or not. | | WASB_ACCOUNT | Azure BLOB Storage account that you connect to.| Required if using Azure BLOB Storage as an offline store. | | WASB_KEY | Azure BLOB Storage key that you connect to. |Required if using Azure BLOB Storage as an offline store. | | S3_ACCESS_KEY | AWS S3 access key for the S3 account. |Required if using AWS S3 Storage as an offline store. | | S3_SECRET_KEY | AWS S3 secret key for the S3 account. |Required if using AWS S3 Storage as an offline store. | | OFFLINE_STORE__S3__S3_ENABLED | Whether to enable S3 as offline store or not. |Optional | | OFFLINE_STORE__S3__S3_ENDPOINT | S3 endpoint. If you use S3 endpoint, then you need to provide access key and secret key in the environment variable as well. |Required if using AWS S3 Storage as an offline store. | | OFFLINE_STORE__JDBC__JDBC_ENABLED | Whether to enable JDBC as offline store or not. |Optional | | OFFLINE_STORE__JDBC__JDBC_DATABASE | If using JDBC endpoint as offline store, this config specifies the JDBC database to read from. | Required if using JDBC sources as offline store | | OFFLINE_STORE__JDBC__JDBC_TABLE | If using JDBC endpoint as offline store, this config specifies the JDBC table to read from. Same as JDBC_TABLE. |Required if using JDBC sources as offline store | | JDBC_TABLE | If using JDBC endpoint as offline store, this config specifies the JDBC table to read from |Required if using JDBC sources as offline store | | JDBC_USER | If using JDBC endpoint as offline store, this config specifies the JDBC user |Required if using JDBC sources as offline store | | JDBC_PASSWORD | If using JDBC endpoint as offline store, this config specifies the JDBC password |Required if using JDBC sources as offline store | | KAFKA_SASL_JAAS_CONFIG | If using EventHub as a streaming input source, this configures the KAFKA stream. If using EventHub, read here for how to get this string from the existing string in Azure Portal. | Required if using Kafka/EventHub as streaming source input.| | PROJECT_CONFIG__PROJECT_NAME | Configures the project name. | Required| | OFFLINE_STORE__SNOWFLAKE__URL | Configures the Snowflake URL. Usually it’s something like dqllago-ol19457.snowflakecomputing.com. |Required if using Snowflake as an offline store. | | OFFLINE_STORE__SNOWFLAKE__USER | Configures the Snowflake user. |Required if using Snowflake as an offline store. | | OFFLINE_STORE__SNOWFLAKE__ROLE | Configures the Snowflake role. Usually it’s something like ACCOUNTADMIN. |Required if using Snowflake as an offline store. | |JDBC_SF_PASSWORD| Configurations for Snowflake password|Required if using Snowflake as an offline store. | | SPARK_CONFIG__SPARK_CLUSTER | Choice for spark runtime. Currently support: azure_synapse, databricks. The databricks configs will be ignored if azure_synapse is set and vice versa. | Required| | SPARK_CONFIG__SPARK_RESULT_OUTPUT_PARTS | Configure number of parts for the spark output for feature generation job | Required| | SPARK_CONFIG__AZURE_SYNAPSE__DEV_URL | Dev URL to the synapse cluster. Usually it’s something like https://yourclustername.dev.azuresynapse.net | Required if using Azure Synapse| | SPARK_CONFIG__AZURE_SYNAPSE__POOL_NAME | name of the sparkpool that you are going to use |Required if using Azure Synapse| | SPARK_CONFIG__AZURE_SYNAPSE__WORKSPACE_DIR | A location that Synapse has access to. This workspace dir stores all the required configuration files and the jar resources. All the feature definitions will be uploaded here |Required if using Azure Synapse| | SPARK_CONFIG__AZURE_SYNAPSE__EXECUTOR_SIZE | Specifies the executor size for the Azure Synapse cluster. Currently the options are Small, Medium, Large. |Required if using Azure Synapse| | SPARK_CONFIG__AZURE_SYNAPSE__EXECUTOR_NUM | Sepcifies the number of executors for the Azure Synapse cluster |Required if using Azure Synapse| | SPARK_CONFIG__AZURE_SYNAPSE__FEATHR_RUNTIME_LOCATION | Specifies the Feathr runtime location. Support local paths, path start with http(s)://, and paths start with abfss:/. If not set, will use the Feathr package published in Maven. |Required if using Azure Synapse| | SPARK_CONFIG__DATABRICKS__WORKSPACE_INSTANCE_URL | Workspace instance URL for your databricks cluster. Will be something like this: https://adb-6885802458123232.12.azuredatabricks.net/ |Required if using Databricks| | SPARK_CONFIG__DATABRICKS__CONFIG_TEMPLATE | Config string including run time information, spark version, machine size, etc. See below for more details. |Required if using Databricks| | SPARK_CONFIG__DATABRICKS__WORK_DIR | Workspace dir for storing all the required configuration files and the jar resources. All the feature definitions will be uploaded here. |Required if using Databricks| | SPARK_CONFIG__DATABRICKS__FEATHR_RUNTIME_LOCATION | Feathr runtime location. Support local paths, path start with http(s)://, and paths start with dbfs:/. If not set, will use the Feathr package published in Maven. |Required if using Databricks| | ONLINE_STORE__REDIS__HOST | Redis host name to access Redis cluster. |Required if using Redis as online store. | | ONLINE_STORE__REDIS__PORT | Redis port number to access Redis cluster. |Required if using Redis as online store. | | ONLINE_STORE__REDIS__SSL_ENABLED | Whether SSL is enabled to access Redis cluster. |Required if using Redis as online store. | | REDIS_PASSWORD | Password for the Redis cluster. |Required if using Redis as online store. | | FEATURE_REGISTRY__PURVIEW__PURVIEW_NAME | Configure the name of the purview endpoint. |Required if using Purview as the endpoint. | | FEATURE_REGISTRY__PURVIEW__DELIMITER | See here for more details. | Required| | FEATURE_REGISTRY__PURVIEW__TYPE_SYSTEM_INITIALIZATION | Controls whether the type system (think this as the “schema” for the registry) will be initialized or not. Usually this is only required to be set to True to initialize schema, and then you can set it to False to shorten the initialization time. | Required|

Explanation for selected configurations

SPARK_CONFIGDATABRICKSCONFIG_TEMPLATE

Essentially it’s a compact JSON string represents the important configurations that you can configure for the databricks cluster that you use. There are parts that marked as “FEATHR_FILL_IN” that Feathr will fill in, but all the other parts are customizable.

Essentially, the config template represents what is going to be submitted to a databricks cluster, and you can see the structure of this configuration template by visiting the Databricks job runs API:

The most important and useful part would be the new_cluster section. For example, you can changespark_version, node_type_id, num_workers, etc. based on your environment.

{"run_name":"FEATHR_FILL_IN","new_cluster":{"spark_version":"9.1.x-scala2.12","node_type_id":"Standard_D3_v2","num_workers":2,"spark_conf":{"FEATHR_FILL_IN":"FEATHR_FILL_IN"}},"libraries":[{"jar":"FEATHR_FILL_IN"}],"spark_jar_task":{"main_class_name":"FEATHR_FILL_IN","parameters":["FEATHR_FILL_IN"]}}

Another use case is to use instance_pool_id, where instead of creating the Spark cluster from scratch every time, you can reuse a pool to run the job to make the run time shorter:

{"run_name":"FEATHR_FILL_IN","new_cluster":{"spark_version":"9.1.x-scala2.12","num_workers":2,"spark_conf":{"FEATHR_FILL_IN":"FEATHR_FILL_IN"},"instance_pool_id":"0403-214809-inlet434-pool-l9dj3kwz"},"libraries":[{"jar":"FEATHR_FILL_IN"}],"spark_jar_task":{"main_class_name":"FEATHR_FILL_IN","parameters":["FEATHR_FILL_IN"]}}

Other advanced settings includes idempotency_token to guarantee the idempotency of job run requests, etc.

FEATURE_REGISTRYPURVIEWDELIMITER

Delimiter indicates that how the project name, feature names etc. are delimited. By default it will be ‘__’. this is for global reference (mainly for feature sharing). For exmaple, when we setup a project called foo, and we have an anchor called ‘taxi_driver’ and the feature name is called ‘f_daily_trips’. the feature will have a global unique name called ‘foo__taxi_driver__f_daily_trips’

A note on using Azure Key Vault to store credentials

Feathr has native integrations with Azure Key Vault to make it more secure to access resources. However, Azure Key Vault doesn’t support the secret name to have underscore _ in the secret name. Feathr will automatically convert underscore _ to dash -. For example, Feathr will look for ONLINE-STORE--REDIS--HOST in Azure Key Vault if the actual environment variable is ONLINE_STORE__REDIS__HOST.

Please refer to akv_client.py to understand the secret name convert logic.

Azure Key Vault is not case sensitive, so online_store__redis__host and ONLINE_STORE__REDIS__HOST will result in the same request to Azure Key Vault and yield the same result.

You can automatically store resource credentials in provision stage, just like azure_resource_provision.json do for REDIS-PASSWORD.