Copyright 2021 The TensorFlow Authors.¶
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Data validation using TFX Pipeline and TensorFlow Data Validation¶
Note: We recommend running this tutorial in a Colab notebook, with no setup required! Just click "Run in Google Colab".
In this notebook-based tutorial, we will create and run TFX pipelines to validate input data and create an ML model. This notebook is based on the TFX pipeline we built in Simple TFX Pipeline Tutorial. If you have not read that tutorial yet, you should read it before proceeding with this notebook.
The first task in any data science or ML project is to understand and clean the data, which includes:
- Understanding the data types, distributions, and other information (e.g., mean value, or number of uniques) about each feature
- Generating a preliminary schema that describes the data
- Identifying anomalies and missing values in the data with respect to given schema
In this tutorial, we will create two TFX pipelines.
First, we will create a pipeline to analyze the dataset and generate a
preliminary schema of the given dataset. This pipeline will include two new
components, StatisticsGen
and SchemaGen
.
Once we have a proper schema of the data, we will create a pipeline to train
an ML classification model based on the pipeline from the previous tutorial.
In this pipeline, we will use the schema from the first pipeline and a
new component, ExampleValidator
, to validate the input data.
The three new components, StatisticsGen, SchemaGen and ExampleValidator, are TFX components for data analysis and validation, and they are implemented using the TensorFlow Data Validation library.
Please see Understanding TFX Pipelines to learn more about various concepts in TFX.
try:
import colab
!pip install --upgrade pip
except:
pass
Install TFX¶
!pip install -U tfx
Did you restart the runtime?¶
If you are using Google Colab, the first time that you run the cell above, you must restart the runtime by clicking above "RESTART RUNTIME" button or using "Runtime > Restart runtime ..." menu. This is because of the way that Colab loads packages.
Check the TensorFlow and TFX versions.
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))
from tfx import v1 as tfx
print('TFX version: {}'.format(tfx.__version__))
Set up variables¶
There are some variables used to define a pipeline. You can customize these variables as you want. By default all output from the pipeline will be generated under the current directory.
import os
# We will create two pipelines. One for schema generation and one for training.
SCHEMA_PIPELINE_NAME = "penguin-tfdv-schema"
PIPELINE_NAME = "penguin-tfdv"
# Output directory to store artifacts generated from the pipeline.
SCHEMA_PIPELINE_ROOT = os.path.join('pipelines', SCHEMA_PIPELINE_NAME)
PIPELINE_ROOT = os.path.join('pipelines', PIPELINE_NAME)
# Path to a SQLite DB file to use as an MLMD storage.
SCHEMA_METADATA_PATH = os.path.join('metadata', SCHEMA_PIPELINE_NAME,
'metadata.db')
METADATA_PATH = os.path.join('metadata', PIPELINE_NAME, 'metadata.db')
# Output directory where created models from the pipeline will be exported.
SERVING_MODEL_DIR = os.path.join('serving_model', PIPELINE_NAME)
from absl import logging
logging.set_verbosity(logging.INFO) # Set default logging level.
Prepare example data¶
We will download the example dataset for use in our TFX pipeline. The dataset we are using is Palmer Penguins dataset which is also used in other TFX examples.
There are four numeric features in this dataset:
- culmen_length_mm
- culmen_depth_mm
- flipper_length_mm
- body_mass_g
All features were already normalized to have range [0,1]. We will build a
classification model which predicts the species
of penguins.
Because the TFX ExampleGen component reads inputs from a directory, we need to create a directory and copy the dataset to it.
import urllib.request
import tempfile
DATA_ROOT = tempfile.mkdtemp(prefix='tfx-data') # Create a temporary directory.
_data_url = 'https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/penguin/data/labelled/penguins_processed.csv'
_data_filepath = os.path.join(DATA_ROOT, "data.csv")
urllib.request.urlretrieve(_data_url, _data_filepath)
Take a quick look at the CSV file.
!head {_data_filepath}
You should be able to see five feature columns. species
is one of 0, 1 or 2,
and all other features should have values between 0 and 1. We will create a TFX
pipeline to analyze this dataset.
Generate a preliminary schema¶
TFX pipelines are defined using Python APIs. We will create a pipeline to generate a schema from the input examples automatically. This schema can be reviewed by a human and adjusted as needed. Once the schema is finalized it can be used for training and example validation in later tasks.
In addition to CsvExampleGen
which is used in
Simple TFX Pipeline Tutorial,
we will use StatisticsGen
and SchemaGen
:
- StatisticsGen calculates statistics for the dataset.
- SchemaGen examines the statistics and creates an initial data schema.
See the guides for each component or TFX components tutorial to learn more on these components.
Write a pipeline definition¶
We define a function to create a TFX pipeline. A Pipeline
object
represents a TFX pipeline which can be run using one of pipeline
orchestration systems that TFX supports.
def _create_schema_pipeline(pipeline_name: str,
pipeline_root: str,
data_root: str,
metadata_path: str) -> tfx.dsl.Pipeline:
"""Creates a pipeline for schema generation."""
# Brings data into the pipeline.
example_gen = tfx.components.CsvExampleGen(input_base=data_root)
# NEW: Computes statistics over data for visualization and schema generation.
statistics_gen = tfx.components.StatisticsGen(
examples=example_gen.outputs['examples'])
# NEW: Generates schema based on the generated statistics.
schema_gen = tfx.components.SchemaGen(
statistics=statistics_gen.outputs['statistics'], infer_feature_shape=True)
components = [
example_gen,
statistics_gen,
schema_gen,
]
return tfx.dsl.Pipeline(
pipeline_name=pipeline_name,
pipeline_root=pipeline_root,
metadata_connection_config=tfx.orchestration.metadata
.sqlite_metadata_connection_config(metadata_path),
components=components)
Run the pipeline¶
We will use LocalDagRunner
as in the previous tutorial.
tfx.orchestration.LocalDagRunner().run(
_create_schema_pipeline(
pipeline_name=SCHEMA_PIPELINE_NAME,
pipeline_root=SCHEMA_PIPELINE_ROOT,
data_root=DATA_ROOT,
metadata_path=SCHEMA_METADATA_PATH))
You should see "INFO:absl:Component SchemaGen is finished." if the pipeline finished successfully.
We will examine the output of the pipeline to understand our dataset.
Review outputs of the pipeline¶
As explained in the previous tutorial, a TFX pipeline produces two kinds of
outputs, artifacts and a
metadata DB(MLMD) which contains
metadata of artifacts and pipeline executions. We defined the location of
these outputs in the above cells. By default, artifacts are stored under
the pipelines
directory and metadata is stored as a sqlite database
under the metadata
directory.
You can use MLMD APIs to locate these outputs programatically. First, we will define some utility functions to search for the output artifacts that were just produced.
from ml_metadata.proto import metadata_store_pb2
# Non-public APIs, just for showcase.
from tfx.orchestration.portable.mlmd import execution_lib
# TODO(b/171447278): Move these functions into the TFX library.
def get_latest_artifacts(metadata, pipeline_name, component_id):
"""Output artifacts of the latest run of the component."""
context = metadata.store.get_context_by_type_and_name(
'node', f'{pipeline_name}.{component_id}')
executions = metadata.store.get_executions_by_context(context.id)
latest_execution = max(executions,
key=lambda e:e.last_update_time_since_epoch)
return execution_lib.get_output_artifacts(metadata, latest_execution.id)
# Non-public APIs, just for showcase.
from tfx.orchestration.experimental.interactive import visualizations
def visualize_artifacts(artifacts):
"""Visualizes artifacts using standard visualization modules."""
for artifact in artifacts:
visualization = visualizations.get_registry().get_visualization(
artifact.type_name)
if visualization:
visualization.display(artifact)
from tfx.orchestration.experimental.interactive import standard_visualizations
standard_visualizations.register_standard_visualizations()
Now we can examine the outputs from the pipeline execution.
# Non-public APIs, just for showcase.
from tfx.orchestration.metadata import Metadata
from tfx.types import standard_component_specs
metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(
SCHEMA_METADATA_PATH)
with Metadata(metadata_connection_config) as metadata_handler:
# Find output artifacts from MLMD.
stat_gen_output = get_latest_artifacts(metadata_handler, SCHEMA_PIPELINE_NAME,
'StatisticsGen')
stats_artifacts = stat_gen_output[standard_component_specs.STATISTICS_KEY]
schema_gen_output = get_latest_artifacts(metadata_handler,
SCHEMA_PIPELINE_NAME, 'SchemaGen')
schema_artifacts = schema_gen_output[standard_component_specs.SCHEMA_KEY]
It is time to examine the outputs from each component. As described above,
Tensorflow Data Validation(TFDV)
is used in StatisticsGen
and SchemaGen
, and TFDV also
provides visualization of the outputs from these components.
In this tutorial, we will use the visualization helper methods in TFX which use TFDV internally to show the visualization.
Examine the output from StatisticsGen¶
# docs-infra: no-execute
visualize_artifacts(stats_artifacts)
You can see various stats for the input data. These statistics are supplied to
SchemaGen
to construct an initial schema of data automatically.
Examine the output from SchemaGen¶
visualize_artifacts(schema_artifacts)
This schema is automatically inferred from the output of StatisticsGen. You should be able to see 4 FLOAT features and 1 INT feature.
Export the schema for future use¶
We need to review and refine the generated schema. The reviewed schema needs to be persisted to be used in subsequent pipelines for ML model training. In other words, you might want to add the schema file to your version control system for actual use cases. In this tutorial, we will just copy the schema to a predefined filesystem path for simplicity.
import shutil
_schema_filename = 'schema.pbtxt'
SCHEMA_PATH = 'schema'
os.makedirs(SCHEMA_PATH, exist_ok=True)
_generated_path = os.path.join(schema_artifacts[0].uri, _schema_filename)
# Copy the 'schema.pbtxt' file from the artifact uri to a predefined path.
shutil.copy(_generated_path, SCHEMA_PATH)
The schema file uses Protocol Buffer text format and an instance of TensorFlow Metadata Schema proto.
print(f'Schema at {SCHEMA_PATH}-----')
!cat {SCHEMA_PATH}/*
You should be sure to review and possibly edit the schema definition as needed. In this tutorial, we will just use the generated schema unchanged.
Validate input examples and train an ML model¶
We will go back to the pipeline that we created in Simple TFX Pipeline Tutorial, to train an ML model and use the generated schema for writing the model training code.
We will also add an ExampleValidator component which will look for anomalies and missing values in the incoming dataset with respect to the schema.
Write model training code¶
We need to write the model code as we did in Simple TFX Pipeline Tutorial.
The model itself is the same as in the previous tutorial, but this time we will use the schema generated from the previous pipeline instead of specifying features manually. Most of the code was not changed. The only difference is that we do not need to specify the names and types of features in this file. Instead, we read them from the schema file.
_trainer_module_file = 'penguin_trainer.py'
%%writefile {_trainer_module_file}
from typing import List
from absl import logging
import tensorflow as tf
from tensorflow import keras
from tensorflow_transform.tf_metadata import schema_utils
from tfx import v1 as tfx
from tfx_bsl.public import tfxio
from tensorflow_metadata.proto.v0 import schema_pb2
# We don't need to specify _FEATURE_KEYS and _FEATURE_SPEC any more.
# Those information can be read from the given schema file.
_LABEL_KEY = 'species'
_TRAIN_BATCH_SIZE = 20
_EVAL_BATCH_SIZE = 10
def _input_fn(file_pattern: List[str],
data_accessor: tfx.components.DataAccessor,
schema: schema_pb2.Schema,
batch_size: int = 200) -> tf.data.Dataset:
"""Generates features and label for training.
Args:
file_pattern: List of paths or patterns of input tfrecord files.
data_accessor: DataAccessor for converting input to RecordBatch.
schema: schema of the input data.
batch_size: representing the number of consecutive elements of returned
dataset to combine in a single batch
Returns:
A dataset that contains (features, indices) tuple where features is a
dictionary of Tensors, and indices is a single Tensor of label indices.
"""
return data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size, label_key=_LABEL_KEY),
schema=schema).repeat()
def _build_keras_model(schema: schema_pb2.Schema) -> tf.keras.Model:
"""Creates a DNN Keras model for classifying penguin data.
Returns:
A Keras Model.
"""
# The model below is built with Functional API, please refer to
# https://www.tensorflow.org/guide/keras/overview for all API options.
# ++ Changed code: Uses all features in the schema except the label.
feature_keys = [f.name for f in schema.feature if f.name != _LABEL_KEY]
inputs = [keras.layers.Input(shape=(1,), name=f) for f in feature_keys]
# ++ End of the changed code.
d = keras.layers.concatenate(inputs)
for _ in range(2):
d = keras.layers.Dense(8, activation='relu')(d)
outputs = keras.layers.Dense(3)(d)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(
optimizer=keras.optimizers.Adam(1e-2),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[keras.metrics.SparseCategoricalAccuracy()])
model.summary(print_fn=logging.info)
return model
# TFX Trainer will call this function.
def run_fn(fn_args: tfx.components.FnArgs):
"""Train the model based on given args.
Args:
fn_args: Holds args used to train the model as name/value pairs.
"""
# ++ Changed code: Reads in schema file passed to the Trainer component.
schema = tfx.utils.parse_pbtxt_file(fn_args.schema_path, schema_pb2.Schema())
# ++ End of the changed code.
train_dataset = _input_fn(
fn_args.train_files,
fn_args.data_accessor,
schema,
batch_size=_TRAIN_BATCH_SIZE)
eval_dataset = _input_fn(
fn_args.eval_files,
fn_args.data_accessor,
schema,
batch_size=_EVAL_BATCH_SIZE)
model = _build_keras_model(schema)
model.fit(
train_dataset,
steps_per_epoch=fn_args.train_steps,
validation_data=eval_dataset,
validation_steps=fn_args.eval_steps)
# The result of the training should be saved in `fn_args.serving_model_dir`
# directory.
model.save(fn_args.serving_model_dir, save_format='tf')
Now you have completed all preparation steps to build a TFX pipeline for model training.
Write a pipeline definition¶
We will add two new components, Importer
and ExampleValidator
. Importer
brings an external file into the TFX pipeline. In this case, it is a file
containing schema definition. ExampleValidator will examine
the input data and validate whether all input data conforms the data schema
we provided.
def _create_pipeline(pipeline_name: str, pipeline_root: str, data_root: str,
schema_path: str, module_file: str, serving_model_dir: str,
metadata_path: str) -> tfx.dsl.Pipeline:
"""Creates a pipeline using predefined schema with TFX."""
# Brings data into the pipeline.
example_gen = tfx.components.CsvExampleGen(input_base=data_root)
# Computes statistics over data for visualization and example validation.
statistics_gen = tfx.components.StatisticsGen(
examples=example_gen.outputs['examples'])
# NEW: Import the schema.
schema_importer = tfx.dsl.Importer(
source_uri=schema_path,
artifact_type=tfx.types.standard_artifacts.Schema).with_id(
'schema_importer')
# NEW: Performs anomaly detection based on statistics and data schema.
example_validator = tfx.components.ExampleValidator(
statistics=statistics_gen.outputs['statistics'],
schema=schema_importer.outputs['result'])
# Uses user-provided Python function that trains a model.
trainer = tfx.components.Trainer(
module_file=module_file,
examples=example_gen.outputs['examples'],
schema=schema_importer.outputs['result'], # Pass the imported schema.
train_args=tfx.proto.TrainArgs(num_steps=100),
eval_args=tfx.proto.EvalArgs(num_steps=5))
# Pushes the model to a filesystem destination.
pusher = tfx.components.Pusher(
model=trainer.outputs['model'],
push_destination=tfx.proto.PushDestination(
filesystem=tfx.proto.PushDestination.Filesystem(
base_directory=serving_model_dir)))
components = [
example_gen,
# NEW: Following three components were added to the pipeline.
statistics_gen,
schema_importer,
example_validator,
trainer,
pusher,
]
return tfx.dsl.Pipeline(
pipeline_name=pipeline_name,
pipeline_root=pipeline_root,
metadata_connection_config=tfx.orchestration.metadata
.sqlite_metadata_connection_config(metadata_path),
components=components)
Run the pipeline¶
tfx.orchestration.LocalDagRunner().run(
_create_pipeline(
pipeline_name=PIPELINE_NAME,
pipeline_root=PIPELINE_ROOT,
data_root=DATA_ROOT,
schema_path=SCHEMA_PATH,
module_file=_trainer_module_file,
serving_model_dir=SERVING_MODEL_DIR,
metadata_path=METADATA_PATH))
You should see "INFO:absl:Component Pusher is finished." if the pipeline finished successfully.
Examine outputs of the pipeline¶
We have trained the classification model for penguins, and we also have validated the input examples in the ExampleValidator component. We can analyze the output from ExampleValidator as we did with the previous pipeline.
metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(
METADATA_PATH)
with Metadata(metadata_connection_config) as metadata_handler:
ev_output = get_latest_artifacts(metadata_handler, PIPELINE_NAME,
'ExampleValidator')
anomalies_artifacts = ev_output[standard_component_specs.ANOMALIES_KEY]
ExampleAnomalies from the ExampleValidator can be visualized as well.
visualize_artifacts(anomalies_artifacts)
You should see "No anomalies found" for each split of examples. Because we used the same data which was used for the schema generation in this pipeline, no anomaly is expected here. If you run this pipeline repeatedly with new incoming data, ExampleValidator should be able to find any discrepancies between the new data and the existing schema.
If any anomalies were found, you may review your data to check to see if any examples do not follow your assumptions. Outputs from other components like StatisticsGen might be useful. However, any anomalies which are found will NOT block further pipeline executions.
Next steps¶
You can find more resources on https://www.tensorflow.org/tfx/tutorials.
Please see Understanding TFX Pipelines to learn more about various concepts in TFX.