Data Quality
Data Quality plugins inspect the transformed data and provide a list data quality issues. These can be anything from extra newlines, to decimals with too many places, or the presence special characters.
Need to modify the data before it's ever touched by the Transforms? Use File IO.
Need to modify the data, maps, options, etc after Transforms has successfully loaded the data into a table? Use the Transform Process
Need to generate data quality reporting? Use Data Quality.
Use Cases
A company file doesn't allow for decimals with more than 2 places
We need to verify that every group of transactions in the file sum to positive amounts
The combination of multiple fields cannot be greater than n length.
Creating a plugin project
If you would like to actually create a plugin library (dll
project), follow these steps first and we'll put our code here. Otherwise, skip this step, and create the code directly within your project.
Create a new DLL project, and for the time being, set the framework to
net6.0
.Install the latest version of Perigee using
install-package perigee
- OR use Nuget Package Manager.Open the
.proj
file by double clicking on the DLL project in your code editor. You should see the XML for the project below.The two changes you need to make are:
Add the
<EnableDynamicLoading>true</EnableDynamicLoading>
to thePropertyGroup
tagFor the
PackageReferences
, add<Private>false</Private
and<ExcludeAssets>runtime</ExcludeAssets>
That's it! You've created a new DLL Project that when built, will produce a plugin.dll
that Transforms is able hot reload and run dynamically at runtime.
Authoring the plugin
The plugin can contain many Data Quality checks. Each process is defined by a method, and an attribute. Here's what a new process for AmountCannotBeZero
looks like:
Attribute
The [attribute]
tells the system several important things, in the order shown above, they are:
Active? - Should the plugin loader use this plugin, is it active? Or is this in development or unavailable.
Name - What name is this plugin given? This is shown in the data quality report and should be short and descriptive
Description - May be used in the report to further explain the check
Valdiate At - This is going to the most common use case, validating at the table level. It's also the most performant. The other option is related to validating Set level transforms.
Partition Keys - This is a very important field to fill out. This specifies under what files (partitions) to run the data quality checks. You can partition them for only certain types of files. You may provide multiple keys in a comma separated list like so:
"yardi, finanace, FinanceFileA"
It's either blank, "" - which means it can always run.
It has the DataTableName (TransformGroup) - Which can automatically be selected when running that specific map.
It has a generic key (like
yardi
,custom
,finance
, etc), for which you can specify during the transform process which keys you'd like to run. See the MapTo section for more info on running with partition keys
Other optional attribute values you can supply are:
IsPostTransform (false|true) - This is typically
true
, meaning this process is run after the transformation occurs.IsPreTransform (false|true) - This is typically
false
, meaning this process is run before the transformation occurs. Less common, as typically you validate the data after it's been modified and mapped.
Interface
The IDataQualityValidator
interface gives the method all of the required data it needs to process the file.
The main method you'll use in the TransformDataQualityContext
is the Process
method. This method automatically parallel processes the entire dataset and provides an easy to use callback to add validation rows.
The end result of any callback should be adding a new DataQualityValidationRow
for every quality issue that is found. Finishing the implementation for our amount zero check, we'll look at any AMOUNT columns that do not convert and read as 0.0m
.
One more example:
Here's another example of a check that validates no newline characters are present. You can see the exact same pattern is followed, we just use the helper method ColumnsOfType
to determine any string columns, then iterate those and report.
SDK
To see all of the available methods, properties, and helpers, check out the SDK page:
Running Data Quality Manually (SDK)
If you're running DQ as part of a transform it's baked right into the process. See MapTo.
If you are wanting to run DQ modules outside of this process, here's an example of running them manually:
Installation in Client App
If you created a plugin.dll
project: Compile the project and drop the .dll
into the Plugins/DQ
folder.
If you wrote the process in the same project as you're running, the plugin loader will automatically take a scan of the assembly and the plugin is available for use.
Last updated