📢Data Quality

Data Quality plugins inspect the transformed data and provide a list data quality issues. These can be anything from extra newlines, to decimals with too many places, or the presence special characters.

  • Need to modify the data before it's ever touched by the Transforms? Use File IO.

  • Need to modify the data, maps, options, etc after Transforms has successfully loaded the data into a table? Use the Transform Process

  • Need to generate data quality reporting? Use Data Quality.

Use Cases

  • A company file doesn't allow for decimals with more than 2 places

  • We need to verify that every group of transactions in the file sum to positive amounts

  • The combination of multiple fields cannot be greater than n length.

Creating a plugin project

If you would like to actually create a plugin library (dll project), follow these steps first and we'll put our code here. Otherwise, skip this step, and create the code directly within your project.

  1. Create a new DLL project, and for the time being, set the framework to net6.0.

  2. Install the latest version of Perigee using install-package perigee - OR use Nuget Package Manager.

  3. Open the .proj file by double clicking on the DLL project in your code editor. You should see the XML for the project below.

  4. The two changes you need to make are:

    • Add the <EnableDynamicLoading>true</EnableDynamicLoading> to the PropertyGroup tag

    • For the PackageReferences, add <Private>false</Private and <ExcludeAssets>runtime</ExcludeAssets>

<Project Sdk="Microsoft.NET.Sdk">

	<PropertyGroup>
		<EnableDynamicLoading>true</EnableDynamicLoading>
		<TargetFramework>net6.0</TargetFramework>
		<ImplicitUsings>enable</ImplicitUsings>
		<Nullable>enable</Nullable>
	</PropertyGroup>

	<ItemGroup>
		<PackageReference Include="perigee" Version="24.6.1.1">
			<Private>false</Private>
			<ExcludeAssets>runtime</ExcludeAssets>
		</PackageReference>
	</ItemGroup>

</Project>

That's it! You've created a new DLL Project that when built, will produce a plugin.dll that Transforms is able hot reload and run dynamically at runtime.

Authoring the plugin

The plugin can contain many Data Quality checks. Each process is defined by a method, and an attribute. Here's what a new process for AmountCannotBeZero looks like:

[DataQualityCheck(true, "Amount Zero", "A check to determine if any AMOUNT columns are zero", validateAtTable: true, partition: "yardi")]
public class DQ_AmountCannotBeZero : IDataQualityValidator
{
    public void Validate(TransformDataQualityContext data, ref DataQualityValidationResult result)
    {
        
    }
}

Attribute

The [attribute] tells the system several important things, in the order shown above, they are:

  1. Active? - Should the plugin loader use this plugin, is it active? Or is this in development or unavailable.

  2. Name - What name is this plugin given? This is shown in the data quality report and should be short and descriptive

  3. Description - May be used in the report to further explain the check

  4. Valdiate At - This is going to the most common use case, validating at the table level. It's also the most performant. The other option is related to validating Set level transforms.

  5. Partition Keys - This is a very important field to fill out. This specifies under what files (partitions) to run the data quality checks. You can partition them for only certain types of files. You may provide multiple keys in a comma separated list like so: "yardi, finanace, FinanceFileA"

    • It's either blank, "" - which means it can always run.

    • It has the DataTableName (TransformGroup) - Which can automatically be selected when running that specific map.

    • It has a generic key (like yardi, custom, finance, etc), for which you can specify during the transform process which keys you'd like to run. See the MapTo section for more info on running with partition keys

Other optional attribute values you can supply are:

  • IsPostTransform (false|true) - This is typically true, meaning this process is run after the transformation occurs.

  • IsPreTransform (false|true) - This is typically false, meaning this process is run before the transformation occurs. Less common, as typically you validate the data after it's been modified and mapped.

Interface

The IDataQualityValidator interface gives the method all of the required data it needs to process the file.

The main method you'll use in the TransformDataQualityContext is the Process method. This method automatically parallel processes the entire dataset and provides an easy to use callback to add validation rows.

The end result of any callback should be adding a new DataQualityValidationRow for every quality issue that is found. Finishing the implementation for our amount zero check, we'll look at any AMOUNT columns that do not convert and read as 0.0m.

[DataQualityCheck(true, "Amount Zero", "A check to determine if any AMOUNT columns are zero", validateAtTable: true, partition: "yardi")]
public class DQ_AmountCannotBeZero : IDataQualityValidator
{
    public void Validate(TransformDataQualityContext data, ref DataQualityValidationResult result)
    {
        data.Process(result, () => data.RequiredColumns("AMOUNT"), (header, row, indx, bag) =>
        {
            var amount = row["AMOUNT"];
            if (amount != DBNull.Value && Convert.ToDecimal(amount) == 0.0m)
                bag.Add(new DataQualityValidationRow("AMOUNT", amount.ToString(), indx));

        }, 1);
    }
}

One more example:

Here's another example of a check that validates no newline characters are present. You can see the exact same pattern is followed, we just use the helper method ColumnsOfType to determine any string columns, then iterate those and report.

[DataQualityCheck(true, "Newline Columns", "If the cell contains two or more lines separated by a newline", validateAtTable: true, partition: "yardi")]
public class DQ_NewlineColumns : IDataQualityValidator
{
    public void Validate(TransformDataQualityContext data, ref DataQualityValidationResult result)
    {
        var dc = data.ColumnsOfType(typeof(string));
        var nlc = new char[] { '\r', '\n' };
        data.Process(result, () => dc.Any(), (header, row, indx, bag) =>
        {
            foreach (var col in dc)
            {
                if (row[col] == null || row[col] == DBNull.Value) continue;
                if (row[col]?.ToString()?.IndexOfAny(nlc) != -1) bag.Add(new DataQualityValidationRow(col.ColumnName, row[col]?.ToString(), indx));

            }
        }, dc.Count);
    }
}

SDK

To see all of the available methods, properties, and helpers, check out the SDK page:

Running Data Quality Manually (SDK)

If you're running DQ as part of a transform it's baked right into the process. See MapTo.

If you are wanting to run DQ modules outside of this process, here's an example of running them manually:

//Get data and map spec first
var sourceData = new DataTable();
var mapSpec = Transformer.GetMapFromFile("map.xlsx", out var mrpt).FirstOrDefault();

//Iterate modules defined in assembly
foreach (var dqTable in DataQuality.GetModules().Where(f => f.dataQualityCheckAttribute.Active))
{
    var dqresult = dqTable.RunForInstance(dqTable.Instance, 
        new TransformDataQualityContext(dqTable.dataQualityCheckAttribute, sourceData, mapSpec, null));

    if (!dqresult.Ignored)
    {
        //Do whatever you like with the results
    }
}

Installation in Client App

If you created a plugin.dll project: Compile the project and drop the .dll into the Plugins/DQ folder.

If you wrote the process in the same project as you're running, the plugin loader will automatically take a scan of the assembly and the plugin is available for use.

Last updated