Parallel Processing Reference
Parallel Processing basics
Let's take a look at a few things we need to know before jumping into parallel processing libraries.
Ruleset
The rules for parallel processing are as follows: (Psst, don't get too lost in them, but reading through will help you understand how to design your parallel processing architecture)
Don't write to anything that isn't a concurrent safe type. (Only write to concurrent safe types!)
Don't call instantiated class methods in parallel (unless you KNOW you can). Prefer to write static methods when calling in parallel loops to avoid concurrency issues.
Use synchronization constructs, like locks or semaphores, to protect shared resources and avoid race conditions.
Ensure that the workload is balanced across all parallel tasks to maximize efficiency and minimize the chance of bottlenecks.
Be mindful of the trade-offs between parallelism and performance, as not all tasks will benefit from parallelization. In some cases, the overhead of creating and managing parallel tasks may outweigh any potential speedup.
Test your parallel code thoroughly to ensure it is working correctly and efficiently, as debugging parallel code can be more difficult than debugging sequential code.
Keep in mind potential issues with exception handling, as exceptions thrown in parallel tasks may be propagated and require proper handling in the calling code.
Consider the target hardware and environment when designing parallel code, as factors like the number of available cores and memory limitations can impact performance and scalability.
Hashing
Hashing is a process of converting data into a fixed-size string of bytes, typically using a mathematical algorithm like SHA, MD5, etc. The output, called a hash, is a unique representation of the input data. A small change in the input data will result in a completely different hash, making it useful for verifying data integrity, securely storing passwords, and creating unique identifiers for data.
Hash based data structures, such as hash tables or hash sets, provide a fast and efficient way to look up or verify existing data. These data structures use hashing to determine the location of an item in memory, allowing for quick access and retrieval even in large datasets. This is particularly useful in parallel processing, where multiple threads or processes may be accessing the same data simultaneously.
Perigee fully embraces the hash structures as a way of providing better parallel processing for data sets.
Parallel write-safe structures
C# only has a few concurrent write-safe structures. You can find them under System.Collections.Concurrent
.
The two primary concurrent classes we are interested in for this practice are:
ConcurrentBag
ConcurrentDictionary<K,V>
Both of these structures provide a thread safe, concurrent way of writing data back during a parallel loop. We'll use both of these while processing data in parallel.
Intro to Perigee Parallel
There are 3 main parallel concepts within Perigee:
Parallel processing lookup data for rapid existence and lookup checks
Parallel transforming records into a new type
Parallel iteration without a defined output.
1 - Parallel Lookup Processing
We have two classes for easy parallel lookups:
GroupProcessor
SingleProcessor
Group Processor
The GroupProcessor
allows for parallel execution over a list, DataTable, or other enumerable input. The "Group" is because it will perform a "GroupBy" on the input data, so that one input key may point to multiple input items. Original input items are retained, and the "key" is used as a hashed lookup. Let's look at some data and an example:
If you were to use a GroupProcessor
on the data above, it would look as so:
Once the data has been processed, you can perform existence checks and data retrieval all using hashed values. This means it's parallel existence check safe and parallel data retrieval safe.
Using our parallel execution loops to produce hashed data lookup has reduced various process times down by almost 500,000% in certain production applications.
Single Processor
The SingleProcessor
allows for parallel execution over a list, DataTable
, or other enumerable input. The "Single" part expects only a single input key to exist within the input data, so that one input key points to a single input item. Original input items are retained and the "key" is used as a hashed lookup. Let's look at some data and an example:
If you were to use a SingleProcessor
on the data above, it would look as so:
Once the data has been processed, you can perform existence checks and data retrieval all using hashed values. This means it's parallel existence check safe and parallel data retrieval safe.
A note on Mesh-Keys
Because of the nature of hashed values, you can use multiple inputs as a mesh key.
Option 1:
Supply the key as a delimited list. An example might be:
This would make the hashed lookup a mesh key of FirstName, LastName. And the retrieval of those records would require both values to match.
Option 2:
Supply the key as a hash itself. The hash can be calculated from a database, or local depending on your needs. An example using a local hash function for SHA512:
Notice how we computed the hash outside of a parallel loop? The one caveat to local hashing is that they CANNOT be hashed in parallel on the same instance of a HashAlgorithm
. For this reason I would recommend, when possible, going with Option 1
.
Hashing the mesh-key is really most useful in two conditions:
The hash is pre-calculated on input data, especially when coming from a database or a different system
You need to compute and add together multiple hashed keys into a single lookup, so that you can perform rapid existence checks on multiple types of mesh keys.
2 - Parallel Transformation Processing
This type of parallel processing is a bit different. Instead of purely processing data for rapid lookup and retrieval, we're taking the input data and transforming it to a new type, class, structure, or even multiple new items to store again later.
For this, we want to use a parallel loop that returns a concurrent safe class. For our example, let's use a concurrent bag:
Bag
now contains 2 items: SteakType, 2
, and SteakRarity, 3
. There's a lot of performance to gain when performing parallel operations on larger datasets.
P.S, Ready for dinner yet? I am...
3 - Parallel Iteration
This one is very similar to what's above, this time we aren't capturing a direct output, but rather sending items to a concurrent dictionary. We'll also introduce the ExceptionRows callback.
After running the code, you'll see one item logged to the console, as expected:
Item {"rCount":0,"TypeKey":null,"Name":null} had an exception thrown: TypeKey is null or empty
We've also created 3 items in the ReturnedDictionary
, with the .value
property being the number of items from the grouped lookup.
Finally, I've added the JsonCompress
class to serialize and deserialize the ParallelProcessException
items, so that they can be stored, sent, or retrieved later.
Summary
We've looked at three different ways we can use parallel processing to boost the speed of our data conversion lookups and loops. We've taken a look at what hashing is and how to effectively use it. We've seen several examples of using the Single
and Group
processors. We also got to see how the ExceptionRows
callback allows us to iterate over failed parallel execution items.
Last updated