Obfusware

Obfusware is specifically designed to meet your Big Data data masking requirements.  Big Data has become an important part of most businesses as the benefits of analyzing big data sets has become a ​necessity to be competitive in today's rapidly evolving markets.  With the emergence of AI  and Machine Learning and the huge datasets required for training models, managing data privacy for customers and regulatory compliance risk is a new challenge that requires a solution designed for the novel processes and system being developed to keep your business competitive.

Designed for Big Data

Legacy Data Masking

Existing data masking systems are designed  for handling data in traditional relational database systems such as PostgreSQL, MySQL, and Oracle.  When pressed into service on big data they face challenges.

  • Scaling - processing big data requires horizontal scaling and legacy data masking systems have been built for vertical scaling and single system processing architectures which can handle terabyte-scale data but fail to handle the petabyte-scale data that defines Big Data.
  • Data Storage - Big Data lives in Data Lakes and is stored in the cloud using data storage formats such Parquet, ORC, Avro, CSV, and JSON designed for the data volume, velocity and variety that is the hallmark of big data.
  • Big Data Tools - legacy data masking are not designed to work with big data distributed processing tools such as AWS Glue and Apache Spark which enable fault-tolerant clusters of thousands of machines to rapidly handle big data jobs.

AWS Glue

Amazon Web Services (AWS) are the number one cloud service for hosting of big data.  The use of AWS S3 for data storage and AWS Glue provide a powerful solution for big data management.  AWS Glue is a serverless data integration service that provides data extraction, transformation and loading using both custom scripting and visual data pipeline development.

Obfusware has been designed and built to tightly integrate with AWS Glue and provide data masking transforms that perform like core AWS Glue transforms. 

Obfusware integrates with the Glue Data Catalog to enable configuring data masking in the Catalog so that Jobs can be created without having to know the data masking requirements and guarantees the correct masking will be used for the given table.

Obfusware leverages the AWS Glue Custom Visual Transform (CVT) interface to allow Obfusware enabled data pipelines to be developed using AWS Glue Studio, the graphical interface used to create, run, and monitor Glue jobs.

Apache Spark

Apache Spark is the premier open source tool for processing big data.  Big Data solutions such AWS Glue and Databricks have built their solutions around Apache Spark.

Apache Spark core concept is the dataframe which represents the data being processed. The dataframe allow advance data processing strategies such as Lazy Evaluation and its Advanced DAG Execution Engine to optimize complex chains of operations to maximize processing performance.

Obfusware interfaces directly with the dataframe API which ensures that Obfusware receives all the benefits of the advanced data processing capabilities of Spark which made it the most widely-used engine for scalable computing.

Apache Spark is used by thousands of companies, including 80% of the Fortune 500.

Comprehensive Data Masking Algorithms

Obfusware provides an extensive set of data masking or obfuscation functions based on proven algorithms

  • Substitution - One of the most versatile and useful algorithms, substitution works by replacing data with a similar but different data value. The most common example would be to obfuscate a last name by replacing it with a different last name. For example, the last name "Smith" might be replaced with the last name "Brown". One of the advantages of this algorithm is that realistic data can be used allowing for more natural user interactions with the masked data.

  • Variance - Some data, like dates, has a vary specific meaning based on the value. Replacing the data item with an unrelated value even if it is the same type can result in unrealistic date. For instance, if the date of birth (DOB) is replaced with random date, then it could look like a toddler has heart disease, instead of the an 80 year old. In this case, modifying the value by a set amount, 1/31/1963 becomes 1/5/1963, can maintain realistic meaning for the data while maintaining the data privacy a DOB requires.

  • Redacting - This simplest data masking method is to just redact the data value by either replacing it with characters such as 'X' or '*' or a string such as "<REDACTED>" or simply nulling out the data. While this does ensure data privacy, it does not provide realistic data or preserve referential integrity. It also does not work for non-text data types like dates or numbers.

  • Masking out - This is a variation of redacting, but not all the data is masked. The most common examples are Social Social Numbers where the last 4 digits are not masked (ie "***-**-1234") or credit cards (i.e. "**** **** **** 1234".

  • Encryption - This masking method has the advantage of being reversible.  Most masking or obfuscation methods are one-way and there is no way to recover the original value from the masked value.  Encryption replaces the date with an encrypted token.  This token can be used to find the original data value by applying the appropriate decryption algorithm.

REFERENTIAL INTEGRITY

Maintaining data referential integrity is one of the key characteristics of useful data masking algorithms

Data does not usually exist in isolation.  It most often is related to other data items or even very often duplicated. When masking data it is very important to maintain referential integrity.  If a data value is masked in one data table or file, then it is important that it is masked in any other table or file to the same value.  This means data masking methods need to be deterministic, always returning the same masked value for a given input value.

Semi-structured data

Big Data is often includes semi-structured and structured data

Data formats often used with Big Data such as Parquet, JSON, and XML are charactered as semi-structured because the do not conform to the strict requirements of structured data used by RDBMS.  Semi-structured data is often self-describing, including tags to describe fields instead of conforming to a set schema. Semi-structured data is often nested and each data element may contain different fields.

Obfusware provides support for semi-structured nested data by allowing addressing of nested fields using JSON like field selectors (ie "object.field1.subfield").  This alleviates the need to transform the data using techniques such as flattening to efficiently apply data masking and the potential lose of data structure and information during transformations, while delivering high performance.

Context Dependent

It is important that data masking not return the same results for every organization

One of the requirements of data masking is that it should be impossible to determine the original data value from the masked value.  If data masking methods return the same value for for every organization, then it would be possible to a third party to use the software to determine what original values mask to a given result value.  This would allow them to potentially determine the original value thus invalidating the privacy promised by data masking. 

Obfusware creates a unique context for each organization using the data masking software.  Using cryptographic techniques, the context is used to generate a unique mapping from the original data value to the masked data value.   The result is deterministic so the masking for a given organizational context will not change overtime.

Configurable Maskers

Obfusware data maskers offer many configuration options to produce varied results to meet requirements

Obfusware offers over a dozen data masking algorithms, each of which can be configured to create numerous data maskers to meet specific criteria.  Out-of-the-box Obfusware offers almost two dozen pre-configured maskers to meet a the most common data masking requirements.

  1. Pre-configured data maskers
    1. IdentityMasker - Returns the original data value.  Useful for testing and data cleansing using common masker attributes to control whitespace, case, and default values.
    2. RedactMasker - Replace text with "<Redacted>" label
    3. HiddenMasker - Replaces characters of the data value with the '*' character.
    4. LoremIpsumMasker - Replaces text with Lorem Ipsum text
    5. EmailMasker - Replaces email address with an email address for the domain @example.com
    6. USLastNameMasker - Substitutes a culturally US last name for the data value 
    7. USFirstNameMasker - Substitutes a culturally US first name for the data value 
    8. USFullNameMasker - Substitutes a culturally US full name ("firstname lastname") for the data value 
    9. USCommaNameMasker - Substitutes a culturally US full name in "lastname, firstname" format for the data value 
    10. USStreetAddress1Masker - Substitutes a US street address for the data value 
    11. USCityMasker - Substitutes a US city/town name for the data value 
    12. USCountyMasker - Substitutes a US county name for the data value 
    13. USStateMasker - Substitutes a US State name for the data value 
    14. USStateAbbrMasker - Substitutes the two character US State abbreviation for the data value
    15. CountryMasker - Substitutes a Country name for the data value
    16. CountryAbbrMasker - Substitutes a 2 letter Country abbreviation for the data value
    17. USVariableDateMasker - Replaces the date with new date using the same month and year but different day using the  "m/d/yyyy" format
    18. SSNMasker - Replaces a Social Security Number (SSN) with a new SSN preserving the the first 3 digits
    19. SSNLast4Masker - Masks a Social Security Number (SSN) with the '*' character and leaves the last 4 digits unchanged
    20. CreditCardMasker - Generates a new credit card number preserving the first 6 digits and a valid luhn digit
    21. CCLast4Masker -  Masks a credit card number with the '*' character and leaves the last 4 digits unchanged
    22. US555TelephoneMasker - Generate a new US telephone number replacing the 3 exchange digits with '555' and generating new last four digits
    23. USExchangeTelephoneMasker - Generate a new US telephone number replacing the 3 exchange digits and the last four digits
    24. VariableZipcodeMasker - Generates a new zipcode, preserving the first 3 digits and replacing the 4th & 5th digits and any plus4 digits

Custom Maskers

Extend Obfusware data masking with customer data maskers using the Obfusware masker API

While Obfusware built-in data masking algorithms cover the vast majority of data masking requirements using their highly configurable behavior, there are sometimes custom datamasking requirements which cannot be met using the built-in masking algorithms.  For these special cases, Obfusware provides the ability to add custom data masking algorithm using the masker API, which makes creating a data masker with all the capabilities of a built-in masker as simple as writing Java class with a few methods implementing the desired behavior.

Masking Statistics

Obfusware provides extensive configurable statics gathering on masking operations

Obfusware gathers statistics on the time spent masking data including count, min, max, mean, variance, and standard deviation, throughput and error rates.  Obfusware is able to aggregate statistics from distributed operations to provide an overview of masking performance.

Big Data Masking

High performance data masking to manage your compliance risk with your Big Data assets.  Provides the Big Data tools support to seamlessly integrate with your enterprise Data Lake.

GET Obfusware