Obfusware

Obfusware is designed to run efficiently and simply within AWS Glue, making it ideal for use with Big Data. Developed by engineers who were dissatisfied with the slow speeds, complicated setups, and the high licensing costs of older solutions, Obfusware addresses these challenges directly. As organizations increasingly rely on data masking to meet strict security and privacy standards, Obfusware offers a high-performance, user-friendly solution. Whether you're migrating systems to the cloud, working in a fast-paced CI/CD environment, or using proprietary data to train an LLM, Obfusware serves as your go-to data masking tool on AWS.

FREE 30 Day Trial

Obfusware offers a 30-day free trial. Contact us to learn how to make data masking an integral part of your AWS Glue jobs.

Request FREE Trail

Designed for aws glue and big data

Legacy Data Masking

Existing data masking systems are designed  for handling data in traditional relational database systems such as PostgreSQL, MySQL, and Oracle.  When pressed into service on big data they face challenges.

  • Scaling - processing big data requires horizontal scaling. Legacy data masking systems have been built for vertical scaling and single system processing architectures that stall at the petabyte-scale data that defines Big Data.
  • Data Storage - Big Data lives in Data Lakes and is stored in the cloud using data storage formats such Parquet, ORC, Avro, CSV, and JSON designed for the data volume, velocity and variety that is the hallmark of big data.
  • Big Data Tools - legacy data masking become cumbersome and complex when working with a wide variety of data sources. Being native to AWS Glue, Obfusware is compatible with any data source supported by Glue from a Data Lake using Parquet files for storage to your favorite RDBMS, such as MySQL, PostgreSQL, Oracle, etc..

AWS Glue

Amazon Web Services (AWS) is the number one cloud service provider for hosting of big data.  

Obfusware has been designed and built to tightly integrate with AWS Glue and provide data masking transforms that perform like core AWS Glue transforms. 

Obfusware integrates with the Glue Data Catalog to enable configuring data masking in the Catalog so that Jobs can be created without having to know the data masking requirements.

Obfusware leverages the AWS Glue Custom Visual Transform (CVT) interface to allow Obfusware enabled data pipelines to be developed using AWS Glue Studio.

Apache Spark

Apache Spark is the premier open source tool for processing big data.  Big Data solutions such AWS Glue and Databricks have built their solutions around Apache Spark.

Obfusware interfaces directly with the dataframe API which ensures that Obfusware receives all the benefits of the advanced data processing capabilities of Spark, which makes it the most widely-used engine for scalable computing.


AWS Glue Integration

Designed to work within AWS Glue.

Leverages all the inherent features of AWS Glue, including accessing all data sources

Familiar environment, easy to integrate workflows, quick to deploy for practical use.

Integrates with Glue Data Catalog.

Leverages the CVT interface to allow data pipelines to be developed using Glue Studio.

Code executes in native Glue environments.

Code can be observed in standard AWS run logs.

Comprehensive Data Masking Algorithms

Obfusware provides an extensive set of data masking or obfuscation functions based on proven algorithms

  • Substitution - One of the most versatile and useful algorithms, substitution works by replacing data with a similar but different data value. The most common example would be to obfuscate a last name by replacing it with a different last name. For example, the last name "Smith" might be replaced with the last name "Brown". One of the advantages of this algorithm is that realistic data can be used allowing for more natural user interactions with the masked data.

  • Anonymization - Some data, like dates, has a vary specific meaning based on the value. Replacing the data item with an unrelated value even if it is the same type can result in unrealistic date. For instance, if the date of birth (DOB) is replaced with random date, then it could look like a toddler has heart disease, instead of the an 80 year old. In this case, modifying the value by a set amount, 1/31/1963 becomes 1/5/1963, can maintain realistic meaning for the data while maintaining the data privacy a DOB requires. This data realism is critical for any analysis as well as AI.

  • Redacting - This simplest data masking method is to simply redact the data value by either replacing it with characters such as 'X' or '*' or a string such as "<REDACTED>" or simply nulling out the data. While this does ensure data privacy, it does not provide realistic data or preserve referential integrity. It also does not work for non-text data types like dates or numbers.

  • Masking out - This is a variation of redacting, but not all the data is masked. The most common examples are Social Social Numbers where the last 4 digits are not masked (ie "***-**-1234") or credit cards (i.e. "**** **** **** 1234".

REFERENTIAL INTEGRITY

Maintaining data referential integrity is one of the key characteristics of useful data masking algorithms

Data does not exist in isolation.  It most often is related to other data items or even very often duplicated. When masking data it is critical to maintain referential integrity.  If a data value is masked in one data table or file, then it is important that it is masked in any other table or file to the same value.  This means data masking methods need to be deterministic, always returning the same masked value for a given input value.

Multi-field masking Consistency

Enables data consistency for related or dependent fields when masked

Often there are dependencies and relationships between fields.  A simple example is the relationship between the State and the Zipcode fields an address.  Zipcodes are strongly related to states.  All New Hampshire zipcodes start with the digit 03, while New York zipcode start with the digits 10,11,12,13, or 14.  If a persons sees an address with New Hampshire 12543, it does not make sense.  If there is data validation code it may reject or throw an error.  Obfusware has the ability to use a key field for masking.  If two fields use specify the same key field, then the data relationship or dependency can be maintained even in the masked field values. This is essential for AI LLM training.

Semi-structured data

Big Data is often includes semi-structured and structured data

Data formats often used with Big Data such as Parquet, JSON, and XML are charactered as semi-structured because the do not conform to the strict requirements of structured data used by RDBMS.  Semi-structured data is often self-describing, including tags to describe fields instead of conforming to a set schema. Semi-structured data is often nested and each data element may contain different fields.

Obfusware provides support for semi-structured nested data by allowing addressing of nested fields using JSON like field selectors (ie "object.field1.subfield").  This alleviates the need to transform the data using techniques such as flattening to efficiently apply data masking, and the potential loss of data structure and information during transformations, while delivering high performance.

Context Dependent

It is important that data masking not return the same results for every organization

One of the requirements of data masking is that it should be impossible to determine the original data value from the masked value.  If data masking methods return the same value for for every organization, then it would be possible to a third party to use the software to determine what original values mask to a given result value.  This would allow them to potentially determine the original value thus invalidating the privacy promised by data masking. 

Obfusware creates a unique context for each organization using the data masking software.  Using cryptographic techniques, the context is used to generate a unique mapping from the original data value to the masked data value.   The result is deterministic so the masking for a given organizational context will not change overtime.

Configurable Maskers

Obfusware data maskers offer many configuration options to produce varied results to meet requirements

Obfusware offers over a dozen data masking algorithms, each of which can be configured to create numerous data maskers to meet specific criteria.  Out-of-the-box, Obfusware offers almost two dozen pre-configured maskers to meet a the most common data masking requirements.

  1. Pre-configured data maskers
    1. IdentityMasker - Returns the original data value.  Useful for testing and data cleansing using common masker attributes to control whitespace, case, and default values.
    2. RedactMasker - Replace text with "<Redacted>" label
    3. HiddenMasker - Replaces characters of the data value with the '*' character.
    4. LoremIpsumMasker - Replaces text with Lorem Ipsum text
    5. EmailMasker - Replaces email address with an email address for the domain @example.com
    6. USLastNameMasker - Substitutes a culturally US last name for the data value 
    7. USFirstNameMasker - Substitutes a culturally US first name for the data value 
    8. USFullNameMasker - Substitutes a culturally US full name ("firstname lastname") for the data value 
    9. USCommaNameMasker - Substitutes a culturally US full name in "lastname, firstname" format for the data value 
    10. USStreetAddress1Masker - Substitutes a US street address for the data value 
    11. USCityMasker - Substitutes a US city/town name for the data value 
    12. USCountyMasker - Substitutes a US county name for the data value 
    13. USStateMasker - Substitutes a US State name for the data value 
    14. USStateAbbrMasker - Substitutes the two character US State abbreviation for the data value
    15. CountryMasker - Substitutes a Country name for the data value
    16. CountryAbbrMasker - Substitutes a 2 letter Country abbreviation for the data value
    17. USVariableDateMasker - Replaces the date with new date using the same month and year but different day using the  "m/d/yyyy" format
    18. SSNMasker - Replaces a Social Security Number (SSN) with a new SSN preserving the the first 3 digits
    19. SSNLast4Masker - Masks a Social Security Number (SSN) with the '*' character and leaves the last 4 digits unchanged
    20. CreditCardMasker - Generates a new credit card number preserving the first 6 digits and a valid luhn digit
    21. CCLast4Masker -  Masks a credit card number with the '*' character and leaves the last 4 digits unchanged
    22. US555TelephoneMasker - Generate a new US telephone number replacing the 3 exchange digits with '555' and generating new last four digits
    23. USExchangeTelephoneMasker - Generate a new US telephone number replacing the 3 exchange digits and the last four digits
    24. VariableZipcodeMasker - Generates a new zipcode, preserving the first 3 digits and replacing the 4th & 5th digits and any plus4 digits

Custom Maskers

Extend Obfusware data masking with customer data maskers using the Obfusware masker API

While Obfusware built-in data masking algorithms cover the vast majority of data masking requirements there are situations when an organization's custom requirements cannot be met using standard masking algorithms.  For these cases, Obfusware provides the ability to add custom data masking algorithms using the masker API. It's as simple as writing Java with a few methods implementing the desired behavior.

Masking Statistics

Obfusware provides extensive configurable statics gathering on masking operations

Obfusware gathers statistics on the time spent masking data including count, min, max, mean, variance, and standard deviation, throughput and error rates.  Obfusware is able to aggregate statistics from distributed operations to provide an overview of masking performance (see Apache Spark).

Masking for ai

High performance data masking to manage your compliance risk.  Provides the Big Data tools support to seamlessly integrate with your enterprise Data Lake for AI model training.

purchase obfusware

FREE 30-Day Trial

Obfusware offers a 30-day free trial. Contact us to learn how to make data masking an integral part of your AWS Glue jobs.

Request FREE Trail