Algorithm
The anonym
algorithm is designed to anonymize data in a DataFrame. It works by replacing real data with fake data, while maintaining the structure and format of the original data. Here’s a step-by-step explanation of how it works:
1. Initialization: The anonym class is initialized with a language parameter (default is ‘dutch’) and a verbosity level (default is ‘info’). The language parameter is used to load the appropriate language model for named entity recognition (NER), and the verbosity level sets the logger’s verbosity.
2. Data Import: The import_data method is used to import a dataset from a given file path. The data is read into a pandas DataFrame.
3. Data Anonymization: The anonymize method is the core of the algorithm. It takes a DataFrame and optional parameters for specifying columns to fake or not to fake, and a NER blacklist. The method works as follows:
4. It calls the extract_entities function to extract all entities from the DataFrame. This function uses the
spacy
library’s NER capabilities to identify entities in the data. If a column is specified in the fakeit parameter, the entities in that column are replaced with the specified fake replacement. If a column is specified in the do_not_fake parameter, it is left untouched. Otherwise, NER is performed on each row of the column.5. The generate_fake_labels function is then called to generate fake labels for the extracted entities. This function uses the
faker
library to generate fake data that matches the type of the original data (e.g., names, companies, dates, cities, etc.).6. The replace_label_with_fake function is then used to replace the original entities in the DataFrame with the generated fake labels.
7. Data Export: The to_csv method is used to write the anonymized DataFrame to a CSV file.
8. Example Data Import: The import_example method is used to import example datasets from a GitHub source or a specified URL.
Start
|
v
Initialize `anonym` class
|
v
Import data using `import_data` method
|
v
Anonymize data using `anonymize` method
| |
| v
| Extract entities using `extract_entities` function
| |
| v
| Generate fake labels using `generate_fake_labels` function
| |
| v
| Replace original labels with fake ones using `replace_label_with_fake` function
v
Export anonymized data using `to_csv` method
|
v
End
The algorithm also includes several utility functions for text cleaning, preprocessing, filtering values, checking the spacy
model, and setting the logger. The main function at the end of the script demonstrates how to use the anonym class to import an example dataset, anonymize it, and plot the results.
Named Entity Recognition Catagories
The available Named Entity Recognition (NER) categories in the anonym algorithm are determined by the underlying spacy
library. Here are the categories used in the anonym algorithm:
Abbreviation |
Description |
---|---|
PERSON |
People, including fictional. |
ORG |
Companies, agencies, institutions, etc. |
DATE |
Absolute or relative dates or periods. |
LOC |
Non-GPE locations, mountain ranges, bodies of water. |
MONEY |
Monetary values, including unit. |
NORP |
Nationalities or religious or political groups. |
ADDRESS |
Addresses of any kind. |
GPE |
Countries, cities, states. |
EVENT |
Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART |
Titles of books, songs, etc. |
LAW |
Named documents made into laws. |
Each of these categories is associated with a specific fake data generation function from the faker
library. For example, ‘PERSON’ entities are replaced with fake names, ‘ORG’ entities are replaced with fake company names, ‘DATE’ entities are replaced with fake dates, and so on.
Please note that the algorithm also allows for a NER blacklist, which is a list of NER labels to be ignored during the anonymization process. By default, this list includes ‘CARDINAL’, ‘GPE’, ‘PRODUCT’, and ‘DATE’.
Faking the data entries
The data is faked in the anonym algorithm using the faker
library, which is a Python package that generates fake data. Here’s a summary of how the data is faked:
1. Entity Extraction: The algorithm first identifies entities in the data using Named Entity Recognition (NER) from the spacy library. This process identifies pieces of text that represent certain types of information, such as names, dates, or addresses.
2. Fake Data Generation: For each identified entity, the algorithm generates a corresponding piece of fake data. The type of fake data generated depends on the type of the entity. For example, if the entity is a person’s name, the algorithm generates a fake name. If the entity is a date, the algorithm generates a fake date. This is done using the
faker
library, which has functions for generating various types of fake data.3. Replacement: Once the fake data is generated, the algorithm replaces the original entities in the data with the generated fake data. This is done in such a way that the structure and format of the data remain the same, only the actual content is changed.
4. Blacklist: The algorithm also allows for a blacklist of entity types that should not be faked. If an entity type is in the blacklist, it will be ignored and left as is in the data.
By using this process, the anonym algorithm can effectively anonymize data by replacing real, potentially sensitive information with fake, nonsensitive information.