Python Faker: how to generate synthetic data

Becca Weng - February 8th, 2023

There’s never enough data to run all the tests and analysis possible, and synthetic data may be the future of AI. Synthetic data can be useful for product demos, training, internal testing, and particularly for companies that handle sensitive data like financial or medical institutions. Although we’re generating real data every day, sometimes it is still necessary to augment what we have already. In this post, I’ll introduce the Python Faker library, which can be used for generating fake data. I’ll cover the profile provider and how to customize it, as well as the DynamicProvider class for further customization. Check out the full code in the canvas below:

Install and import Python Faker

%pip install faker
from faker import Faker
import pandas as pd # for data manipulation

We'll use the pandas library to convert the results from the Faker library into DataFrames for easier data manipulation and exploration.

Generate synthetic data with Python Faker

Example 1: fake.name(), fake.address(), fake.email(), fake.phone_number()

# Instantiate Faker() instance
fake = Faker()

# Create a dataset including names, addresses, emails, and phone numbers
data = []
for _ in range(10):
    data.append([fake.name(), fake.address(), fake.email(), fake.phone_number()])

# Convert to DataFrame
df = pd.DataFrame(data, columns=['Name', 'Address', 'Email', 'Phone'])
df.head()

Output:

Python Faker output table: name, address, email, phonePython Faker output table: name, address, email, phone

Faker's package has a number of callable functions, called providers, that will generate random data for you. In the above code chunk, I used the BaseProvider's functions to generate names, physical mailing addresses, email addresses, and phone numbers.

Example 2: fake.profile()

Let's use another provider: profile, and see what data we can generate.

# Create a list of fake profiles
profiles = []

for _ in range(10):
    profiles.append(fake.profile())

# Save as a DataFrame
df2 = pd.DataFrame(profiles, columns = profiles[0].keys())
df2.head()

Output:

Python Faker profile() output tablePython Faker profile() output table

As you can see from the output, there's a lot of information. Let's take a look at an individual profile:

fake.profile()

Output:

Out[5]: 
{'job': 'Proofreader',
 'company': 'Gomez-Warren',
 'ssn': '610-78-7480',
 'residence': 'Unit 2431 Box 1222\nDPO AA 08184',
 'current_location': (Decimal('37.419820'), Decimal('-115.668193')),
 'blood_group': 'B-',
 'website': ['http://www.snyder-marquez.com/'],
 'username': 'btrujillo',
 'name': 'Karen Douglas',
 'sex': 'F',
 'address': '057 Karen Ports\nSouth Zacharystad, KS 29980',
 'mail': 'sergio52@yahoo.com',
 'birthdate': datetime.date(1939, 5, 11)}

Example 3: customize fake.profile(fields = [])

Depending on the columns you actually want for your fake profiles, you can list whichever attributes you're interested in using the fields argument.

# Create fake profiles using specific columns
profiles2 = []

for _ in range(10):
    profiles2.append(fake.profile(fields = ["name", "sex", "occupation", "blood_group", "birthdate"]))

df3 = pd.DataFrame(profiles2, columns = profiles2[0].keys())
df3.head()

Output:

Python Faker profile() output table with custom fields: blood_group, name, sex, birthdatePython Faker profile() output table with custom fields: blood_group, name, sex, birthdate

DynamicProvider: customizable provider

from faker.providers import DynamicProvider

# Get unique list of museum names from existing dataset
museum_list = set(df["Museum Name"])

# Create museum_provider
museum_provider = DynamicProvider(
     provider_name = "museum_provider",
     elements = museum_list,
)

# Instantiate new Faker() instance
fake_more = Faker()

# Add new provider
fake_more.add_provider(museum_provider)

# Use new provider
fake_more.museum_provider()

Output:

Out[16]: 'BENNINGTON HISTORICAL SOCIETY'

In this dummy example, I took an existing dataset on museums, extracted just the names, and in 2 lines of code, created a new provider that will randomly generate a museum name based on the data I've provided it. This could be applied to any other existing dataset that you have.

Python Faker providers: standard vs. community

To learn more about other providers you can use the following line of code. Note that we're calling on the providers attribute of a Faker() instance, called fake. All of the providers' accompanying functions can be called on like we did above without any additional import statements.

# Get full list of built-in providers
fake.providers

Output:

Out[6]: 
[<faker.providers.user_agent.Provider at 0x7f2b79bd0b20>,
 <faker.providers.ssn.en_US.Provider at 0x7f2b79bd36d0>,
 <faker.providers.python.Provider at 0x7f2b79bd3070>,
 <faker.providers.profile.Provider at 0x7f2b79bd3370>,
 <faker.providers.phone_number.en_US.Provider at 0x7f2b79bd2e90>,
 <faker.providers.person.en_US.Provider at 0x7f2b79bd3010>,
 <faker.providers.misc.en_US.Provider at 0x7f2b79bd0be0>,
 <faker.providers.lorem.en_US.Provider at 0x7f2b79bd2ec0>,
 <faker.providers.job.en_US.Provider at 0x7f2b79bd0c10>,
 <faker.providers.isbn.Provider at 0x7f2b79bd0d30>,
 <faker.providers.internet.en_US.Provider at 0x7f2b79bd0a90>,
 <faker.providers.geo.en_US.Provider at 0x7f2b79bd0bb0>,
 <faker.providers.file.Provider at 0x7f2b79bd1060>,
 <faker.providers.date_time.en_US.Provider at 0x7f2b79bd08b0>,
 <faker.providers.currency.en_US.Provider at 0x7f2b79bd0970>,
 <faker.providers.credit_card.en_US.Provider at 0x7f2b79bd3400>,
 <faker.providers.company.en_US.Provider at 0x7f2b79bd3190>,
 <faker.providers.color.en_US.Provider at 0x7f2b79bd3610>,
 <faker.providers.barcode.en_US.Provider at 0x7f2b79aa0ca0>,
 <faker.providers.bank.en_GB.Provider at 0x7f2b79aa3760>,
 <faker.providers.automotive.en_US.Provider at 0x7f2b79aa2b60>,
 <faker.providers.address.en_US.Provider at 0x7f2b79aa2f50>]

Beyond the basic providers, there are also community-developed providers, such as:

  • faker_airtravel: airport and flight information
  • faker_music: music genres, subgenres, and instrument information
  • faker_vehicle: year, make, model, and other vehicle information

But you will have to install and import community providers separately:

%pip install faker_airtravel

from faker import Faker
from faker_airtravel import AirTravelProvider
fake.add_provider(AirTravelProvider)

Check out Python Faker's full GitHub and documentation for more.

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.