6 Resources to Generate Test Data That’s Realistic
Needing to Generate test data or synthetic data can be very stressful. Therefore, I like to produce sample data that contains realistic values, such as professional names, so that the people involved in the training have a frame of reference and do not get tripped up on the data. Realistic datasets can be created by following these steps and utilizing the resources listed below.
There are times when, as a consultant or trainer, you have to produce a demo and need a demo dataset that is unrelated to your client’s data. This can be due to security, confidentiality, or the timing of your data availability. Creating these datasets for Power BI Portfolio Projects can be pretty helpful and is usually defined in the testing plan.
1. Need a Database Schema?
You often know what schema you wish to generate but need some interesting ideas on enhancing the data model. There was a site, Database Answers.org, with over 1500 data models listed in database diagrams. According to (4) What happened to Database Answers? : Database (reddit.com) The person keeping this site up may have passed away, according to the Reddit Post.
Most of the links are on Archive.org’s Wayback machine located here, List of All Data Models from DatabaseAnswers.org (archive.org). It may not be all, but this site was so helpful to me back in the day, and I want to honor the creator here: List of All Data Models from DatabaseAnswers.org (archive.org).
Note: There is a mirror of the site here https://fordnox.github.io/databaseanswers/data_models/index.htm and a the GitHub project; https://github.com/fordnox/databaseanswers/
2. Getting Realistic People’s Names
Most databases sometimes require people’s names, such as employees, customers, or sales reps. These can be difficult to get right and believable. A great resource is the Random Word Generator (name), which allows you to generate up to one hundred names at a time using various options, including male, female, or both, and can even select common or rare names. This site also has other random generation options and features.
3. Getting Sample Product Names
These can be the hardest, but I found a site that lets you generate fantasy object names, Fantasy-Name-Generator.com. This generator will provide ten random names based on real and fictional artifact names, normally used for relics, artifacts, and other special trinkets. This adds a bit of fun; however, you can filter out any “non-professional” sounding names depending on your audience.
4. Building Business Names and Addresses
Getting realistic business names can be challenging, as with people’s names. Fantasy Name Generators.com, allows you to select how many names you need and hit the Randomize button. You can also choose an industry group and a set of names for that sector. North American Address Generator can provide a list of random addresses you can add to the company name dataset. This provides random phone numbers also. However, I usually replace most numbers with an “x” if someone wants to dial them up.
Note: I have removed the sites that provided fake addresses and names, as I did not trust most of the ones I could find. See the demo below, where I use Python to generate addresses.
5. Get GPS Locations For Sample Maps
Location names are important as you can access maps and capture the longitude and latitude of the location to use on maps. The Latitude and Longitude Finder on Map Get Coordinates (latlong.net) will let you select a point on a map and provide its Longitude and Latitude. This is useful if you need to select points in a specific area, such as actual client sites or points.
If you are looking for a group of random locations, the Random Point Generator, pictured below, is a great place to start. It allows you to gather locations while providing several options.
These are resources that I have used to generate realistic data for demos, training, and learning databases. Share out any others you have used.
6. How to use AI or Python to Generate Test Data
Test data can be generated in Python with a combination of libraries and techniques that produce realistic but randomized data. The Faker library is a useful tool for generating false data, including names, addresses, and other common data types, that can be used for various purposes.
The Random Library can assist in randomizing position selection and yearly salaries. To illustrate this, a company payroll table will be created featuring 50 employees and four attributes: “Employee Number,” “Employee Name,” “Position,” and “Yearly Salary”.
Tutorial: Create an Employee Data table in Python.
In the example below, I needed a payroll table for a data analytics company. The employee positions are associated with “Employee Number,” “Employee Name,” “Position,” and “Yearly Salary.” The table needed to have 50 employees.
First, you must install the Faker Python library if you haven’t already done so. You can install it using pip:
pip install Faker
Now, let’s create a Python script to generate the company payroll table:
import random
from faker import Faker
fake = Faker()
positions = ["Data Analyst", "Data Scientist", "Data Engineer", "Machine Learning Engineer", "Business Analyst", "Database Administrator", "BI Developer", "Data Architect"]
def generate_employee_record(employee_number):
employee_name = fake.name()
position = random.choice(positions)
yearly_salary = round(random.uniform(50000, 120000), 2)
return {"Employee Number": employee_number, "Employee Name": employee_name, "Position": position, "Yearly Salary": yearly_salary}
def generate_payroll_table(num_employees):
payroll_table = []
for i in range(1, num_employees + 1):
employee_record = generate_employee_record(i)
payroll_table.append(employee_record)
return payroll_table
payroll_table = generate_payroll_table(50)
for employee in payroll_table:
print(employee)
This script creates a company payroll table with 50 employees, each having a unique employee number, a randomly generated employee name, a randomly selected position from the positions
list, and a randomized yearly salary between $50,000 and $120,000. The picture below is from a Juypter Notebook in the Anaconda environment on a Mac.
Tutorial: Create an Business Data table in Python
Lets create a Business Dimension (Table).
Using the same format, this Python code uses the Faker package.
import random
from faker import Faker
fake = Faker('en_US')
def generate_business_record(business_number):
business_name = fake.company()
address = fake.address()
phone_number = fake.phone_number()
# Replace two random digits in the phone number with 'X' to prevent real calls
phone_number_chars = list(phone_number)
digit_indices = [i for i, char in enumerate(phone_number_chars) if char.isdigit()]
if len(digit_indices) >= 2:
indices_to_replace = random.sample(digit_indices, 2)
for idx in indices_to_replace:
phone_number_chars[idx] = 'X'
phone_number = ''.join(phone_number_chars)
return {
"Business Number": business_number,
"Business Name": business_name,
"Address": address,
"Phone Number": phone_number
}
def generate_business_table(num_businesses):
business_table = []
for i in range(1, num_businesses + 1):
business_record = generate_business_record(i)
business_table.append(business_record)
return business_table
business_table = generate_business_table(50)
for business in business_table:
print(business)
The results below can be modified depending on your requrements. Being able to modify the code to closly get the results you are looking for are a big benefit. For example, the phone number has an ‘X” in the code to make sure that you are not using actual phone numbers.
Partial Results:
{'Business Number': 1, 'Business Name': 'Hodges-Mcdonald', 'Address': '7494 Underwood Point\nGreenmouth, MD 72139', 'Phone Number': 'X01-342-321-5538x01X'}
{'Business Number': 2, 'Business Name': 'Williams, Burke and Best', 'Address': '871 Ward Dam Apt. 577\nSouth Jamestown, PR 91999', 'Phone Number': '(690)961-98X6x4X6'}
{'Business Number': 3, 'Business Name': 'Oliver Inc', 'Address': '005 Peterson Manors\nNew Anthonyhaven, GU 54803', 'Phone Number': '001-50X-518-77X9'}
{'Business Number': 4, 'Business Name': 'Gordon-Jones', 'Address': '1005 Colon Fords Apt. 631\nSmithhaven, ME 77481', 'Phone Number': '+1-X7X-803-8786'}
Conclusion
It seems many things have changed since I wrote the original version of this article in 2017. There is still a need to create demo data, and with data science and machine learning, now being able to generate test data also. Now, with Python, there is another way to get this done.
How do you generate test data?
Resources
What is Test Data? Test Data Preparation Techniques with Example (softwaretestinghelp.com)