If we want to get all the unique values, duplicate values, or retrieve values with conditions, we are filtering our data frame.
Given the following data frame:
# Expanded data dictionary with 30 samples and new departments
data = {
"WorkerID": list(range(31)),
"Age": [25, None, 35, 30, 24, 28, None, 32, 27, None, 33, 29, 40, 50,
45, 37, 31, 34, 26, 38, 48, 27, 41, 32, 29, 47, 46, 33, 39, 36, 30],
"Salary": [50000, 54000, None, 58000, 45000, 60000, 49000, None, None,
47000, 55000, 60000, 52000, 64000, 51000, 47000, 58000, 54000,
57000, 53000, 60000, 55000, 56000, 59000, 55000, 62000, 61000, 53000, 58000, 56000, 60000],
"Department": ["HR", "Finance", "IT", "HR", "IT", "Finance", "IT",
"HR", "Finance", "HR", "AI", "Marketing", "Business",
"Finance", "IT", "Marketing", "AI", "HR", "Business",
"Finance", "IT", "AI", "HR", "Business", "Marketing", "HR",
"Finance", "IT", "Business", "AI", "HR"]
}
# Convert to DataFrame
df = pd.DataFrame(data).fillna(30) # Fill in the missing values with 30
# Output the DataFrame (10 first rows)
print(df.head(10))
Output:
WorkerID Age Salary Department
0 0 25.0 50000.0 HR
1 1 30.0 54000.0 Finance
2 2 35.0 30.0 IT
3 3 30.0 58000.0 HR
4 4 24.0 45000.0 IT
5 5 28.0 60000.0 Finance
6 6 30.0 49000.0 IT
7 7 32.0 30.0 HR
8 8 27.0 30.0 Finance
9 9 30.0 47000.0 HR
Unique values are values that appear at least once in the dataset
# Since unique() can only be applied to single column
for i in df.columns:
print(f"Unique values for column {i}: {df[i].unique()}")
Output:
Unique values for column WorkerID: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30]
Unique values for column Age: [25. 30. 35. 24. 28. 32. 27. 33. 29. 40. 50. 45. 37. 31. 34. 26. 38. 48.
41. 47. 46. 39. 36.]
Unique values for column Salary: [5.0e+04 5.4e+04 3.0e+01 5.8e+04 4.5e+04 6.0e+04 4.9e+04 4.7e+04 5.5e+04
5.2e+04 6.4e+04 5.1e+04 5.7e+04 5.3e+04 5.6e+04 5.9e+04 6.2e+04 6.1e+04]
Unique values for column Department: ['HR' 'Finance' 'IT' 'AI' 'Marketing' 'Business']
df['Salary'] = df['Salary'].replace(30, 40000) # Replace the salary to 40000
# Since unique() can only be applied to single column
for i in df.columns:
print(f"Unique values for column {i}: {df[i].unique()}")
Output:
Unique values for column WorkerID: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30]
Unique values for column Age: [25. 30. 35. 24. 28. 32. 27. 33. 29. 40. 50. 45. 37. 31. 34. 26. 38. 48.
41. 47. 46. 39. 36.]
Unique values for column Salary: [50000. 54000. 40000. 58000. 45000. 60000. 49000. 47000. 55000. 52000.
64000. 51000. 57000. 53000. 56000. 59000. 62000. 61000.]
Unique values for column Department: ['HR' 'Finance' 'IT' 'AI' 'Marketing' 'Business']