import pandas as pd
from string_grouper import group_similar_strings
We have already seen that string_grouper has a function group_similar_strings() that partitions a Series of strings into groups based on their degree of mutual similarity. To represent each group, group_similar_strings() chooses one member of the group. The default choice is the so-called centroid of the group.
The centroid of a group of similar strings is that string in the group which has the highest similarity aggregate.
The similarity aggregate of a string is the sum of all the cosine similarities between it and the strings that it matches.
This choice can also be specified by setting the following keyword argument of group_similar_strings:
group_rep='centroid'
.
group_similar_strings() has an alternative choice of group representative which is specified by setting group_rep='first'
. This choice is merely the first member of the group according to its index (that is, its position in the order of appearance of members in the group). Though somewhat arbitrary, this choice is the fastest and can be used for large datasets whenever the choice of group-representative is not important.
group_rep='first' |
---|
group_rep='centroid' |
But the user may not be satisfied with group_similar_strings()' only two available choices. For example, he/she might prefer the earliest recorded string in the group to represent the group (if timestamp metadata is available). Fortunately, there are three other choices available in an auxiliary module named string_grouper_utils
included in the package and which can be imported whenever necessary:
from string_grouper_utils import new_group_rep_by_highest_weight, \
new_group_rep_by_earliest_timestamp, new_group_rep_by_completeness
string_grouper_utils provides three high-level functions new_group_rep_by_highest_weight()
, new_group_rep_by_earliest_timestamp()
, and new_group_rep_by_completeness()
. These functions change the group-representatives of data that have already been grouped (by group_similar_strings(), for example).
Let us create a DataFrame with some artificial timestamped records:
customers_df = pd.DataFrame(
[
('BB016741P', 'Mega Enterprises Corporation', 'Address0', 'Tel0', 'Description0', 0.2, '2014-12-30 10:55:00-02:00'),
('CC082744L', 'Hyper Startup Incorporated', '', 'Tel1', '', 0.5, '2017-01-01 20:23:15-05:00'),
('AA098762D', 'Hyper Startup Inc.', 'Address2', 'Tel2', 'Description2', 0.3, '2020-10-20 15:29:30+02:00'),
('BB099931J', 'Hyper-Startup Inc.', 'Address3', 'Tel3', 'Description3', 0.1, '2013-07-01 03:34:45-05:00'),
('HH072982K', 'Hyper Hyper Inc.', 'Address4', '', 'Description4', 0.9, '2005-09-11 11:56:00-07:00'),
('EE059082Q', 'Mega Enterprises Corp.', 'Address5', 'Tel5', 'Description5', 1.0, '1998-04-14 09:21:11+00:00')
],
columns=('Customer ID', 'Customer Name', 'Address', 'Tel', 'Description', 'weight', 'timestamp')
).set_index('Customer ID')
NB. These 'timestamps' are not actual pandas Timestamp
datatypes --- they are strings. If we like, we could convert these strings to pandas Timestamp
datatypes or datetime datatypes (from python module datetime
), but this is not necessary because string_grouper_utils can deal with these strings just as they are and can automatically parse them to into (localized) pandas Timestamp
datatypes internally for comparison as we shall soon see.
Also, in this example we have used the most general timestamps, that is, each string has a date together with time-of-day and timezone information. This is not always necessary, for example, if desired, only date information can be contained in each string.
Let us display the DataFrame:
customers_df
Customer Name | Address | Tel | Description | weight | timestamp | |
---|---|---|---|---|---|---|
Customer ID | ||||||
BB016741P | Mega Enterprises Corporation | Address0 | Tel0 | Description0 | 0.2 | 2014-12-30 10:55:00-02:00 |
CC082744L | Hyper Startup Incorporated | Tel1 | 0.5 | 2017-01-01 20:23:15-05:00 | ||
AA098762D | Hyper Startup Inc. | Address2 | Tel2 | Description2 | 0.3 | 2020-10-20 15:29:30+02:00 |
BB099931J | Hyper-Startup Inc. | Address3 | Tel3 | Description3 | 0.1 | 2013-07-01 03:34:45-05:00 |
HH072982K | Hyper Hyper Inc. | Address4 | Description4 | 0.9 | 2005-09-11 11:56:00-07:00 | |
EE059082Q | Mega Enterprises Corp. | Address5 | Tel5 | Description5 | 1.0 | 1998-04-14 09:21:11+00:00 |
With the following command, we can create a mapping table with the groupings that group_similar_strings() finds. Here the keyword argument group_rep
is not explicitly set. It therefore takes on the default value 'centroid'
.
customers_df[['group rep ID', 'group rep']] = \
group_similar_strings(
customers_df['Customer Name'],
min_similarity=0.6)
Let's display the mapping table:
customers_df
Customer Name | Address | Tel | Description | weight | timestamp | group rep ID | group rep | |
---|---|---|---|---|---|---|---|---|
Customer ID | ||||||||
BB016741P | Mega Enterprises Corporation | Address0 | Tel0 | Description0 | 0.2 | 2014-12-30 10:55:00-02:00 | BB016741P | Mega Enterprises Corporation |
CC082744L | Hyper Startup Incorporated | Tel1 | 0.5 | 2017-01-01 20:23:15-05:00 | AA098762D | Hyper Startup Inc. | ||
AA098762D | Hyper Startup Inc. | Address2 | Tel2 | Description2 | 0.3 | 2020-10-20 15:29:30+02:00 | AA098762D | Hyper Startup Inc. |
BB099931J | Hyper-Startup Inc. | Address3 | Tel3 | Description3 | 0.1 | 2013-07-01 03:34:45-05:00 | AA098762D | Hyper Startup Inc. |
HH072982K | Hyper Hyper Inc. | Address4 | Description4 | 0.9 | 2005-09-11 11:56:00-07:00 | HH072982K | Hyper Hyper Inc. | |
EE059082Q | Mega Enterprises Corp. | Address5 | Tel5 | Description5 | 1.0 | 1998-04-14 09:21:11+00:00 | BB016741P | Mega Enterprises Corporation |
Let's try this again, this time with group_rep='first':
customers_df[['group rep ID', 'group rep']] = \
group_similar_strings(
customers_df['Customer Name'],
group_rep='first',
min_similarity=0.6)
Displaying the new mapping table shows the differences from the result above:
customers_df
Customer Name | Address | Tel | Description | weight | timestamp | group rep ID | group rep | |
---|---|---|---|---|---|---|---|---|
Customer ID | ||||||||
BB016741P | Mega Enterprises Corporation | Address0 | Tel0 | Description0 | 0.2 | 2014-12-30 10:55:00-02:00 | BB016741P | Mega Enterprises Corporation |
CC082744L | Hyper Startup Incorporated | Tel1 | 0.5 | 2017-01-01 20:23:15-05:00 | CC082744L | Hyper Startup Incorporated | ||
AA098762D | Hyper Startup Inc. | Address2 | Tel2 | Description2 | 0.3 | 2020-10-20 15:29:30+02:00 | CC082744L | Hyper Startup Incorporated |
BB099931J | Hyper-Startup Inc. | Address3 | Tel3 | Description3 | 0.1 | 2013-07-01 03:34:45-05:00 | CC082744L | Hyper Startup Incorporated |
HH072982K | Hyper Hyper Inc. | Address4 | Description4 | 0.9 | 2005-09-11 11:56:00-07:00 | HH072982K | Hyper Hyper Inc. | |
EE059082Q | Mega Enterprises Corp. | Address5 | Tel5 | Description5 | 1.0 | 1998-04-14 09:21:11+00:00 | BB016741P | Mega Enterprises Corporation |
Remember it displays the same groups! Only the group names (representatives) have changed.
As mentioned above, there are still more choices of group-representatives available. Let's use the new_group_rep_by_earliest_timestamp()
function:
customers_df.reset_index(inplace=True)
customers_df[['group rep ID', 'group rep']] = \
new_group_rep_by_earliest_timestamp(
grouped_data=customers_df,
group_col='group rep ID',
record_id_col='Customer ID',
record_name_col='Customer Name',
timestamps='timestamp',
dayfirst=False
)
Notice that this time the function operates on already grouped data (such as the mapping table that was output by group_similar_strings() above). Thus the column of the input grouped data containing the groups (here either 'group rep ID' or 'group rep') must be specified as argument group_col in addition to the column containing the group members (here either 'Customer ID' or 'Customer Name') in argument record_id_col.
Argument record_name_col is optional and will appear in the output alongside the new group-representatives chosen from record_id_col only if specified.
The keyword argument dayfirst
used here is one that is also used in python module dateutil's parser.parse() function. This option specifies whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the day ('True') or month ('False'). If keyword argument yearfirst
is set to 'True', this distinguishes between YDM and YMD.
The other possible keyword arguments that can be used are detailed in the docstring (help) of new_group_rep_by_earliest_timestamp():
help(new_group_rep_by_earliest_timestamp)
Help on function new_group_rep_by_earliest_timestamp in module string_grouper_utils.string_grouper_utils:
new_group_rep_by_earliest_timestamp(grouped_data: pandas.core.frame.DataFrame, group_col: Union[str, int], record_id_col: Union[str, int], timestamps: Union[pandas.core.series.Series, str, int], record_name_col: Union[str, int, NoneType] = None, parserinfo=None, **kwargs) -> Union[pandas.core.frame.DataFrame, pandas.core.series.Series]
Selects the oldest string in each group as group-representative.
:param grouped_data: The grouped DataFrame
:param group_col: The name or positional index of the column in grouped_data containing the groups
:param record_id_col: The name or positional index of the column in grouped_data with all groups' members' IDs
(This will appear in the output)
:param timestamps: pandas.Series or the column name (str) or column positional index (int) in grouped_data
This contains the timestamps of the strings to be grouped.
:param record_name_col: (Optional) The name or positional index of the column in grouped_data with
all groups' members' names. (This will appear in the output.)
:param parserinfo: (See below.)
:param **kwargs: (See below.)
parserinfo and kwargs are the same arguments as those you would pass to dateutil.parser.parse. They help in
interpreting the string inputs which are to be parsed into datetime datatypes.
FYI, the dateutil.parser.parse documentation for these arguments follows:
:param parserinfo:
A :class:`parserinfo` object containing parameters for the parser.
If ``None``, the default arguments to the :class:`parserinfo`
constructor are used.
The ``**kwargs`` parameter takes the following keyword arguments:
:param default:
The default datetime object, if this is a datetime object and not
``None``, elements specified in the strings containing the date/time-stamps replace elements in the
default object.
:param ignoretz:
If set ``True``, time zones in parsed strings are ignored and a naive
:class:`datetime` object is returned.
:param tzinfos:
Additional time zone names / aliases which may be present in the
string. This argument maps time zone names (and optionally offsets
from those time zones) to time zones. This parameter can be a
dictionary with timezone aliases mapping time zone names to time
zones or a function taking two parameters (``tzname`` and
``tzoffset``) and returning a time zone.
The timezones to which the names are mapped can be an integer
offset from UTC in seconds or a :class:`tzinfo` object.
.. doctest::
:options: +NORMALIZE_WHITESPACE
>>> from dateutil.parser import parse
>>> from dateutil.tz import gettz
>>> tzinfos = {"BRST": -7200, "CST": gettz("America/Chicago")}
>>> parse("2012-01-19 17:21:00 BRST", tzinfos=tzinfos)
datetime.datetime(2012, 1, 19, 17, 21, tzinfo=tzoffset(u'BRST', -7200))
>>> parse("2012-01-19 17:21:00 CST", tzinfos=tzinfos)
datetime.datetime(2012, 1, 19, 17, 21,
tzinfo=tzfile('/usr/share/zoneinfo/America/Chicago'))
This parameter is ignored if ``ignoretz`` is set.
:param dayfirst:
Whether to interpret the first value in an ambiguous 3-integer date
(e.g. 01/05/09) as the day (``True``) or month (``False``). If
``yearfirst`` is set to ``True``, this distinguishes between YDM and
YMD. If set to ``None``, this value is retrieved from the current
:class:`parserinfo` object (which itself defaults to ``False``).
:param yearfirst:
Whether to interpret the first value in an ambiguous 3-integer date
(e.g. 01/05/09) as the year. If ``True``, the first number is taken to
be the year, otherwise the last number is taken to be the year. If
this is set to ``None``, the value is retrieved from the current
:class:`parserinfo` object (which itself defaults to ``False``).
:param fuzzy:
Whether to allow fuzzy parsing, allowing for string like "Today is
January 1, 2047 at 8:21:00AM".
:param fuzzy_with_tokens:
If ``True``, ``fuzzy`` is automatically set to True, and the parser
will return a tuple where the first element is the parsed
:class:`datetime.datetime` datetimestamp and the second element is
a tuple containing the portions of the string which were ignored:
.. doctest::
>>> from dateutil.parser import parse
>>> parse("Today is January 1, 2047 at 8:21:00AM", fuzzy_with_tokens=True)
(datetime.datetime(2047, 1, 1, 8, 21), (u'Today is ', u' ', u'at '))
customers_df
Customer ID | Customer Name | Address | Tel | Description | weight | timestamp | group rep ID | group rep | |
---|---|---|---|---|---|---|---|---|---|
0 | BB016741P | Mega Enterprises Corporation | Address0 | Tel0 | Description0 | 0.2 | 2014-12-30 10:55:00-02:00 | EE059082Q | Mega Enterprises Corp. |
1 | CC082744L | Hyper Startup Incorporated | Tel1 | 0.5 | 2017-01-01 20:23:15-05:00 | BB099931J | Hyper-Startup Inc. | ||
2 | AA098762D | Hyper Startup Inc. | Address2 | Tel2 | Description2 | 0.3 | 2020-10-20 15:29:30+02:00 | BB099931J | Hyper-Startup Inc. |
3 | BB099931J | Hyper-Startup Inc. | Address3 | Tel3 | Description3 | 0.1 | 2013-07-01 03:34:45-05:00 | BB099931J | Hyper-Startup Inc. |
4 | HH072982K | Hyper Hyper Inc. | Address4 | Description4 | 0.9 | 2005-09-11 11:56:00-07:00 | HH072982K | Hyper Hyper Inc. | |
5 | EE059082Q | Mega Enterprises Corp. | Address5 | Tel5 | Description5 | 1.0 | 1998-04-14 09:21:11+00:00 | EE059082Q | Mega Enterprises Corp. |
Here the group-member with the earliest timestamp has been chosen as group-representative for each group. Notice that even though the timestamp data is input as strings, the function is able to treat them as if they were datetime (or pandas Timestamp) datatypes.
The other two utility functions new_group_rep_by_highest_weight()
and new_group_rep_by_completeness()
operate in a similar way to new_group_rep_by_earliest_timestamp():
-
new_group_rep_by_highest_weight() chooses the group-member with the highest weight as group-representative for each group. The weight of each member is assigned as desired by the user, and provided as an argument to the function. The weights could also be a specified column in the input grouped data (mapping table).
-
new_group_rep_by_completeness() chooses the group member with the most filled-in fields in its row as group-representative for each group.
customers_df[['group rep ID', 'group rep']] = \
new_group_rep_by_highest_weight(
grouped_data=customers_df,
group_col='group rep ID',
record_id_col='Customer ID',
weights='weight',
record_name_col='Customer Name'
)
customers_df
Customer ID | Customer Name | Address | Tel | Description | weight | timestamp | group rep ID | group rep | |
---|---|---|---|---|---|---|---|---|---|
0 | BB016741P | Mega Enterprises Corporation | Address0 | Tel0 | Description0 | 0.2 | 2014-12-30 10:55:00-02:00 | EE059082Q | Mega Enterprises Corp. |
1 | CC082744L | Hyper Startup Incorporated | Tel1 | 0.5 | 2017-01-01 20:23:15-05:00 | CC082744L | Hyper Startup Incorporated | ||
2 | AA098762D | Hyper Startup Inc. | Address2 | Tel2 | Description2 | 0.3 | 2020-10-20 15:29:30+02:00 | CC082744L | Hyper Startup Incorporated |
3 | BB099931J | Hyper-Startup Inc. | Address3 | Tel3 | Description3 | 0.1 | 2013-07-01 03:34:45-05:00 | CC082744L | Hyper Startup Incorporated |
4 | HH072982K | Hyper Hyper Inc. | Address4 | Description4 | 0.9 | 2005-09-11 11:56:00-07:00 | HH072982K | Hyper Hyper Inc. | |
5 | EE059082Q | Mega Enterprises Corp. | Address5 | Tel5 | Description5 | 1.0 | 1998-04-14 09:21:11+00:00 | EE059082Q | Mega Enterprises Corp. |
customers_df[['group rep ID', 'group rep']] = \
new_group_rep_by_completeness(
grouped_data=customers_df,
group_col='group rep ID',
record_id_col='Customer ID',
record_name_col='Customer Name',
tested_cols=['Address', 'Tel', 'Description']
)
N.B. If argument tesed_cols is not given, new_group_rep_by_completeness() will test the filled-in status of all the fields of grouped_data for each group member.
customers_df
Customer ID | Customer Name | Address | Tel | Description | weight | timestamp | group rep ID | group rep | |
---|---|---|---|---|---|---|---|---|---|
0 | BB016741P | Mega Enterprises Corporation | Address0 | Tel0 | Description0 | 0.2 | 2014-12-30 10:55:00-02:00 | BB016741P | Mega Enterprises Corporation |
1 | CC082744L | Hyper Startup Incorporated | Tel1 | 0.5 | 2017-01-01 20:23:15-05:00 | AA098762D | Hyper Startup Inc. | ||
2 | AA098762D | Hyper Startup Inc. | Address2 | Tel2 | Description2 | 0.3 | 2020-10-20 15:29:30+02:00 | AA098762D | Hyper Startup Inc. |
3 | BB099931J | Hyper-Startup Inc. | Address3 | Tel3 | Description3 | 0.1 | 2013-07-01 03:34:45-05:00 | AA098762D | Hyper Startup Inc. |
4 | HH072982K | Hyper Hyper Inc. | Address4 | Description4 | 0.9 | 2005-09-11 11:56:00-07:00 | HH072982K | Hyper Hyper Inc. | |
5 | EE059082Q | Mega Enterprises Corp. | Address5 | Tel5 | Description5 | 1.0 | 1998-04-14 09:21:11+00:00 | BB016741P | Mega Enterprises Corporation |