Everything should be made as simple as possible, but not simpler. (Albert Einstein)

Thursday, December 29, 2016

K-means Clustering: Geolocation EDA & Inference Using Cell-phone CDRs

I played with (real) CDRs data, doing some exploratory data analysis / EDA and inferential (clustering), in order to find out approximate geolocation of certain phone number. 

Sometimes we're curious e.g. "who is this woman?"
Does she work in a bank, or studio? Maybe she's a bus driver? ;)



Data science + machine learning tools: Python (pandas, sklearn, matplotlib).

Call Detail Records (CDRs) is cell phone usage information collected by cell phone service providers.
Providers record every voice call or text (SMS) message exchange between two cell phone numbers. Information collected includes:

  • calls and messages: phone number, reciprocal phone number, time, duration, length of messages, etc., relational to users' identity.
  • cell towers: transceiver station CID, LAC (location area code), etc., relational to latitude and longitude data.
  • devices: IMEI, MAC, name and type of device, etc.
  • wifis: name, MAC of base station, etc.
  • ... and other data for resource provisioning and billing.

It's a comprehensive data, and it's available almost in real-time (within minutes). 

We can also do similar EDA to social media's geotagged data. 


Now imagine we have this CDRs data in which every user can produce thousand of CDRs observations per month.
Based on the time and location data, we can track approximate geolocation of a user at certain time period of the call/SMS.
With the help of unsupervised machine learning we can do clustering to location of the cell towers. Then based on time-space analysis we can evaluate approximately a user's living place and/or working place, with around several square miles accuracy.

Finding out someone's geolocation can be useful for serious case, or just for fun ;) 
(For the sake of law enforcement, judge may order service provider to disclose information, i.e. identity of a user based on the registered number. Otherwise, if we are not law officers then data science + machine learning may come to a bit of rescue.)



Disclaimers:
Sure there are so many phone trackers with GPS mobile locator out there.
But in most cases we must install an app (or spyware) into the target/victim's cell phone, or tapping.
E.g. to track lost cell phone, to track cheating spouse(s), FBI's surveillance, etc..

NO, this is NOT a kind of those trackers. I just played with a simple EDA and inference, not involving dedicated hardware, I just worked with CDRs data, because this is about "data science", not spying ;) 


Also, this is a basic machine learning article just for fun and for the purpose of our records only.


Question raised, how can we obtain CDRs data?
For real CDR data definitely it's confidential, unless a hacker steals the data, or someone leaks the data (either someone related to a telecom company or intelligence agency such as NSA). The leakage could be illegal or legal.

Examples of legal disclosure, e.g. for academic usage, or for a machine learning competition (e.g. Kaggle.com). Although usually for ethical reason any personal information would be obfuscated (anonymized) for privacy protection.

There are also some resources to generate (unreal / simulated) CDRs data, such as this gedis-studio.

Hacking parts is beyond our scope.
Let's pretend we 'magically' found the CDRs data ;))
So we shall focus only on data science + machine learning parts.

Here is a "minified" and "simplified" example. Let's play...

I can not reveal CDRs data, it should be anonymized. There are (real) phone numbers which should be hidden here, because we are civilized people, (albeit there were some educated people who blatantly shared other's phone numbers in unethical manner, e.g. for idiotic reason such as hatred. You recognized this data should know what I meant.



Python in action for the exploratory data analysis and inference:

Import necessary packages,

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from matplotlib.ticker import ScalarFormatter, FormatStrFormatter
from sklearn.cluster import KMeans


For pretty plot we use ggplot style,

matplotlib.style.use('ggplot')

Read CDRs data, print description and data types,

# read CDRs data
df = pd.read_csv('./Datasets/CDR.csv')

print df.columns
print df.describe()
print df.dtypes



Index([u'In', u'Out', u'Direction', u'CallDate', 
       u'CallTime', u'DOW', u'Duration', u'TowerID', 
       u'TowerLat', u'TowerLon'],
       dtype='object')

                 In           Out      TowerLat      TowerLon
count  5.318800e+04  5.318800e+04  53188.000000  53188.000000
mean   3.587986e+09  4.761601e+09     32.792888    -96.836699
std    2.298312e+09  2.846774e+09      0.087926      0.082759
min    1.559411e+09  1.015282e+08     32.663889    -97.036583
25%    1.884183e+09  2.288701e+09     32.721944    -96.910389
50%    2.894366e+09  4.570681e+09     32.772361    -96.840667
75%    4.638472e+09  7.216980e+09     32.870944    -96.802528
max    8.549533e+09  9.946220e+09     33.015250    -96.604444

In             int64
Out            int64
Direction     object
CallDate      object
CallTime      object
DOW           object
Duration      object
TowerID       object
TowerLat     float64
TowerLon     float64

 

We have 53188 records (observations).
Convert CallDate to datetime and CallTime to timedelta,

df['CallDate'] = pd.to_datetime(df['CallDate'], format='%Y-%m-%d')
df['CallTime'] = pd.to_timedelta(df['CallTime'])

 

Get unique phone number,

# unique phone number
phone_in = df['In'].unique().tolist()
print phone_in

[46384xxxxx, 15594xxxxx, 49315xxxxx, 24199xxxxx, 18841xxxxx,
 36880xxxxx, 45550xxxxx, 20686xxxxx, 28943xxxxx, 85495xxxxx]

 

We have 10 phone numbers. Let's do clustering with some of this phone numbers,
We are interested in phone number #9:

# filter certain user
u = 9
filt = df['In'] == phone_in[u]
user = df.loc[filt, :]
print user.shape
(7741, 10)

 

User #9 made 7741 phone calls connected to nearest cell towers.
Plot the cell towers connected to user #9:

fig = plt.figure()
ax = fig.add_subplot(111)
ax.xaxis.set_major_formatter(FormatStrFormatter('%.3f'))
ax.yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Cell Towers (all calls)')
ax.scatter(user.TowerLon, user.TowerLat, c='g', marker='o', alpha=0.2, s=80)
plt.show()



Note: the greener dot means user connected to the same tower several times more.
Notice we set nodes' transparency to alpha=0.2. 

Remember there are 7741 cell tower locations in the filtered data, many of them are overlapped. There are only around 22 unique locations actually.

Some intuitive behaviours on weekends: People do not go to work or school, big chances they are at home at late hours during weekend.


Some intuitive behaviours on weekdays: People are at work or school between morning - evening. They are at home in early morning, then commuting on the road, in the office/school during working hours, go home in evening or early night.

So, for weekend let's filter the data based on DOW (day of week).

filt = user['DOW'].isin(['Sat', 'Sun'])
user = user.loc[filt, :]
print user.shape
(2305, 10)


Because of people are at home during late night, let's filter again the data based on time.
We want to investigate locations between [10 PM till midnight], or between [midnight till 6 AM].
Note, in the code we use military time string format (24 hour), so 01:00 PM is represented as '13:00', 10:00 PM as '22:00' and so on.

filt = (user['CallTime'] > '22:00:00') | (user['CallTime'] < '06:00:00')
user = user.loc[filt, :]
print user.shape
(64, 10)


We have 64 cell tower locations connected to user #9 when in weekend at between late night till morning.

For the K-means clustering, we're only interested only in longitude and latitude features:

user = user[['TowerLat', 'TowerLon']]

 
Run K-means clustering with number of cluster K=1:

K = 1
kmeans = KMeans(n_clusters=K)
kmeans.fit(user)
centroids = kmeans.cluster_centers_
print centroids

[[32.77609108 -96.78028103]]


In this case, centroid represents center of geolocation of the cell towers.
Let's plot the cell towers' geolocation and the centroid:

fig = plt.figure()
ax = fig.add_subplot(111)
ax.xaxis.set_major_formatter(FormatStrFormatter('%.3f'))
ax.yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Weekend Calls (
10pm-6am)')
ax.scatter(user_weekend.TowerLon, user_weekend.TowerLat,
           c='g', marker='o', alpha=0.2, s=80)
ax.scatter(centroids[:,1], centroids[:,0], marker='x',
           c='red', alpha=0.8, linewidths=2, s=160)
plt.show()


The greener dot means user connected to the same tower (overlapped) several times more. Red cross is the centroid.

Note there are outliers in the plot (nodes in pale transparent green, top-left). 

About the outlier location: e.g. it can be user was so excited meeting a lovely person, so she decided to stay overnight in another romantic place ;)
Or maybe she was just too late...

With the help of Google Maps, locate the centroid of cell towers: [lat,long] = [32.77609108 -96.78028103]
We suspected user's homeplace is at around this centroid, maybe in radius of 5-50 miles, depends on the cell tower's coverage area. 







Remember, the nodes represent geolocation of cell towers connected to, NOT geolocation of the user.

Note: From here, if necessary we can analyze more deeper with the help of our "hacking" expertise to find out exact geolocation of the user.
But it's beyond scope of this article, i.e. for ethical reason among others.

Here we're going to do only intuitive experiments with Google Maps to find nearest settlement.
For example, we can find nearest apartments. In the Google Maps, click "Nearby" then enter "apartments".





 
In the same way we can also find nearest real estate, or other similar settlement or habitation. Also don't forget there's possibility user lives in an ordinary house.

As shown in the above map, looks like nearest apartments to the centroid are "Alta Farmers Market" and "Camden Farmers Market".

But we also see in Google Earth, there are houses in the neighborhood of the centroid.
 

Note: If we want to go deeper, then prepare our sophisticated investigation tools, tapper tool, etc..., don't forget wearing Sherlock Holmes hat.
Or, we may just want to hire nearest private detective.



Let's continue by doing exploratory data analysis during weekdays:
Filter based on DOW:

# weekdays:
filt = user['DOW'].isin(['Mon', 'Tue', 'Wed', 'Thr', 'Fri'])
user_weekday0 = user.loc[filt, :]
print user_weekday0.shape
(5436, 10)


Looks like our user is very busy, as she made 5436 phone calls ;)
 

We assume user is at home before 6 AM. Next, user is off to commute, this time user may show specific pattern of errands as she moves on the way from home to work or school every morning. After working hard, we assume user goes home at 5 PM.

# between 6 AM - 5 PM
filt = (user_weekday0['CallTime'] > '06:00:00') & 

       (user_weekday0['CallTime'] < '17:00:00')
user_weekday0 = user_weekday0.loc[filt, :]
print user_weekday0.shape
(3049, 10)


Do clustering with K=3, assuming there are 3 areas of concentration: home - commute - work/school.

# K-means clustering
user_weekday = user_weekday0[['TowerLat', 'TowerLon']]
K = 3
kmeans = KMeans(n_clusters=K)
kmeans.fit(user_weekday)
centroids2 = kmeans.cluster_centers_
print centroids2

[[ 32.98500948 -96.80262338]
 [ 32.77240335 -96.7789695 ]
 [ 32.93033402 -96.80360535]]


Note: In practice, there may be more than 3 areas, e.g. depends on density (crowdedness) of nearest cell tower, it can be also user moving to different buildings in the company or school/campus.


There are 3 clusters, intuitively:

  • cluster with most nodes, is workplace,
  • cluster with second most nodes, is home,
  • cluster with least nodes, is places between workplace and home (commuting).

If our intuition is correct, then cluster with least nodes must be in the morning between 6 AM till 9 AM, that's when user is commuting on the road.
Let's find out:

# commuting time is predicted as cluster with minimum number of nodes:
mincluster_idx = np.argmin(np.array([(kmeans.labels_ == i).sum() for i in range(K)]))
user_weekday_commute = user_weekday0[kmeans.labels_ == mincluster_idx]
commuting_time = user_weekday_commute['CallTime']
print 'min index:', mincluster_idx
print 'mean:', commuting_time.mean()

min index: 2
mean: 0 days 07:51:00.414270


Indeed, cluster with least nodes is happened when in commuting time, at around 7:51 AM.

Now let's predict working time, as the cluster with maximum number of nodes:

# working time is predicted as cluster with maximum number of nodes:
maxcluster_idx = np.argmax(np.array([(kmeans.labels_ == i).sum() for i in range(K)]))
user_weekday_work = user_weekday0[kmeans.labels_ == maxcluster_idx]
working_time = user_weekday_work['CallTime']
print 'max index:', maxcluster_idx
print 'mean:', working_time.mean()
print 'range:', working_time.min(), '-', working_time.max()

max index: 1
mean: 0 days 10:23:52.941197
range: 0 days 06:42:07.500972 - 0 days 16:59:57.351640


Again, our intuition is correct, cluster with most nodes represent working time, with mean at 10:23 AM, until 4:59 PM.
Note, there are some outliers in this cluster with time less than 9 AM.

From previous results, we conclude:

[[ 32.98500948 -96.80262338]  --- is centroid of weekend (home)
 [ 32.77240335 -96.7789695 ]  --- is centroid of working time
 [ 32.93033402 -96.80360535]] --- is centroid of commuting time



Plot location of cell towers connected at weekdays:

fig = plt.figure()
ax = fig.add_subplot(111)
ax.xaxis.set_major_formatter(FormatStrFormatter('%.3f'))
ax.yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Weekday Calls (6am - 5pm)')
ax.scatter(user_weekday_commute.TowerLon, user_weekday_commute.TowerLat,
           c='c', marker='^', alpha=0.2, s=80, label='workday: commute')
ax.scatter(user_weekday_work.TowerLon, user_weekday_work.TowerLat,
           c='b', marker='^', alpha=0.2, s=80, label='workday: work')
ax.legend(loc='upper left')
centroids3 = centroids2[[mincluster_idx, maxcluster_idx]]
ax.scatter(centroids3[:,1], centroids3[:,0], marker='x', c='red',
           alpha=0.8, linewidths=2, s=160)
plt.show()



Now we know approximate location of nearest cell towers when user is commuting (nodes in cyan) and working (nodes in blue).

Ask Google Maps for generous helps:






At the surrounding area of "working time" centroid, we see banks, restaurants, supermarket, studio, school, ...
Does she work in a bank, or studio? Or else..? 

One thing for sure, apparently she's NOT a bus driver ;)) 


To get illustration of whole scenario, let's plot all of the connected cell towers, both in weekdays and weekends:

fig = plt.figure()
ax = fig.add_subplot(111)
ax.xaxis.set_major_formatter(FormatStrFormatter('%.3f'))
ax.yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Weekday (6am-5pm) + Weekend (10pm-6am) Calls')
ax.scatter(user_weekend.TowerLon, user_weekend.TowerLat,
           c='g', marker='o', alpha=0.2, s=80, label='weekend: home')
ax.scatter(user_weekday_commute.TowerLon, user_weekday_commute.TowerLat,
           c='c', marker='^', alpha=0.2, s=80, label='workday: commute')
ax.scatter(user_weekday_work.TowerLon, user_weekday_work.TowerLat,
           c='b', marker='^', alpha=0.2, s=80, label='workday: work')
ax.legend(loc='upper left')
ax.scatter(centroids[:,1], centroids[:,0], marker='x', c='red',
           alpha=0.8, linewidths=2, s=160)
ax.scatter(centroids2[:,1], centroids2[:,0], marker='x', c='red',
           alpha=0.8, linewidths=2, s=160)
plt.show()



Reminder again, remember that nodes and centroids are location of (nearest) connected cell towers, not location of user.
So when doing intuitive analysis we must consider nearest settlements, nearest roads, and nearest workplaces/schools, in certain radius from the centroids.

Do we see any pattern?

  • Starting point is home (green nodes), 
  • then commuting (cyan nodes) toward North direction, 
  • and finally arriving at workplace (blue nodes).
Again Google Maps please.. Click 'Directions' between:
(1) home [32.77609108 -96.78028103] and (2) work [32.98500948, -96.80262338]




The path is between around Dawson Street and Preston Road.


We notice it's similar to our previous plot of cell towers, user is commuting toward North direction.


Note: In some other practical cases, distance between home and work places can be close enough, then likely user will rarely make phone calls when commuting.



Another question: Is the CDRs data real? Let's find out:
First, let's obtain most connected cell towers during working time:

# towers nearby workplace
print user_weekday_work.drop_duplicates(subset=['TowerLat', 'TowerLon'])[['TowerLat', 'TowerLon']]
print user_weekday_work['TowerLat'].value_counts()

        TowerLat   TowerLon
187    32.997222 -96.798889
192    32.985083 -96.802528
244    33.011361 -96.824944
303    32.966500 -96.803778
320    32.985083 -96.802556
2743   33.015250 -96.831472
39695  32.978611 -96.784444

32.985083    1978
32.966500      89
32.997222      87
33.015250      13
33.011361       2
32.978611       1
Name: TowerLat, dtype: int64


Most connected cell tower (1978 times) is at [32.997222, -96.798889].


Let's zoom in,


Do some tilts in 3D view,


So, it's indeed a real cell tower. Google Earth won't lie! Conclusion: I used real CDRs data ;))

What's about cell tower nearby home?

print user_weekend.drop_duplicates(subset=['TowerLat', 'TowerLon'])[['TowerLat', 'TowerLon']]
print user_weekend['TowerLat'].value_counts()

        TowerLat   TowerLon
157    32.772361 -96.777278
6012   32.771667 -96.800278
52127  33.015250 -96.831472

32.772361    57
32.771667     6
33.015250     1
Name: TowerLat, dtype: int64

 
Most connected cell tower (57 times) is at [32.772361, -96.777278].
Say 'please' to Google Maps:


Let's zoom in, then tilt in 3D view:

Again, it's a real cell tower. Now I believe it's a real CDRs data ;) 




End note: Do no harm, primium non cocere,
non-maleficence. 
///