Everything should be made as simple as possible, but not simpler. (Albert Einstein)

Thursday, December 29, 2016

K-means Clustering: Geolocation EDA & Inference Using Cell-phone CDRs

I played with (real) CDRs data, doing some exploratory data analysis / EDA and inferential (clustering), in order to find out approximate geolocation of certain phone number. 

Sometimes we're curious e.g. "who is this woman?"
Does she work in a bank, or studio? Maybe she's a bus driver? ;)



Data science + machine learning tools: Python (pandas, sklearn, matplotlib).

Call Detail Records (CDRs) is cell phone usage information collected by cell phone service providers.
Providers record every voice call or text (SMS) message exchange between two cell phone numbers. Information collected includes:

  • calls and messages: phone number, reciprocal phone number, time, duration, length of messages, etc., relational to users' identity.
  • cell towers: transceiver station CID, LAC (location area code), etc., relational to latitude and longitude data.
  • devices: IMEI, MAC, name and type of device, etc.
  • wifis: name, MAC of base station, etc.
  • ... and other data for resource provisioning and billing.

It's a comprehensive data, and it's available almost in real-time (within minutes). 

We can also do similar EDA to social media's geotagged data. 

Thursday, December 8, 2016

Probabilistic Graphical Models: Bayesian Networks Example With R, Python, SAMIAM

Here I try to practice examples of Bayesian networks as explained in the book written by Stanford Professor Daphne Koller, and Hebrew Professor Nir Friedman., “Probabilistic Graphical Models: Principle and Techniques”, chapter 3.2.   

In the book, the final values of probabilities is given as it is, without derivation of the formulas. Here I tried to expand the formulas, as well to verify the results with the help of R, Python and SamIam.

Course: PGM Specialization (MATLAB / Octave). 
Tools: R, Python, SamIam.  

The DAG of student BN example is,