CDR data analysis — Location approximation

Wageesha Erangi
Analytics Vidhya
Published in
7 min readFeb 6, 2021

--

You probably wonder what does it mean by CDR data analysis. The Call Detail Record data is initially used to monitor the usage of users by the telecommunication companies. Many researchers and developers tried to use these data for many new research projects by analyzing combinations of data sets such as call records , cell tower data records and message data records. The data scientist uses various libraries and tools to maximize the efficiency of the process of analyzing data which have a high value in the industry when it comes to behavior of customers.

Introduction to Python data analysis libraries and tools

Python is one of the most widely used programming languages in the world when it comes to data science and data analysis. Hence the language has a highly readable coding methods and the no of lines needed to accomplish a certain task is minimum, it saves time.

There are many data analysis tools based on the Python language. However here are some tools and libraries that are needed in the process of analyzing Call Detail Record(CDR) data.

Numpy

Numpy(Numerical Python) package is used for scientific computing with python vectorized data. The mathematical and logical operations, vector shape creation and manipulation, operations related linear algebra can be achieved using this library.

Some advantages of using numpy instead of the vanilla python arrays and objects are,

  • Minimization of memory usage using the numpy arrays using fixed sized arrays.
  • Minimization of time spent on the data analysis process because of the vectorized arrays and multidimensional array operations and data filtering operations.

Numpy code base : https://github.com/numpy/numpy

Numpy (PyPI) : https://pypi.org/project/numpy/

Numpy library needs to be installed using pip commands to the python working environment.

pip install numpy

OR

pip3 install numpy

Then it has to be imported to the python working environment.

import numpy as np

All the analysis based on numpy can be then carried out using the “np” alias.

np_arr = np.array(list)

Pandas

Pandas (Python data analysis) library is a fast, powerful, flexible, widely used open source data analysis tool written for python language for data manipulation and analysis.

Pandas official website: https://pandas.pydata.org/

Pandas code base : https://github.com/pandas-dev/pandas

Pandas (PyPI) : https://pypi.org/project/pandas/

Pandas library needs to be installed using pip commands to the python 3 working environment.

pip install pandas

Then it has to be imported to the python working environment.

Import pandas as pd

Pandas has the ability to read csv, tsv, xlsx, json types of files and handle data using data frames which is a built-in two dimensional data structure. Similarly the library uses Pandas Series, one dimensional array which can store any type of data.

pd.read_csv(“filename”) # read csv files 
pd.read_json(“filename”) # read json files

Data objects built using vanilla python can be converted into pandas Dataframes or Series and then can be used with the data analysis functionalities provided by the library.

pd.DataFrame(list)
pd.Series(list)

Bandicoot

Bandicoot is an open source, python CDR data analysis tool which contains many in-built functions to analyze the given phone metadata. The library has the capability of visualizing user data in a user interface which was built using React frontend framework.

Mostly bandicoot is focused on providing functionalities to analyze individual user data which can be categorized to individual features, spatial features and social networks. Each category contains a set of functionalities which can be accessed using user objects.

Bandicoot Official website : https://cpg.doc.ic.ac.uk/bandicoot/

Bandicoot Code base : https://github.com/computationalprivacy/bandicoot

Bandicoot (PyPI) : https://pypi.org/project/bandicoot/

Bandicoot library needs to be installed using pip commands to the python working environment.

pip install bandicoot

Then it has to be imported to the python working environment.

import bandicoot as bc

Usage of the libraries in location approximation process

Common CDR data files have the file types like txt, csv, json and xlsx. The data analyzing tools must have the capability to read the mentioned file types in order to process the relevant data.

The different libraries provide different types of functionalities in the data analysis process. And by combining the different functionalities of two or more libraries we can achieve the required CDR analysis functionality we want.

When it comes to numpy library, which is not specifically developed for the CDR data analysis process, gives many functionalities such as reading data from files and writing into files, converting data arrays to numpy arrays, and handling n-dimensional arrays.

  • Opening csv files using genfromtxt() method
Numpy_array = np.genfromtxt(“file.csv”, 
delimiter = “,”,
skip_header=1)
  • Save as txt and csv files
np.savetxt(“file.txt”, np_array)
np.savetxt(“file.csv”, np_array)
  • Convert list to numpy array
np_array = np.array(python_list)
  • Convert numpy array to list
python_list = np_array.tolist()
  • Filtering the items in the array based on conditions.
arr_user1  = arr[arr == user1_number]

Even though numpy is an efficient way of handling n-dimensional arrays in python, the library is not a good solution for handling arrays which contains different data types, for example int and String.

However Pandas library has built in functionalities to handle different data types in the same array. After reading the given file, the data will be processed and converted into a dataframe. In addition data printing, column dropping, row filtering, data frame appending, dataframe merging, slicing and joining are other important functions provided by the Pandas library.

When it comes to Bandicoot, the library has the ability to read different types of datafiles, clean data, handle missing data, and calculate the home location for a given user. However, one drawback of the library is that, the data files must be in a specific format. The example data files are provided in the bandicoot demo tutorial.

Choosing proper datasets

Choosing proper data sets is an important part in the location calculation process. When calculating data in order to get the approximation of home and work locations, the following two data sets are essential.

Call data set

When selecting the call data set files, the user, timestamp and the antenna id columns are essential as the user mobile number is needed to identify the the relevant user, timestamp to categorize the records based on the user’s work start, work end time, antenna id to get the relevant latitude and longitude of the record.

Image of an example call data set file

Antenna data set

When selecting the antenna data set files, the latitude and longitude columns are essential, as the location is calculated using the latitude and longitude. Moreover, the antenna id should be included in the data file, as each call record is associate with an antenna id.

Image of an example antenna data set file

Approximate the home and work location of a user

The following methods can be used to calculate the approximate home and work location of a selected user using Pandas.

Get the call and antenna data frames by reading the csv files using read_csv(“filename”) method

call_df = pd.read_csv("calls.csv")
antenna_df = pd.read_csv("antennas.csv")

Get the relevant data frames filtered by dropping columns using drop(array_of_col_names) method.

c_drop = ["direction", "duration", "cost"]
call_df.drop(c_drop, inplace=True, axis=1)

Define the user contact number, the work start time and the work end time for the user. The home and work locations are calculated based on the defined time period.

user1_no = 7123456789
work_start_time = 7 #7am
work_end_time = 19 #7pm

In the calculation, if the user is at work then the work_start_time <= timestamp < work_end_time. In the same manner the user is at home if the above statement is false.

Filter the rows of the call data frame based on the selected user contact number and get all the records related to the user as user1_call_records_df.

filter1 = call_df[call_df['user']==user1_no ]
filter2 = call_df[call_df['other']==user1_no ]
user1_call_records_df = filter1.append(filter2)

Concatenate the call and antenna dataframes by taking the intersection/inner join of the two dataframe objects based on the same antenna_id.

call_antenna_df = pd.concat([user1_call_records_df, antenna_df], axis=1, join='inner')

Check the timestamp with the given work start and end time and return “true” if the user call record time is in the working time period else return “false”. The method can be vary based on the timestamp column in the added csv file.

def check_timestamp_for_work(timestamp):
day = timestamp.day
time = timestamp.time
if day > 5:
return False
else:
if work_end_time >= time.hour > work_start_time:
return True
else:
return False

The home and work locations are calculated using the call_antenns_df dataframes. In the following functions the geographic location with the maximum record count is calculated separately for home and work locations and return an array in the form of [latitude, longitude].

def compute_home_location(call_antenna_df):
location_dict = {}
for record in call_antenna_df.itertuples():
at_home = not(check_timestamp_for_work(record.timestamp))
if at_home:
location =
str(record.latitude)+","+str(record.longitude)
if location in location_dict:
location_dict[location] += 1
else:
location_dict[location] = 1
if len(location_dict) > 0:
latitude, longitude = map(float, max(location_dict,
key=location_dict.get).split(','))
home = [latitude, longitude]
return home
return []
def compute_work_location(call_antenna_df):
location_dict = {}
for record in call_antenna_df.itertuples():
at_home = check_timestamp_for_work(record.timestamp)
if at_home:
location = str(record.latitude)+ ","
+str(record.longitude)
if location in location_dict:
location_dict[location] += 1
else:
location_dict[location] = 1
if len(location_dict) > 0:
latitude, longitude = map(float, max(location_dict,
key=location_dict.get).split(','))
home = [latitude, longitude]
return home
return []
home_location = compute_home_location(call_antenna_df)
work_location = compute_work_location(call_antenna_df)

The accuracy of the locations increase with the no of records in the data file for the selected user.

If the data size is larger and if the data calculation process takes a massive time, the numpy library can help with the time minimization.

Now you can code with Pandas and Bandicoot libraries and analyze the data files to find various user pattern like user trips and population around antennas. Moreover, you can use python data visualizing libraries like Matplotlib, Folium to visualize the structured data.

--

--

Wageesha Erangi
Analytics Vidhya

Computer Science & Engineering undergraduate student