CDR data analysis — Location approximation
You probably wonder what does it mean by CDR data analysis. The Call Detail Record data is initially used to monitor the usage of users by the telecommunication companies. Many researchers and developers tried to use these data for many new research projects by analyzing combinations of data sets such as call records , cell tower data records and message data records. The data scientist uses various libraries and tools to maximize the efficiency of the process of analyzing data which have a high value in the industry when it comes to behavior of customers.
Introduction to Python data analysis libraries and tools
Python is one of the most widely used programming languages in the world when it comes to data science and data analysis. Hence the language has a highly readable coding methods and the no of lines needed to accomplish a certain task is minimum, it saves time.
There are many data analysis tools based on the Python language. However here are some tools and libraries that are needed in the process of analyzing Call Detail Record(CDR) data.
Numpy
Numpy(Numerical Python) package is used for scientific computing with python vectorized data. The mathematical and logical operations, vector shape creation and manipulation, operations related linear algebra can be achieved using this library.
Some advantages of using numpy instead of the vanilla python arrays and objects are,
- Minimization of memory usage using the numpy arrays using fixed sized arrays.
- Minimization of time spent on the data analysis process because of the vectorized arrays and multidimensional array operations and data filtering operations.
Numpy code base : https://github.com/numpy/numpy
Numpy (PyPI) : https://pypi.org/project/numpy/
Numpy library needs to be installed using pip commands to the python working environment.
pip install numpy
OR
pip3 install numpy
Then it has to be imported to the python working environment.
import numpy as np
All the analysis based on numpy can be then carried out using the “np” alias.
np_arr = np.array(list)
Pandas
Pandas (Python data analysis) library is a fast, powerful, flexible, widely used open source data analysis tool written for python language for data manipulation and analysis.
Pandas official website: https://pandas.pydata.org/
Pandas code base : https://github.com/pandas-dev/pandas
Pandas (PyPI) : https://pypi.org/project/pandas/
Pandas library needs to be installed using pip commands to the python 3 working environment.
pip install pandas
Then it has to be imported to the python working environment.
Import pandas as pd
Pandas has the ability to read csv, tsv, xlsx, json types of files and handle data using data frames which is a built-in two dimensional data structure. Similarly the library uses Pandas Series, one dimensional array which can store any type of data.
pd.read_csv(“filename”) # read csv files
pd.read_json(“filename”) # read json files
Data objects built using vanilla python can be converted into pandas Dataframes or Series and then can be used with the data analysis functionalities provided by the library.
pd.DataFrame(list)
pd.Series(list)
Bandicoot
Bandicoot is an open source, python CDR data analysis tool which contains many in-built functions to analyze the given phone metadata. The library has the capability of visualizing user data in a user interface which was built using React frontend framework.
Mostly bandicoot is focused on providing functionalities to analyze individual user data which can be categorized to individual features, spatial features and social networks. Each category contains a set of functionalities which can be accessed using user objects.
Bandicoot Official website : https://cpg.doc.ic.ac.uk/bandicoot/
Bandicoot Code base : https://github.com/computationalprivacy/bandicoot
Bandicoot (PyPI) : https://pypi.org/project/bandicoot/
Bandicoot library needs to be installed using pip commands to the python working environment.
pip install bandicoot
Then it has to be imported to the python working environment.
import bandicoot as bc
Usage of the libraries in location approximation process
Common CDR data files have the file types like txt, csv, json and xlsx. The data analyzing tools must have the capability to read the mentioned file types in order to process the relevant data.
The different libraries provide different types of functionalities in the data analysis process. And by combining the different functionalities of two or more libraries we can achieve the required CDR analysis functionality we want.
When it comes to numpy library, which is not specifically developed for the CDR data analysis process, gives many functionalities such as reading data from files and writing into files, converting data arrays to numpy arrays, and handling n-dimensional arrays.
- Opening csv files using genfromtxt() method
Numpy_array = np.genfromtxt(“file.csv”,
delimiter = “,”,
skip_header=1)
- Save as txt and csv files
np.savetxt(“file.txt”, np_array)
np.savetxt(“file.csv”, np_array)
- Convert list to numpy array
np_array = np.array(python_list)
- Convert numpy array to list
python_list = np_array.tolist()
- Filtering the items in the array based on conditions.
arr_user1 = arr[arr == user1_number]
Even though numpy is an efficient way of handling n-dimensional arrays in python, the library is not a good solution for handling arrays which contains different data types, for example int and String.
However Pandas library has built in functionalities to handle different data types in the same array. After reading the given file, the data will be processed and converted into a dataframe. In addition data printing, column dropping, row filtering, data frame appending, dataframe merging, slicing and joining are other important functions provided by the Pandas library.
When it comes to Bandicoot, the library has the ability to read different types of datafiles, clean data, handle missing data, and calculate the home location for a given user. However, one drawback of the library is that, the data files must be in a specific format. The example data files are provided in the bandicoot demo tutorial.
Choosing proper datasets
Choosing proper data sets is an important part in the location calculation process. When calculating data in order to get the approximation of home and work locations, the following two data sets are essential.
Call data set
When selecting the call data set files, the user, timestamp and the antenna id columns are essential as the user mobile number is needed to identify the the relevant user, timestamp to categorize the records based on the user’s work start, work end time, antenna id to get the relevant latitude and longitude of the record.
Antenna data set
When selecting the antenna data set files, the latitude and longitude columns are essential, as the location is calculated using the latitude and longitude. Moreover, the antenna id should be included in the data file, as each call record is associate with an antenna id.
Approximate the home and work location of a user
The following methods can be used to calculate the approximate home and work location of a selected user using Pandas.
Get the call and antenna data frames by reading the csv files using read_csv(“filename”) method
call_df = pd.read_csv("calls.csv")
antenna_df = pd.read_csv("antennas.csv")
Get the relevant data frames filtered by dropping columns using drop(array_of_col_names) method.
c_drop = ["direction", "duration", "cost"]
call_df.drop(c_drop, inplace=True, axis=1)
Define the user contact number, the work start time and the work end time for the user. The home and work locations are calculated based on the defined time period.
user1_no = 7123456789
work_start_time = 7 #7am
work_end_time = 19 #7pm
In the calculation, if the user is at work then the work_start_time <= timestamp < work_end_time. In the same manner the user is at home if the above statement is false.
Filter the rows of the call data frame based on the selected user contact number and get all the records related to the user as user1_call_records_df.
filter1 = call_df[call_df['user']==user1_no ]
filter2 = call_df[call_df['other']==user1_no ]
user1_call_records_df = filter1.append(filter2)
Concatenate the call and antenna dataframes by taking the intersection/inner join of the two dataframe objects based on the same antenna_id.
call_antenna_df = pd.concat([user1_call_records_df, antenna_df], axis=1, join='inner')
Check the timestamp with the given work start and end time and return “true” if the user call record time is in the working time period else return “false”. The method can be vary based on the timestamp column in the added csv file.
def check_timestamp_for_work(timestamp):
day = timestamp.day
time = timestamp.time
if day > 5:
return False
else:
if work_end_time >= time.hour > work_start_time:
return True
else:
return False
The home and work locations are calculated using the call_antenns_df dataframes. In the following functions the geographic location with the maximum record count is calculated separately for home and work locations and return an array in the form of [latitude, longitude].
def compute_home_location(call_antenna_df):
location_dict = {}
for record in call_antenna_df.itertuples():
at_home = not(check_timestamp_for_work(record.timestamp))
if at_home:
location =
str(record.latitude)+","+str(record.longitude)
if location in location_dict:
location_dict[location] += 1
else:
location_dict[location] = 1
if len(location_dict) > 0:
latitude, longitude = map(float, max(location_dict,
key=location_dict.get).split(','))
home = [latitude, longitude]
return home
return []
def compute_work_location(call_antenna_df):
location_dict = {}
for record in call_antenna_df.itertuples():
at_home = check_timestamp_for_work(record.timestamp)
if at_home:
location = str(record.latitude)+ ","
+str(record.longitude)
if location in location_dict:
location_dict[location] += 1
else:
location_dict[location] = 1
if len(location_dict) > 0:
latitude, longitude = map(float, max(location_dict,
key=location_dict.get).split(','))
home = [latitude, longitude]
return home
return []
home_location = compute_home_location(call_antenna_df)
work_location = compute_work_location(call_antenna_df)
The accuracy of the locations increase with the no of records in the data file for the selected user.
If the data size is larger and if the data calculation process takes a massive time, the numpy library can help with the time minimization.
Now you can code with Pandas and Bandicoot libraries and analyze the data files to find various user pattern like user trips and population around antennas. Moreover, you can use python data visualizing libraries like Matplotlib, Folium to visualize the structured data.