Cross-company customer churn prediction in telecommunication: A comparison of data transformation methods

Cross-Company Churn Prediction (CCCP) is a domain of research where one company (target) is lacking enough data and can use data from another company (source) to predict customer churn successfully. To support CCCP, the cross-company data is usually transformed to a set of similar normal distribution of target company data prior to building a CCCP model. However, it is still unclear which data transformation method is most effective in CCCP. Also, the impact of data transformation methods on CCCP model performance using different classifiers have not been comprehensively explored in the telecommunication sector. In this study, we devised a model for CCCP using data transformation methods (i.e., log, z-score, rank and box-cox) and presented not only an ex-tensive comparison to validate the impact of these transformation methods in CCCP, but also evaluated the performance of underlying baseline classifiers (i.e., Naive Bayes (NB), K-Nearest Neighbour (KNN), Gradient Boosted Tree (GBT), Single Rule Induction (SRI) and Deep learner Neural net (DP)) for customer churn pre-diction in telecommunication sector using the above mentioned data transformation methods. We performed experiments on publicly available datasets related to the telecommunication sector. The results demonstrated that most of the data transformation methods (e.g., log, rank, and box-cox) improve the performance of CCCP significantly. However, the Z-Score data transformation method could not achieve better results as compared to the rest of the data transformation methods in this study. Moreover, it is also investigated that the CCCP model based on NB outperform on transformed data and DP, KNN and GBT performed on the average, while SRI classifier did not show significant results in term of the commonly used evaluation measures (i.e., probability of detection, probability of false alarm, area under the curve and g-mean).

DT methods when applied in both source and target company data, have the potential to mitigate the aforementioned issues. Also, DT methods can improve the prediction performance as observed in our previous work (Adnan et al., 2018) in which we have applied only two DT methods (e.g., rank and log), the single classifier (e.g., Naïve Bayes) and WCCP. This study is the extension of our previous work which represents a more comprehensive and extended study by applying four DT methods namely: log, rank, box-cox and z-score, and multiple state-of-the-art classifiers (i.e., KNN, NB, GBT, SRI, and DP), leading to producing multiple models through different iterations of the application of classifiers on the said datasets obtained from the log, rank, boxcox, and z-score. Accordingly, we study the following research questions: RQ1: What is the effect of DT methods (i.e., Log, Rank, Box-Cox, and Z-Score) on data normality in CCCP?
RQ2: What impact does the DT method has in the performance of different classifiers?
RQ3: Do the application of different DT methods exhibits significant performance difference?
The rest of the paper is organized as: Section 2 presents the customer churn, CCP modeling and four studied transformation methods. The evaluation measures and evaluation setup of the proposed study are presented in Section 3. The results and comparison are briefly explained in Section 4. The contribution to existing knowledge, implications to practice, future direction, and threats to the validity of the proposed work are discussed in Section 5. We conclude the proposed study in Section 6.

Customer churn and CCP modeling
Customer churn is one of the most common application for the ML in the services based industry; in particular, the telecommunication sector. Due to passionate competition and high costs associated with new customer acquisition as compared to retention of the existing customer (Liu, Guo, & Lee, 2011), many competitive businesses (i.e., telecommunication is on top) are looking for developing Customer Churn Prediction (CCP) and retention strategies using ML (Culbert, Fu, Brownlow, & Chu, 2018). While there is no standard accepted defini-tion for customer churn it is generally considered as the contract ter-mination (financial relationship) between the customer and a business (Popovic & Basic, 2009). CCP modeling refers to the statistical and computational procedures used to derive hidden customer behavior from the CRM. The main purpose of the CCP modeling is to prioritize the relationships and marketing strategies to entice the riskiest custo-mers to stay with the company or keep using the service or package (Charles, Guandong, James, & Bin, 2017;Culbert et al., 2018).
CCP is an important research area for businesses and marketing which helping in automation to prevent customers who are likely to churn. For those businesses as it is important to retain existing customers for the following reasons (De Caigny et al., 2018;Deng, Lu, Wei, & Zhang, 2010): (i) best companies are focusing on their customer needs instead of acquiring new customers because they rely on longterm relationships with their existing customers (loyal), (ii) customer churn influence other people within their social network to adapt the same churn behavior, (iii) loyal customers' effects on both the maximum the profit and minimize the cost because loyal customer buy more and can do company's marketing by positive word of mouth in their social network.
CCP models are usually solely evaluated based on their predictive performance in which the models show the ability to correctly identify customer churns and non-churns separately and accurately (De Caigny et al., 2018). Recent studies in the field have focused on the class im-balance techniques namely SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer, 2002), MTDF (D.-C. Li, Wu, Tsai, & Lina, 2007), ICOTE (Ai et al., 2015), MWMOTE (Barua, Islam, Yao, & Murase, 2014), ADASYN (H. He, Bai, Garcia, & Li, 2008), TRkNN (M. Tsai & Yu, 2016), and Cube (Deville & Tille, 2004) for improving the prediction performance of CCP in the telecommunication sector because research identified a bias to-wards the majority class (Culbert et al., 2018;Burez & Van den Poel, 2009). A number of studies have focused on the comprehensibility of the CCP model for customer retention management where they try to know why the customers are churning (De Bock & Van den Poel, 2012;De Caigny et al., 2018;Verbeke et al., 2011). It is clear from the above-mentioned discussion that CCP is a serious problem for the competitive sectors. Therefore, the research community has focused this problem and proposed various studies in several im-portant domains. However, these proposed studies are mostly for WCCP. On the other hand, none or less focus has been given to handle customer churn problem by utilizing the advantages of CCCP. The contributions of this paper are relevant to both practitioners and aca-demia and to the best of our knowledge, this paper is the first study that deeply articulates a CCCP in the telecommunication domain using DT methods. The next subsection provides detail on the WCCP and CCCP concepts.

WCCP and CCCP
WCCP: within-company customer churn prediction allows competitive companies to take advantage of early conventional churn prediction techniques where it might be more cost-effective because CCP models are built with within-company data i.e., available in the local CRM (Peters et al., 2013). In WCCP, the dataset is comprised of multiple independent and one depended variable (e.g., the class attribute or sometimes called it the prediction attribute or target attribute). The depended variable holds labels or values, which indicate whether cus-tomers are churn or non-churn. The predictive model is trained on the local data of the company using one or more state-of-the-art classifi-cation algorithms (Tim, Jeremy, & Art, 2007), then a new unseen in-stance is assigned a label as churn or non-churn through the already trained classifier. However, a major limitation of WCCP is where there is a lack of availability of data for companies that are either newly established or does not have historical data due to some any other reason. The lack of availability of data hinders the building/training of the prediction model (Poon et al., 2017). To handle the limitation of WCCP, the research community has introduced crosscompany con-cepts.

CCCP:
When the data can be shared between companies, where a matured company data is used as training set and the data of such company which is lacking historical data, consider for test set (Poon et al., 2017). In CCCP, the target data acquired from such company which possesses no or less historical data or a company is newly established, while the source data is obtained from the mature company or the company which has enough historical customer data. CCCP has advantages over WCCP in terms that the new company may take benefits from the experiences gained by the older company. However, it is also reported that the performance of CCCP was either very poor or inconclusive (Kitchenham, Emilia, Guilherme, & Travassos, 2007). Similarly, according to Poon et al., 2017, cross-company predictions mostly shown poor performance in the final result. Additionally, Zhang et al., 2013 have also reported that cross-company experiences a great challenge to deal with the heterogeneity between the source and target company data. One major issue is datasets shifting between the source and target companies which create a problem during the training and testing process by ML techniques (Turhan, Tosun Mısırlı, & Bener, 2013).
Recently, various studies (Amin, Al-Obeidat et al., 2017;Li et al., 2017) successfully implemented the cross-company concept where the target company was unable to provide sufficient data for ML classifier's training phase. Z. He, Fengdi, Ye, Mingshu, & Qing, 2012 has reported that finding the most suitable training data to show the same patterns within the target company can obtain more suitable results. Ad-ditionally, Zhang, Audris, Iman, & Ying, 2014 reported that if DT is applied on both the training and target datasets, it will have the ability to handle the heterogeneity between the shared companies datasets. In the literature, researchers widely used rank and log transformations to improve the performance of the classifier in the source and target companies' datasets (Zhang et al., 2017). Therefore, in this study, we have presented an exploratory work to observe if using multiple dif-ferent DT methods, that retain distinct properties, are beneficial to CCCP. The subsection provides the detail about these DT methods.

DT methods
Different DT methods retain the information of the original dataset from the different perspectives, particularly in the cross-company setting (Zhang et al., 2017). Therefore, we have used different DT methods to implement the CCCP concepts using multiple classifiers in the tele-communication companies.
Log data transformation: it is a mathematical procedure that calculates the natural logarithm of all attributes in the source and target datasets. It is one of the widely used DT methods in software defect prediction (Tim, Alex, & Justin, 2007). Due to its ease of use, the log DT method is very popular which has created its vulnerability of being commonly used. Its common use has led to its incorrect interpretation, even by the most experts. For instance, the two-sample t-test is commonly applied to compare the means of two independent samples (approx. normally distributed) data, but many researchers take this critical assumption without considering to verify the underlying assumption. Thus, proper use of the log DT method allowed to transform skewed data and follows a normal or near to normal distribution which conforms the data normality (Feng et al., 2014). The literature revealed that the log DT method is widely applied in developing cross-company prediction models (Tim, Alex et al., 2007). This DT method restricted to only transform the numerical values that are higher than zero (0), due to the limitation of ln(x) function. In case the data variable contains zero, a constant is commonly added, such as ln (x+1) functions. Another solution to deal with zero is to replace all the zero values by 0.00001. We can use the following often used log DT method equation: where x is zero value of any attribute in a given dataset.
any Rank data transformation: It replaces the original values of variable in a given dataset by their statistical calculated rank's value. This DT method is recommended to deal with such type of data values which contain heavy-tailed distributions or high kurtosis (Bishara & Hittner, 2015). It is observed from the literature in prediction modeling that the rank DT method can improve the performance of ML classifier (e.g., Naïve Bayes algorithm; Zhang et al., 2017). Furthermore, it has been successfully used to mitigate the heterogeneity of source data across the companies in cross-company prediction models (Zhang et al., 2014). In this study, we followed the study (Zhang et al., 2017) to transform the original values of each attribute in a given dataset into ten (10) ranks, using every 10 th % (percentile) of the corresponding attribute's values, the equation can be defined as; (2) where Qk is the k × 10 percentile of the corresponding metric in the combination (union) of the source and target companies data and sign is actually representing the infinity symbol. Box-Cox data transformation: it belongs to the family of the power transformation method. As we have discussed the case of incorrect interpretation of the log DT method (misuse of log method), similarly, Box-Cox DT method can be misused by applying this DT method without even being aware of the proportionality assumption which is required for the core inference (Feng et al., 2014). The Box-Cox DT method can be mathematically expressed as; Where is the configuration parameter, and x is the value of any attribute in a given dataset.
This method is usually applied for data transformation of the attributes that used a power law distribution. Thus, this DT method is suggested to use for increasing the variance of homogeneity to improve the precision of estimation resulting in the simplifying the proposed models as well (Zhang et al., 2017).
Z-score data transformation: the procedure of this DT method for normalizing the data is a familiar statistical data transformation method used in various domains including neuroimaging, psycholo-gical, pattern recognition, and many others. This z-score DT method process can be easily applied to more large and complex dataset (Cheadle, Vawter, Freed, & Becker, 2003). The z-score DT method in-dicates that how many samples in a dataset, through standard devia-tions, is above or below the mean. The formula is given below.
x sample mean ZScore = sample dtandard deviation (4) where x is a value of an attribute which to be transformed by Eq. (4).

Empirical setup
In this section, we will describe the subject datasets, then, we will present the proposed CCCP model for this study, followed by state-ofthe-art evaluation measures.

Subject datasets
In this study, we choose two publicly available datasets, which are widely used for CCP problem in the telecommunication sector. These publically available datasets are chosen to make sure that the proposed study may be reproducible and verifiable by the research community in the same domain. In addition, it can provide a simple comparison to other existing studies. Table 1 describes the detail of the subject data-sets.

Data preparation
We consider the subject datasets (described in Table 1) where dataset-1 is used as target company dataset and the dataset-2 is used as source company dataset. For the reason, the dataset-1 has less number of samples and attributes as compared to the source company dataset. On the other hand, the source company dataset contained 15760 cus-tomers are non-churn and 2240 customers are churn, while in the target company dataset, there are 2850 customers as churn and 483 customers are non-churn. During the data preparation phase, we have applied the following important steps: -Ignored all those attributes for which the corresponding attribute in the source-target companies datasets are not available, -Ignored all unique values attributes in the subject datasets, -To handle the naming conflicts of the attributes in the subject datasets, we assigned arbitrary titles to all the mapped attributes because the cross-company datasets different attributes name might be used for the same dimension (Rahm & Do, 2000).
-To avoid redundancy between the attributes in the source and a target companies dataset, Feng Zhang et al (Zhang et al., 2017) ap-plied Spearman's p-value statistical test to measure the correlation of attributes. Similarly, we have measured the correlation between the attributes of the subject datasets using Spearman's correlation statistical approach (Sheskin, 2007), On the one side, we select only those attributes which are correlated with each other in the subject datasets while on the other side, we removed those attributes that were more robust to outliers and preferred in the presence of ties within each subject datasets. For this purpose, we have followed the suggestion provided in the study (Adnan et al., 2018). For example, in source dataset attributes such as; Avg_Mins with Avg_Mins_local and Avg_Call with Avg_call_local have shown a strong correlation with each other within the subject dataset. Therefore, we have not considered Avg_Mins and Avg_Call attributes and vice versa, -Included manually the derived attributes in the target dataset based on the domain expert opinion, which match the attribute in the source dataset. For example, attributes Avg_mins_intran and Avg_call_intran exist in the source company dataset, but similar attributes do not exist in the target company dataset. In such a situation, if this attribute contains any value then we considered the customer has the international plan, which corresponds to the attribute Intl_Plan in the target dataset. -The continuous-valued attributes were normalized into discrete valued attributes as it can significantly improve the performance of the classification in ML . We have used approximate equal frequency discretization technique for converting continuous-values into discrete-values because this method is easy to implement and has almost linear computing complexity on both small and large size datasets (Jiang et al., 2009). The final list of selected attributes with titles provided in Table 2. Table 2 Reflects the list of finally selected attributes.

Evaluation measures
The confusion matrix is usually applied to evaluate the performance of the binary classification or predictive classifiers and prediction models. It contains the following corresponding measures; (i) True Positive (TP): customer churn that has been correctly predicted, (ii) True Negative (TN): customer non-churn has been correctly predicted, (iii) False Positive (FP): customer non-churn that has been incorrectly predicted to be customer churn, and (iv) False Negative (FN): customer churn that has been incorrectly predicted to be customer non-churn. The computed values of these measures (e.g., confusion matrix) further used in the base evaluation measures (Probability of detection, prob-ability of false alarm, area under the curve and Gmean) which selected for this study.
Since the subject datasets are unbalanced in terms of customer churn and non-churn, and we have followed of the suggestions in (Tim, Alex et al., 2007) and (Tim, Alex et al., 2007), therefore, we have not used state-of-the-art evaluation measures (i.e., accuracy and precision). In this study, we used the following performance metrics for evaluating the performance of the classifiers.  (Zhang et al., 2017). The advantage of the using AUC as base evaluation measures is that it provides a single measure for the overall POF and POD performance of a prediction model since it incorporates both values. The higher the value of AUC shows the best performance result. It is defined as: -G-Mean (GM): It is suggested that precision is not more appropriate measure (Tim, Alex et al., 2007;Tim, Alex et al., 2007) while f-measure is based on precision and recall; therefore, fmeasure is not considered in this study as a main evaluation measure. Instead, the G-mean (Shuo & Xin, 2013) is more appropriate to use as a per-formance measure reported in (Poon et al., 2017). The GM formula is defined below.

Empirical setup
To empirically validate the proposed approach, we conducted extensive experiments by comparing the credibility of the proposed CCCP model. With the three research questions mentioned earlier in Section 1, we presented an empirical setup for the proposed study. We have conducted a series of experiments using log, rank, box-cox, and z-score DT methods for data normality and built the CCCP model with multiple state-of-the-art ML and data mining techniques (NB, KNN, GBT, SRI, and DP).
The proposed study integrated its CCCP model separately built with log, rank, box-cox, and z-score DT methods providing us the trans-formed datasets. After applying the DT methods, we applied the dis-cretization process to discretize a large number of values allowing for small value(s) to be part of a discrete group. Finally, we got a transformed dataset which has an observable impact on the improvement of the classification performance.
Approach: To evaluate the performance of the proposed model, an empirical setup and experiment were conducted for a better comparison of all the DT methods. Firstly, a raw model was built without applying any DT methods on any attributes of the subject datasets. This was followed by the creation of models using each DT methods (i.e., log, rank, box-cox, and zscore) by using Eq. (1) (for log model), Eq. (2) (for rank DT based model), Eq. (3) (for box-cox DT based model) and Eq. (4) (for z-score DT based model) on each attribute of the subject datasets. This resulted in five baseline models. Then we separately performed the data discretization process on each of the obtained datasets from above five baselines models using different cross-family classifiers listed in Table 3. The source company prepared dataset is used as a training set while the target company prepared dataset is considered as test set data. Fig. 1 illustrates the overall empirical design and setup.

Baseline Classifiers:
In order to address the RQ2, we introduce a more elegant approach of applying different classifiers selected from different model type or family. Then we used these arbitrary selected cross-families classifiers separately in the CCCP models construction process. Table 3 provides a description of the baseline classifiers.   compare and identify the best prediction model. It is easy to implement and generally have a lesser bias than other validation methods. This method has a single parameter k that refers to the number of folds that a given dataset is to be split out. We set the following properties for applying k-fold cross-validation resampling procedure:

Validation method: we used k-fold cross-validation method for CCCP model evaluation because it is commonly used in applied ML to
-Shuffle randomly the samples in the subject datasets.
-Assign the value to k = 10 which split the dataset into 10 folds.
-Then select the stratified property for splitting the subject datasets samples into 10 folds where each fold has the same ratio of observations with a given categorical value.
Finally, in order to address the three research questions (discussed in section 1, we have carried out a certain number of experiments.
We conducted the first experiment without applying any DT methods on the subject datasets and built a raw CCCP model. The baselines classifiers in raw CCCP model were trained on source company dataset and tested on target company dataset. Table 4 takes into account the findings, which were achieved from raw CCCP model.
In the second experiment, we have designed log DT based CCCP models in which we used to preprocess data, which we obtained from the baseline approach (as discussed earlier in Section 3.4). Then sepa-rately built CCCP models based on the obtained dataset which were transformed by log DT method, further the cross-company data were used as training set and target data used as a test set for validating the performance of classifiers. Table 5 reflects the performances of all CCCP models using Log DT methods. We conducted another experiment in which the subject datasets were transformed based on the rank DT method. The classifiers of CCCP model was trained on the cross-company dataset and tested with the target-company dataset. The performances of all the classifiers are given in Table 6.
In one more experiment, the Box-Cox DT method is applied to the subject datasets to transform and prepared all data for CCCP models. Then all the baseline classifiers are evaluated in CCCP models. Table  7 provides the detail on the performance of baseline classifiers using state-of-the-art evaluation measures.
In the final experiment, the subject datasets were preprocessed and transformed through z-score DT method. A separate CCCP model was built through a baseline classifier on the z-score transformed datasets. All the classifiers were trained and tested through the crosscompany and target-company datasets, respectively. The performance evaluation of the baseline classifiers is given in Table 8.

Results
In this section, we explore the results obtained in Section 3, and provide a brief discussion. We structure the discussion according to the three RQs described in Section 1. In order to report RQ1 (What is the effect of DT methods (i.e., Log, Rank, Box-Cox, and Z-Score) on data normality in CCCP?), we applied Log, Rank, Box-Cox and Z-Score DT methods on the subject datasets. We used the following normality measurements (i.e., Skewness and Kurtosis) which is commonly used in the data normality test (Zhang et al., 2017) using the SPSS toolkit 1 .
Skewness: it is computing the level of symmetry in the probability distribution of the values of each attribute in the subject datasets. The obtained skewness level can be either positive or negative. The positive skewness indicates a long tail to the right side while the negative skewness indicates a long tail to the left side of the data distribution, and zero (o) represents the equal or balanced tails on the left and right sides. Although, the ideal level of skewness can start from the -0.80 to 0.80; but an appropriate range for skewness fall between -3 and +3.
Kurtosis: it is computing the measure of the width of the peak (peaks) in the probability distribution of the values in each attribute of the subject datasets. The measure of the kurtosis can be either positive or negative where positive sign reflects a more acute peak, and the negative sign indicates the lower wider peak. The normal curve is leptokurtic which is highly arched at the mean with the short tails while the acceptable range is between -10 and +10 for kurtosis 2 . However, the kurtosis ideal value is zero (0).
In order to determine the data normality, the Q-Q (quantile-quantile) plot helps to visually assess the probability of data normality. Fig.  2 shows that data in some cases is normally distributed (i.e., iv, v, vi, viii, xii, xiii, xiv, xvii, xix, xxvi) and in some cases it's not (i.e., ii,vii, xi, xvi, xxi, xxi, xxv, xxvii, xxviii, xxix. Also, we can easily see in Fig. 2 (Attr. A) that raw data have skewness 0.079 and kurtosis −0.108 which are an acceptable sign of data normality. On the other hand, when applied with log DT method on the same attribute (Attr. A), it produced skewness and kurtosis values in at log DT method are -2.691 and 14.035, respectively. Hence, by comparing raw versus log transformation, it is clear that log DT method has not increased the data normality in this particular case. But it is not the case for the rest of the DT 1 https://www.ibm.com/analytics/data-science/predictive-analytics/spssstatistical-software 2 https://www.sciencedirect.com/topics/neuroscience/kurtosis methods because Z-Score DT method increased the data normality when applied on the same attribute as compared to the rest of all with and without DT methods (i.e., raw). Similarly, we can see from Fig. 2 that in some cases DT methods improved the data normality and in some cases, it did not. Additionally, we have also visualized the performance of the DT methods in boxplot graphical tools. Fig. 2 illustrate the data normality using the normal Q-Q plot for each attribute and each target DT methods. Fig. 3 represents the visualization of the impact of DT methods using boxplot for target-company datasets. A highly skewed data displayed reasonable symmetry in its box and whiskers with many other values as outliers as these values are beyond the whisker on any side of the box.
Similarly, we have also included the graphical representation of the data normality Q-Q plot for cross-company datasets in Figs. 4 and 5 visualize their results in the boxplot. This easily finds and show the unusual data samples in each attribute which are beyond the whisker on any side. It can be observed from Fig. 4 (cross-company dataset) that DT method (i.e., Rank) significantly increase and improve the data normality because before the data transformation the skewness and kurtosis values were 5.456 and 74.074, respectively. These values of skewness and kurtosis in an attribute, such as Attri-D clearly indicates that both skew and kurtosis are not appropriate and which are not acceptable and thus these values are far beyond the normal ranges. But after applying DT methods not all but some significantly improve the data normality, such as Rank DT method obtained skewness value 0.000 an ideal value, and kurtosis value of -1.200 (shown in a normal range called Platykurtic curve).
To address the RQ2 (What impact does the DT method has in the performance of different classifiers?), we follow series of experiments and studied five classifiers namely: NB, KNN, GBT, SRI and DP classifiers. To investigate if our techniques can improve the performance of CCCP, we evaluated baseline classifiers by four commonly used evaluation measures (i.e., POD, POF, GM and AUC).
Initially, we tested baseline classifiers without any DT method applied to the subject datasets. Table 4 reflects the results obtained from the experiments. The classifier NB obtained the lowest POF value of 0.124 and GBT classifier obtained POD measure value very close to 1 (e.g., 0.176). However, overall the GBT outperformed when compared to the rest of target classifiers because it also achieved the best results in term of AUC = 0.52 and GM = 0.293. From this perspective, we can observe that the GBT classifier has shown best performance on predicting the correct customer churn, while the NB classifier performed well in prediction of the incorrect customer churn.
In the second experiment, we tested baseline classifiers on the subject datasets which were transformed through log DT methods. Table 5 describes the results achieved from the experiments. In this experiment, DP classifier obtained the highest values for POD = 0.198, GM = 0.323 and AUC = 0.53. However, in third experiment both the classifiers (i.e., NB and DP) performed similarly in terms of GM, AUC and POF, while overall DP classifier outperformed when compared to the rest of classifiers because it has shown higher ability to correctly predict the customer churn (i.e., POD = 0.927 which is close to the ideal case). In fourth experiment, it is observed that CCCP model based on GBT classifier has obtained the best performance among all the baseline classifiers on subject datasets transformed through Box-Cox DT method. However, in this experiment, the NB classifier also reflects better performance in terms of showing less error for incorrect pre-diction of customer churn as it achieved the lowest POF. Finally, the last experiment evaluated the performance of all classifiers where Z-Score DT method was applied on subject datasets. The SRI based CCCP model outperformed in terms of all evaluation measures in the final experiment. Fig. 6 is a pictorial representation of the performances of all baseline classifiers and DT methods in terms of target evaluation measures (as discussed earlier).
In order to address RQ3 (Do the application of different DT methods exhibits significant performance difference?), Fig. 7 depicts the ob-tained values of the primary evaluation measure (i.e., AUC) for dif-ferent DT methods which were applied to the subject datasets in tele-communication companies. It is clear from the Fig. 7 and Table 5-8 that the application of the DT methods exhibits significantly different per-formance because NB overall the best on Log, Rank, and BoxCox (i.e., 0.51, 0.513, and 0.529 respectively) while it is also investigated that NB does not produce a good result in Z-Score (i.e., 0.459). Similarly, SRI shows worst performance on Log, Rank, and BoxCox but it is also ob-served that SRI can outperform in Z-Score (i.e., AUC 0.541) where the Fig. 2. The normal Q-Q plot for target DT methods (Log, Rank, Box-Cox, and Z-Score), including Raw data on the target company dataset.
A. Amin et al. Fig. 2. (continued) rest of the baseline classifiers do not perform well. So it is clear that the application of different DT methods exhibits significantly different performance which ultimately reported to RQ3.

Discussion
In this section, we present contributions to existing knowledge and the implications of our findings for practice and future research efforts.
Finally, we conclude this section with threats to validity.

Contribution to the existing knowledge
The major contributions of the proposed study are as following: (i) to the best of our knowledge, this is the first study on CCCP using the DT methods (i.e., log, rank, box-cox, and z-score) in the tele-communication sector, (ii) proposed to use the parameter for: (a) to replace all the zero values attribute by 0.00001 in log DT method, (b) by applying 10 th percentile DT method the source-target companies data is transformed into more appropriate transformation, and (iii) we recommend to use the NB classifier because on the average NB classifier best performed on DT methods in CCCP in the target domain.

Implications for practice
Our results have several implications for the practice of crosscompany churn prediction in the telecommunication company using data transformation methods. Firstly present work is an extension of our previous work (Adnan et al., 2018) wherein it has brought consistency to our mentioned previous work. The extension and consistency are reflected through an empirical indication and implications for the development of the CCCP model in the telecommunication industry as the existing studies (see Section 2) mostly are designed for WCCP. Our finding suggests that a company, lacking the required data for the classifier's learning purpose due to any reason (as discussed earlier), can use the data from a mature company to predict customers churn successfully.
Secondly, the importance of this study can be viewed from a report (Takeuchi, 2016) that by 2020 digitalization of business practices will double as compared to the existing initiative from 22% to 50%. Therefore, the implication of this study also provides methodological practices that can establish a connection to the significant advantage in the existing and future data transformation methods for the telecommunication companies.
Thirdly, data needs to be considered from the researcher's perspective as well because cross-company data should be managed in a A. Amin et al. pattern that can be more common for empirical analysis and develop-ment of CCCP models in the future. For this reason, we applied DT methods (i.e., log, rank, box-cox, and z-score) in cross-company of telecommunication sector that takes account of such practice. This will provide the researcher's a blueprint for using cross-company data in the target company. It is important to note that DT methods do not lose the information of the original data in cross-company context (Zhang et al., 2017). Our results of DT methods shed light on the data normalization in general and specifically, our approach provides a clear picture of the expected scientific consequences of increased in data normality. Generally, it can be expected that improving normality in cross-company data will lead to an increase in the performance of the predictive model. However, it can also be noticed that it is not possible to bring uniform improvements in normalization as observed during the second experiment depicted in Table 5. The experiment reveals that the performances of the baseline classifiers have not shown uniform improvement. Rather it has improved some classifiers (e.g., NB) while other (e.g., SRI) performance has not improved at all, leading to variation in improvement in the data normality due to data transformation. This can report to RQ1. Similarly, our findings suggest that it is not a necessary condition for DT methods to increase the data normality in every data samples. Hence, such an implication also supports RQ1. In most cases, the DT methods can increase the data normality which can achieve improvement in the CCCP performance but it is also found (see Section 4 and Fig. 4) that it is not true for all the DT methods. For example, the Zscore DT method based CCCP model neither reduced the classifier's error rate nor obtained more improvement in transformation. In Section 4, the results are obtained by following the recommended evaluation measures (i.e., skewness and kurtosis) used in the study (Zhang et al., 2017).
Fourthly, this study investigated the performance of cross-family (Bayesian, Lazy, Trees, Rules-based and Neural Nets) baseline classifiers (i.e., NB, KNN, GBT, SRI, and DP) in term of POF, POD, AUC, GM evaluation measures. The finding of this study also provides direction for evaluating the impact of DT methods, in practice, in the perfor-mance of different cross-family baseline classifiers. From this perspec-tive, we have observed that overall on the average the NB classifier based CCCP model achieved higher performance (see Table 4-8 and Fig. 7) as compared to the rest of the target classifiers. Furthermore, the results also show that NB is not a good choice to apply on such data which is transformed with the zscore DT method. On the other hand, the z-score DT method exceptionally suites the SRI classifiers in CCCP model because the rest of all the classifiers do not obtain good results.
Additionally, from a data management perspective, an important question had always been as to what extent one should use the different classifiers and how many for better CCCP. Our empirical evaluation setup tested alternative baseline classifiers which were originated from cross-family classifiers and successfully experimented with different sets of DT methods. We suggest to pick the single best classifier and DT method before deploying a final CCCP model in telecommunication sector and discard all the rest of approaches used in this study in series of experiments. With respect to the area of future directions, we suggest to focus on the following directions; (i) we recommend to experiment with our technique/ model and to apply ensemble learning methods, (ii) it can be worth studying to make an alternative approach or to apply the same approach when building predictive models for domains other than the telecommunication, (iii) another opinion would be to consider other evaluation measures for measuring the data normality and validating the predictive model.

Threats to validity
In this section, we present the threats to the validity of the proposed study in the following important term: Fig. 4. The normal Q-Q plot for target DT methods (Log, Rank, Box-Cox and Z-Score), including Raw data on the cross-company dataset.
A. Amin et al. -External validity: We only considered the two benchmark publically available datasets related to the telecommunication sector. The dataset-1 is used as a training set for cross-company and dataset-2 is used as a target-company dataset. Hence, our empirical results may not be generalized to all the companies within or outside the target domain. We could have used dataset-2 as crosscompany and dataset-1 as target-company. In such a situation, the generated results might have been different. Nonetheless, replication of the proposed CCCP model using more datasets of different companies may provide more fruitful results. -Baseline classifiers: We arbitrarily selected five classifiers from cross-families model types to evaluate the performance of the proposed model. However, using different classifiers instead of our baseline classifiers could show different results. -Internal validity: For most parameters, the defaults of Rapidminer studio educational 8.1.001 are used for baseline classifiers, while for discretization by frequency, the bin size was empirically evaluated and the optimum value (i.e., bin size = 3) was obtained from the series of testing values. Different value for bin size or any other discretization method may yield different results.

Conclusion
CCCP is still a challenging and rapidly growing problem in general for competitive businesses and specifically for the telecommunication companies. It is due to the heterogeneity between the cross-company and target-company datasets. In this paper, four different data trans-formation methods namely log, rank, box-cox, and Z-score, were applied to introduce a CCCP predictive model based on multiple clas-sifiers (e.g., NB, KNN, GBT, SRI, and DP) belonging to different ML predictive classifier's families (e.g., Bayesian, Lazy, Trees, Rules-based and Neural Nets). The proposed study not only presented a CCCP model for telecommunication sector where one company is lacking historical data, is newly established, or have recently started using CRM or have lost data due to any reason that also investigated the performances of multiple classifiers and data transformation methods in the target do-main. The empirical results show that in most cases the data transfor-mation methods increased the data normality by improving the classi-fiers performances. Tables 4-8 illustrate the results obtained by all the classifiers on transformed data with different DT methods. Further, it is observed that NB classifier, based on different DT methods, achieved the highest AUC values 0.51, 0.51, 0.513 in raw, log, and Box-Cox, respectively, as compared to the rest of the classifiers. It is also investigated that single rule induction (SRI) classifiers achieved the maximum performance (i.e., AUC value 0.541) in single DT method (i.e., Z-Score) and has obtained the lowest level of performance (i.e., AUC values 0.45, 0.44, 0.357, 0.455 in raw, log, rank and box-cox, respectively) when compared to the rest of the classifiers in all other DT methods. While DP and GBT classifiers performed on average in all the DT methods. Moreover, we have also revealed that SRI based CCCP model did not achieve best results in terms of evaluation measure POD which means it cannot improve the accurate prediction of the correct customer churns. Additionally, from the experiments, we have observed that the Z-Score based data transformation cannot receive better results in POD across all the classifiers while POF measures values, in most cases, is higher than the POD values. It is showing that Z-Score based data transformation neither achieve the best result in improving the classifier prediction performance nor obtained the best result in redu-cing the error when compared to the rest of the methods. For the future direction, we recommend to experiment with our technique/ model and to apply ensemble learning methods. In addition, it can be worth  studying to make an alternative approach or to apply the same approach when building predictive models for domains other than the telecommunication.