CEU eTD Collection (2022); Jancó István: Survival Analysis for a Hungarian E-commerce Firm

CEU Electronic Theses and Dissertations, 2022
Author Jancó István
Title Survival Analysis for a Hungarian E-commerce Firm
Summary The project was created for a B2C e-commerce company, residing in Hungary. The Firm (aka. Client) is operating an online marketplace for various types of handmade and vintage products.
The conducted analysis was aimed at providing the Firm an insight into expected seller attrition and the variables that might affect it. At The Firm, seller accounts become inactive in case, there are no purchasing offers from clients. The Firm’s management was interested in the probability distribution of the sellers’ activity rate and lifetime. The created distribution and variable analysis would serve as a base for the Firm to understand the possible reasons behind attrition. While one of the main goals for the Firm is to scale up its operation and take entry into new markets, the seller attrition rate remains one of the most significant constraints.
One of the main goals was to discover those seller attributes, that might contribute most to the seller’s departure from the platform, so that going forward, management can mitigate, using preventive tactics (e.g., reminders). As the sellers are the main source of revenue, it is crucial to understand what factors might drive them away from using the Firm’s marketplace and possibly to a competitor. To have a stronger base for analysis, a database of approx. ten thousand sellers was assembled with additional data including sales records, product groups, and buyers. Any previous analysis on seller survival at The Firm is missing. The ability to forecast changes in seller base more accurately enables the company to establish a more proactive resource allocation strategy as well as adjust its business model accordingly. The project was intended to serve as a base for further retention or failure analysis.
The data provided for the exercise was related to vendors’ marketplace accounts. One of the most important variables is the seller’s registration time and the time when the last posting expired. These variables allowed to establish the time boundaries of the analysis and provide insight into the state of vendors at different times.
Apart from the time-related data, additional covariate data was derived. A pool of possible explanatory variables was created and boiled down, using various methods (e.g., Recursive Feature Elimination, etc.).
Once the most predictive features were determined, a method called Survival Analysis was used to estimate the distribution of survival probabilities of vendors against the time frame. Survival Analysis is the analysis of data involving times to some event of interest. In the context of this project, the event of interest was the sellers’ departure from the Firm’s marketplace. Since there is no way of knowing the exact time of vendor deciding to leave, the following logic was used to establish a proxy for the time of departure. If a seller has not made a sale in the past six months, the time of departure equates to the date of the last made sale. This logic established the accounts, which could be considered “dead”. Accounts for which the above logic is not true, are considered censored. In the context of Survival Analysis, censoring means that a subject did not experience the event of interest during the timeframe of the study, however, it still might sometime in the future. The fact of censoring in the data added some unique challenges to the analysis.
Various statistical approaches were used to validate the outcomes of the model. During the development, the data was split to train and test samples to avoid overfitting. Log-rank test and concordance indexes were used to validate the performance and predictions of the model. Furthermore, the created model was tested against a Random Survival Forrest to discover any possible shortcomings of the developed algorithm. The performance of the created model was proven to be equitable to the Random Survival Forrest.
Several milestones were accomplished during the project. In the first part, the data cleaning and enrichment processes were established and mostly automated, increasing the reproducibility of the analysis for new data. Subsequently, the features with the most predictive potential were determined, these included several binary indicators as well as continuous variables. Using statistical methods and modeling (e.g., Logistic Regression, etc.) the pool of features was narrowed down. The features were then used as covariates for developing a model that is capable of predicting both survival and hazard for one or multiple shops. The study concluded one of the most predictive features was a binary variable which indicates whether a shop is also registered as a business enterprise. The variable seems to have a positive effect on the survival probability of shops.
The established theoretical background served as a basis for the analysis automation. A separate Jupyter Notebook was created, the objective of which was to automate the Survival Analysis and make it easily reproducible and scalable. This was achieved by defining several functions that can conduct the exercise based on user-selected inputs and criteria.
Supervisor Lieli Robert
Department Economics MSc
Full texthttps://www.etd.ceu.edu/2022/janco_istvan.pdf

Visit the CEU Library.

© 2007-2021, Central European University