Step 1 – Benchmarking Data Making sure data is correct prevents Garbage In = Garbage out problem This blog is part 1 of the 3 part series . Introduction While Analyzing any model first step is to benchmark the data that is being used to build the model. The old adage of “Garbage in = Garbage out” can not be more strongly emphasized in case of building ML models. To achieve this task we chose a specific vintage. Chosing Specific vintage allowed us to avoid a lot of data noise issues (incomplete data, changing fields, changing definitions) that can come by comparing combined data spread across vintages. To Benchmark the data we used 2014-Q4 vintage This vintage has gone through the complete cycle for 36 month loans and has completed most of the cycle for 60 month loans. This aging allowed us to compare not only the loan principal but it also allowed us to compare the interest payments, principal payments, chargeoffs and NAR calculations. While 2014Q4 vintage is more than 3 years old it is still recent enough vintage that we can have confidence in the quality of our overall data. Lending Club Data First step in bench-marking the data is to collect data from Lending Club site (Copied as of June 5, 2018) Croudify’s Modeling Data Prior to evaluating Croudify’s data there are five important points that we wanted to point out that might make some comparisons slightly off: No FG Loans : Since Lending Club no longer originates these type of loans we did not model these and did not benchmark this category Data Lag : Our data that we are modeling is usually with a 1 months lag (in this case 2 month lag) so as of June 5th we still are bench-marking with March end data. Usually we will lag in our modeling data by 1 month. So for example when we are modeling to rate June 2018 loans we will have data till the end of April. No Fees & Adjustments in NAR : In Adjusted NAR calculation Lending club substracts the fees and also makes loan valuation adjustments we did not do anything like this in our calculations. No Recoveries (Charged Off (NET)) : In Net charged off calculations Lending Club adds back the recovery. In our models we consider recovery to be virtually 0 so we ignore it for all modeling purposes and hence we did all our bench-marking calculations excluding those and thus slightly might be off. NAR off by max 0.7 %: If you add the differences in 3 & 4 we expect the NAR to be off by a max of 0.7% for any rating. So if our calculations are within this range of the LC NAR we marked the data as the same and moved forward. With all that in mind below is our analysis data (data as of March 30, 2018) Our detailed NAR calculation sheet is here. Few things to note from the comparisons of the two data sets Croudify’s total Loan population matches exactly with the Lending Club’s Total Loan Population Croudify’s Total Principal Received, Total Interest Received and Chargeoffs are very near to Lending Club’s data (small dependencies are just timing issue) Croudify’s NAR calculations are very near to Lending Club’s net NAR if you incorporate the differences in the data set as described above. This bench-marking exercise gave us complete confidence in the validity of our data and our calculations. In next step we will move forward by segregating the 36 month and 60 month population and bench-marking 36 month population (our chosen recommended loan types). Also published on Medium.