Tec de Monterrey, 2020
Project - Predictive Modeling for Airbnb Pricing in Mexico City
-Data Treatment, Data Modelling, Predictive Analytics





GREAT DEALS WITH PREDICTIVE MODELS
This project tackled a practical, real-world data problem: helping a homeowner (Carina) determine the optimal nightly price to rent out her apartment on Airbnb. The team approached it as a full data analytics pipeline, from raw data collection and cleaning to building and validating a multiple linear regression model capable of predicting rental prices based on key features.
​
The project began with a Project Charter, which defined clear SMART objectives and a five-week plan. The team aimed to:
-
Build a clean, structured database by Week 3, removing irrelevant and inconsistent variables.
-
Use exploratory data analysis (EDA) — histograms, boxplots, and correlation matrices — to understand distribution and detect outliers.
-
Construct a predictive regression model with a confidence level of R² > 90%, using key features such as neighborhood, room type, number of beds, bathrooms, ratings, and host status.
​
The scope was limited to Mexico City properties, acknowledging that results could vary in other regions and that qualitative factors (like host personality) could not be captured quantitatively.
​
The raw dataset contained 72 variables and 19,180 samples, which were systematically reduced to 10 key predictors relevant to price:
-
Host status (superhost or not)
-
Neighborhood (coded from 1–16 based on desirability)
-
Room type (private, shared, entire home, hotel)
-
Number of guests, beds, bedrooms, and bathrooms
-
Availability throughout the year
-
Number of reviews in the last 12 months and overall rating
-
Instant bookability
The team:
-
Removed 62 irrelevant variables (e.g., host description text, amenities not affecting price).
-
Standardized categorical data, converting text fields to numeric codes.
-
Handled missing values by filling with median/mode and removed outliers (e.g., luxury hotels skewing prices) to avoid bias.
-
Verified distributions through boxplots and histograms, identifying skewness that would later require transformation.
​
Four regression models were built, iteratively refined using Minitab:
-
Initial Model (with outliers):
-
Low predictive power (R² ≈ 9%).
-
High Variance Inflation Factor (VIF) for certain variables (e.g., “review_scores_rating”), suggesting multicollinearity.
-
Residual plots showed strong non-normality and clustering, confirming model inadequacy
​
​
-
-
Second Model (without outliers):
-
Applied Box-Cox transformation (λ = 0.0578).
-
Achieved R² ≈ 99.9% with excellent normality — but VIF values remained too high, meaning variables were too correlated.
-
-
Third Model (with constant added):
-
Reduced VIFs to acceptable levels (<10).
-
R² dropped to ~49%, but residuals showed good distribution and linearity, indicating a more reliable though less “perfect” model
​
-
-
Final Model:
-
Removed statistically insignificant variables (P-Value > 0.05) to improve parsimony.
-
Maintained R² ≈ 49% with balanced residuals and acceptable VIFs.
-
Produced a regression equation predicting nightly price as a function of the selected variables.
-
​
The final model was validated using test scenarios:
-
In some cases, predictions deviated by up to 23%, showing room for improvement.
-
In others, the error margin was as low as 1%, proving the model’s utility for initial price-setting decisions.
This variability was acknowledged as a limitation — the model should be used as a guideline, not a definitive pricing tool, due to external sociocultural and seasonal factors not captured in the dataset.
​
-
Tradeoff between R² and VIF: A very high R² can be misleading if multicollinearity is present, so balancing predictive power and variable independence was crucial.
-
Data cleaning is critical: Outlier removal and transformation dramatically improved model performance.
-
Practical impact: The final model gives Airbnb hosts a data-driven starting point to price their listings competitively while maximizing income.
Team members reflected on the experience as a comprehensive introduction to data science: learning to wrangle big datasets, interpret statistical outputs, use tools like Minitab, and appreciate the ethics of data handling (privacy, accuracy, and proper cleaning practices).
​
Please find attached below the relevant documents to this project. (Note: most, if not all documents, are in Spanish)
