Tec de Monterrey, 2020

Project - Predictive Modeling for Airbnb Pricing in Mexico City

-Data Treatment, Data Modelling, Predictive Analytics

GREAT DEALS WITH PREDICTIVE MODELS

This project tackled a practical, real-world data problem: helping a homeowner (Carina) determine the optimal nightly price to rent out her apartment on Airbnb. The team approached it as a full data analytics pipeline, from raw data collection and cleaning to building and validating a multiple linear regression model capable of predicting rental prices based on key features.

The project began with a Project Charter, which defined clear SMART objectives and a five-week plan. The team aimed to:

Build a clean, structured database by Week 3, removing irrelevant and inconsistent variables.
Use exploratory data analysis (EDA) — histograms, boxplots, and correlation matrices — to understand distribution and detect outliers.
Construct a predictive regression model with a confidence level of R² > 90%, using key features such as neighborhood, room type, number of beds, bathrooms, ratings, and host status.

The scope was limited to Mexico City properties, acknowledging that results could vary in other regions and that qualitative factors (like host personality) could not be captured quantitatively.

The raw dataset contained 72 variables and 19,180 samples, which were systematically reduced to 10 key predictors relevant to price:

Host status (superhost or not)
Neighborhood (coded from 1–16 based on desirability)
Room type (private, shared, entire home, hotel)
Number of guests, beds, bedrooms, and bathrooms
Availability throughout the year
Number of reviews in the last 12 months and overall rating
Instant bookability

The team:

Removed 62 irrelevant variables (e.g., host description text, amenities not affecting price).
Standardized categorical data, converting text fields to numeric codes.
Handled missing values by filling with median/mode and removed outliers (e.g., luxury hotels skewing prices) to avoid bias.
Verified distributions through boxplots and histograms, identifying skewness that would later require transformation.

Four regression models were built, iteratively refined using Minitab:

Initial Model (with outliers):
- Low predictive power (R² ≈ 9%).
- High Variance Inflation Factor (VIF) for certain variables (e.g., “review_scores_rating”), suggesting multicollinearity.
- Residual plots showed strong non-normality and clustering, confirming model inadequacy
Second Model (without outliers):
- Applied Box-Cox transformation (λ = 0.0578).
- Achieved R² ≈ 99.9% with excellent normality — but VIF values remained too high, meaning variables were too correlated.
Third Model (with constant added):
- Reduced VIFs to acceptable levels (<10).
- R² dropped to ~49%, but residuals showed good distribution and linearity, indicating a more reliable though less “perfect” model
Final Model:
- Removed statistically insignificant variables (P-Value > 0.05) to improve parsimony.
- Maintained R² ≈ 49% with balanced residuals and acceptable VIFs.
- Produced a regression equation predicting nightly price as a function of the selected variables.

The final model was validated using test scenarios:

In some cases, predictions deviated by up to 23%, showing room for improvement.
In others, the error margin was as low as 1%, proving the model’s utility for initial price-setting decisions.

This variability was acknowledged as a limitation — the model should be used as a guideline, not a definitive pricing tool, due to external sociocultural and seasonal factors not captured in the dataset.

Tradeoff between R² and VIF: A very high R² can be misleading if multicollinearity is present, so balancing predictive power and variable independence was crucial.
Data cleaning is critical: Outlier removal and transformation dramatically improved model performance.
Practical impact: The final model gives Airbnb hosts a data-driven starting point to price their listings competitively while maximizing income.

Team members reflected on the experience as a comprehensive introduction to data science: learning to wrangle big datasets, interpret statistical outputs, use tools like Minitab, and appreciate the ethics of data handling (privacy, accuracy, and proper cleaning practices).

Please find attached below the relevant documents to this project. (Note: most, if not all documents, are in Spanish)

Tec de Monterrey, 2020

GREAT DEALS WITH PREDICTIVE MODELS

FILES

Want to see more? Click here to return to the portfolio and check some interesting stuff out