top of page

Tec de Monterrey, 2020

Project - Predictive Modeling for Airbnb Pricing in Mexico City

-Data Treatment, Data Modelling, Predictive Analytics

Image by Andrea Davis
image.png
image.png
image.png
image.png

GREAT DEALS WITH PREDICTIVE MODELS

This project tackled a practical, real-world data problem: helping a homeowner (Carina) determine the optimal nightly price to rent out her apartment on Airbnb. The team approached it as a full data analytics pipeline, from raw data collection and cleaning to building and validating a multiple linear regression model capable of predicting rental prices based on key features.

​

The project began with a Project Charter, which defined clear SMART objectives and a five-week plan. The team aimed to:

  • Build a clean, structured database by Week 3, removing irrelevant and inconsistent variables.

  • Use exploratory data analysis (EDA) — histograms, boxplots, and correlation matrices — to understand distribution and detect outliers.

  • Construct a predictive regression model with a confidence level of R² > 90%, using key features such as neighborhood, room type, number of beds, bathrooms, ratings, and host status.

​

The scope was limited to Mexico City properties, acknowledging that results could vary in other regions and that qualitative factors (like host personality) could not be captured quantitatively.

​

The raw dataset contained 72 variables and 19,180 samples, which were systematically reduced to 10 key predictors relevant to price:

  • Host status (superhost or not)

  • Neighborhood (coded from 1–16 based on desirability)

  • Room type (private, shared, entire home, hotel)

  • Number of guests, beds, bedrooms, and bathrooms

  • Availability throughout the year

  • Number of reviews in the last 12 months and overall rating

  • Instant bookability

The team:

  • Removed 62 irrelevant variables (e.g., host description text, amenities not affecting price).

  • Standardized categorical data, converting text fields to numeric codes.

  • Handled missing values by filling with median/mode and removed outliers (e.g., luxury hotels skewing prices) to avoid bias.

  • Verified distributions through boxplots and histograms, identifying skewness that would later require transformation.

​

Four regression models were built, iteratively refined using Minitab:

  1. Initial Model (with outliers):

    • Low predictive power (R² ≈ 9%).

    • High Variance Inflation Factor (VIF) for certain variables (e.g., “review_scores_rating”), suggesting multicollinearity.

    • Residual plots showed strong non-normality and clustering, confirming model inadequacy

      ​

      ​

  2. Second Model (without outliers):

    • Applied Box-Cox transformation (λ = 0.0578).

    • Achieved R² ≈ 99.9% with excellent normality — but VIF values remained too high, meaning variables were too correlated.

  3. Third Model (with constant added):

    • Reduced VIFs to acceptable levels (<10).

    • R² dropped to ~49%, but residuals showed good distribution and linearity, indicating a more reliable though less “perfect” model

      ​

  4. Final Model:

    • Removed statistically insignificant variables (P-Value > 0.05) to improve parsimony.

    • Maintained R² ≈ 49% with balanced residuals and acceptable VIFs.

    • Produced a regression equation predicting nightly price as a function of the selected variables.

​

The final model was validated using test scenarios:

  • In some cases, predictions deviated by up to 23%, showing room for improvement.

  • In others, the error margin was as low as 1%, proving the model’s utility for initial price-setting decisions.

This variability was acknowledged as a limitation — the model should be used as a guideline, not a definitive pricing tool, due to external sociocultural and seasonal factors not captured in the dataset.

​

  • Tradeoff between R² and VIF: A very high R² can be misleading if multicollinearity is present, so balancing predictive power and variable independence was crucial.

  • Data cleaning is critical: Outlier removal and transformation dramatically improved model performance.

  • Practical impact: The final model gives Airbnb hosts a data-driven starting point to price their listings competitively while maximizing income.

Team members reflected on the experience as a comprehensive introduction to data science: learning to wrangle big datasets, interpret statistical outputs, use tools like Minitab, and appreciate the ethics of data handling (privacy, accuracy, and proper cleaning practices).

​

Please find attached below the relevant documents to this project. (Note: most, if not all documents, are in Spanish)

Want to see more? Click here to return to the  portfolio and check some interesting stuff out

bottom of page