Mubawab.ma Housing Scraper
Pricing
from $1.00 / 1,000 results
Mubawab.ma Housing Scraper
Scrapes Moroccan real estate listings from mubawab.ma and outputs a structured dataset ready for ML model training (price prediction, classification).
Pricing
from $1.00 / 1,000 results
Rating
0.0
(0)
Developer
LIAICHI MUSTAPHA
Actor stats
0
Bookmarked
12
Total users
4
Monthly active users
15 days ago
Last modified
Categories
Share
The Moroccan Housing Dataset — an open-source Apify Actor that scrapes real estate listings from mubawab.ma and produces a flat, ML-ready dataset modelled after the classic California Housing dataset (Géron, Hands-On ML, Chapter 2).
Table of Contents
- What it does
- Output dataset
- Quick start
- Input configuration
- Apify Console output
- ML usage example
- Architecture
- Cities & property types covered
- Contributing
- License
What it does
Morocco's real estate market lacks structured, machine-readable public data. This actor closes that gap by crawling mubawab.ma — Morocco's largest property portal — and extracting every listing into a single CSV/JSON dataset suitable for:
- 🏠 Price prediction models (regression)
- 📍 Geo-spatial analysis by city and neighborhood
- 📊 Market trend dashboards
- 🤖 AI / LLM-powered property assistants
The scraper uses a two-phase Playwright crawl (search results → detail pages) and persists output through the Apify storage API so you can export CSV/JSON directly from the platform or via API with zero extra tooling.
Output dataset
Every scraped listing maps to one row with these fields:
| Field | Type | Description |
|---|---|---|
priceDh | number | null | Target variable — price in Moroccan Dirhams (MAD) |
pricePerM2 | number | null | Derived: price ÷ surface (MAD/m²) |
surfaceM2 | number | null | Living area in m² |
numRooms | integer | null | Bedrooms |
numBathrooms | integer | null | Bathrooms |
floor | integer | null | Floor (0 = ground / RDC) |
propertyType | string | null | appartement, villa, maison, riad, … |
standing | string | null | economique, moyen_standing, haut_standing |
state | string | null | neuf, bon_etat, a_renover, en_cours_de_construction |
city | string | null | Lowercase ASCII name, e.g. casablanca |
neighborhood | string | null | Sub-area within the city |
transactionType | string | null | vente or location |
url | string | Direct link to the listing on mubawab.ma |
title | string | null | Raw listing title |
scrapedAt | string | ISO-8601 scrape timestamp |
Sample record
{"priceDh": 1250000,"pricePerM2": 12500,"surfaceM2": 100,"numRooms": 3,"numBathrooms": 2,"floor": 3,"propertyType": "appartement","standing": "moyen_standing","state": "bon_etat","city": "casablanca","neighborhood": "maârif","transactionType": "vente","url": "https://www.mubawab.ma/fr/a/12345/appartement-a-vendre-casablanca","title": "Appartement à vendre à Maârif, Casablanca","scrapedAt": "2025-03-27T14:32:00.000Z"}
Quick start
Option A — Run on Apify (no setup needed)
- Open the actor on the Apify Store
- Click Try for free
- Configure inputs in the visual form
- Click Start → export results as CSV or JSON once the run completes
Option B — Run locally
Prerequisites: Node.js 20+, Apify CLI
# 1. Install the CLInpm install -g apify-cli# 2. Clone this repogit clone https://github.com/MuLIAICHI/Mubawab-Housing-Scraper.gitcd Mubawab-Housing-Scraper# 3. Install dependenciesnpm install# 4. Quick test — 10 listings onlyapify run --input='{"maxListings": 10, "transactionType": "vente"}'# 5. Full run — all 9 cities, up to 5 000 listingsapify run
Results are saved locally under storage/datasets/mubawab-housing/.
Option C — Deploy to your Apify account
apify login # Enter your Apify API tokenapify push # Build & upload the actor
Then run and schedule from console.apify.com.
Input configuration
Configure the actor via the Apify Console form or by passing a JSON input:
| Parameter | Type | Default | Description |
|---|---|---|---|
transactionType | string | "vente" | "vente" · "location" · "both" |
cities | string[] | (all 9 cities) | Filter to specific cities, e.g. ["casablanca", "rabat"] |
propertyTypes | string[] | 4 main types | appartements · villas · maisons · riads · terrains · bureaux · commerces |
maxListings | integer | 5000 | Hard cap on detail pages scraped (0 = unlimited) |
maxConcurrency | integer | 5 | Parallel browser tabs (max 20) |
startUrls | array | [] | Override seed URLs; leave empty for auto-generation |
proxyConfiguration | object | Apify Residential | Proxy settings — residential proxy is strongly recommended |
Example input
{"transactionType": "vente","cities": ["casablanca", "marrakech", "rabat"],"propertyTypes": ["appartements", "villas"],"maxListings": 1000,"maxConcurrency": 5,"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
Apify Console output
After a run completes, the Output tab in Apify Console shows four named links:
| Output | Description |
|---|---|
| Housing listings (Overview) | All scraped records in a table view (city, type, price, surface, rooms, URL) |
| ML-ready dataset | Same records restricted to the 12 ML feature columns — export this as CSV for model training |
| Run statistics | JSON with total listings, pages visited, null-rates per field, elapsed time |
| Debug HTML snapshots | HTML captured when a page could not be parsed — useful for debugging after site updates |
ML usage example (Python)
import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import r2_score, mean_absolute_error# 1. Load dataset exported from Apify as CSV (ML Dataset view)df = pd.read_csv("mubawab_dataset.csv")# 2. Drop rows missing the target variabledf = df.dropna(subset=["priceDh", "surfaceM2"])# 3. Encode categoricalsdf = pd.get_dummies(df, columns=["propertyType", "standing", "state", "city", "transactionType"])# 4. Feature engineering — Géron-style derived featuresdf["roomsPerM2"] = df["numRooms"] / df["surfaceM2"]feature_cols = [c for c in df.columns if c not in ["priceDh", "pricePerM2", "neighborhood", "url", "title", "scrapedAt"]]X = df[feature_cols].fillna(0)y = df["priceDh"]# 5. Train & evaluateX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)model = RandomForestRegressor(n_estimators=200, random_state=42)model.fit(X_train, y_train)y_pred = model.predict(X_test)print(f"R² : {r2_score(y_test, y_pred):.3f}")print(f"MAE : {mean_absolute_error(y_test, y_pred):,.0f} MAD")
Architecture
.├── .actor/│ ├── actor.json ← Actor metadata + schema references│ ├── input_schema.json ← Typed input form for Apify Console│ ├── output_schema.json ← Output tab links (dataset + KV store)│ ├── dataset_schema.json ← Field definitions + two table views│ └── key_value_store_schema.json ← KV store collections (stats / snapshots)│├── src/│ ├── main.js ← Entry point: reads input, seeds URLs, starts crawler│ ├── router.js ← Crawlee router with LISTING_PAGE + DETAIL_PAGE labels│ ├── parsers/│ │ ├── listingPage.js ← Extracts listing URLs + next-page link from search results│ │ └── detailPage.js ← Extracts all 15 schema fields from a property detail page│ └── utils/│ └── normalize.js ← Pure functions: parsePrice(), parseSurface(), normalizeCity()│├── Dockerfile ← Apify Playwright image (Node.js 20 + Chromium)├── package.json└── README.md
Crawl flow
main.js ──builds seed URLs──► LISTING_PAGE handler│┌─────▼──────────────────────────┐│ Parse search result page ││ Extract listing URLs ││ Follow rel="next" pagination │└─────┬──────────────────────────┘│ enqueue detail URLs┌─────▼──────────────────────────┐│ DETAIL_PAGE handler ││ detailPage.js extracts fields ││ normalize.js cleans values ││ Actor.pushData() → dataset │└────────────────────────────────┘
Key technical decisions
- Playwright (not Cheerio) — mubawab.ma is JS-rendered; a headless browser is required
- Multiple CSS selector fallbacks — the site uses different HTML structures for individual listings vs. project/ensemble listings
- Polite delays — 500–800 ms between requests to avoid rate-limiting
- Named dataset
mubawab-housing— makes the output easy to find and retrieve via API
Cities & property types covered
Cities (default): Casablanca · Marrakech · Rabat · Agadir · Tanger · Fès · Meknès · Oujda · Tétouan
Property types: Appartements · Villas · Maisons · Riads · Terrains · Bureaux · Commerces
Pass any subset via the cities and propertyTypes input fields.
Proxy recommendation
mubawab.ma blocks datacenter IPs. Using Apify Residential Proxy (the default) is strongly recommended for production runs. A free Apify account includes a proxy trial.
Without a proxy, you will encounter CAPTCHAs and 403 errors.
Contributing
Contributions are welcome! Here is how to get started:
- Fork this repository
- Create a feature branch:
git checkout -b feat/your-feature - Make your changes and run a quick local test:
$apify run --input='{"maxListings": 5}'
- Open a Pull Request with a clear description of what changed and why
Good first issues
- Add support for additional Moroccan cities (
agadir,beni-mellal,laayoune…) - Improve null-rate for
standingandstatefields on project listings - Add
listing_idextraction from the URL slug - Write unit tests for
normalize.js(Jest or Vitest)
Please open an issue before starting large changes.
License
LICENSE © 2025 Mustapha LIAICHI
Built with Crawlee · Playwright · Apify SDK
