generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01-intro.Rmd
16 lines (10 loc) · 8.46 KB
/
01-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
# Introduction {#intro}
<!-- You can label chapter and section titles using `{#label}` after them, e.g., we can reference Chapter \@ref(intro). If you do not manually label them, there will be automatic labels anyway, e.g., Chapter \@ref(methods).-->
Existing real estate websites advertise the purchase and rental of properties and offer the means to study and continuously list new opportunities. Moreover according to Google 9 out of 10 homebuyers have used such portals as first "broker" to the searching process, and this percentage has massively surged over last years. However they still miss to provide investors, buyers and tenants with historical data and price comparison with respect to the specific market context. This is also more true since one of the most interesting aspects of Real Estate markets is that data is distributed through multiple websites, each with its own structure and information presentation, so it can be said decentralized. Another factor is that having multiple sources implies also that the accuracy of the data, including missing values, redundancies, incoherence and noise, may also be affected. Here comes the need to have an automated and structured mechanism to collect data to be on the trail.
The most trusted player both for selling and rental in Italy is immobiliare.it and the mechanism i.e. scraping are built on top of it.
Scraping functions (\@ref(scraping)) are genuinely websites' source code interpreters that gathers data and arrange them in a structured way. In addition urls are the way websites organize contents and also the only argument required to these functions. That means if there is a scheme, i.e. a further function, to compose urls at will such that only certain contents are displayed then scraping can be called on these urls and extract desired data based on some parameters of the compose function. Scraping is handled in R by `rvest` package which is wrapped around a popular navigated Python scraping library i.e. Beautifulsoup. rvest follows a custom workflow according to which html source code is at first parsed, then based on a css location it can gather requested data. Since some of the important information are nested in more loactions of the source code, e.g. sometimes in hidden json objects, then custom search strategies through the website source code are adopted. Scraping can damage websites because of insistent requests made to the hosting servers, this is strongly not suggested and a way to balance interest is given. As a consequence a web _etiquette_ is observed taking inspiration from the `Polite` R package and revisiting the concepts in a more consumer side orientation. This is done by putting into place delayed request rates, rotating User Agents (web ID) and through fail dealers (functions that can handle failure, e.g. trycatch) by the R `purrr` stack. Scraping function can take a while and parallelism is required i.e. scraping asynchronously. The selected back-end embodies a programming concept i.e. future by `Future` R package that enables to split tasks into "unresolved" and "resolved" stage. Tasks are sent to workers in the form of back ground sessions in an unresolved stage, then are resolved simultaneously. Since now this may be sufficient for personal usage but the idea is to propose a prduction grade open source sofware interacting with different stakeholders. The _desiderata_ list also accounts: a service that shares information frictionless i.e. API \@ref(Infrastructure), a service realtively cheap, portable, highly functional with authentication. As a result the stack of technologies proposed serves a RESTful API with `Plumber` framework generating 2 endpoints each of which calls Parallel scraping functions settled down in section \@ref(scraping). Precautions taken concerns sanitization of user inputs, anti-Dossing strategies and logs monitoring. The software environment is containerized with Docker and then it is _Composed_ by Docker Compose with a further container housing NGINX proxy server for load balancing and authentication. SSL certificates bring HTTPS communication which is nowadays a compulsory standard for any service. An AWS free tier EC2 server hosts the whole system and the IP is made Elastic. Furthermore the software CI/CD is made automatic by simple connecting cloud services that triggers sequential building. Each single technology cited is attacked singularly in the dedicated chapter.
The fact that residential locations impact house prices as well as rents is axiomatic and it is usually epitomized with the common catch phrase "location location location". Hedonic Price modeling constitutes the economic theoretical foundations that relates the property value/rental to each of the single house characteristics. Literature has widely displayed that various models incorporating spatial location in various settings are only beaten by spatio-temporal ones.
Hedonic Price Models (HPM) might help through theory to set up the linear predictor, indeed the critical part of these models is always _estimation_.
As a matter of fact traditional spatial bayesian methods are generally very flexible and offer a certain degrees of subjectivity, but ate tha same time are computing intensive and slow. The computational aspect refers in particular to the ineffectiveness of linear algebra operations dealing with large dense covariance matrices that scale to the order of $\mathcal{O}(n^3)$. Unfortunately this is even worse in the spatial contexts where house prices dynamics may be regarded as stochastic process indexed on a continuous surface i.e.Gaussian Process, whose covariance matrices are $n \times n$ observations.
Integrated Nested Laplace approximation (INLA \@ref(inla)) lends a hand constituting an alternative and faster deterministic algorithm on a special type of models called Latent Gaussian models (LGM). In few words _INLA_ makes use of optimal approximations with Laplace and numerical methods for space matrices to fasten computation. The under the hood work in INLA gravitates around three main statistical and mathematical concepts: LGM, Gaussian Markov Rrandom Field and Laplace Approximation. Latent Gaussian Models are a class of models for which are need to be specified three further elements: the likelihood for a given set of observation, the Gaussian Markov Random field (GRMF where parameters are) and priors distributions. Two of the three concepts are familiar, indeed GMRF is a pretty simple structure which distributes its arguments (in this case the all latent effects) according to a multivariate Normal density with 0 mean a given precision matrix, additionally with a further conditional independence assumption (here the word “Markov"). The Markov property of GMRFs are encoded in its precision matrices leading them to be very sparse, here resides the most of the calculus saved. Once the model is set up the Baeysian ultimate goal is to find the posterior distributions of parameters and this is not is not going happen estimating the whole joint posterior distribution of the three elements aforementioned. Instead INLA will try to approximate them singularly starting from the easiest and most coherent Gaussian approximation and then quickly executing all of the others according to special gridded strategies.
The spatial component in the extracted data is assumed to be a discrete realization of an unobserved and continuous Gaussian Process (GP), fully characterized by a mean structure and a covariance matrix. There are two main assumptions made for the Gaussian process: stationarity and isotropy, enabling to use a versatile covariance function to be defined i.e. Matern. A GMRF representation of the GP with a Matérn covariance matrix can be produced by the Stochastic Partial Differential Equations (SPDE) solutions through a functional basis representation. This is achieved by triangulating the discrete realizations of the process in the spatial domain. The benefit, once again, of switching from GP to GMRF is that the latter enjoys strong analyitical properties. The model is then mounted on Milan Real Estate rental data and tested with R-INLA, the resulting posterior distribution parameter are served. Prior choices are a critical step into bayesian decision process since they inject subjectivity into the analysis. Penalized complexity priors are a set of guidelines that are wrapped up around building proper prior, in this setting are demonstrated to behave well. In the end the model is fitted and cross validated and spatial prediction on a grid might offer the intuition on house prices ate location not yet observed.