Povcalnet stata command fresh from the oven

Schermata 2019-12-04 alle 16.30.40

The povcalnet command, which we developed with colleagues from DEC – World Bank, Andrés Castañeda, Christoph Lakner, Espen B. Prydz, Ruoxuan Wu and Qinghua Zhao, has just been officially deployed. 

For those of you who don’t know povcalnet yet, it’s a tool that allows us to calculate in real time poverty and inequality numbers for most countries/economies in the world. The stata command works by bridging the information from the World Bank povcalnet website and your computer, through an open API. You can check the working paper here.

So if you have Stata and would like to try it out you can download it by typing:

ssc install povcalnet
Advertisement

Afro-descendants in Latin America

Schermata 2019-12-04 alle 16.21.34

“About one in four Latin Americans self-identify as Afro-descendants today. They comprise a highly heterogeneous population and are unevenly distributed across the region, but share a common history of displacement and exclusion. Despite significant gains over the past decade, Afro-descendants still are overrepresented among the poor and are underrepresented in decision-making positions, both in the private and the public sector.”

Check out my report on Afro-descendants in Latin America, written with my colleagues from the World Bank Carolina Diaz-Bonilla, Germán Freire, Steven Schwartz and Flavia Carbonari.

The report offers an overview of the Afro-descendants in LAC from different perspectives. A historical and sociological view of the origins, recognition and identification of the population. An economic and social view on their agency, their political representation, their poverty and deprivation indicators. And finally, a public policy view on what is on the agenda, what policies can catalyse development and what has worked.

If you’re curious about knowing more about the Afro-descendants in Latin America. Check out these dynamic infographs developed by out team.

These resources are also in Spanish (visualizaciones, libro) and Portuguese (livro)

Le cifre della settimana

Ho seguito il formatto di Significant Digits di FiveThirthyEight, che presenta i numeri più importanti delle notizie del giorno.


cultura

7 anni

impiegherei per visitare tutti i musei e istituti simili che esistono in Italia, se dovessi visitare due al giorno per due ore caduno. Senza considerare trasporto.

Ci sono 4976 musei e simili di cui 4158 sono Musei, gallerie o collezioni; 282 aree o parchi archeologici; e 536 monumenti o complessi monumentali. [Istat]


politica

3.13%

il percentaggio della popolazione italiana che segue Matteo Salvini su Facebook (assumendo che tutti i seguaci siano italiani unici), oppure 1.9 Milioni. In confronto, Matteo Renzi e Luigi di Maggio hanno circa 1M o 1.6% della popolazione. [Facebook]


scienza

82.7

è l’aspettativa di vita in Italia (2000-2015). Siamo secondi nell’Unione Europea, dopo la Spagna; e sesti al mondo, dopo Giappone, Svizzera, Spagna, Singapore e Australia. [WHO]


economia

37.607,26€

è il totale che dovrebbe pagare ogni singolo italiano per saldare il debito totale delle amministrazioni pubbliche. [Repubblica]


sport

21

numero di gol italiani nel primo turno delle qualificazioni europee per Russia 2018. In confronto la Germania ed il Belgio hanno fatto 43 gol caduno. L’Italia passerà al turno di spireggio che si giocherà a Novembre. [FIFA]

My experience connecting R with MySQL for using large files

For writing this post I borrowed heavily from:

  • Several Stackoverflow including http://stackoverflow.com/questions/4785933/adding-rmysql-package-to-r-fails
  • Some basic manuals on MySQL available online

R is an impressive tool for analyzing data, but it comes a little bit short when we want to use a large databases. This is because the data is loaded in the RAM , which poses major hardware related constraints. In contrast, I’ve seen Stata work with big databases in a fantastic way for regressions and general statistical analysis. In general, and using not so intricate functions, we should expect a good behavior of R with databases that occupy  maximum 60%-70% of your total RAM. 

Now, there are several packages that allow to operate large databases using different methods that will override the given RAM limitations, these include the package ff (http://ff.r-forge.r-project.org/bit&ff2.1-2_WU_Vienna2010.pdf), and more robust and specialized software, like Revolution Analytics (http://www.revolutionanalytics.com/).

However,  one of the solutions that seems more elegant in certain applications is to use MySQL to manage the large database and take just enough of it through queries that R does to MySQL. This allows not only to use the huge capabilities of a native relational database manager as SQL , but reserves the use of RAM for the data processing tasks in R. The possibilities have been explored in several packages, but we concentrate our attention in RMySQL.

The RMySQL package that generates the connection of R with MySQL on Windows, cannot be easily downloaded from a CRAN repository ( if you’re on Mac or Linux there is no problem ). Now, after searching several forums for a simple solution to the problem , I came to one that fits the computer I work with, whose configuration is as follows :

  • Operating System: Windows 7 64-bit (remember that a 32-bit OS has inherently limiting file sizes so it would not allow you to work with files of more than 4GB)
  • Database Manager: MySQL 5.5 64 – bit
  • R 2.15.1 ” Roasted Marshmallows ” 64-bit

We then proceed to make the required changes to the system (remember that the routes on your system may vary because of the language / Drive letter) :

1.Create paths in the system variables:

What is it? The System Variables are values in the Windows operating system,  to which programs refer in order to execute a certain duty. For example, in our case, adding the following routes tells any program that asks Windows about “Path” the location of certain folders.
How to proceed? Click Start> This PC ( Right click)> Properties > Advanced System Settings > Advanced tab > Environment Variables .
Here we will make sure that the system variable named Path possesses the following routes :
C:\Program Files\MySQL\MySQL Server 5.5\bin , C:\Program Files\R\R- 2.15.1\bin\x64

Otherwise add them up to the ones already in the value.

2. Copy the MySQL libraries to a place known for RMySQL

What is it? A library contains relevant information to which other programs refer to when they are dealing with a task. For example, say you are a program that operates an oven, and a user tells you to bake a cake. Then you would go to the “cake” library and get functions related to the temperature, variables of time to cook and other related information in order to bake the cake.

In this sense, adding these libraries to the path we just added, would simply allow R to get the “cookbook” in which it can find the information needed to connect with MySQL.

How to proceed?

  • Copy libmysql.dll , from C:\Program Files\MySQL\MySQL Server 5.5\lib to C:\Program Files\MySQL\MySQL Server 5.5\bin.
  • Copy libmysql.dll Y libmysql.lib from  C:\Program Files\MySQL\MySQL Server 5.5\lib  to  C:\Program Files\MySQL\MySQL Server 5.5\lib\opt. The latter folder must be created if it’s not there.

3 . Generate file environment R

What is it? We should inform R where we left the libraries.

How to proceed?

We create a text file called ” Renviron.site ” in C:\Program Files\R\R- 2.15.1\etc , containing the following:

MYSQL_HOME = C :/Program Files/MySQL/MySQL Server 5.5

4. Install RMySQL 0.8 and DBI

Finally we install the needed packages in R:

That’s all ! You can now use the package RMySQL .

A simple query from R:


library ( " RMySQL " )
driver = dbdriver ( "MySQL ") # Call the driver of MySQL
with = dbConnect ( driver , host = "localhost", dbname = " MyDatabase " , user = "root", pass = " pass") # Create the connection to the DB in MySQL
query = dbGetQuery ( with statement = "SELECT * FROM WHERE mytable cusno = 2" )

For more information on the package RMySQL and how to use it : http://cran.r-project.org/web/packages/RMySQL/RMySQL.pdf

You may want to take a look at this websites for more information:

Using ggplot for presenting Dynamic Linear Models (DLM) in R

As you may know, Dynamic Linear Models, or more generally, State Space Models, are a type of non-stationary time series models. Given the peculiarities of the model, they can be used to model trend, seasonality, AR and even Multivariate and correlated time series.

The methodology draws from Bayesian statistics, which means that the information itself is helpful in determining the parameters of the distribution, and that new information will “renew” the process of updating such parameters. If you are interested in the way the math and stat work their way in this kind of models, I suggest you the book of Petris, Petrone and Compagnoli (shown down here) or to look at some presentations like this one.

Now, when representing the time series and their forecast, I like presenting graphics with ggplot. Thus, I developed a function that uses Local Level and Linear growth to forecast the behavior of a certain time series. For this purpose I borrowed heavily from:

Now, the idea is the following: The file OilProduction.csv contains the data of Oil production in Colombia in average bbl/day for every month from 1993 till May 2013. This is a non-stationary time series, even when we do double differences.

First, we create a time series object with the Oil Production file.


Oil.Production <- read.table("Oil Production.csv", header=T, quote="\"")
Oil.Production<-ts(Oil.Production, start=c(1993,1), freq=12)

The formula ForecastLL() generates the following graph:

Rplot

library("reshape")
library("dlm")
library("forecast")
library("tseries")
library("ggplot2")
ForecastLL<-function(ts_object=Oil.Production ,n.ahead=12,CI=.95,MLEC="Y",dV=0,dW=0){

 #Construct the function and find the MLE (if necessary)
 if(MLEC=="Y"){
 buildFunction<-function(beta){dlmModPoly(order=1,dV=exp(beta[1]),dW=exp(beta[2]))}
 fit<-dlmMLE(ts_object, rep(1,2),buildFunction)
 fit$convergence
 vu<-unlist(buildFunction(fit$par)[c("V","W")])
 vu=as.data.frame(vu)
 dV=as.numeric(vu[1,1])
 dW=as.numeric(vu[2,1])
 }

 ###Model
 mod <- dlmModPoly(order=1, dV, dW, m0=Oil.Production[1])
 modFilt <- dlmFilter(ts_object, mod)

 modFore <- dlmForecast(modFilt,n.ahead)
 alpha <- 1-CI
 Qse <- sqrt(unlist(modFore$Q))
 Foreca<-modFore$f
 Forecal<-modFore$f - qnorm(1-alpha/2)*Qse
 Forecau<-modFore$f + qnorm(1-alpha/2)*Qse

 for_values<-data.frame(time=round(time(Foreca), 3), value_forecast=as.data.frame(Foreca), dev=as.data.frame(Forecau)-as.data.frame(Foreca))
 actual_values<-data.frame(time=round(time(ts_object), 3), Actual=c(ts_object))
 fitted_values<-data.frame(time=round(time(modFilt$f), 3), value_fitted=as.data.frame(modFilt$f))

 graphset<-merge(actual_values, fitted_values, by='time', all=FALSE)
 graphset<-merge(graphset, for_values, all=T, by='time')
 graphset.melt<-melt(graphset[, c('time', 'Actual', 'Series.1', 'Oilproduction')], id='time')

 graphset.melt[complete.cases(graphset.melt),]
 p<- ggplot(graphset.melt, aes(x=time, y=value))+geom_line(aes(colour=variable), size=1)+xlab('Time') + ylab('Value') + theme(legend.position="bottom") +labs(title =paste("Local Level Forecasts for ",as.character(n.ahead)," periods ahead"))+ scale_colour_hue('Legend',breaks = levels(graphset.melt$variable),labels=c('Actual', 'Forecast', 'Fitted Values')) + geom_vline(x=max(actual_values$time), lty=2)+geom_ribbon(data=graphset, aes(x=time, y=Series.1, ymin=Series.1-Series.1.1, ymax=Series.1 + Series.1.1), alpha=.2, fill="green")
 return(p)
}

Notice the

  • ts_object= the time series object
  • n.ahead= number of periods ahead for the forecast
  • CI=Confidence interval
  • MLEC= If Y, calculates the initial values of V and W through MLE
  • dV=0,dW=0

Refer to the dlm package documentation for more information.

While the function ForecastLG generates the following graph:

Rplot01


ForecastLG<-function(ts_object=Oil.Production ,n.ahead=12,CI=.95,MLEC="Y",vu=0){

 #Construct the function and find the MLE (if necessary)
 if(MLEC=="Y"){
 buildFunction<-function(beta){dlmModPoly(order=2,dV=exp(beta[1]),dW=c(exp(beta[2]),exp(beta[3])))}

 fit<-dlmMLE(Oil.Production, rep(2,3),buildFunction)

 fit$convergence
 vu<-unlist(buildFunction(fit$par)[c("V","W")])
 vu=as.data.frame(vu)
 }

 ###Model
 mod <- dlmModPoly(dV = vu[1,1], dW = c(vu[2,1], vu[5,1]))
 modFilt <- dlmFilter(Oil.Production, mod)

 modFore <- dlmForecast(modFilt,n.ahead)
 alpha <- 1-CI
 Qse <- sqrt(unlist(modFore$Q))
 Foreca<-modFore$f
 Forecal<-modFore$f - qnorm(1-alpha/2)*Qse
 Forecau<-modFore$f + qnorm(1-alpha/2)*Qse

 #Graph

 for_values<-data.frame(time=round(time(Foreca), 3), value_forecast=as.data.frame(Foreca), dev=as.data.frame(Forecau)-as.data.frame(Foreca))
 actual_values<-data.frame(time=round(time(ts_object), 3), Actual=c(ts_object))
 fitted_values<-data.frame(time=round(time(modFilt$f), 3), value_fitted=as.data.frame(modFilt$f))

 graphset<-merge(actual_values, fitted_values, by='time', all=T)
 graphset<-merge(graphset, for_values, all=T, by='time')

 graphset.melt<-melt(graphset[, c('time', 'Actual', 'Series.1', 'Oilproduction')], id='time')

 graphset.melt[complete.cases(graphset.melt),]
 p<- ggplot(graphset.melt, aes(x=time, y=value))+geom_line(aes(colour=variable), size=1)+xlab('Time') + ylab('Value') + theme(legend.position="bottom") +labs(title =paste("Local Level Forecasts for ",as.character(n.ahead)," periods ahead"))+ scale_colour_hue('Legend',breaks = levels(graphset.melt$variable),labels=c('Actual', 'Forecast', 'Fitted Values')) + geom_vline(x=max(actual_values$time), lty=2)+geom_ribbon(data=graphset, aes(x=time, y=Series.1, ymin=Series.1-Series.1.1, ymax=Series.1 + Series.1.1), alpha=.2, fill="green")
 return(p)

}

Notice that in this case dV must be a dataframe containing the values that should be passed to the dlm filter. As with the previous function, this is not required if the parameters are to be estimated by MLE.

20137 – Advanced Statistics

I realize that probably this is one of the most difficult courses in our program (ESS Bocconi). In fact, I invested a big deal of time studying it.

Here you can find some materials I have done while I was following the course of Advanced Statistics with Prof. Marco Bonetti during the fall semester of 2012:

-My handwritten notes (161 MB) taken with Notability on my iPad: Here

-A review of distributions that you should know by heart: Distributions

-A review of the most important formulas that you should know by heart (also): Important Formulas for Statistics

If you wanna know how do I take notes on my iPad, visit my post Using Notability as your everyday notebook

Using Notability as your everyday notebook

I fairly realized why Apple used so many times the word “revolution” and its different variants when they talked about the iPad and the iPhone. However, after I was given an iPad on Christmas 2011, I was surprised to see how well done the device is and how versatile and dynamic it is. Moreover, after discovering several useful apps especially designed for education, I must admit that it came to replace many of the tools I originally use as a student. For example, PDF readers and annotators are just awesome when you need to review a paper; the agenda is well designed and is automatically synced with the cloud; there’s even an official app from my Uni (Bocconi) which keeps me updated of my classrooms, exams and university communications.

But none of these “features” is a novelty (compared to what I can do with a browser or a normal PC). That’s why the really striking feature of the iPad for me was the capacitive screen in conjunction with a note taking app. Originally I’ve got this idea from a good friend, who was already using it when I got my iPad. He recommended some apps for Note taking, but I  finally decided to go for Notability.

Pretty much all note taking applications follow the same simple idea: You write in the screen with a capacitive pen as if it were a big piece of paper and the app offers you the possibility of changing the style of the pen, changing pages, adding a background, etc. Many apps offer also automatic uploading to the cloud, voice recording and even conversion of the notes into PDFs. But what really made Notability shine was that, apart from all these features, it offers a seamless handwriting experience: you zoom into one spot and write in big letters and they appear as normal (smaller) handwriting when you zoom out (no need to buy an additional expensive pen) and it scrolls automatically when you have finished the space. It’s simple, easy and assures that the transition period between physical and virtual notebooks is smooth and quick.

Here are some examples of handwritten notes:

Page stats

This note is from the class of Time Series Analysis, download the whole PDF: Lesson Mar 12, 2013Comparative financial

This is a note taken in my class of “Comparative Financial Systems”, to see the whole PDF: Lesson 25-feb-2013

Now, I must point out some concerns: First, as with any technology, this will never be free of bugs, although Notability is quite stable, don’t expect 100% reliability. This doesn’t mean that it will hang every 3 minutes, in fact, of my almost 2 years of usage, the software has frozen a couple of times (and has promptly saved my work so I would loose anything of what I’ve written);  Second, I still find it hard to do Math exercises on it, specially when there’s a lot of algebra and I need to go back and forth pages constantly. In this case I’d opt for traditional pencil and paper. Third, it seems that technology is moving fast and that new devices are being designed to enhanced the virtual note taking experience, eg. the sony super big 13.3 e-ink notebook, soon to be launched in Japan and that is already in proofs in three Japanese universities, so if you are not an early adopter, I would advise you to wait.

Resources:

-Download Notability here.

-A good pen (although somewhat expensive): Adonit Jot Pro