Michaeͥl Tsikerdekis

What is the probability of getting an academic job?

2017-04-16T00:00:00-07:00

It is an interesting thought experiment that can be dissected in several ways depending on how it is addressed.

So, I figured why not just create github.com Shiny R application that builds estimates depending on the input. The current formula is a crude estimation that assumes there in an N population of positions that become available based on an independent probabiltiy p and there is a competing number of w applicants which are chosen at random. For CS and CE jobs in US and Canada based on CRA's survey data is around 50%.

The Shiny R application for playing around wtih the input can be found here: https://tsikerdekis.shinyapps.io/AcademicJobProbability/

The source can be found: https://github.com/tsikerdekis/AcademicJobProbability (Pull requests more than welcome)

Hiding your real IP from the Internet using a proxy and securing it through your firewall

2015-07-14T00:00:00-07:00

This article is not about setting up a proxy on your browser, skype or torrents. This is easy! Just go on your application's options and setup your proxy. You should of course have access to a proxy server that ideally does not retain any log files. If they do, then if those log files can be accessed by someone then your traffic history would be available to them. So, you got your proxy server (paid or free) with no log files and you have its port and authentication (if any). You set everything up on your browser or torrent application and that's it, right? All your traffic parsed through the proxy, others see the proxy's IP address while you are hidden behind the proxy. Well, that is not always the case and the reason boils down to he programming of the application that you are using. This is a guide on how to put an additional layer of protection through applying firewall.

Proxies

Before I go into more details, I need to clarify the obvious. Proxies (such as SOCKS5) do not encrypt traffic from your ISP. If you were to look at the packets transmitted from your computer on the way to the proxy server, you could see what is the final destination of the packet. After packets leave the proxy server then your IP cannot be discovered unless it is in the information of the actual packet sent to the server and it is not encrypted. Notice I am using the term packet loosely here. So if you don't want a website, a torrent swarm or peers and skype knowing what is your IP address, proxies will do the trick. Your ISP will still be able to see what you are doing. The only workaround for this is an SSH tunnel or a VPN which is not going to be discussed in this post.

How does your IP leak?

This could happen in a number of ways. It all boils down to bad programming on the side of applications. Some are designed in such a way that when your proxy server is down, they just redirect all traffic through the normal route. Other times, some of the traffic is sent through the proxy while some packets may be sent outside the proxy. All it takes is one packet to leak and basically you failed to do what you were attempting to do (hiding your IP from the destination server).

Solution

I am providing a solution for Ubuntu but a Windows solution would work in the same way. Also, Mac users may be able to follow this guide but instead use the ipfw command which is similar to iptables (linux's firewall).

The problem is divided into two solutions: a) block all outgoing leaking traffic and b) don't answer to any calls that don't come from a proxy. The latter is not necessary with browsing but with torrents it is if you really want to appear that you don't have a torrent client on to the outside world.

Blocking incoming

Blocking an incoming connection is relatively easy. I am assuming that if you are behind a router you already port forwarded the relevant port for your application to the computer running the application. Sometimes, UPnP takes care of that. So let's say that port 8000 is the one for your application. All you need to do is tell your firewall to accept packets to this port only when they come from your proxy and drop the rest. Let's say that your proxy's ip is 10.10.10.10. As root you just run:

iptables -F
iptables -A INPUT -p tcp -s 10.10.10.10 --dport 8000 -j ACCEPT
iptables -A INPUT -p udp -s 10.10.10.10 --dport 8000 -j ACCEPT
iptables -A INPUT -p udp --dport 8000 -j DROP
iptables -A INPUT -p tcp --dport 8000 -j DROP

The first command deletes all previous rules on the firewall which by default there aren't any.

Blocking outgoing packets

Windows are a bit easier at restricting rules for one application. Linux isn't. My solution for this is to run an application as another user and apply rules to that user. It is definitely safer this way but it takes a bit of work. I won't go into details on how to create a new user and run that application as that user but you can find guides online. Assuming you have this ready and verified using ps -faux that your application runs through that user (IMPORTANT since rules will apply only for that user) you can type the following as root.

iptables -A OUTPUT -p tcp -m owner --uid-owner testing -d 10.10.10.10 -j ACCEPT
iptables -A OUTPUT -p udp -m owner --uid-owner testing -d 10.10.10.10 -j ACCEPT
iptables -A OUTPUT -p udp -m owner --uid-owner testing -d 192.168.0.0/24 -j ACCEPT
iptables -A OUTPUT -p tcp -m owner --uid-owner testing -d 192.168.0.0/24 -j ACCEPT
iptables -A OUTPUT -p tcp -m owner --uid-owner testing -d 127.0.0.1 -j ACCEPT
iptables -A OUTPUT -m owner --uid-owner deluge -j DROP

Basically accept outgoing traffic from this user to 10.10.10.10, all ips in the LAN (you don't have to do this though) and packets sent to localhost. The last option is used by some programs to communicate with others. You have to adjust your settings but the important part is that you DROP packets sent to any IP that you don't like. If you try to do anything with that user, you will find that no websites will open without a proxy on your browser.

If you combine all of the incoming and outgoing rules into one file, make it executable and place it here: /etc/network/if-pre-up.d/ then your firewall settings will not be deleted after a reboot.

Verifying that it works

A way to see what packets are hitting your interface is to use tcpdump. This shows incoming packets before they pass through the firewall and outgoing packets that already passed through the firewall.

sudo tcpdump port 8000 -i wlan0

Here is a sample of what you would expect to see:

17:30:57.219187 IP michael-netbook.local.8000 > 10.10.10.10.42869: UDP, length 111
17:30:57.430905 IP 10.10.10.10.42869 > michael-netbook.local.8000: UDP, length 30
17:30:57.461266 IP 10.10.10.10.42869 > michael-netbook.local.8000: UDP, length 380
17:30:57.461473 IP michael-netbook.local.8000 > 10.10.10.10.42869: UDP, length 30
17:30:57.492072 IP 10.10.10.10.42869 > michael-netbook.local.8000: UDP, length 380
17:30:57.492286 IP michael-netbook.local.8000 > 10.10.10.10.42869: UDP, length 30
17:30:57.502889 IP 10.10.10.10.42869 > michael-netbook.local.8000: UDP, length 380
17:30:57.503056 IP michael-netbook.local.8000 > 10.10.10.10.42869: UDP, length 30
17:30:57.517659 IP 10.10.10.10.42869 > michael-netbook.local.8000: UDP, length 380
17:30:57.517858 IP michael-netbook.local.8000 > 10.10.10.10.42869: UDP, length 33

It is likely that you would still see incoming traffic. This can be due to a) you had an open connection before applying the rules and activating the proxy (this will persist for a while) and b) machines on the internet initiated port scans for whatever reason. If your IP is dynamic it is likely to see a (b) traffic mainly due to other users that used your IP before you got it.

But is the firewall working? Well, let's see:

michael@michael-netbook:~$ sudo iptables -nvx -L INPUT
Chain INPUT (policy ACCEPT 5814 packets, 2633217 bytes)
pkts bytes target prot opt in out source destination
0 0 ACCEPT tcp -- * * 10.10.10.10 0.0.0.0/0 tcp dpt:8000
4210 3556857 ACCEPT udp -- * * 10.10.10.10 0.0.0.0/0 udp dpt:8000
0 0 DROP udp -- * * 0.0.0.0/0 0.0.0.0/0 udp dpt:8000
0 0 DROP tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8000

Ideally, you will not see DROPPED packages in the counters but even if you see that is a good thing. It means that people tried to sent you stuff from 8000 directly to your IP and you firewall blocked them. For the rest of the world, the port appears closed as if you don't have an application listening.

How about your outgoing traffic?

michael@michael-netbook:~$ sudo iptables -nvx -L OUTPUT
Chain OUTPUT (policy ACCEPT 1028 packets, 388980 bytes)
pkts bytes target prot opt in out source destination
2585 652144 ACCEPT tcp -- * * 0.0.0.0/0 10.10.10.10 owner UID match 130
4074 336713 ACCEPT udp -- * * 0.0.0.0/0 10.10.10.10 owner UID match 130
0 0 ACCEPT udp -- * * 0.0.0.0/0 192.168.0.0/24 owner UID match 130
552 314414 ACCEPT tcp -- * * 0.0.0.0/0 192.168.0.0/24 owner UID match 130
826 423050 ACCEPT tcp -- * * 0.0.0.0/0 127.0.0.1 owner UID match 130
0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 owner UID match 130

Ideally, this should also show no DROPPED packages but even if it does it just means that everything is working. It also means that your application attempted to send something by bypassing the proxy but your firewall crashed its attempts.

But don't take my word for it. Setup these rules and then remove the proxy. Try using your application and monitor traffic. Does anything work? If not then your firewall is doing its job allowing traffic only through proxy even if programs attempt to bypass your settings.

Importing Wikipedia Dumps to Mysql

2013-07-14T00:00:00-07:00

It can be quite frustrating adding Wikipedia Dumps in a local database. For some Wikipedias, such as the English Wikipedia, it takes a long time. This is a collection of scripts I've used to import Wikipedia dumps in Mysql.

Note: This guide is based on an Ubuntu server setup
Warning: The restoring process is likely to take weeks for large Wikipedias such as the English Wikipedia

Step 1: Install MediaWiki

This is the quickest way to develop an almost blank mediawiki db used by Wikipedia. You will need a typical LAMP server.

Mediawiki uses innoDB tables. By default, all innodb tables are saved under one file on the disk. The file cannot shrink and can cause problems. It's best to use an option of MySQL to create a seperate file on the disc per innodb table. To do this you need to do the following:

	sudo /etc/my.cnf

Find the [mysqld] part in the config file and add:

	innodb_file_per_table=1

Save the file and then on the terminal restart mysql:

	service mysql restart

Note that any existing innodb tables will remain in the large ibfdata file but any newly created tables will be assigned a different file on the disk.

You will need to do some fine tuning on mysql for variables such as: innodb_buffer_pool_size, innodb_log_buffer_size, innodb_additional_mem_pool_size. You will have to investigate a bit to see what's best.

Now you need to install mediawiki. You need to follow the Quick Installation Guide for Mediawiki. For the most part if you know your root password for mysql mediawiki can setup automatically the database, tables, and the user (if you don't want to have root as your user accessing the database).

At this point, if you know exactly which columns on a table you are going to need, you may want to turn some fields in smaller versions, so that they can still exist and avoid errors, however they won't occupy as much space. A good example is the text table that contains two blob fields that track changes. If you are not interested in these changes, you could always turn these blob fields in varchar(2) or something else and save space.

Step 2: Find dumps and retrieve the list

After the installation, you will need to figure out which dump contains the data that you want. There are many and they contain dumps for different tables. You can look some of the dumps here. If you want another language Wikipedia, you have to change "enwiki" to reflect the prefix of the language that you are interested in (e.g., elwiki for Greek Wikipedia, cswiki for Czech Wikipedia, eswiki for Spanish Wikipedia).

Use this script to create a filelist.txt file containing all files to be downloaded. You will need a proper regular expression to capture the names of the files automatically. As an alternative, you could type all file names manually in a file names filelist.txt. Also you will need to setup the url variable to the directory containing the dumps that interest you.

	import urllib2, re

	#main url to retrieve files

	url = "http://dumps.wikimedia.org/enwiki/20130102/"

	f = urllib2.urlopen(url)

	data = f.read()

	a = []

	m = re.findall("(enwiki-\d*?-pages-meta-history\d+?.xml-p.+?\.7z)",data,re.DOTALL)

	for item in m:

	        if item not in a:

	                a.append(item)

	print a

	f = open("filelist.txt","w")

	for file in a:

	        f.write(file+"\n")

	f.close()

Step 3: Start retrieving and importing

For this step you will need to save a couple of scripts. You will also need the filelist.txt file from the previous step. I have instruction that you need to follow for some files. Also, you may need to install p7zip ubuntu package.

Save all of the following on your disk (same directory).

preimport.sql (source Brian Stempin)

	SET autocommit=0;

	SET unique_checks=0;

	SET foreign_key_checks=0;

	BEGIN;

postimport.sql (source Brian Stempin)

	COMMIT;

	SET autocommit=1;

	SET unique_checks=1;

	SET foreign_key_checks=1;

mwimport.pl (original source here)

	- You can drastically speed up the import process by commenting the insert line that adds information for the text table. Look for "Comment this to save time" in the code below.

	#!/usr/bin/perl -w

	=head1 NAME

	mwimport -- quick and dirty mediawiki importer

	=head1 SYNOPSIS

	cat pages.xml | mwimport [-s N|--skip=N]

	=cut

	use strict;

	use Getopt::Long;

	use Pod::Usage;

	my ($cnt_page, $cnt_rev, %namespace, $ns_pattern);

	my $committed = 0;

	my $skip = 0;

	## set this to 1 to match "mwdumper --format=sql:1.5" as close as possible

	sub Compat() { 0 }

	# 512kB is what mwdumper uses, but 4MB gives much better performance here

	my $Buffer_Size = Compat ? 512*1024 : 4*1024*1024;

	sub textify($)

	{

	  my $l;

	  for ($_[0]) {

	    if (defined $_) {

	      s/"/"/ig;

	      s/</
	      s/>/>/ig;

	      /&(?!amp;)(.*?;)/ and die "textify: does not know &$1";

	      s/&/&/ig;

	      $l = length $_;

	      s/\\/\\\\/g;

	      s/\n/\\n/g;

	      s/'/\\'/ig;

	      Compat and s/"/\\"/ig;

	      $_ = "'$_'";

	    } else {

	      $l = 0;

	      $_ = "''";

	    }

	  }

	  return $l;

	}

	sub getline()

	{

	  $_ = <>;

	  defined $_ or die "eof at line $.\n";

	}

	sub ignore_elt($)

	{

	  m|^\s*<$_[0]>.*?\n$| or die "expected $_[0] element in line $.\n";

	  getline;

	}

	sub simple_elt($$)

	{

	  if (m|^\s*<$_[0]\s*/>\n$|) {

	    $_[1]{$_[0]} = '';

	  } elsif (m|^\s*<$_[0]>(.*?)\n$|) {

	    $_[1]{$_[0]} = $1;

	  } else {

	    die "expected $_[0] element in line $.\n";

	  }

	  getline;

	}

	sub simple_opt_elt($$)

	{

	  if (m|^\s*<$_[0]\s*/>\n$|) {

	    $_[1]{$_[0]} = '';

	  } elsif (m|^\s*<$_[0]>(.*?)\n$|) {

	    $_[1]{$_[0]} = $1;

	  } else {

	    return;

	  }

	  getline;

	}

	sub redirect_elt($)

	{

	  if (m|^\s*\s*title="([^"]*)"\s*/>\n$|) { # " -- GeSHI syntax highlighting breaks on this line

	    $_[0]{redirect} = $1;

	  } else {

	    simple_opt_elt redirect => $_[0];

	    return;

	  }

	  getline;

	}

	sub opening_tag($)

	{

	  m|^\s*<$_[0]>\n$| or die "expected $_[0] element in line $.\n";

	  getline;

	}

	sub closing_tag($)

	{

	  m|^\s*$_[0]>\n$| or die "$_[0]: expected closing tag in line $.\n";

	  getline;

	}

	sub si_nss_namespace()

	{

	  m|^\s*"(-?\d+)"[^/]*?/>()\n|

	    or m|^\s*"(-?\d+)"[^>]*?>(.*?)\n|

	    or die "expected namespace element in line $.\n";

	  $namespace{$2} = $1;

	  getline;

	}

	sub si_namespaces()

	{

	  opening_tag("namespaces");

	  eval {

	    while (1) {

	      si_nss_namespace;

	    }

	  };

	  # note: $@ is always defined

	  $@ =~ /^expected namespace element / or die "namespaces: $@";

	  $ns_pattern = '^('.join('|',map { quotemeta } keys %namespace).'):';

	  closing_tag("namespaces");

	}

	sub siteinfo()

	{

	  opening_tag("siteinfo");

	  eval {

	    my %site;

	    simple_elt sitename => \%site;

	    simple_elt base => \%site;

	    simple_elt generator => \%site;

	    $site{generator} =~ /^MediaWiki 1.20wmf1$/

	      or warn("siteinfo: untested generator '$site{generator}',",

	              " expect trouble ahead\n");

	    simple_elt case => \%site;

	    si_namespaces;

	    print "-- MediaWiki XML dump converted to SQL by mwimport

	BEGIN;

	-- Site: $site{sitename}

	-- URL: $site{base}

	-- Generator: $site{generator}

	-- Case: $site{case}

	--

	-- Namespaces:

	",map { "-- $namespace{$_}: $_\n" }

	  sort { $namespace{$a} <=> $namespace{$b} } keys %namespace;

	  };

	  $@ and die "siteinfo: $@";

	  closing_tag("siteinfo");

	}

	sub pg_rv_contributor($)

	{

	  if (m|^\s*"deleted"\s*/>\s*\n|) {

	    getline;

	  } else {

	    opening_tag "contributor";

	    my %c;

	    eval {

	      simple_elt username => \%c;

	      simple_elt id => \%c;

	      $_[0]{contrib_user} = $c{username};

	      $_[0]{contrib_id}   = $c{id};

	    };

	    if ($@) {

	      $@ =~ /^expected username element / or die "contributor: $@";

	      eval {

	        simple_elt ip => \%c;

	        $_[0]{contrib_user} = $c{ip};

	      };

	      $@ and die "contributor: $@";

	    }

	    closing_tag "contributor";

	  }

	}

	sub pg_rv_comment($)

	{

	  if (m|^\s*s*/>\s*\n|) {

	    getline;

	  } elsif (m|^\s*"deleted"\s*/>\s*\n|) {

	    getline;

	  } elsif (s|^\s*([^<]*)||g) {

	    while (1) {

	      $_[0]{comment} .= $1;

	      last if $_;

	      getline;

	      s|^([^<]*)||;

	    }

	    closing_tag "comment";

	  } else {

	    return;

	  }

	}

	sub pg_rv_text($)

	{

	  if (m|^\s*"preserve"\s*/>\s*\n|) {

	    $_[0]{text} = '';

	    getline;

	  } elsif (m|^\s*"deleted"\s*/>\s*\n|) {

	    $_[0]{text} = '';

	    getline;

	  } elsif (s|^\s*"preserve">([^<]*)||g) {

	    while (1) {

	      $_[0]{text} .= $1;

	      last if $_;

	      getline;

	      s|^([^<]*)||;

	    }

	    closing_tag "text";

	  } else {

	    die "expected text element in line $.\n";

	  }

	}

	my $start = time;

	sub stats()

	{

	  my $s = time - $start;

	  $s ||= 1;

	  printf STDERR "%9d pages (%7.3f/s), %9d revisions (%7.3f/s) in %d seconds\n",

	    $cnt_page, $cnt_page/$s, $cnt_rev, $cnt_rev/$s, $s;

	}

	### flush_rev($text, $rev, $page)

	sub flush_rev($$$)

	{

	  $_[0] or return;

	  for my $i (0,1,2) {

	    $_[$i] =~ s/,\n?$//;

	  }

	  print "INSERT INTO text(old_id,old_text,old_flags) VALUES $_[0];\n"; #Comment this to save time

	  $_[2] and print "INSERT INTO page(page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES $_[2];\n";

	  print "INSERT INTO revision(rev_id,rev_page,rev_text_id,rev_comment,rev_user,rev_user_text,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id) VALUES $_[1];\n";

	  for my $i (0,1,2) {

	    $_[$i] = '';

	  }

	}

	### flush($text, $rev, $page)

	sub flush($$$)

	{

	  flush_rev $_[0], $_[1], $_[2];

	  print "COMMIT;\n";

	  $committed = $cnt_page;

	}

	### pg_revision(\%page, $skip, $text, $rev, $page)

	sub pg_revision($$$$$)

	{

	  my $rev = {};

	  opening_tag "revision";

	  eval {

	    my %revision;

	    simple_elt id => $rev;

	    simple_opt_elt parentid => $rev;

	    simple_elt timestamp => $rev;

	    pg_rv_contributor $rev;

	    simple_opt_elt minor => $rev;

	    pg_rv_comment $rev;

	    pg_rv_text $rev;

	    simple_opt_elt sha1 => $rev;

	    simple_opt_elt model => $rev;

	    simple_opt_elt format => $rev;

	  };

	  $@ and die "revision: $@";

	  closing_tag "revision";

	  $_[1] and return;

	  $$rev{id} =~ /^\d+$/ or return

	    warn("page '$_[0]{title}': ignoring bogus revision id '$$rev{id}'\n");

	  $_[0]{latest_len} = textify $$rev{text};

	  for my $f (qw(comment contrib_user)) {

	    textify $$rev{$f};

	  }

	  $$rev{timestamp} =~

	    s/^(\d\d\d\d)-(\d\d)-(\d\d)T(\d\d):(\d\d):(\d\d)Z$/'$1$2$3$4$5$6'/

	      or return warn("page '$_[0]{title}' rev $$rev{id}: ",

	                     "bogus timestamp '$$rev{timestamp}'\n");

	  $_[2] .= "($$rev{id},$$rev{text},'utf-8'),\n";

	  $$rev{minor} = defined $$rev{minor} ? 1 : 0;

	  $_[3] .= "($$rev{id},$_[0]{id},$$rev{id},$$rev{comment},"

	    .($$rev{contrib_id}||0)

	    .",$$rev{contrib_user},$$rev{timestamp},$$rev{minor},0,$_[0]{latest_len},$_[0]{latest}),\n";

	  $_[0]{latest} = $$rev{id};

	  $_[0]{latest_start} = substr $$rev{text}, 0, 60;

	  if (length $_[2] > $Buffer_Size) {

	    flush_rev $_[2], $_[3], $_[4];

	    $_[0]{do_commit} = 1;

	  }

	  ++$cnt_rev % 1000 == 0 and stats;

	}

	### page($text, $rev, $page)

	sub page($$$)

	{

	  opening_tag "page";

	  my %page;

	  ++$cnt_page;

	  eval {

	    simple_elt title => \%page;

	    simple_opt_elt ns => \%page;

	    simple_elt id => \%page;

	    redirect_elt \%page;

	    simple_opt_elt restrictions => \%page;

	    $page{latest} = 0;

	    while (1) {

	      pg_revision \%page, $skip, $_[0], $_[1], $_[2];

	    }

	  };

	  # note: $@ is always defined

	  $@ =~ /^expected revision element / or die "page: $@";

	  closing_tag "page";

	  if ($skip) {

	    --$skip;

	  } else {

	    $page{id} =~ /^\d+$/

	      or warn("page '$page{title}': bogus id '$page{id}'\n");

	    my $ns;

	    if ($page{title} =~ s/$ns_pattern//o) {

	      $ns = $namespace{$1};

	    } else {

	      $ns = 0;

	    }

	    for my $f (qw(title restrictions)) {

	      textify $page{$f};

	    }

	    if (Compat) {

	      $page{redirect} = $page{latest_start} =~ /^'#(?:REDIRECT|redirect) / ?

	        1 : 0;

	    } else {

	      $page{redirect} = $page{latest_start} =~ /^'#REDIRECT /i ? 1 : 0;

	    }

	    $page{title} =~ y/ /_/;

	    if (Compat) {

	      $_[2] .= "($page{id},$ns,$page{title},$page{restrictions},0,"

	        ."$page{redirect},0,RAND(),"

	          ."DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,"

	            ."$page{latest},$page{latest_len}),\n";

	    } else {

	      $_[2] .= "($page{id},$ns,$page{title},$page{restrictions},0,"

	        ."$page{redirect},0,RAND(),NOW()+0,$page{latest},$page{latest_len}),\n";

	    }

	    if ($page{do_commit}) {

	      flush $_[0], $_[1], $_[2];

	      print "BEGIN;\n";

	    }

	  }

	}

	sub terminate

	{

	  die "terminated by SIG$_[0]\n";

	}

	my $SchemaVer = '0.8';

	my $SchemaLoc = "http://www.mediawiki.org/xml/export-$SchemaVer/";

	my $Schema    = "http://www.mediawiki.org/xml/export-$SchemaVer.xsd";

	my $help;

	GetOptions("skip=i"             => \$skip,

	           "help"               => \$help) or pod2usage(2);

	$help and pod2usage(1);

	getline;

	m|^"$SchemaLoc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="$SchemaLoc $Schema" version="$SchemaVer"\E xml:lang="..">$|

	  or die "unknown schema or invalid first line\n";

	getline;

	$SIG{TERM} = $SIG{INT} = \&terminate;

	siteinfo;

	my ($text, $rev, $page) = ('', '', '');

	eval {

	  while (1) {

	    page $text, $rev, $page;

	  }

	};

	$@ =~ /^expected page element / or die "$@ (committed $committed pages)\n";

	flush $text, $rev, $page;

	stats;

	m|| or die "mediawiki: expected closing tag in line $.\n";

	=head1 COPYRIGHT

	Copyright 2007 by Robert Bihlmeyer

	This program is free software; you can redistribute it and/or modify

	it under the terms of the GNU General Public License as published by

	the Free Software Foundation; either version 2 of the License, or

	(at your option) any later version.

	You may also redistribute and/or modify this software under the terms

	of the GNU Free Documentation License without invariant sections, and

	without front-cover or back-cover texts.

	This program is distributed in the hope that it will be useful,

	but WITHOUT ANY WARRANTY; without even the implied warranty of

	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

	GNU General Public License for more details.

adddb.sh - You will need to change the urll variable to the url that you will be using. Wikipedia releases some files under bz2 compression and other files under 7z. I left a line commented in this script that you need to enable in case your files are bz2 and not 7z (don't forget to comment the line underneath that that deals with 7z). Finally, you will need to add your mysql username password and database.

	!/bin/bash

	urll='http://dumps.wikimedia.org/enwiki/20130102/'

	for i in $( cat filelist.txt ); do

	        echo retrieving: $i

	        wget $urll$i

	        echo mwimport: $i

	        #bzcat $i | perl mwimport.pl > temp.sql

	        7za e -so $i |perl mwimport.pl > temp.sql

	        result=$( tail temp.sql -n1 )

	        result1='COMMIT;'

	        echo checking...

	        echo $result

	        if [ "$result" = "$result1" ]

	        then

	                echo OKAY: $i

	                rm $i

	                cat preimport.sql temp.sql postimport.sql | mysql -f -u  -p --database=

	                rm temp.sql

	        else

	                echo FAIL: $i

	                exit

	        fi

	done

Step 4: Run "sh adddb.sh"

The script will start retrieving the first file in filelist.txt, extract it and process it using mwimport.pl, check on whether the extraction is successful and finally add the information to the database. After that, it will delete the file and carry on with the second file in filelist.txt all the way to the end.

You will need to do this preferably under a screen. If you don't have it installed:

	sudo aptitude install screen

	screen -t Restoringwiki

	sh adddb.sh

Ctrl+A+D will detach the screen. It's still active in the background. To view the progress you can enter that screen again by using "screen -r " and hit tab to get the number of the screen automatically.

Bayes Factor

2012-07-14T00:00:00-07:00

Contrary to NHST where you have a p value along with the effect size to determine whether there is an effect and how big is it, Bayes factor answers both of these questions. In other words, one simple result determines which hypothesis is asserted and by how much according to your data.

You determine which hypothesis is more likely given the data based on the Bayes Factor. The way to interpret a general Bayes Factor is the following. If a Bayes factor is denoted as BFxy then you say that the data are n times more likely under Hx than Hy. An example based on the BF10 would be that the data would be 0.53 times more likely under H1 than H0. If we use the BF01 then we would say that the data are 1.87 times more likely under H0 than H1(which is more meaningful). Basically, one version of Bayes factor(e.g. BF01) is the inverted version of the other(e.g BF10). Pick the version of Bayes factor that is above 1.

Interpreting a Bayes Factor

There is a scale used to determine how strong is the evidence presented by the Bayes factor. The scale was developed by Harold Jeffreys in his book "Theory of probability" (H. Jeffreys (1961). The Theory of Probability (3 ed.). Oxford. p. 432).

Bayes Factor	Strength of Evidence
< 1:1	Negative (Supports the opposite model)
1:1 to 3:1	Barely worth mentioning
3:1 to 10:1	Substantial
10:1 to 30:1	Strong
30:1 to 100:1	Very strong
> 100:1	Decisive

Bayesian Test of Independence

2012-07-14T00:00:00-07:00

Introduction

The following test behaves alot like the chi-square test of independence. It can work with ordinal, categorical and even dichotomous variables (any case that can give you a contingency table).

References

This method is based on the book "Bayesian Computation With R" by Jim Albert. If you want to learn more about the model and the code you can read the book or the article.

For the procedure you need R and the LearnBayes package that can be installed in R using the commandinstall.packages('LearnBayes').

Procedure Highlights

Input

You need to type the data for your contingency table or feed it to your tabledata variable. Additionally, you need to specify the rows and columns for your table.

	#---------------INPUT DATA------------------

	tabledata = c(6,9,40,34) # Enter data first row by row and then column by column

	tablerows = 3 # rows in the contigency table

	tablecolumns = 4 # columns in the contigency table

	#-------------------------------------------

The hypotheses are:

H0: There is no dependency between the two variables
H1: There is a dependency between the two variables
H~0: There is almost no dependency. This hypothesis tests for a model close to independence.

Output

	Your table: 

	     [,1] [,2]

	[1,]    6   40

	[2,]    9   34

	The uniform table to be compared with your table: 

	     [,1] [,2]

	[1,]    1    1

	[2,]    1    1

	-----------Results--------------

	Bayes Factor(BF10) for H1 Dependence over H0 Independence:  0.4660114 

	Bayes Factor(BF01) for H0 Independence over H1 Dependence:  2.14587 

	-----------Additional Model Results--------------

	Bayes factor in support of the model close to independence versus the model of independence:

	  log.K log.BF   BF

	1     2  -1.76 0.17

	2     3  -0.50 0.61

	3     4  -0.25 0.78

	4     5  -0.07 0.93

	5     6  -0.02 0.98

	6     7   0.00 1.00

	-------Results--------------

	Bayes Factor(BF10) for H~0 Close to Independence over H0 Independence:  0.9954232 

	Bayes Factor(BF01) for H0 Independence over H~0 Close to Independence:  1.004598

You need to always verify that your table looks the way that it should. The code is still a bit buggy and sometimes rows get changed for columns. In case your table looks the opposite way, just change the numbers between rows and columns.

The code performs two analyses. The first, tests the independence hypothesis against the dependence hypothesis. The second analysis tests the hypothesis of Independence against the hypothesis close to independence.

Read the Bayes Factor page for how you should interpret these results.

In this example, H0 is 2.14 times more likely than H1. The evidence is not really strong however. Additionally, the second test failed to provide support for a model close to independence.

Code

	# clears workspace:  

	rm(list=ls(all=TRUE))

	#---------------INPUT DATA------------------

	tabledata = c(6,9,40,34) # Enter data first row by row and then column by column

	tablerows = 2 # rows in the contigency table

	tablecolumns = 2 # columns in the contigency table

	#-------------------------------------------

	library(LearnBayes)

	tablesize = c(tablecolumns,tablerows)

	data=matrix(tabledata,tablesize)

	cat("\r\nYour table: \r\n")

	print(data)

	#chisq.test(data)

	#fisher.test(data)

	totalrowscolumns = tablerows * tablecolumns

	a=matrix(rep(1,totalrowscolumns),tablesize)

	cat("\r\nThe uniform table to be compared with your table: \r\n")

	print(a)

	BF10 = ctable(data,a) #BF in support of the dependence hypothesis

	BF01 =  1 /BF10

	cat("\r\n-----------Results--------------\r\n")

	cat("Bayes Factor(BF10) for H1 Dependence over H0 Independence: ",BF10,"\r\n")

	cat("Bayes Factor(BF01) for H0 Independence over H1 Dependence: ",BF01,"\r\n")

	log.K=seq(2,7)

	compute.log.BF=function(log.K)

	  log(bfindep(data,exp(log.K),100000)$bf)

	log.BF=sapply(log.K,compute.log.BF)

	BF=exp(log.BF)

	#BF in support of the alternative model close to independence

	#Bayes factor against independence assuming alternatives close to independence

	cat("\r\n-----------Additional Model Results--------------\r\n")

	cat("Bayes factor in support of the model close to independence versus the model of independence:\r\n")

	print(round(data.frame(log.K,log.BF,BF),2))

	#Plotting

	plot(log.K,log.BF)

	lines(log.K,log.BF)

	cat("\r\n-----------Results--------------\r\n")

	cat("Bayes Factor(BF~00) for H~0 Close to Independence over H0 Independence: ",max(BF),"\r\n")

	cat("Bayes Factor(BF0~0) for H0 Independence over H~0 Close to Independence: ",1/max(BF),"\r\n")

Bayesian t-test hypothesis testing for two independent groups

2012-07-14T00:00:00-07:00

Introduction

This method can be used in the same circumstances that one would use the regular independent t-test; when you want to statistically compare the means of two groups. Both groups should have their data normally distributed.

References

This method is based on the book "A Practical Course in Bayesian Graphical Modeling" by Michael Lee and Eric-Jan Wagenmakers. Additionally, a published scientific article can be found here. Either or both are good to cite when using this method. Some of the code may has been changed in order to make the application of the analysis easier. If you want to learn more about the model and the code you can read the book or the article.

For the procedure you need R and Openbugs.

Procedure highlights

Input

	setwd("/the/directory/for/both/of/your/files/") #this will help the program find the location of the model

	group1 = c(49,13,50,64,21,23,15,7,32,8,17,15,19,16,33) #your set of values for group 1

	group2 = c(41,20,53,41,20,44,31,24,24,44,15,35,32,25,35) #your set of values for group 2

	openbugsmodel = "Ttest_2.txt" #specify the location for the file that contains the OpenBugs code

	priorforh1 = dcauchy(0) #prior for the case of delta<>0

	priorforh2 = 2*dcauchy(0) #prior for the case of delta<0

	priorforh3 = 2*dcauchy(0) #prior for the case of delta>0

	#Advanced input

	itterations = 30000 #Don't change these unless you know what you're doing

	burnin = 3000 #Don't change these unless you know what you're doing

Set the first three lines according to your setup and data or feed the variables your own data. You also need to set the file that contains the Openbugs model which you can find at the end of this page. If you wish to change the priors you can, just remember that you need to adjust the prior for hypothesis 2 and hypothesis 3 in order apply only for positive or negative numbers. Also you can change the iterations and the burnin if you want to improve your results. These need to be reported in your paper later on.

Output

The hypotheses for this test are:

H0 - δ = 0: there is no difference between the two groups
H1 - δ <>0: there is a difference between the two groups
H2 - δ < 0: group 2 has larger values than group 1
H3 - δ > 0: group 1 has larger values than group 2

The output produces a set of results in text along with the probability distribution plots for each one of them. Both are useful for making a decision about your hypothesis. As an example, you can use the graph and determine if at the point δ=0 your posterior(your results after the data) are higher or lower than the prior(your initial belief). If the posterior is higher than the prior at δ=0 then it reinforces the fact that the null hypothesis(H0) is probably true. If the posterior is lower than the prior then the data weakens your belief that the null hypothesis is true. Bayes factors are automatically reported on the graphs.

	-----------Results--------------

	Bayes Factor(BF10) for H1 delta<>0 over H0 delta=0:  0.5346907 

	Bayes Factor(BF01) for H0 delta=0 over H1 delta<>0:  1.87024 

	---

	Bayes Factor(BF20) for H2 delta<0 over H0 delta=0:  0.9432402 

	Bayes Factor(BF02) for H0 delta=0 over H2 delta<0:  1.060175 

	---

	Bayes Factor(BF30) for H3 delta> over H0 delta=0:  0.1256324 

	Bayes Factor(BF03) for H0 delta=0 over H3 delta>0:  7.959729 

	---

Please read the Bayes Factor page for how to interpret it.

In the first two cases the evidence is "Barely worth mentioning" for H0. But, the third result (7.67) is considered "Substantial" evidence in favor of H0, indicating that when we consider if group 1 has a bigger effect than group 2, there is substantial evidence to say that is unlikely(providing proof for H0).

When publishing you are going to have to report also the process you obtained your results and the numbers of itterations for the MCMC test along with the burnin value and the chains(in this case 3).

Code

R code

	# clears workspace:  

	rm(list=ls(all=TRUE))

	#---------------INPUT DATA------------------

	setwd("/the/directory/for/both/of/your/files/") #this will help the program find the location of the modelthe model

	group1 = c(1,5,4,2,1,2,3,1,2,3,1,2,5,4,7,8,9,7,5,3,2,3,4,6,7,6,4,7) #your set of values for group 1

	group2 = c(5,4,3,3,4,6,6,8,7,6,5,7,7,6,5,6,7,5,4,5,6,7,8,5,6,7,8,9) #your set of values for group 2

	openbugsmodel = "Ttest_2.txt"

	priorforh1 = dcauchy(0) #prior for the case of delta<>0

	priorforh2 = 2*dcauchy(0) #prior for the case of delta<0

	priorforh3 = 2*dcauchy(0) #prior for the case of delta>0

	#Advanced input

	itterations = 30000 #Don't change these unless you know what you're doing

	burnin = 3000 #Don't change these unless you know what you're doing

	#-------------------------------------------

	library(R2OpenBUGS)

	n1 = length(group1)

	n2 = length(group2)

	# Rescale:

	group2 = group2-mean(group1)

	group2 = group2/sd(group1)

	group1 = as.vector(scale(group1))

	data=list("group1", "group2", "n1", "n2") # to be passed on to WinBUGS

	inits=function()

	{

	    list(delta=rnorm(1,0,1),mu=rnorm(1,0,1),sigma=runif(1,0,5))

	}

	# Parameters to be monitored

	parameters=c("delta")

	# The following command calls WinBUGS with specific options.

	# For a detailed description see Sturtz, Ligges, & Gelman (2005).

	samples = bugs(data, inits, parameters,

	                model.file =openbugsmodel,

	                n.chains=3, n.iter=itterations, n.burnin=burnin, n.thin=1,

	                DIC=T, 

	                codaPkg=F, debug=F)

	# Now the values for the monitored parameters are in the "samples" object, 

	# ready for inspection.

	samples$summary # overview

	# Please work through the analyses below one at a time

	######################################################

	# Analysis 1. H1: delta ~ Cauchy (unrestricted case)

	######################################################

	# Collect posterior samples across all chains:

	delta.posterior  = samples$sims.list$delta  

	#============ BFs based on logspline fit ===========================

	library(polspline) # this package can be installed from within R

	fit.posterior = logspline(delta.posterior)

	# 95% confidence interval:

	x0=qlogspline(0.025,fit.posterior)

	x1=qlogspline(0.975,fit.posterior)

	posterior     = dlogspline(0, fit.posterior) # this gives the pdf at point delta = 0

	prior         = priorforh1                   # height of order--restricted prior at delta = 0

	BF10          = prior/posterior

	BF01          = posterior/prior

	cat("-----------Results--------------\r\n")

	cat("Bayes Factor(BF10) for H1 delta<>0 over H0 delta=0: ",BF10,"\r\n")

	cat("Bayes Factor(BF01) for H0 delta=0 over H1 delta<>0: ",BF01,"\r\n")

	cat("---\r\n")

	if (BF10>=BF01){

	  BFplot=BF10

	  BFtext = bquote(BF[1][0])

	  }else{

	    BFplot=BF01

	    BFtext = bquote(BF[0][1])

	  }

	BFplot=round(BFplot,2)

	#============ Plot Prior and Posterior  ===========================

	par(cex.main = 1.5, mar = c(5, 6, 4, 5) + 0.1, mgp = c(3.5, 1, 0), cex.lab = 1.5,

	    font.lab = 2, cex.axis = 1.3, bty = "n", las=1)

	xlow  = -3

	xhigh = 3

	yhigh = 4

	Nbreaks = 80

	y = hist(delta.posterior, Nbreaks, prob=T, border="white", ylim=c(0,yhigh), xlim=c(xlow,xhigh), lwd=2, lty=1, ylab="Density", xlab=" ", main=" ", axes=F) 

	#white makes the original histogram -- with unwanted vertical lines -- invisible

	lines(c(y$breaks, max(y$breaks)), c(0,y$intensities,0), type="S", lwd=2, lty=1) 

	axis(1, at = c(-4,-3,-2,-1,0,1,2,3,4), lab=c("-4","-3","-2","-1","0", "1", "2", "3", "4"))

	axis(2)

	mtext(expression(delta), side=1, line = 2.8, cex=2)

	#now bring in log spline density estimation:

	par(new=T)

	plot(fit.posterior, ylim=c(0,yhigh), xlim=c(xlow,xhigh), lty=1, lwd=1, axes=F)

	points(0, dlogspline(0, fit.posterior),pch=19, cex=2)

	# plot the prior:

	par(new=T)

	plot ( function( x ) dcauchy( x, 0, 1 ), xlow, xhigh, ylim=c(0,yhigh), xlim=c(xlow,xhigh), lwd=1, lty=1, ylab=" ", xlab = " ", axes=F) 

	axis(1, at = c(-4,-3,-2,-1,0,1,2,3,4), lab=c("-4","-3","-2","-1","0", "1", "2", "3", "4"))

	axis(2)

	points(0, dcauchy(0), pch=19, cex=2)

	text(-1,3.5, expression(H[0]:  delta == 0),cex=2)

	text(-1,3, expression(H[1]:  delta != 0),cex=2)

	text(-1,2.5, bquote(.(BFtext)  == .(BFplot)),cex=2)

	###########################################################################

	# Analysis 2. H1: delta ~ Cauchy^- (restricted case, negative values only)

	###########################################################################

	# Collect posterior samples across all chains:

	delta.posterior  = samples$sims.list$delta

	# selects only negative delta's:

	delta.posterior  = delta.posterior[which(delta.posterior<0)]

	#============ BFs based on logspline fit ===========================

	fit.posterior = logspline(delta.posterior,ubound=0) # NB. note the bound

	# 95% confidence interval:

	x0=qlogspline(0.025,fit.posterior)

	x1=qlogspline(0.975,fit.posterior)

	posterior     = dlogspline(0, fit.posterior) # this gives the pdf at point delta = 0

	prior         = priorforh2                 # height of order--restricted prior at delta = 0

	BF10          = prior/posterior

	BF01          = posterior/prior

	cat("Bayes Factor(BF20) for H2 delta<0 over H0 delta=0: ",BF10,"\r\n")

	cat("Bayes Factor(BF02) for H0 delta=0 over H2 delta<0: ",BF01,"\r\n")

	cat("---\r\n")

	if (BF10>=BF01){

	  BFplot=BF10

	  BFtext = bquote(BF[2][0])

	}else{

	  BFplot=BF01

	  BFtext = bquote(BF[0][2])

	}

	BFplot=round(BFplot,2)

	#============ Plot Prior and Posterior  ===========================

	par(cex.main = 1.5, mar = c(5, 6, 4, 5) + 0.1, mgp = c(3.5, 1, 0), cex.lab = 1.5,

	    font.lab = 2, cex.axis = 1.3, bty = "n", las=1)

	xlow  = -3

	xhigh = 0

	yhigh = 12

	Nbreaks = 80

	y = hist(delta.posterior, Nbreaks, prob=T, border="white", ylim=c(0,yhigh), xlim=c(xlow,xhigh), lwd=2, lty=1, ylab="Density", xlab=" ", main=" ", axes=F) 

	#white makes the original histogram -- with unwanted vertical lines -- invisible

	lines(c(y$breaks, max(y$breaks)), c(0,y$intensities,0), type="S", lwd=2, lty=1) 

	axis(1, at = c(-3,-2,-1,0), lab=c("-3","-2","-1","0"))

	axis(2)

	mtext(expression(delta), side=1, line = 2.8, cex=2)

	#now bring in log spline density estimation:

	par(new=T)

	plot(fit.posterior, ylim=c(0,yhigh), xlim=c(xlow,xhigh), lty=1, lwd=1, axes=F)

	points(0, dlogspline(0, fit.posterior),pch=19, cex=2)

	# plot the prior:

	par(new=T)

	plot ( function( x ) 2*dcauchy( x, 0, 1 ), xlow, xhigh, ylim=c(0,yhigh), xlim=c(xlow,xhigh), lwd=1, lty=1, ylab=" ", xlab = " ", axes=F) 

	axis(1, at = c(-3,-2,-1,0), lab=c("-3","-2","-1","0"))

	axis(2)

	points(0, 2*dcauchy(0), pch=19, cex=2)

	text(-2,10, expression(H[0]:  delta == 0),cex=2)

	text(-2,8, expression(H[2]:  delta < 0),cex=2)

	text(-2,6, bquote(.(BFtext)  == .(BFplot)),cex=2)

	###########################################################################

	# Analysis 3. H1: delta ~ Cauchy^+ (restricted case, positive values only)

	###########################################################################

	# Collect posterior samples across all chains:

	delta.posterior  = samples$sims.list$delta  

	# selects only positive delta's:

	delta.posterior  = delta.posterior[which(delta.posterior>0)]

	#============ BFs based on logspline fit ===========================

	fit.posterior = logspline(delta.posterior,lbound=0) # NB. note the bound

	# 95% confidence interval:

	x0=qlogspline(0.025,fit.posterior)

	x1=qlogspline(0.975,fit.posterior)

	posterior     = dlogspline(0, fit.posterior) # this gives the pdf at point delta = 0

	prior         = priorforh3                   # height of order--restricted prior at delta = 0

	BF10          = prior/posterior #this produces the decimal version. if you inverse this you get the odds

	BF01          = posterior/prior

	cat("Bayes Factor(BF30) for H3 delta>0 over H0 delta=0: ",BF10,"\r\n")

	cat("Bayes Factor(BF03) for H0 delta=0 over H3 delta>0: ",BF01,"\r\n")

	cat("---\r\n")

	if (BF10>=BF01){

	  BFplot=BF10

	  BFtext = bquote(BF[3][0])

	}else{

	  BFplot=BF01

	  BFtext = bquote(BF[0][3])

	}

	BFplot=round(BFplot,2)

	#============ Plot Prior and Posterior  ===========================

	par(cex.main = 1.5, mar = c(5, 6, 4, 5) + 0.1, mgp = c(3.5, 1, 0), cex.lab = 1.5,

	    font.lab = 2, cex.axis = 1.3, bty = "n", las=1)

	xlow  = 0

	xhigh = 3

	yhigh = 12

	Nbreaks = 80

	y = hist(delta.posterior, Nbreaks, prob=T, border="white", ylim=c(0,yhigh), xlim=c(xlow,xhigh), lwd=2, lty=1, ylab="Density", xlab=" ", main=" ", axes=F) 

	#white makes the original histogram -- with unwanted vertical lines -- invisible

	lines(c(y$breaks, max(y$breaks)), c(0,y$intensities,0), type="S", lwd=2, lty=1) 

	axis(1, at = c(0,1,2,3,4), lab=c("0", "1", "2", "3", "4"))

	axis(2)

	mtext(expression(delta), side=1, line = 2.8, cex=2)

	#now bring in log spline density estimation:

	par(new=T)

	plot(fit.posterior, ylim=c(0,yhigh), xlim=c(xlow,xhigh), lty=1, lwd=1, axes=F)

	points(0, dlogspline(0, fit.posterior),pch=19, cex=2)

	# plot the prior:

	par(new=T)

	plot ( function( x ) 2*dcauchy( x, 0, 1 ), xlow, xhigh, ylim=c(0,yhigh), xlim=c(xlow,xhigh), lwd=1, lty=1, ylab=" ", xlab = " ", axes=F) 

	axis(1, at = c(0,1,2,3,4), lab=c("0", "1", "2", "3", "4"))

	axis(2)

	points(0, 2*dcauchy(0), pch=19, cex=2)

	text(2,10, expression(H[0]:  delta == 0),cex=2)

	text(2,8, expression(H[3]:  delta > 0),cex=2)

	text(2,6, bquote(.(BFtext)  == .(BFplot)),cex=2)

OpenBugs code

Ttest_2.txt

	model

	{ 

	  for (i in 1:n1)

	  {

	    group1[i] ~ dnorm(muX,lambdaXY)

	  }

	  for (i in 1:n2)

	  {

	    group2[i] ~ dnorm(muY,lambdaXY)

	  }

	  lambdaXY <- pow(sigmaXY,-2)

	  delta       ~ dnorm(0,lambdaDelta)

	  lambdaDelta ~ dchisqr(1)

	  sigma    ~ dnorm(0,sigmaChi)

	  sigmaChi ~ dchisqr(1)

	  sigmaXY <- abs(sigma)

	  mu    ~ dnorm(0,muChi)

	  muChi ~ dchisqr(1)

	  alpha <- delta*sigmaXY    

	  muX <- mu + alpha*0.5 

	  muY <- mu - alpha*0.5 

	}

Additional information

The model specifications are:

	Group 1 ~ Normal distribution ( μ + α/2, σ^2) #These are the only known values

	Group 2 ~ Normal distribution ( μ - α/2, σ^2) #These are the only known values

	σ ~ Cauchy distribution(0,1)+

	μ ~ Cauchy distribution(0,1)

	α = σ * δ

	δ ~ Cauchy distribution(0,1) # This is assumed for H1 for the model while H0 is considered δ=0

Bayesian Statistical Hypothesis Testing for HCI

2012-01-01T00:00:00-08:00

Disclaimer (Please read this first!)

Evidence for and against the null hypothesis is possible :-)

Found anything interesting? Any comments or errors?Contact me :-)

I started this wiki so that I can try and gather as many procedures(and code) as I can that currently exists in Bayesian statistics. The goal is to create an easy to read, easy to apply guide for each method depending on your data and your design. Although this is geared towards HCI research, most of these methods can be applied in other scientific disciplines such as social sciences, psychology and others. The philosophy behind this guide is to always keep things simple. Just as I don't ask for my visitors on this website to understand HTTP requests, the same should apply for someone that wants to perform Bayesian statistics. You only need to know what is your input, and how to interpret the output. Therefore, the emphasis here is taken away from the math aspects of bayesian statistics.

My inspiration for developing such content was the site Statistics for HCI Research by Koji Yatani. It is an excellent guide for NHST analysis for HCI.

Keep in mind that I am not an expert of statistics. The contents provided here is basically what I learned from my experience of HCI research and by reading different online/offline materials. I always double-check the content before posting, but it still may be not 100% accurate or even wrong. So, use the contents on this website at your discretion. I own no responsibility on any kind of consequences, such as you have done a wrong analysis after reading my wiki or your papers do not get into a conference or a journal, or your adviser doesn't like your analysis.

I also strongly recommend you get a second opinion on your analysis from other kinds of resources before you really perform a test. If you have found any factual errors, please email me(tsikerdekis@gmail.com). Your comments would be greatly appreciated. Also, I am always looking for R(matlab,stata) code that can perform hypothesis testing so don't hesitate to let me know about it.

Basics of statistics (A quick introduction to things you need to know)

There are 4 types of variables that you need to know and identify.

Interval/Numerical/Ratio are ordered sets of data (usually numbers) that maintain equal distance between their space (e.g., the distance between 2 and 3 is equal to the distance between 3 and 4).
Ordinal are ordered sets of data that do not show an equal distance between their elements. (e.g., "very strong" is definitely higher than "strong" and the same applies for "extremely strong" but the distance between this elements is not necessary equal.)
Nominal/Categorical are sets of data with no order (e.g., countries is a good example).
Dichotomous are categorical variables that have only two levels (e.g., sex can have values only male and female.)

You will also need a general understanding of the Bayes Factor. However, I have connected the link to every procedure's interpretation section as well.

Finally, Bayesian procedures have their pros and cons just as NHST analysis(guide development in progress) BUT the single most appealing thing for me is the power to provide evidence for the null hypothesis. Yes, with Bayesian methods you can do it!

What statistical test should I use?

While with NHST analysis answers are straight forward, Bayesian statistics is still a field under development. This is especially true when it comes to hypothesis testing. The following is a set of techniques that I managed to gather.

	Types of your dependent/independent variables
	Interval/Ratio	Interval/Ratio, Ordinal	Ordinal,Categorical	Dichotomous
Compare two unpaired groups	Bayesian t-test	Bayesianmannwhitney Bayesian Mann-Whitney test	Bayesian test of independence	Bayesianbinomialtesting Bayesian Binomial
Compare two paired groups	--	--	--	--
Find relationship between two variables	--	--	--	--