So, I figured why not just create github.com Shiny R application that builds estimates depending on the input. The current formula is a crude estimation that assumes there in an N population of positions that become available based on an independent probabiltiy p and there is a competing number of w applicants which are chosen at random. For CS and CE jobs in US and Canada based on CRA's survey data is around 50%.
The Shiny R application for playing around wtih the input can be found here: https://tsikerdekis.shinyapps.io/AcademicJobProbability/
The source can be found: https://github.com/tsikerdekis/AcademicJobProbability (Pull requests more than welcome)
]]>
Before I go into more details, I need to clarify the obvious. Proxies (such as SOCKS5) do not encrypt traffic from your ISP. If you were to look at the packets transmitted from your computer on the way to the proxy server, you could see what is the final destination of the packet. After packets leave the proxy server then your IP cannot be discovered unless it is in the information of the actual packet sent to the server and it is not encrypted. Notice I am using the term packet loosely here. So if you don't want a website, a torrent swarm or peers and skype knowing what is your IP address, proxies will do the trick. Your ISP will still be able to see what you are doing. The only workaround for this is an SSH tunnel or a VPN which is not going to be discussed in this post.
This could happen in a number of ways. It all boils down to bad programming on the side of applications. Some are designed in such a way that when your proxy server is down, they just redirect all traffic through the normal route. Other times, some of the traffic is sent through the proxy while some packets may be sent outside the proxy. All it takes is one packet to leak and basically you failed to do what you were attempting to do (hiding your IP from the destination server).
I am providing a solution for Ubuntu but a Windows solution would work in the same way. Also, Mac users may be able to follow this guide but instead use the ipfw command which is similar to iptables (linux's firewall).
The problem is divided into two solutions: a) block all outgoing leaking traffic and b) don't answer to any calls that don't come from a proxy. The latter is not necessary with browsing but with torrents it is if you really want to appear that you don't have a torrent client on to the outside world.
Blocking an incoming connection is relatively easy. I am assuming that if you are behind a router you already port forwarded the relevant port for your application to the computer running the application. Sometimes, UPnP takes care of that. So let's say that port 8000 is the one for your application. All you need to do is tell your firewall to accept packets to this port only when they come from your proxy and drop the rest. Let's say that your proxy's ip is 10.10.10.10. As root you just run:
iptables -F
iptables -A INPUT -p tcp -s 10.10.10.10 --dport 8000 -j ACCEPT
iptables -A INPUT -p udp -s 10.10.10.10 --dport 8000 -j ACCEPT
iptables -A INPUT -p udp --dport 8000 -j DROP
iptables -A INPUT -p tcp --dport 8000 -j DROP
The first command deletes all previous rules on the firewall which by default there aren't any.
Windows are a bit easier at restricting rules for one application. Linux isn't. My solution for this is to run an application as another user and apply rules to that user. It is definitely safer this way but it takes a bit of work. I won't go into details on how to create a new user and run that application as that user but you can find guides online. Assuming you have this ready and verified using ps -faux that your application runs through that user (IMPORTANT since rules will apply only for that user) you can type the following as root.
iptables -A OUTPUT -p tcp -m owner --uid-owner testing -d 10.10.10.10 -j ACCEPT
iptables -A OUTPUT -p udp -m owner --uid-owner testing -d 10.10.10.10 -j ACCEPT
iptables -A OUTPUT -p udp -m owner --uid-owner testing -d 192.168.0.0/24 -j ACCEPT
iptables -A OUTPUT -p tcp -m owner --uid-owner testing -d 192.168.0.0/24 -j ACCEPT
iptables -A OUTPUT -p tcp -m owner --uid-owner testing -d 127.0.0.1 -j ACCEPT
iptables -A OUTPUT -m owner --uid-owner deluge -j DROP
Basically accept outgoing traffic from this user to 10.10.10.10, all ips in the LAN (you don't have to do this though) and packets sent to localhost. The last option is used by some programs to communicate with others. You have to adjust your settings but the important part is that you DROP packets sent to any IP that you don't like. If you try to do anything with that user, you will find that no websites will open without a proxy on your browser.
If you combine all of the incoming and outgoing rules into one file, make it executable and place it here: /etc/network/if-pre-up.d/ then your firewall settings will not be deleted after a reboot.
A way to see what packets are hitting your interface is to use tcpdump. This shows incoming packets before they pass through the firewall and outgoing packets that already passed through the firewall.
sudo tcpdump port 8000 -i wlan0
Here is a sample of what you would expect to see:
17:30:57.219187 IP michael-netbook.local.8000 > 10.10.10.10.42869: UDP, length 111
17:30:57.430905 IP 10.10.10.10.42869 > michael-netbook.local.8000: UDP, length 30
17:30:57.461266 IP 10.10.10.10.42869 > michael-netbook.local.8000: UDP, length 380
17:30:57.461473 IP michael-netbook.local.8000 > 10.10.10.10.42869: UDP, length 30
17:30:57.492072 IP 10.10.10.10.42869 > michael-netbook.local.8000: UDP, length 380
17:30:57.492286 IP michael-netbook.local.8000 > 10.10.10.10.42869: UDP, length 30
17:30:57.502889 IP 10.10.10.10.42869 > michael-netbook.local.8000: UDP, length 380
17:30:57.503056 IP michael-netbook.local.8000 > 10.10.10.10.42869: UDP, length 30
17:30:57.517659 IP 10.10.10.10.42869 > michael-netbook.local.8000: UDP, length 380
17:30:57.517858 IP michael-netbook.local.8000 > 10.10.10.10.42869: UDP, length 33
It is likely that you would still see incoming traffic. This can be due to a) you had an open connection before applying the rules and activating the proxy (this will persist for a while) and b) machines on the internet initiated port scans for whatever reason. If your IP is dynamic it is likely to see a (b) traffic mainly due to other users that used your IP before you got it.
But is the firewall working? Well, let's see:
michael@michael-netbook:~$ sudo iptables -nvx -L INPUT
Chain INPUT (policy ACCEPT 5814 packets, 2633217 bytes)
pkts bytes target prot opt in out source destination
0 0 ACCEPT tcp -- * * 10.10.10.10 0.0.0.0/0 tcp dpt:8000
4210 3556857 ACCEPT udp -- * * 10.10.10.10 0.0.0.0/0 udp dpt:8000
0 0 DROP udp -- * * 0.0.0.0/0 0.0.0.0/0 udp dpt:8000
0 0 DROP tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8000
Ideally, you will not see DROPPED packages in the counters but even if you see that is a good thing. It means that people tried to sent you stuff from 8000 directly to your IP and you firewall blocked them. For the rest of the world, the port appears closed as if you don't have an application listening.
How about your outgoing traffic?
michael@michael-netbook:~$ sudo iptables -nvx -L OUTPUT
Chain OUTPUT (policy ACCEPT 1028 packets, 388980 bytes)
pkts bytes target prot opt in out source destination
2585 652144 ACCEPT tcp -- * * 0.0.0.0/0 10.10.10.10 owner UID match 130
4074 336713 ACCEPT udp -- * * 0.0.0.0/0 10.10.10.10 owner UID match 130
0 0 ACCEPT udp -- * * 0.0.0.0/0 192.168.0.0/24 owner UID match 130
552 314414 ACCEPT tcp -- * * 0.0.0.0/0 192.168.0.0/24 owner UID match 130
826 423050 ACCEPT tcp -- * * 0.0.0.0/0 127.0.0.1 owner UID match 130
0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 owner UID match 130
Ideally, this should also show no DROPPED packages but even if it does it just means that everything is working. It also means that your application attempted to send something by bypassing the proxy but your firewall crashed its attempts.
But don't take my word for it. Setup these rules and then remove the proxy. Try using your application and monitor traffic. Does anything work? If not then your firewall is doing its job allowing traffic only through proxy even if programs attempt to bypass your settings.
This is the quickest way to develop an almost blank mediawiki db used by Wikipedia. You will need a typical LAMP server.
Mediawiki uses innoDB tables. By default, all innodb tables are saved under one file on the disk. The file cannot shrink and can cause problems. It's best to use an option of MySQL to create a seperate file on the disc per innodb table. To do this you need to do the following:
Find the [mysqld] part in the config file and add:
Save the file and then on the terminal restart mysql:
Note that any existing innodb tables will remain in the large ibfdata file but any newly created tables will be assigned a different file on the disk.
You will need to do some fine tuning on mysql for variables such as: innodb_buffer_pool_size, innodb_log_buffer_size, innodb_additional_mem_pool_size. You will have to investigate a bit to see what's best.
Now you need to install mediawiki. You need to follow the Quick Installation Guide for Mediawiki. For the most part if you know your root password for mysql mediawiki can setup automatically the database, tables, and the user (if you don't want to have root as your user accessing the database).
At this point, if you know exactly which columns on a table you are going to need, you may want to turn some fields in smaller versions, so that they can still exist and avoid errors, however they won't occupy as much space. A good example is the text table that contains two blob fields that track changes. If you are not interested in these changes, you could always turn these blob fields in varchar(2) or something else and save space.
After the installation, you will need to figure out which dump contains the data that you want. There are many and they contain dumps for different tables. You can look some of the dumps here. If you want another language Wikipedia, you have to change "enwiki" to reflect the prefix of the language that you are interested in (e.g., elwiki for Greek Wikipedia, cswiki for Czech Wikipedia, eswiki for Spanish Wikipedia).
Use this script to create a filelist.txt file containing all files to be downloaded. You will need a proper regular expression to capture the names of the files automatically. As an alternative, you could type all file names manually in a file names filelist.txt. Also you will need to setup the url variable to the directory containing the dumps that interest you.
For this step you will need to save a couple of scripts. You will also need the filelist.txt file from the previous step. I have instruction that you need to follow for some files. Also, you may need to install p7zip ubuntu package.
Save all of the following on your disk (same directory).
preimport.sql (source Brian Stempin)
postimport.sql (source Brian Stempin)
mwimport.pl (original source here)
adddb.sh - You will need to change the urll variable to the url that you will be using. Wikipedia releases some files under bz2 compression and other files under 7z. I left a line commented in this script that you need to enable in case your files are bz2 and not 7z (don't forget to comment the line underneath that that deals with 7z). Finally, you will need to add your mysql username password and database.
The script will start retrieving the first file in filelist.txt, extract it and process it using mwimport.pl, check on whether the extraction is successful and finally add the information to the database. After that, it will delete the file and carry on with the second file in filelist.txt all the way to the end.
You will need to do this preferably under a screen. If you don't have it installed:
Ctrl+A+D will detach the screen. It's still active in the background. To view the progress you can enter that screen again by using "screen -r " and hit tab to get the number of the screen automatically.
There is a scale used to determine how strong is the evidence presented by the Bayes factor. The scale was developed by Harold Jeffreys in his book "Theory of probability" (H. Jeffreys (1961). The Theory of Probability (3 ed.). Oxford. p. 432).
Bayes Factor | Strength of Evidence |
---|---|
< 1:1 | Negative (Supports the opposite model) |
1:1 to 3:1 | Barely worth mentioning |
3:1 to 10:1 | Substantial |
10:1 to 30:1 | Strong |
30:1 to 100:1 | Very strong |
> 100:1 |
Decisive |
]]>
The following test behaves alot like the chi-square test of independence. It can work with ordinal, categorical and even dichotomous variables (any case that can give you a contingency table).
This method is based on the book "Bayesian Computation With R" by Jim Albert. If you want to learn more about the model and the code you can read the book or the article.
For the procedure you need R and the LearnBayes package that can be installed in R using the commandinstall.packages('LearnBayes').
You need to type the data for your contingency table or feed it to your tabledata variable. Additionally, you need to specify the rows and columns for your table.
The hypotheses are:
You need to always verify that your table looks the way that it should. The code is still a bit buggy and sometimes rows get changed for columns. In case your table looks the opposite way, just change the numbers between rows and columns.
The code performs two analyses. The first, tests the independence hypothesis against the dependence hypothesis. The second analysis tests the hypothesis of Independence against the hypothesis close to independence.
Read the Bayes Factor page for how you should interpret these results.
In this example, H0 is 2.14 times more likely than H1. The evidence is not really strong however. Additionally, the second test failed to provide support for a model close to independence.
]]>
This method can be used in the same circumstances that one would use the regular independent t-test; when you want to statistically compare the means of two groups. Both groups should have their data normally distributed.
This method is based on the book "A Practical Course in Bayesian Graphical Modeling" by Michael Lee and Eric-Jan Wagenmakers. Additionally, a published scientific article can be found here. Either or both are good to cite when using this method. Some of the code may has been changed in order to make the application of the analysis easier. If you want to learn more about the model and the code you can read the book or the article.
For the procedure you need R and Openbugs.
Set the first three lines according to your setup and data or feed the variables your own data. You also need to set the file that contains the Openbugs model which you can find at the end of this page. If you wish to change the priors you can, just remember that you need to adjust the prior for hypothesis 2 and hypothesis 3 in order apply only for positive or negative numbers. Also you can change the iterations and the burnin if you want to improve your results. These need to be reported in your paper later on.
The hypotheses for this test are:
The output produces a set of results in text along with the probability distribution plots for each one of them. Both are useful for making a decision about your hypothesis. As an example, you can use the graph and determine if at the point δ=0 your posterior(your results after the data) are higher or lower than the prior(your initial belief). If the posterior is higher than the prior at δ=0 then it reinforces the fact that the null hypothesis(H0) is probably true. If the posterior is lower than the prior then the data weakens your belief that the null hypothesis is true. Bayes factors are automatically reported on the graphs.
Please read the Bayes Factor page for how to interpret it.
In the first two cases the evidence is "Barely worth mentioning" for H0. But, the third result (7.67) is considered "Substantial" evidence in favor of H0, indicating that when we consider if group 1 has a bigger effect than group 2, there is substantial evidence to say that is unlikely(providing proof for H0).
When publishing you are going to have to report also the process you obtained your results and the numbers of itterations for the MCMC test along with the burnin value and the chains(in this case 3).
The model specifications are:
]]>
Found anything interesting? Any comments or errors?Contact me :-)
I started this wiki so that I can try and gather as many procedures(and code) as I can that currently exists in Bayesian statistics. The goal is to create an easy to read, easy to apply guide for each method depending on your data and your design. Although this is geared towards HCI research, most of these methods can be applied in other scientific disciplines such as social sciences, psychology and others. The philosophy behind this guide is to always keep things simple. Just as I don't ask for my visitors on this website to understand HTTP requests, the same should apply for someone that wants to perform Bayesian statistics. You only need to know what is your input, and how to interpret the output. Therefore, the emphasis here is taken away from the math aspects of bayesian statistics.
My inspiration for developing such content was the site Statistics for HCI Research by Koji Yatani. It is an excellent guide for NHST analysis for HCI.
Keep in mind that I am not an expert of statistics. The contents provided here is basically what I learned from my experience of HCI research and by reading different online/offline materials. I always double-check the content before posting, but it still may be not 100% accurate or even wrong. So, use the contents on this website at your discretion. I own no responsibility on any kind of consequences, such as you have done a wrong analysis after reading my wiki or your papers do not get into a conference or a journal, or your adviser doesn't like your analysis.
I also strongly recommend you get a second opinion on your analysis from other kinds of resources before you really perform a test. If you have found any factual errors, please email me(tsikerdekis@gmail.com). Your comments would be greatly appreciated. Also, I am always looking for R(matlab,stata) code that can perform hypothesis testing so don't hesitate to let me know about it.
There are 4 types of variables that you need to know and identify.
You will also need a general understanding of the Bayes Factor. However, I have connected the link to every procedure's interpretation section as well.
Finally, Bayesian procedures have their pros and cons just as NHST analysis(guide development in progress) BUT the single most appealing thing for me is the power to provide evidence for the null hypothesis. Yes, with Bayesian methods you can do it!
While with NHST analysis answers are straight forward, Bayesian statistics is still a field under development. This is especially true when it comes to hypothesis testing. The following is a set of techniques that I managed to gather.
Types of your dependent/independent variables
|
||||
---|---|---|---|---|
Interval/Ratio | Interval/Ratio, Ordinal | Ordinal,Categorical | Dichotomous | |
Compare two unpaired groups | Bayesian t-test | Bayesianmannwhitney Bayesian Mann-Whitney test | Bayesian test of independence | Bayesianbinomialtesting Bayesian Binomial |
Compare two paired groups | -- | -- | -- | -- |
Find relationship between two variables | -- | -- | -- | -- |
]]>