- Posts tagged experiments
- Explore experiments on posterous
How to evaluate your project
This post is all about how to evaluate your project. This is something you will do ideally about half way through your work, but more realistically towards the end of your work, just before you write up. However, it's imporant that you plan your evaluation early on, otherwise you run the risk of getting almost to the end of your work and finding that your evaluation process isn't going to give you any good results. So, the main thing you need to understand when you are planning is how evaluation works in the sciences and how to apply that understanding to your own projects.
Proof
Sometimes we see student dissertations which make claims like "this experiment has proved that ...". Almost always the student has used the word "proof" incorrectly, and will lose marks because they have given the impression that they have not understood the contribution that their work has made to the field they are working in. Very often students in this situation simply haven't understood how scientists use the word "proof".
Proof, in a scientific context, is a mathematical argument that is used to convince other mathematicians or scientists that a theorem (or a mathematical idea) is true. Proofs must never involve evidence or experiments, only arguments. There's an example proof of a simple theorem at the end of this post.
Once mathematicians are convinced that a proof is correct (and sometimes that is difficult in itself, if the proof is several hundred pages long) then it is irrifutable. This is very different to the sort of science that is advanced by experiments, where another scientist can find new data or eveidence that shows that an old idea was wrong.
So, we generally say that a theorem can be proved correct whereas a hypothesis (or guess!) can only be tested via experiments. A hypothesis might turn out to be wrong if experimental data cannot be found to support the hypothesis, or contradictary evidence is found. If a lot of evidence is found to support a hypothesis we might call it a theory. Even so, a theory cannot be proved correct in all cases. For example, if you came up with a theory that said all atoms have a particular shape, you might invent a special microscope to look at atoms and find out if they have your shape. This would provide some evidence to support your theory. You couldn't, however, test every single atom in the Universe, so your hypothesis might well become a theory, but it can never be "proved" correct.
[Aside: there's a long literature in this sort of philosophy of science. If you are really interested, read Karl Popper on Falsification, AJ Ayer on Verification and Paul Feyerabend on scientific revolutions and Imre Lakatos on Proof.]
Scientific method
Scientific method is the way that scientists decide whether a particular hypothesis (or guess) is likely to be a good model for the way the world works. If most scientists accept that the hypothesis is likely to be true, then we call it a theory. Of course, even theories have limitations, and it may be that as more experiments are carried out we find that a different theory fits the evidence better, or that the theory only works in certain circumstances. This is exactly what happened in physics to Newton's laws of motion. It turns out that Newton's laws describe the world pretty well in most cases, they can certainly tell you when your train is likely to arrive at its destination. For other circumstances, for example when you are travelling very fast, close to the speed of light, or for very small particles like quarks, other theories (like Einstein's theories or quantum mechanics) better fit the data we have gathered. Of course, much of this work in areas like physics is driven by what we can measure and observe. Better telescopes mean better theories of cosmology, and so on.
In computer science we also have hypotheses that we can test. For example "functional programming languages can run just as efficiently as imperative languages", "online learning increases student engagement", "objects and inheritance improve code reuse in software companies", and so on.
To be a true hypothesis, and not just the opinion of the author, a statement must be refutable, that is, it must be possible for experiments to determine that the hypothesis is incorrect. The opposite statement to a hypothesis is called an alternate hypothesis. Examples for the hypotheses listed above would be "functional languages are necessarily slower (or faster!) than imperative ones", "online learning has no effect on student engagement" and "objects and inheritance have no effect on code reuse in software companies".
So, to evaluate your own research questions, you need to do the following:
- Devise a hypothesis.
- Form your alternative hypothesis.
- Plan an experiment that tests whether the hypothesis or the alternative hypothesis is true.
- Conduct your experiment.
- Analyse the results of your experiments.
- If the results are conclusive, STOP. Else, re-run the experiments, or devise a better experiment and repeat.
In a student project, you may not have time to repeat your experiments, especially if they involve people, but you should design your evaluation in such a way that this would be possible, were you to continue the work.
About experiments
A good experiment should test one variable and one variable only. So, if your hypothesis is "neural network algorithms run faster in C than C++" then you will probably want to implement some neural network algorithms in both languages. You should make sure that the programs are as similar as possible, except for the language you are using. If you implement slightly different algorithms, it may be the algorithm and not the language which is causing any change in performance you observe. In this case, the programming language is called the independent variable and the algorithms are called the controlled variable and the speed is the dependent variable which is being measured.
Bad science? The case of Usability tests
Many students undertaking projects who have developed a web application, web site or content management system (often in response to a client brief) ask about the most suitable evaluation method for their project. In these cases precisely what to evaluate is often less clear cut than an experiment with an independent variable and a controlled variable (as in the "neural network algorithms run faster in C than C++" example above).
In order to settle on an evaluation method for such cases it is often necessary to return to the client and question them about their goals for the application you have built for them. The first question you need to ask is “Can my application be evaluated by automatic means?” i.e. for applications that are evaluated for technical performance (network performance, speed of execution, resource efficiency, memory performance, robustness against attack, etc) the answer is usually 'yes it can' and your evaluation may consist of running a number of automated performance tests and collating the results into comparison tables.
However, in cases where the client might be interested how humans (users) interact with the application and 'perform better' because of it, the evaluation solution is usually some form of 'user based test'. Your client may be interested in the performance of the application you have built in supporting users to achieve their goals. There are a number of parameters that could be tested:
Effectiveness: Can users actually perform and complete the (desired) specified task?
Efficiency: Can users do it quickly, without getting bored or frustrated?
Satisfaction: Is it fun, or at least pleasant to use?
Learnability: Can users 'pick up' the application without reaching for a manual or asking for help? Does it support learning? (see ISO 9241 section 11)
Different applications will have different emphases in terms of what you need to evaluate. Games, for example, need to be satisfying or challenging most of all. Terminal applications for call centres need to be efficient most of all. Most kinds of 'stand-alone' technology (Car park ticket machines, vending machines, ATM machines) need to be learnable most of all. All applications need to be effective.
At this point it is important to stress that working with real users introduces a large number of uncontrolled variables that cannot easily be 'designed out' of any user test you may undertake. The fundamental question about the 'scientific validity' of usability tests (i.e. whether you are 'really' evaluating user performance rather than the application's performance in the test) is very difficult (but not impossible) to answer.
There are two specific limitations with usability test data.
The reliability limitation: Is the data reliable?
No, if you are testing users who are not typical of the true intended user group, or if there are significant individual variations within the test group (this is made worse by small sample sizes*)
The validity limitation: Are the conditions under which the data was recorded reproducible?
No, if the test is run differently each time (users briefed differently, different equipment used, not quite the same test questions asked)
As Jeffrey Rubin points out [usability] “testing is always artificial” (Rubin, 1994, p.27). [Aside: There is a fascinating and long running debate on the scientific status of usability between Jakob Nielsen and Rolf Molich, particularly in relation to the small sample sizes championed by Nielsen. See here or here]
There is some good news about usability testing, however. Often it is entirely unnecessary to worry about the number of uncontrolled variables in a usability test because you are looking forindicators rather than proofs. In fact, if you think of a usability test as a 'design tool' rather than an 'experimental tool' you are closer to the way tests are used in the commercial world. This does not absolve you of the responsibility for designing and recording a test in which you have addressed the reliability and validity limitations OR that you have set appropriate benchmark metrics for judging the success (or not) of the application in supporting effectiveness, efficiency, satisfaction or Learnability. It does mean, however, that usability tests can, if designed properly, be a very good evaluation method for your project.
Mind your language: 'user friendly' and 'significant'.
There are two phrases you need to be very careful about using in your test reports.
User-friendly. Never use this phrase! Software cannot be friendly, it is not (yet) sentient. This is called ANTHROPOMORPHISM and should be avoided. To claim that an interface is 'user friendly' is also subjective and not testable. To claim, however, that an interface is USABLE is more sustainable as long as we measure against some predetermined metrics.
Significant. In normal English, "significant" means 'important', while in Statistics "significant" means not due to chance. For example, if I answer all questions in a multiple-choice quiz randomly, sometimes I will still pass the test. However, the number of times that I pass the test as a percentage of the number of tries I have in total should show that passing the test a few times just happened "by chance" and was therefore "not significant" in statistical terms. A research finding may be true without being important. When statisticians say a result is "highly significant" they mean it is very unlikely that the result happened by chance. They do not (necessarily) mean it is highly important. Be careful when you claim to have found 'significant' results.
(http://www.surveysystem.com/signif.htm)
Interpreting your results: correlation does not imply causation
Correlation by xkcd
When you perform an experiment, you are hoping that the outcome will lend some evidence to either your hypothesis or your alternate hypothesis. Going back to the example above, they hypothesis "neural network algorithms run faster in C than C++" has an alternate hypothesis "neural network algorithms run no faster in C than in C++". If we run an experiment to test this, and assume it's a fair experiment, and the results are that all our algorithms run faster in C, what has this told us? A naive answer would be that the experiments have confirmed the hypothesis that C is the faster language for this sort of algorithm. A more subtle answer would be that efficient neural networks are correlated with neural networks written in C. That means that when the algorithm is written in C it's likely to run quickly, which is what the experiment reported. This does not necessarily mean that the algorithms implemented in C ran quickly because they were written in C, it may be that there was some other factor involved that the experiment didn't effectively control.
In experimental work it is very important to understand this subtle distinction, otherwise you can easily fool yourself into believing that your experiments have discovered something far more conclusive than is actually possible.
To give you a better idea of how this distinction between correlation and causation works, below are some examples of incorrect conclusions drawn from perfectly reasonable correlations. See if you can work out why the conclusions are unreasonable:
- Children with bigger feet have higher reading ages. Therefore, people with bigger feet are more intelligent.
- Teenagers who text late at night have poor motivation in class (see news reports here). Therefore, using mobile phones leads to poor performance in class (see also a more skeptical analysis here).
- In the last 150 years there has been a dramatic increase in the number of people who report being abducted by aliens. There has also been a trend towards global warming. Therefore, alien abductions cause global warming.
In your own work, just be honest and straight forward about your results. If they aren't conclusive then say so and demonstrate your understanding by describing what future work could be done to gather more data.
Some basic dos and don'ts
This is some more specific advice, based on good and bad practice we have seen from students over the years:
- DO be clear and honest about what results your evaluation has obtained.
- DON'T claim to have "prooven" anything if you haven't written a formal, mathematical proof.
- DO use an appropriate experiment for your hypothesis. For example, if your work is about evaluating the performance or security of a technique, there is no need to involve real users in your evaluation. If your hypothesis is about usability you really must involve real users.
- DON'T use questionnaires unless you can guarentee to get a large sample size of answers (always well above thirty) and you understand the statistics needed to analyse the results. If you are in any doubt at all about this then seek the advice of a qualified statistician before you start your project. If you can't do that, think about using an alternative evaluation method such as semi-structured interviews.
Appendix
Example proof: The square root of 2 cannot be written as a fraction of whole numbers
Theorem
The square root of 2 cannot be written as a fraction of two whole numbers. (This is sometimes called the Theorem of Theaetetus)
Proof (by contradiction)
Imagine we could write the square root of 2 as a fraction of two whole numbers, say x/y where x and y are integers.
Let's say that x and y don't have any factors in common, so x/y is already written in its simplest form and no numbers can be "cancelled out" of the fraction.
So, we can also say that (x/y)*(x/y)=2
Therefore (x*x)/(y*y)=2
Therefore (x*x)=2*(y*y)
So we now know that x*x is even, since x is 2 times another number.
Since x*x is even, we also know that x is even (by the "Lemma" or little theorem that squares of odd numbers are never even).
Therefore, there must be a number, which we'll call z such that x=2*z
So, (2*z)*(2*z)=2*(y*y)
Or, more simply, 2*z*z=y*y
y must also be even, by the same argument that we used to say that x is even.
If y is also even, there must be some number, which we'll call w such that y=2*w
But if x/y=2z/2w then the fraction x/y was not in its simplest form like we assumed above.
This contradicts our initial assumptions, which must have been wrong.
So, the square root of 2 cannot be written as a fraction of whole numbers.
