How to evaluate your project

Image by mpeterke on Flickr

This post is all about how to evaluate your project. This is something you will do ideally about half way through your work, but more realistically towards the end of your work, just before you write up. However, it's imporant that you plan your evaluation early on, otherwise you run the risk of getting almost to the end of your work and finding that your evaluation process isn't going to give you any good results. So, the main thing you need to understand when you are planning is how evaluation works in the sciences and how to apply that understanding to your own projects.

Proof

Sometimes we see student dissertations which make claims like "this experiment has proved that ...". Almost always the student has used the word "proof" incorrectly, and will lose marks because they have given the impression that they have not understood the contribution that their work has made to the field they are working in. Very often students in this situation simply haven't understood how scientists use the word "proof".

Proof, in a scientific context, is a mathematical argument that is used to convince other mathematicians or scientists that a theorem (or a mathematical idea) is true. Proofs must never involve evidence or experiments, only arguments. There's an example proof of a simple theorem at the end of this post. 

Once mathematicians are convinced that a proof is correct (and sometimes that is difficult in itself, if the proof is several hundred pages long) then it is irrifutable. This is very different to the sort of science that is advanced by experiments, where another scientist can find new data or eveidence that shows that an old idea was wrong. 

So, we generally say that a theorem can be proved correct whereas a hypothesis (or guess!) can only be tested via experiments. A hypothesis might turn out to be wrong if experimental data cannot be found to support the hypothesis, or contradictary evidence is found. If a lot of evidence is found to support a hypothesis we might call it a theory. Even so, a theory cannot be proved correct in all cases. For example, if you came up with a theory that said all atoms have a particular shape, you might invent a special microscope to look at atoms and find out if they have your shape. This would provide some evidence to support your theory. You couldn't, however, test every single atom in the Universe, so your hypothesis might well become a theory, but it can never be "proved" correct. 

[Aside: there's a long literature in this sort of philosophy of science. If you are really interested, read Karl Popper on Falsification, AJ Ayer on Verification and Paul Feyerabend on scientific revolutions and Imre Lakatos on Proof.]

Scientific method

Scientific method is the way that scientists decide whether a particular hypothesis (or guess) is likely to be a good model for the way the world works. If most scientists accept that the hypothesis is likely to be true, then we call it a theory. Of course, even theories have limitations, and it may be that as more experiments are carried out we find that a different theory fits the evidence better, or that the theory only works in certain circumstances. This is exactly what happened in physics to Newton's laws of motion. It turns out that Newton's laws describe the world pretty well in most cases, they can certainly tell you when your train is likely to arrive at its destination. For other circumstances, for example when you are travelling very fast, close to the speed of light, or for very small particles like quarks, other theories (like Einstein's theories or quantum mechanics) better fit the data we have gathered. Of course, much of this work in areas like physics is driven by what we can measure and observe. Better telescopes mean better theories of cosmology, and so on.

In computer science we also have hypotheses that we can test. For example "functional programming languages can run just as efficiently as imperative languages", "online learning increases student engagement", "objects and inheritance improve code reuse in software companies", and so on.

To be a true hypothesis, and not just the opinion of the author, a statement must be refutable, that is, it must be possible for experiments to determine that the hypothesis is incorrect. The opposite statement to a hypothesis is called an alternate hypothesis. Examples for the hypotheses listed above would be "functional languages are necessarily slower (or faster!) than imperative ones", "online learning has no effect on student engagement" and "objects and inheritance have no effect on code reuse in software companies".

So, to evaluate your own research questions, you need to do the following:

  1. Devise a hypothesis.
  2. Form your alternative hypothesis.
  3. Plan an experiment that tests whether the hypothesis or the alternative hypothesis is true. 
  4. Conduct your experiment.
  5. Analyse the results of your experiments.
  6. If the results are conclusive, STOP. Else, re-run the experiments, or devise a better experiment and repeat.

In a student project, you may not have time to repeat your experiments, especially if they involve people, but you should design your evaluation in such a way that this would be possible, were you to continue the work.

About experiments

A good experiment should test one variable and one variable only. So, if your hypothesis is "neural network algorithms run faster in C than C++" then you will probably want to implement some neural network algorithms in both languages. You should make sure that the programs are as similar as possible, except for the language you are using. If you implement slightly different algorithms, it may be the algorithm and not the language which is causing any change in performance you observe. In this case, the programming language is called the independent variable and the algorithms are called the controlled variable and the speed is the dependent variable which is being measured. 

Bad science? The case of Usability tests

Many students undertaking projects who have developed a web application, web site or content management system (often in response to a client brief) ask about the most suitable evaluation method for their project.  In these cases precisely what to evaluate is often less clear cut than an experiment with an independent variable and a controlled variable (as in the "neural network algorithms run faster in C than C++" example above).

In order to settle on an evaluation method for such cases it is often necessary to return to the client and question them about their goals for the application you have built for them.  The first question you need to ask is “Can my application be evaluated by automatic means?” i.e. for applications that are evaluated for technical performance (network performance, speed of execution, resource efficiency, memory performance, robustness against attack, etc) the answer is usually 'yes it can' and your evaluation may consist of running a number of automated performance tests and collating the results into comparison tables. 

However, in cases where the client might be interested how humans (users) interact with the application and 'perform better' because of it, the evaluation solution is usually some form of 'user based test'.  Your client may be interested in the performance of the application you have built in supporting users to achieve their goals.  There are a number of parameters that could be tested:

    Effectiveness:  Can users actually perform and complete the (desired) specified task?

    Efficiency: Can users do it quickly, without getting bored or frustrated?

    Satisfaction: Is it fun, or at least pleasant to use?

    Learnability:  Can users 'pick up' the application without reaching for a manual or asking for help? Does it support learning? (see ISO 9241 section 11) 

Different applications will have different emphases in terms of what you need to evaluate.  Games, for example, need to be satisfying or challenging most of all.  Terminal applications for call centres need to be efficient most of all.  Most kinds of 'stand-alone' technology (Car park ticket machines, vending machines, ATM machines) need to be learnable most of all. All applications need to be effective. 

At this point it is important to stress that working with real users introduces a large number of uncontrolled variables that cannot easily be 'designed out' of any user test you may undertake. The fundamental question about the 'scientific validity' of usability tests (i.e. whether you are 'really' evaluating user performance rather than the application's performance in the test) is very difficult (but not impossible) to answer. 

There are two specific limitations with usability test data.

The reliability limitation: Is the data reliable?

    No, if you are testing users who are not typical of the true intended user group, or if there are significant individual variations within the test group (this is made worse by small sample sizes*)

The validity limitation:  Are the conditions under which the data was recorded reproducible?

    No, if the test is run differently each time (users briefed differently, different equipment used, not quite the same test questions asked)

As Jeffrey Rubin points out [usability] “testing is always artificial” (Rubin, 1994, p.27).   [Aside:  There is a fascinating and long running debate on the scientific status of usability between Jakob Nielsen and Rolf Molich, particularly in relation to the small sample sizes championed by Nielsen. See here or here]

There is some good news about usability testing, however.  Often it is entirely unnecessary to worry about the number of uncontrolled variables in a usability test because you are looking forindicators rather than proofs.  In fact, if you think of a usability test as a 'design tool' rather than an 'experimental tool' you are closer to the way tests are used in the commercial world.  This does not absolve you of the responsibility for designing and recording a test in which you have addressed the reliability and validity limitations OR that you have set appropriate benchmark metrics for judging the success (or not) of the application in supporting effectiveness, efficiency, satisfaction or Learnability. It does mean, however, that usability tests can, if designed properly, be a very good evaluation method for your project.

Mind your language: 'user friendly' and 'significant'.

There are two phrases you need to be very careful about using in your test reports.

    User-friendly.  Never use this phrase! Software cannot be friendly, it is not (yet) sentient. This is called ANTHROPOMORPHISM and should be avoided. To claim that an interface is 'user friendly' is also subjective and not testable. To claim, however, that an interface is USABLE is more sustainable as long as we measure against some predetermined metrics. 

    Significant. In normal English, "significant" means 'important', while in Statistics "significant" means not due to chance. For example, if I answer all questions in a multiple-choice quiz randomly, sometimes I will still pass the test. However, the number of times that I pass the test as a percentage of the number of tries I have in total should show that passing the test a few times just happened "by chance" and was therefore "not significant" in statistical terms. A research finding may be true without being important. When statisticians say a result is "highly significant" they mean it is very unlikely that the result happened by chance. They do not (necessarily) mean it is highly important. Be careful when you claim to have found 'significant' results. 
     
    (http://www.surveysystem.com/signif.htm)

Interpreting your results: correlation does not imply causation

Correlation by xkcd

When you perform an experiment, you are hoping that the outcome will lend some evidence to either your hypothesis or your alternate hypothesis. Going back to the example above, they hypothesis "neural network algorithms run faster in C than C++" has an alternate hypothesis "neural network algorithms run no faster in C than in C++". If we run an experiment to test this, and assume it's a fair experiment, and the results are that all our algorithms run faster in C, what has this told us? A naive answer would be that the experiments have confirmed the hypothesis that C is the faster language for this sort of algorithm. A more subtle answer would be that efficient neural networks are correlated with neural networks written in C. That means that when the algorithm is written in C it's likely to run quickly, which is what the experiment reported. This does not necessarily mean that the algorithms implemented in C ran quickly because they were written in C, it may be that there was some other factor involved that the experiment didn't effectively control.

In experimental work it is very important to understand this subtle distinction, otherwise you can easily fool yourself into believing that your experiments have discovered something far more conclusive than is actually possible. 

To give you a better idea of how this distinction between correlation and causation works, below are some examples of incorrect conclusions drawn from perfectly reasonable correlations. See if you can work out why the conclusions are unreasonable:

  • Children with bigger feet have higher reading ages. Therefore, people with bigger feet are more intelligent.
  • Teenagers who text late at night have poor motivation in class (see news reports here). Therefore, using mobile phones leads to poor performance in class (see also a more skeptical analysis here).
  • In the last 150 years there has been a dramatic increase in the number of people who report being abducted by aliens. There has also been a trend towards global warming. Therefore, alien abductions cause global warming.

In your own work, just be honest and straight forward about your results. If they aren't conclusive then say so and demonstrate your understanding by describing what future work could be done to gather more data. 

Some basic dos and don'ts

This is some more specific advice, based on good and bad practice we have seen from students over the years:

  • DO be clear and honest about what results your evaluation has obtained.
  • DON'T claim to have "prooven" anything if you haven't written a formal, mathematical proof.
  • DO use an appropriate experiment for your hypothesis. For example, if your work is about evaluating the performance or security of a technique, there is no need to involve real users in your evaluation. If your hypothesis is about usability you really must involve real users.
  • DON'T use questionnaires unless you can guarentee to get a large sample size of answers (always well above thirty) and you understand the statistics needed to analyse the results. If you are in any doubt at all about this then seek the advice of a qualified statistician before you start your project. If you can't do that, think about using an alternative evaluation method such as semi-structured interviews.

Appendix

Example proof: The square root of 2 cannot be written as a fraction of whole numbers

Theorem

The square root of 2 cannot be written as a fraction of two whole numbers. (This is sometimes called the Theorem of Theaetetus)

Proof (by contradiction)

Imagine we could write the square root of 2 as a fraction of two whole numbers, say x/y where x and y are integers.

Let's say that x and y don't have any factors in common, so x/y is already written in its simplest form and no numbers can be "cancelled out" of the fraction.

So, we can also say that (x/y)*(x/y)=2

Therefore (x*x)/(y*y)=2

Therefore (x*x)=2*(y*y) 

So we now know that x*x is even, since x is 2 times another number.

Since x*x is even, we also know that x is even (by the "Lemma" or little theorem that squares of odd numbers are never even).

Therefore, there must be a number, which we'll call z such that x=2*z

So, (2*z)*(2*z)=2*(y*y)

Or, more simply, 2*z*z=y*y

y must also be even, by the same argument that we used to say that x is even.

If y is also even, there must be some number, which we'll call w such that y=2*w

But if x/y=2z/2w then the fraction x/y was not in its simplest form like we assumed above.

This contradicts our initial assumptions, which must have been wrong.

So, the square root of 2 cannot be written as a fraction of whole numbers.

How to choose a good BSc or MSc project

Media_httpfarm3static_zggtj

Planning stuff...

A critical part of the success or failure of any thesis project is the initial choice of what to work on. This is a surprisingly difficult part of any project, in some ways the most difficult part, and it's something that we see students struggle with year on year. Nothing is so disappointing than marking a project and coming to the realisation that with some better decisions at the beginning of the year a failing project could have passed. This is a trap to avoid, and by avoiding it you will not only improve your chances of passing your project, you will greatly improve your chances of getting a first. In fact, projects are pretty straight forward to do well in, so long as you fully understand what is expected of you. This post takes you through what you need to focus on and avoid right at the start of your project journey.

 

Do something you are interested in

A final year or MSc project is a six month, single person project and in most Universities students will have to study several other modules concurrently. This is a long time to be working on a single piece of coursework, so it is important to choose a project which will hold your attention for that length of time. Moreover, you will be working on other things at the same time, so ideally you need to choose a project that is compelling enough that you want to work on it, in preference to doing other things. 

 

How to know what you are interested in

This might seem like a rather unnecessary topic -- what is "interesting" is very personal and individual. However, estimating what you might find interesting in several months time, when you are under pressure to meet deadlines is not easy. One trick to weight the odds in your favour is to choose a project which you do not, at the start of the project, entirely know how to complete. Like Einstein said: "If we knew what we were doing, it wouldn't be called research". Obviously, don't choose something that is completely outside your area of expertise. If you have spent two years studying bioinformatics then don't suddenly decide to try a dissertation in ceramics, but equally, if you know exactly how to complete every part of the work that you will need to do for your project then your idea is not "big" enough in scope. This point, really is the key to finding a project and much of the rest of this post expands upon it: a thesis or dissertation is not simply a long piece of coursework, it is an individual, self-contained work which should stand on its own. Think of it as a sort of "first job". When you leave University and apply for further study or a job, then the results of your project will be part of the professional portfolio of work you can use to convince a future employer to take you on.

 

Project difficulty: a difficult project is an easy project

By far and away the biggest mistake that we regularly see from students writing project proposals is choosing a project which is far, far too easy. The train of thought seems to go ... projects are difficult, I want to make the project easier, therefore, I will choose a simple idea to work on. The classic examples of this in Computer Science are "a website with a database" -- usually for a family member or friend who runs a small business -- or occasionally a website or database on their own. What's wrong with this? Well... so many things:

  1. By the time a student has reached the final year of their degree, they will likely already have written several databases, websites and at least a couple of websites-with-a-database. Therefore, the project is something that the student has already been awarded credit for. This means that the student will not be demonstrating that they can learn independently, and go beyond what has been taught in lectures, which is one of the main purposes of the project.
  2. Because the proposal is about the same size and quality as an individual module coursework, it is not large enough in scope to gain many marks.
  3. An individual website, in ASP, PHP, or similar, for an SME is a very old problem for which there exist a large number of "turn-key" solutions -- that is, off the shelf products that can be used to create the product. These include templating systems such as Joomla, cloud-based solutions such as Google Sites, Posterous, Tumblr and so on, wikis, and a number of other technologies. A straightforward website-with-a-database is, therefore, in no way a demonstration of the students ability to work at the cutting edge of their field.
  4. A website-with-a-database is not a problem, it's a solution to a problem. A project proposal should propose an interesting problem, with a suggested strategy for solving that problem during the progress of the project. 

Having said this, I have seen and indeed supervised a number of excellent projects, for which the student implemented some sort of website and some sort of database. So, it's not that websites or databases are inherently bad choices as solutions to the problems posed by a project proposal, but a proposal MUST overcome the four problems outlined above.

The heading for this section said (rather confusingly) that "a difficult project is an easy project". What I mean by this is that the "difficulty" of a project is something that will uppermost in the mind of the staff marking your thesis. A "difficult" project is likely to be looked upon favourably because it will be a bigger step away from what you have already been taught, you will need to be reading more academic literature, you will be showing more independent learning, and so on. These are some of the most important factors in getting a good grade, and far outweigh factors such as finishing every part of your practical work. The up-shot of this is that if you choose a "difficult" project and complete it quite poorly, you are likely to get better marks than a student who chooses an "easy" project and completes all of their practical work. If you did choose to work on a website-with-a-database-for-an-sme then the proposal will be so easy that you will really have to complete every part of the project perfectly just to get a pass. So, choose a small but difficult project.

 

Have a research question

In the last section I said that a project proposal should pose a problem, not a solution to a problem. Ideally, it is best to phrase this as a research question, such as the following:

  • is algorithm X more efficient than algorithm Y?
  • is it possible to implement product Z on the cloud?
  • can feature L be added to programming language P?
  • can theorem T be proven?
  • can algorithm Z be adapted to be used in conditions C?

and so on. There are several advantages to this. One is that this is a standard form of writing in academia, and your project will be marked against academic criteria. Secondly, if the aim of your project is to answer a question then you leave the issue of how to answer that question reasonably open ended. It may be that you have a very clear idea, at the start of the project, what you are going to do. That's fine, but as you progress through the project you may well find literature that enlightens your views on how your question can be answered. Thirdly, your answer to the question may not be what you expect. That's fine, it's OK to find out that actually, your algorithm isn't as efficient as you thought, or the theorem cannot be proved, so long as you give solid, convincing evidence for your answer. 

 

Do something practical

If you are working in the sciences, it really is important that you do something practical as part of your work. For these purposes "practical" can mean experimental work or mathematical work -- it's OK to prove a theorem, for example, as the main part of the "practical" content of your work. What you should avoid though, is vague, nebulous, thought-pieces, which have no clear results and cannot be evaluated. Avoid anything with a title like "an investigation into X" or "a dissertation on Y". These sorts of writing are well accepted in the humanities, but for a scientific piece of work you need to propose a question and find some answer to it. Equally, a literature review is not really a project in itself, it needs some research question and evaluation with it to form a complete project.

 

Focus on evaluation from the start

Evaluating your work will likely be the last practical work you complete before finishing your dissertation writing. However, you should know from the start of your project how you plan to do this. As with unit-testing, it is best to have designed you evaluation in as much detail as possible before you start you practical work. That way, you know that what you are aiming for is something that can be evaluated in the manner in which you have planned. Remember, the purpose here is to determine whether your project has answered your original research question.

In general, your evaluation will fall into one of the following categories:

  • Performance evaluation: either testing the speed, memory footprint, scalability, load-balancing, or other aspect of the performance of a program or system. This is often the easiest form of evaluation -- it can be performed by a program and so automated, the results can be analysed and presented using a statistics and you will not be reliant on users. Work in programming languages, networking, operating systems, databases, and hardware tend to suit this sort of evaluation well.
  • User-acceptance testing and usability: if your project involves creating a product for end-users to test, especially if you have an industrial client, then it is essential that you perform some sort of user acceptance testing. Good options for this are the talk-aloud protocol or semi-structured interviews. NEVER, EVER, EVER think that a "heuristic" evaluation is sufficient. Heuristic methods only catch basic errors, they tell you nothing about how your users will actually experience your product.
  • Formal or semi-formal methods: such as proving a theorem, using a model checker (such as SPIN), using a formal method such as B or Z to show that your work is free of particular types of errors.

 

Take (academic) advantage of your supervisor

Every student will have at least one supervisor, who will usually be actively involved in research, consultancy or something similar. This sort of work can provide a wealth of good ideas for projects and has several advantages. Firstly, your supervisor will propose projects that have the right scope and difficulty for your degree course. Secondly, if your supervisor has an interest in what you are doing, they will have a vested interest in seeing you succeed and of course will have a lot of relevant expertise with which they can advise you. Lastly, it is likely that your work will be used by other members of a research group which will give you access to feedback on what you have done.

Be flexible (within reason)

Remember that a project is a marathon, not a sprint. It may well be that you get part way along the journey and find out that what you had first set out to do is actually impossible, or impossible within the scope of the project. Or it may be that you find some other way of answering your research question, or you uncover some literature which shows that the question can actually be answered very simply. In this case, you should speak with your supervisor and find a way to reword or even completely change your original research question. This is quite a reasonable thing to do and happens often in "real" research projects, so you should not be worried about it. Your final project does not have to match the original proposal exactly, but you should be able to explain why the changes you made were necessary.

 

Summary

  • DO choose a project that will hold your interest for the duration of the project.
  • DO NOT choose a project that is the same size or scope as a coursework, or something that is very similar to work you have been set in a module.
  • DO propose a "difficult" problem -- it is easier to pass a challenging project than an "easy" one!
  • DO propose a research question, and an idea for solving it.
  • DO propose a project with some sort of practical or mathematical component, DO NOT set out to write a commentary on a topic.
  • DO have a very clear plan for how you will evaluate your project. This should clearly state how you will determine whether or not you have answered your research question.
  • DO NOT evaluate an end-user product with only heuristic methods.
  • DO test end-user products with real users.
  • DO take advantage of the expertise of your project supervisor and their research interests.
  • DO be flexible, if you find that your original research question cannot be answered, or if you find that a more "interesting" research question emerges during your project.