This is a group project for class HCIN 600: Research Methods. We were asked to conduct an experimental study on a topic of our choice.
In this group project, I did the following:
Intelligent personal assistants such as Siri, Google Now and Cortana has gained increasing interest in recent years. People can use their voice to access some of the functionalities on the mobile devices.
In this study,we intended to find out which of Siri and Google Now perform tasks faster and with less errors.
We conducted an experiment where participants were asked to perform tasks with Siri and Google now. We used time and number of failures taken to complete all tasks as two dependent variables to comparatively evaluate these two systems. Our results from the experiment show that there is no significant difference between Siri and Google Now in terms of time, but Google Now is less error-prone than Siri.
Independent variables:
two different intelligent personal assistants: Siri by Apple, Google Now by Google.
Dependent variables:
time used to complete simple tasks using intelligent personal assistant
number of failures before completing the tasks
Hypothesis 1:
H0: There is no difference in terms of time used to complete tasks between Siri and Google Now
Ha: There is a difference in terms of time used to complete tasks between Siri and Google Now
Hypothesis 2:
H0: There is no difference in terms of the number of failures that occurred before tasks are completed between Siri and Google Now
Ha: There is a difference in terms of the number of failures that occurred before tasks are completed between Siri and Google Now
The experiment will use within-subject design for the following reasons:
We have a small sample. Within-subject design allows us to apply two conditions (Siri and Google Now) on the same group of participants. If we only use people from this class, we will only have around 15 participants. If we use between-subject design, each group will have only 7-8 participants.
We don’t need to manage variance between groups. Considering that people in this class have very varied proficiency in English, individual differences will be a big problem and will create a lot of noise if we choose between-subject design. The within-subject design avoids this problem.
Learning Effect:
One problem with within-subject design is learning effect. Since each participant will repeat tasks on both Siri and Google Now, participants will become better at the task when they perform it the second time. To counterbalance this effect, we plan to randomize the order of the two conditions. Half of the participants will use Siri first and then Google Now; the other half of the participants will use Google Now first and then Siri.
Fatigue :
Another problem is fatigue. Participants will get tired and performance is negatively impacted. To reduce the effect of fatigue, we try to keep the experiment very short. We limit the number of tasks to four. Under two conditions (Siri and Google), the participant will perform 8 tasks in total. In our pilot study, it takes from 3 to 5 minutes to complete all the tasks. Fatigue won’t have a big impact.
After obtaining informed consent from the participant, we will ask him/her to complete the following 4 tasks using both Siri and Google Now.
1. What’s the weather like today?
2. Set an alarm for 7:30 pm today.
3. Text Andy “How’s it going?”.
4. Schedule a meeting at 2 pm on Tuesday for Research Methods
* Participant will receive no instructions during the tasks. Participant will be shown one task at a time so that they don’t see any following tasks.
* We will provide on iPhone and an Android phone for participants to use.
* One experimenter will monitor the task and notify the participant when the task is complete and count the number of failures the participant has. Another experimenter will measure the time with a smart phone. The number of failures will be measured by counting the number of times the participant start over the task.
* A task is considered complete if the phone responds with exactly the desired outcome.
* We will test the devices each time before giving them to the participants.
* To counterbalance, the experiment will alternate between these two orders Google Now, Siri and Siri, Google Now.
Subject 1: Siri, Google Now
Subject 2: Google Now, Siri
Subject 3: Siri, Google Now
Subject 4: Google Now, Siri
We will examine the collected data against the three assumptions for parametric testing.
Independence: our data will be independent. First, data (time and number of failures) from one participant is not correlated to any other participants’ data. Second, one participant performs the same tasks on both Siri and Google Now; there is a correlation between data on Siri and data on Google Now due to learning effect, but this correlation is eliminated by counterbalancing explained in the previous section. Third, we will have to assume that the sample is randomly selected although the sample is not random because all participants are from this class.
Normality: when the sample is small (<30), sample distribution must be normally distributed in order to do parametric testing. Our sample will be less than 15 so that we will see if the data satisfies this assumption. If the sample distribution is normal, we will do parametric testing; otherwise, we will do non-parametric testing.
Homogeneity of variance: because we use within-subject design, we do not need to consider variance between different groups. This assumption does not apply.
*note: the textbook also mentions that data must have an equal-interval scale. We measure time and number of failures and they both have an equal-interval scale.
Therefore, it boils down to the second assumption: normality. If the sample is normally distributed, we will use parametric tests. If the sample is not normally distributed, we will use non-parametric tests. We couldn’t find any related material on how to decide if the sample data is normally distributed from either the textbook or the lectures. We found on the Internet that Anderson-Darling test can be used to tell the distribution of sample.
If we choose parametric tests, we will use paired t-test because we have one independent variable and the variable has two conditions/means and the experiment is within-subject. We will use the p-value to decide whether to reject the null hypothesis or not. Since it is a two-tailed test, if 2*p-value
If we choose non-parametric tests, we will use Wilcoxon signed-ranks test because we have one independent variable and the variable has two conditions/means and the experiment is within-subject. The Wilcoxon test is not covered in the textbook and we will use materials from this link to do the test if there is no better resources. http://vassarstats.net/textbook/ch12a.html
The sample distribution is not normal. Therefore a non-parametric test (Wilcoxon signed-rand test) was used to analyze the data
(performed on Minitab)
The first test indicated that there was no significant difference in the time needed to complete the task using Siri or Google Now (Z = -1.5799 and p = 0.1141) at significance level 0.05.
The second test indicated that there was significant difference in the number of failures in the tasks completed by using Siri or Google Now (Z = -2.5205 and p = 0.014) at significance level 0.05.