Siri vs. Google Now:

A Comparative Evaluation

(An Experimental Research Project)

This is a group project for class HCIN 600: Research Methods. We were asked to conduct an experimental study on a topic of our choice.

In this group project, I did the following:

  • Designed the experiment
  • Took part in conducting the experiment
  • Collected and analyzed data
  • Wrote Abstract, Introduction, Methods sections of the paper

About the Study

Intelligent personal assistants such as Siri, Google Now and Cortana has gained increasing interest in recent years. People can use their voice to access some of the functionalities on the mobile devices.

In this study,we intended to find out which of Siri and Google Now perform tasks faster and with less errors.

We conducted an experiment where participants were asked to perform tasks with Siri and Google now. We used time and number of failures taken to complete all tasks as two dependent variables to comparatively evaluate these two systems. Our results from the experiment show that there is no significant difference between Siri and Google Now in terms of time, but Google Now is less error-prone than Siri.

Experimental Design

Research Hypotheses

Within Subject Design

The experiment will use within-subject design for the following reasons:


Learning Effect:

One problem with within-subject design is learning effect. Since each participant will repeat tasks on both Siri and Google Now, participants will become better at the task when they perform it the second time. To counterbalance this effect, we plan to randomize the order of the two conditions. Half of the participants will use Siri first and then Google Now; the other half of the participants will use Google Now first and then Siri.

Fatigue :

Another problem is fatigue. Participants will get tired and performance is negatively impacted. To reduce the effect of fatigue, we try to keep the experiment very short. We limit the number of tasks to four. Under two conditions (Siri and Google), the participant will perform 8 tasks in total. In our pilot study, it takes from 3 to 5 minutes to complete all the tasks. Fatigue won’t have a big impact.


After obtaining informed consent from the participant, we will ask him/her to complete the following 4 tasks using both Siri and Google Now.

* Participant will receive no instructions during the tasks. Participant will be shown one task at a time so that they don’t see any following tasks.

* We will provide on iPhone and an Android phone for participants to use.

* One experimenter will monitor the task and notify the participant when the task is complete and count the number of failures the participant has. Another experimenter will measure the time with a smart phone. The number of failures will be measured by counting the number of times the participant start over the task.

* A task is considered complete if the phone responds with exactly the desired outcome.

* We will test the devices each time before giving them to the participants.

* To counterbalance, the experiment will alternate between these two orders Google Now, Siri and Siri, Google Now.

Data Collection Sheet

Data Analysis

Plan for Data Analysis

We will examine the collected data against the three assumptions for parametric testing.

*note: the textbook also mentions that data must have an equal-interval scale. We measure time and number of failures and they both have an equal-interval scale.

Therefore, it boils down to the second assumption: normality. If the sample is normally distributed, we will use parametric tests. If the sample is not normally distributed, we will use non-parametric tests. We couldn’t find any related material on how to decide if the sample data is normally distributed from either the textbook or the lectures. We found on the Internet that Anderson-Darling test can be used to tell the distribution of sample.

If we choose parametric tests, we will use paired t-test because we have one independent variable and the variable has two conditions/means and the experiment is within-subject. We will use the p-value to decide whether to reject the null hypothesis or not. Since it is a two-tailed test, if 2*p-value

If we choose non-parametric tests, we will use Wilcoxon signed-ranks test because we have one independent variable and the variable has two conditions/means and the experiment is within-subject. The Wilcoxon test is not covered in the textbook and we will use materials from this link to do the test if there is no better resources.

Data Collected


No. of Failures

Non-parametric Tests

The sample distribution is not normal. Therefore a non-parametric test (Wilcoxon signed-rand test) was used to analyze the data

Results from Wilcoxon Signed-Rank Tests

(performed on Minitab)


No. of Failures

The first test indicated that there was no significant difference in the time needed to complete the task using Siri or Google Now (Z = -1.5799 and p = 0.1141) at significance level 0.05.

The second test indicated that there was significant difference in the number of failures in the tasks completed by using Siri or Google Now (Z = -2.5205 and p = 0.014) at significance level 0.05.


I wrote Abstract, introduction, and Methods in the paper.