02-05-wilson-comparison.Rmd

## Wilson Study Comparison

The Wilson study is performed every even year, but it also asks the participants for their plans for the next year. Based on this, we can create statistics for both odd and even years. Figure \@ref(fig:wilson-study-2018-2020) shows the framework usage for the years 2018 - 2020 ([@wilson18] [@wilson20]).

```{r wilson-study-2018-2020, fig.cap='Framework users according to the Wilson Study', echo=FALSE, out.width = '85%', fig.align='center'}
include_svg("img/wilson_study_2018_2020.svg")
```

In addition to the percentage numbers, Figure \@ref(fig:wilson-study-2018-2020) also contains the 95% confidence interval for each measurement. Using the method of overlapping confidence intervals, we can see few significant changes between 2018 and 2020. **Only the increase in the number of RVM users is a statistically established trend.**

The Wilson and GitHub studies focus on different sets of frameworks, which limits what can be compared. UVM, OSVVM, and UVVM are comparable; but, can the Python-based solutions measured by the Wilson study be compared to the cocotb and VUnit measurements in the GitHub study? Unfortunately, we cannot, and there are a number of reasons for that:

1. The Wilson study asked the participants if they are using a "*Python-Based Methodology (e.g., cocotb, etc.)*". The problem with that question is that we cannot know if that category contains only cocotb users, non cocotb users, or something in between.
2. cocotb is Python-based, while VUnit is not. VUnit is based on VHDL and SystemVerilog testbenches but uses Python for automating the HDL-external tasks of a testing flow. VUnit is Python-aided rather than Python-based, and it's unclear how VUnit users responded to this question.
3. Participants using frameworks not among the predefined choices tend not to tick the "*other*" option. This was noted in Wilson Study 2016 [@wilson16] and can also be seen in Figure \@ref(fig:wilson-study-2018-2020). The "*other*" category contains 5-6% of the participants in the data from 2018 and 2019 [@wilson18]. This is where we would expect to find the users of Python-based methodologies. When Python became a separate category in 2020, it got a 14% of the users [@wilson20]. We would expect such a significant portion to be shown as a drop in the "*other*" category, but instead we see an increase to about 9%.
4. The Wilson study also asks what verification and testbench languages are being used. Among the ASIC projects, 27% answered that they use Python for verification and testbenches, but only 11% use a Python-based methodology. This shows a difference between Python-based and Python-aided, but it's unclear how many participants make that distinction.

Going forward, we will compare the two studies with respect to UVM, OSVVM, and UVVM.

Before we can compare the studies, we also need to compensate the differences in what's being measured. The Wilson study measures the framework usage in FPGA and ASIC projects. When combined, yields the statistics in Figure \@ref(fig:wilson-study-2018-2020). The GitHub study, on the other hand, only measures framework usage for VHDL designs. The VHDL statistics are not provided directly by the Wilson study, but are estimated using the following approach.

* OSVVM and UVVM are targeting the VHDL community, while the other frameworks target both VHDL and (System)Verilog. It is certainly possible to verify a (System)Verilog design with OSVVM and/or UVVM, but it is assumed to be rare; thus, it can be ignored. With this assumption, all OSVVM and UVVM users in Figure \@ref(fig:wilson-study-2018-2020) are also verifying VHDL designs.
* The Wilson study provides the total number of study participants, as well as the number of participants verifying VHDL designs. Based on the assumption above, we can use this information to calculate the number of users using frameworks other than OSVVM and UVVM to verify VHDL designs. We also assume that the relative portions of the other frameworks are the same as in Figure \@ref(fig:wilson-study-2018-2020).
* The sum of the percentages in Figure \@ref(fig:wilson-study-2018-2020) is more than 100% because some users use more than one framework. We assume that this sum remains the same.

Using this approach, we get the result shown in Figure \@ref(fig:wilson-study-vhdl-2018-2020). The numbers have also been normalized to sum 100%.

```{r wilson-study-vhdl-2018-2020, fig.cap='Framework users for VHDL designs according to the Wilson Study', echo=FALSE, out.width = '85%', fig.align='center'}
include_svg("img/wilson_study_vhdl_2018_2020.svg")
```

The results in Figure \@ref(fig:wilson-study-vhdl-2018-2020) can be compared with the GitHub study. The results for 2020 are shown in Figure \@ref(fig:github-wilson-comparison-2020).

```{r github-wilson-comparison-2020, fig.cap='Comparison between the Wilson and GitHub studies', echo=FALSE, out.width = '85%', fig.align='center'}
include_svg("img/github_wilson_comparison_2020.svg")
```

Since the confidence intervals for each group of framework users does overlap between the studies, there is no statistically significant difference between the two. However, judging significance by looking at overlapping confidence intervals comes with a number of problems:

* The method is not exact. There can be significant differences despite a small overlap in the confidence intervals. The better approach is to calculate the confidence interval for the difference between the measurements. If zero is outside of that confidence interval, we can say that there is a significant difference.
* Even with a more exact confidence interval, we cannot look at the frameworks independently since they must sum to 100%.

For these reasons, we are going to derive a single metric for judging the similarity between the two studies. We are going to assume that the two studies are based on a random set of users taken from the same population (null hypothesis). With that assumption, we can use the multinomial distribution to calculate the probability for each of the two study results. We can also calculate the combined probability by taking the product of the study probabilities. If the combined probability is among the 5% least probable of all possible results when we do a pair of studies, we say that there is a significant difference and the assumption must be incorrect (null hypothesis is rejected).

We don't know the true distribution of users in the population, but we can estimate it using the most probable distribution given the two study results we have. This is known as maximum likelihood estimation, and we find the optimum if we combine the participants of the two studies. The result is shown in Figure \@ref(fig:github-wilson-combined-comparison-2020).

```{r github-wilson-combined-comparison-2020, fig.cap='Wilson and GitHub studies combined', echo=FALSE, out.width = '85%', fig.align='center'}
include_svg("img/github_wilson_combined_comparison_2020.svg")
```

The result of the combined study is closer to the Wilson study. Yet, that is expected since the Wilson study has a larger sample size and large deviations are less likely.

With the distribution estimate, we can calculate the probability for all 1.4 billion possible pair of study results given the sample sizes we have. The accumulated probability for the study outcomes less likely than the original study combination is 22.5%. About one in four studies will have a result less probable than what we see. For that reason we cannot reject the null hypothesis that the Wilson and GitHub studies are sampling from the same population.

Not being able to reject the null hypothesis doesn't mean we have proven that the two populations are without differences. There may exists null hypotheses which assumes that there is a relative bias but still give a higher similarity. In the following section we will examine some plausable biases and how they affect the similarity.

### Temporal Bias

The GitHub study presents data accumulated over time while the Wilson Study presents data from a specific point in time (mid 2020). This is a problem since we cannot determine if a GitHub user of an specific framework is still a user today. A user that made the last framework-related commit 5 years ago may still be using that framework outside of GitHub but a user that made the last commit 5 days ago may have decided to stop using that framework. However, the more recent the last commit, the more likely the user is still an active user.

Examining a limited and recent time period on GitHub gives more confidence about data only containing active users. However, with a shorter time span we will also decrease the number of study participants, which lowers the statistical power. Also, unless we can show that there is a significant difference between recent and past data, we can't confidently exclude the possibility that any difference we see is caused by chance alone.

Figure \@ref(fig:temporal-bias-analysis-mid-2018) shows a comparison between all, past, and recent GitHub users with the latest Wilson study. Past users are those who made their last commit before the previous Wilson study in mid 2018 and the recent users are those that made their last commit after that point in time.

```{r temporal-bias-analysis-mid-2018, fig.cap='Temporal bias analysis in mid 2018', echo=FALSE, out.width = '100%', fig.align='center'}
include_svg("img/temporal_bias_analysis_mid_2018.svg")
```

Compared to the overall GitHub data, the more recent data has a higher portion of UVVM users which is more consistent with the Wilson study and the similarity has increased to 75%. At the same time, the similarity between the past and recent GitHub data is low, 3.2%. We can also see that the confidence intervals for the recent data is wider than for the overall GitHub data. As a measure for the overall uncertainty we can sum all confidence intervals and express the sum as a percentage of the total number of users. The uncertainty for the overall GitHub data is 52% while it's 72% for the recent data. More insight into these numbers can be gained by analyzing how they depend on the date separating past and recent GitHub data. This is shown in Figure \@ref(fig:temporal-bias-analysis-2020).

```{r temporal-bias-analysis-2020, fig.cap='Temporal bias analysis', echo=FALSE, out.width = '100%', fig.align='center'}
include_svg("img/temporal_bias_analysis_2020.svg")
```

In general we can see that recent data has a higher similarity to the Wilson study than the overall GitHub data. The similarity in the most recent data is lower than the peak but that time span is also associated with an accelarating uncertainty as the number of inlcluded users goes down. When the separation between past and recent data is set to be early 2018 or later we also have a consistently low similarity between the two data sets which indicates that the separation is meaningful. Setting the separation date at mid 2018 doesn't yield the highest Wilson similarity. Yet, it's in the higher region and has some margin to the where the uncertainty starts to accelerate rapidly and where the similarity to the past data increases. **For these reasons we will exclude GitHub data before mid 2018 when comparing with the Wilson study.**

### Regional Bias

Another possible bias is how the study participants are distributed between the regions of the world. The Wilson study noted that:

> (...) the 2020 study demographics, as shown in fig. 2, saw an 11 percentage points decline in participation from North America but an increase in participation from Europe and India.

For concluding that there is a bias in the distribution between regions, we want to exclude the possibility that there is a bias within the regions themselves. The Wilson study doesn't provide detailed data in this area but the presentation for the study included data for FPGA users in Europe as an example of regional differences. We can extract the VHDL related data using the previously described method and then compare with the GitHub data for the European and African timezones. The result is shown in figure \@ref(fig:github-wilson-europe).

```{r github-wilson-europe, fig.cap='Wilson and GitHub comparison for Europe', echo=FALSE, out.width = '100%', fig.align='center'}
include_svg("img/github_wilson_europe.svg")
```

The match is almost perfect with a similarity of 99.9% but this comparison is not ideal since the GitHub study cannot separate:

1. the European users from African users.
2. the FPGA users from the ASIC users.

With these problems in mind, and without Wilson measurements for the other regions, we need another method for evaluating regional bias. One such approach is to artificially bias the GitHub study and examine how that affects the similarity when comparing with the Wilson study. We do this by weighing users differently depending on their origin. For example, each user from Europe can be counted as 1.2 while users from other regions are counted with a value less than one to create a bias towards Europe. Figure \@ref(fig:github-wilson-biased-comparison-2020) shows the result of this approach. Each point represents a study biased to have a specific portion of the participants from each of the regions. Note that the portion of users from Asia/Australia is given implicitly from the portion of the users from the other regions as given by the x and y axes.

```{r github-wilson-biased-comparison-2020, fig.cap='Wilson and GitHub studies with bias', echo=FALSE, out.width = '100%', fig.align='center'}
include_svg("img/region_bias_similarity.svg")
```

The blue point in figure \@ref(fig:github-wilson-biased-comparison-2020) marks the unbiased GitHub study. We can improve the 75.3% similarity by moving in the direction of fewer American users which is consistent with the decline noted in the Wilson study. The Wilson study also noted that the loss of American participants was mostly compensated with an increase of participants from Europe and India. However, while Indian participation increased, it also dropped in East Asia leading to a relatively small overall increase of 2% in Asia. This is not consistent with the GitHub data in figure \@ref(fig:github-wilson-biased-comparison-2020). In order for the similarity to increase, the American participation drop must be compensated with an increase of Asian/Australian participation and an unchanged European/African participation.

We can also compare the Wilson data from the 2018 study, before the decrease in American participation, with the recent GitHub data of that time. Are the studies more similar with an increased Wilson study participation from North America? Figure \@ref(fig:temporal-bias-analysis-2018) shows that this is no the case. The similarity in 2018 was actually lower but the difference isn't large enough to indicate anything but normal fluctuations caused by the uncertainty at hand.

```{r temporal-bias-analysis-2018, fig.cap='Temporal bias analysis', echo=FALSE, out.width = '100%', fig.align='center'}
include_svg("img/temporal_bias_analysis_2018.svg")
```

**Our conclusion is that the decrease in North American participation in the Wilson study isn't significant to the comparison with the GitHub study.**

### Classification Bias

Figure \@ref(fig:github-academic-and-professional-comparison-2020) shows how the Wilson study compares with recent academic and professional users of the GitHub study.

```{r github-academic-and-professional-comparison-2020, fig.cap='Wilson and GitHub studies compared', echo=FALSE, out.width = '85%', fig.align='center'}
include_svg("img/github_academic_and_professional_comparison_2020.svg")
```

We see that the academic subset is very similar to the Wilson study with a similarity of 99.9% while the professional subset is very different with a similarity of 1.6%. This raises a number of questions. What do we know about the user classification in the Wilson study and why are the GitHub professional and academic subsets so different?

The high similarity between the Wilson study and the academic GitHub users is not itself a surprise. We can expect academic users to have less experience in general but that is more related to how tools are used rather than what tools are used. The cooperation between EDA vendors and the universities will also give professional influences when it comes to choosing the right tools for verification.

The participants in the Wilson study were selected to represent a broad set of design markets and regions. Given that information and the fact that the major EDA companies have programs reaching tens of thousands university students, we can expect a mix of academic and professional participants. There is no public information about that mix but we find it unlikely that the difference with the professional GitHub subset can be explained by a very high percentage of academic users in the Wilson study. It is the deviation of the professional GitHub subset that requires extra analysis.

Looking at past similarity numbers we see that they are highly volatile. The 1.6% in 2020 was 79% in 2019 and 26% in 2018. We can also get a more high-resolution view by studying the similarity between the academic and professional GitHub data. This is shown in figure \@ref(fig:github-academic-professional-comparison). Note that even though the GitHub academic subset is almost identical to the Wilson study in 2020, the similarity between the GitHub academic and professional subsets is higher than the similarity between the Wilson study and the GitHub professional subset. The reason is that the academic subset has a smaller sample size than the Wilson study and that makes larger differences more probable.

```{r github-academic-professional-comparison, fig.cap='GitHub academic and professional subset simularity', echo=FALSE, out.width = '85%', fig.align='center'}
include_svg("img/github_academic_professional_comparison.svg")
```

The absense of stability in the similarity suggests that the low similarity found in 2020 is largely caused by chance and not by any bias. For that reason we've kept the professional data in the comparison with the Wilson study.

### Combining the Studies

By removing bias to make the studies comparable we are also in a position where the studies can be combined. A combined study broadens our knowledge when the studies measure different things and it increases the statistical confidence for the things measured by several studies. This latter effect is due to the increased sample size.

In this case, the results of the Wilson study are broaden by adding data for VUnit and cocotb from the Github study. The statistical confidence increases for UVM, OSVVM, and UVVM which are included in both studies.

Figure \@ref(fig:github-wilson-full-combined-comparison) shows the two studies combined. It was created by taking the relative sizes between the VUnit/cocotb group and the UVM/OSVVM/UVVM group from the Github study. The relative sizes of the frameworks in the latter group were calculated by combining the data from the two studies.

```{r github-wilson-full-combined-comparison, fig.cap='Full GitHub and Wilson study comparison', echo=FALSE, out.width = '85%', fig.align='center'}
include_svg("img/github_wilson_full_combined_comparison.svg")
```

The data shows that the frequency of OSVVM and UVVM users are indistinguishable but significantly lower than the number of cocotb and UVM users. The frequency of cocotb and UVM users are also indistinguishable but significantly lower than the number of VUnit users.

### How-To

[`wilson.py`](https://(ref:repoTree)/py/wilson.py) provides all the results related to the comparison with the Wilson study. The script is called without any arguments.

``` console
python wilson.py
```