Pokémon Data Analysis

Overview

This project explores patterns in Pokémon types across the first six generations and examines potential statistical differences in base Speed stats between Flying and Ground types. The analysis was conducted using a csv containing information about each Pokémon’s name, type(s), base stats, and generation. The hypothesis testing was conducted using a SQlite database. The project is divided into three key components:

Most Common Pokémon Types Across Generations

Analysis to identify the most commonly occurring Pokémon types over all 6 generations. This reveals trends in game design and type distribution for the franchise.
Type Frequency by Generation

To further understand how type popularity evolves, the distribution types were analyzed for each generation and each type. Visualizations such as static and interactive bar plots were used to illustrate shifts and patterns.
Hypothesis Testing: Speed Comparison Between Flying and Ground Types

A statistical hypothesis test (Mann-Whitney U test) was conducted to assess whether Flying type Pokémon have significantly higher base Speed stats than Ground type Pokémon. The null hypothesis assumes no difference in the distributions of Speed stats between the Pokémon of the two groups, while the alternative hypothesis states that Flying types’ distribution rank significantly higher than Ground types when the two are put together. The results provide insights into how type might relate to base stat design decisions.

This project demonstrates the use of R, Plotly, and SQLite for data wrangling, visualization, and statistical analysis, with a focus on uncovering meaningful insights from data.

library(datasets)

Pokemon = read.csv(file = 'data/Pokemon.csv')
head(Pokemon)

##   X.                  Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1  1             Bulbasaur  Grass Poison   318 45     49      49      65
## 2  2               Ivysaur  Grass Poison   405 60     62      63      80
## 3  3              Venusaur  Grass Poison   525 80     82      83     100
## 4  3 VenusaurMega Venusaur  Grass Poison   625 80    100     123     122
## 5  4            Charmander   Fire          309 39     52      43      60
## 6  5            Charmeleon   Fire          405 58     64      58      80
##   Sp..Def Speed Generation Legendary
## 1      65    45          1     False
## 2      80    60          1     False
## 3     100    80          1     False
## 4     120    80          1     False
## 5      50    65          1     False
## 6      65    80          1     False

Exploratory Data Analysis

What is the most common type of Pokémon?

Need to address missing values in Type.2 since some Pokémon only have 1 type.
What the data shows for Pokemon that only have one type

# this is the Type.2 for charmander
# a Pokemon that has a singular "Fire" type

Pokemon$Type.2[5]

## [1] ""

The data displays a “” or empty for a Pokémon that has no Type.2. This will muddy the data when looking for the most common type.

Tables for Type.1 and Type.2 variables.

table(Pokemon$Type.1)

## 
##      Bug     Dark   Dragon Electric    Fairy Fighting     Fire   Flying 
##       69       31       32       44       17       27       52        4 
##    Ghost    Grass   Ground      Ice   Normal   Poison  Psychic     Rock 
##       32       70       32       24       98       28       57       44 
##    Steel    Water 
##       27      112

table(Pokemon$Type.2)

## 
##               Bug     Dark   Dragon Electric    Fairy Fighting     Fire 
##      386        3       20       18        6       23       26       12 
##   Flying    Ghost    Grass   Ground      Ice   Normal   Poison  Psychic 
##       97       14       25       35       14        4       34       33 
##     Rock    Steel    Water 
##       14       22       14

# how many types are in Type.1
length(table(Pokemon$Type.1))

## [1] 18

# how many types are in Type.2
length(table(Pokemon$Type.2))

## [1] 19

There is an extra type in table(Pokemon$Type.2) that is “” with the number 386 to represent all of the observations/Pokémon with “” for Type.2.

This is the combined table of Type.1 and Type.2, minus the empty “” type. However this does not represent every Pokemon. The sum of the table shows 1214 when we only have 800 observations/Pokémon.

This 1214 occurs because it is counting a Pokémon like Bulbasaur(grass/poison) twice, once as a grass type and again as a poison type, but for the purposes of finding the most common type thats okay.

combinedTypeTable = (table(Pokemon$Type.1) + table(Pokemon$Type.2)[-1])

combinedTypeTable

## 
##      Bug     Dark   Dragon Electric    Fairy Fighting     Fire   Flying 
##       72       51       50       50       40       53       64      101 
##    Ghost    Grass   Ground      Ice   Normal   Poison  Psychic     Rock 
##       46       95       67       38      102       62       90       58 
##    Steel    Water 
##       49      126

paste("The sum is: ", sum(combinedTypeTable))

## [1] "The sum is:  1214"

The max value in the table and the name associated with it.

# finds the max(table), and then finds the index of the max value, then returns both the number and its name, in our case the type

combinedTypeTable[which(combinedTypeTable == max(combinedTypeTable))]

## Water 
##   126

The most common type for a Pokémon is Water.

combinedTypeTable = sort(combinedTypeTable, decreasing=TRUE)
barplot(combinedTypeTable[1:5], main = "TOP 5 MOST COMMON TYPES IN POKEMON", ylab="Frequency", col=c("lightblue"))

How many of each type of Pokémon is there in every Generation?

How many different generations are present in the data

table(Pokemon$Generation)

## 
##   1   2   3   4   5   6 
## 166 106 160 121 165  82

There are 6 different generations of Pokémon listed in the data, with a varying number of Pokemon for each generation.

Parse the data to see how many of each type is in Generation 1.

# table for the number of each Type.1 type for all entries where the Generation is 1
table(Pokemon$Type.1[which(Pokemon$Generation == 1)])

## 
##      Bug   Dragon Electric    Fairy Fighting     Fire    Ghost    Grass 
##       14        3        9        2        7       14        4       13 
##   Ground      Ice   Normal   Poison  Psychic     Rock    Water 
##        8        2       24       14       11       10       31

# table for the number of each Type.2 type for all entries where the Generation is 1
table(Pokemon$Type.2[which(Pokemon$Generation == 1)])

## 
##              Dark   Dragon    Fairy Fighting   Flying    Grass   Ground 
##       88        1        1        3        2       23        2        6 
##      Ice   Poison  Psychic     Rock    Steel    Water 
##        3       22        7        2        2        4

length(table(Pokemon$Type.1[which(Pokemon$Generation == 1)]))

## [1] 15

length(table(Pokemon$Type.2[which(Pokemon$Generation == 1)]))

## [1] 14

But due to the way that Pokémon types are set up there is an issue. A Pokémon has one or two types, and the Type.2 is valued the same as Type.1. This causes discrepancies like above, where some types are missing from the Type.2 table and the Type.1 table .

Make a function that fills in the missing “types” and returns one vector with the combined values for both Type.1 and Type.2

# this function will help us combine the two Type tables int o one table with all unique types from both Type.1 and Type.2, as well as add up duplicates


combineTypeTblByGen = function(generation){
  
  # merging the table of Type.1 and Type.2 for the generation
  mergedTypeTable = merge(table(Pokemon$Type.1[which(Pokemon$Generation == generation)]), table(Pokemon$Type.2[which(Pokemon$Generation == generation)])[-1], all= TRUE)
  
  
  # vector version of the merged tables above
  typeVec = mergedTypeTable[[2]]
  names(typeVec) = mergedTypeTable[[1]]
  
  
  
  # for loop that adds up the repeat values for the types, and then removes the duplicate name+value from the vector
  len = length(typeVec) - 1
  for(i in 1:len){
  
    if(i >= length(typeVec)){break}
  
    else if(names(typeVec[i]) == names(typeVec[i + 1])){

      typeVec[i+1] = typeVec[i] + typeVec[i+1]
      #print(typeVec[i])

      typeVec = typeVec[-c(i)]
  
    }

  }
  
  return(typeVec)
  
  
}


barplot(combineTypeTblByGen(1), las=2, main="Frequency of Types for Generation 1", col=c("lightblue"))

The plot above combines frequency of “types” in both Type1 and Type2 for generation 1, and the function combineTypeTblByGen() replicate this for every other generation.

This next function outputs a vector that shows how many of one specific type is in every Generation. For example it’ll show the number of “Bug” types there are in Generation 1 through 6.

# function takes in a character argument that represents a Pokemon Type and returns a vector of length 6, with each index representing the number of that type of pokemon are in that generation number

compareTypeAcrossGen = function(type){
  
  oneTypeAllGensVec = c()
  
  for(i in 1:6){
    
    #print(combineTypeTblByGen(i)[type])
    
    oneTypeAllGensVec[i] = combineTypeTblByGen(i)[type]
   
    
  }

  names(oneTypeAllGensVec) = c("Gen 1", "Gen 2", "Gen 3", "Gen 4", "Gen 5", "Gen 6")
  
  return(oneTypeAllGensVec)

}


compareTypeAcrossGen("Bug")

## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    14    12    14    11    18     3

compareTypeAcrossGen("Dark")

## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     1     8    13     7    16     3

barplot(compareTypeAcrossGen("Bug"), main="Frequency of Bug Types in Each Generation(Gen)", las=2, col = "lightgreen")

Use the Plotly package to create an interactive barplot that shows the frequency of each type across the 6 Generations

typeByGenTraces[]

## [[1]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    14    12    14    11    18     3 
## 
## [[2]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     4     2    15     4    12     9 
## 
## [[3]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     9     9     5    12    12     3 
## 
## [[4]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     5     8     8     1     3    14 
## 
## [[5]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     9     2     9    10    17     4 
## 
## [[6]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    14    11     9     6    16     8 
## 
## [[7]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     4     1     8     9     9    15 
## 
## [[8]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    15    10    18    17    20    15 
## 
## [[9]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    14    11    16    12    12     2 
## 
## [[10]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     5     5     7     8     9     2 
## 
## [[11]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    24    15    18    18    19     4 
## 
## [[12]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    36     4     5     8     7     2 
## 
## [[13]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    18    10    28    10    16     8 
## 
## [[14]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    12     8    12     7    10     9 
## 
## [[15]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    35    18    31    15    18     9 
## 
## [[16]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     1     8    13     7    16     3 
## 
## [[17]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    23    19    14    16    21     8 
## 
## [[18]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     2     3    12    12    12     5

q2FinalGraph

Database

library(DBI)
library(RSQLite)
drv = dbDriver("SQLite")

pokemonDatabase = dbConnect(drv, dbname = "./data/veekun-pokedex.sqlite")
dbExecute(pokemonDatabase, "PRAGMA foreign_keys = on")

## [1] 0

dbListTables(pokemonDatabase)

Hypothesis Testing

Pokémon types follow fairly standard themes, with Water types usually being fish, Flying types being birds, ect. Then the logical conclusion would be that the stats of these Pokémon reflect their physical forms, with Flying types being birds with high speed stats and larger Ground types being more durable but with lower speeds. But does this hypothesis hold up statistically?

Null Hypothesis:

There is no statistically significant difference in the distributions of Flying type Pokémons’ Speed stats and the Ground type Pokémons’ Speed stats.
Alternative Hypothesis:

Flying type Pokémon have a statistically higher distribution of speed stats than Ground types.

Step 1: Query -> data frame with every Flying type and their speed stat and every ground type and their speed stat

Step 1a: get a dataframe with every Pokemon and its speed stat

dbGetQuery(pokemonDatabase, 
           "PRAGMA FOREIGN_KEY_LIST(pokemon_stats)")

##   id seq   table       from to on_update on_delete match
## 1  0   0   stats    stat_id id NO ACTION NO ACTION  NONE
## 2  1   0 pokemon pokemon_id id NO ACTION NO ACTION  NONE

dbGetQuery(pokemonDatabase, "
           SELECT pokemon_id, identifier, base_stat
           FROM pokemon_stats INNER JOIN stats
           ON pokemon_stats.stat_id = stats.id
           WHERE identifier = 'speed'
           LIMIT 20")

##    pokemon_id identifier base_stat
## 1           1      speed        45
## 2           2      speed        60
## 3           3      speed        80
## 4           4      speed        65
## 5           5      speed        80
## 6           6      speed       100
## 7           7      speed        43
## 8           8      speed        58
## 9           9      speed        78
## 10         10      speed        45
## 11         11      speed        30
## 12         12      speed        70
## 13         13      speed        50
## 14         14      speed        35
## 15         15      speed        75
## 16         16      speed        56
## 17         17      speed        71
## 18         18      speed       101
## 19         19      speed        72
## 20         20      speed        97

Step 1b: dataframe with every pokemon’s type

# dbGetQuery(pokemonDatabase, "
#            SELECT *
#            FROM pokemon_types")
# 
# dbGetQuery(pokemonDatabase, "
#            PRAGMA FOREIGN_KEY_LIST(pokemon_types)")
# 
# dbGetQuery(pokemonDatabase, "
#            SELECT *
#            FROM types")
dbGetQuery(pokemonDatabase, "
           PRAGMA FOREIGN_KEY_LIST(pokemon_types)
           ")

##   id seq   table       from to on_update on_delete match
## 1  0   0   types    type_id id NO ACTION NO ACTION  NONE
## 2  1   0 pokemon pokemon_id id NO ACTION NO ACTION  NONE

dbGetQuery(pokemonDatabase, "
           SELECT pokemon_id, slot, identifier as TYPE
           FROM pokemon_types INNER JOIN types
           ON pokemon_types.type_id = types.id
          LIMIT 20")

##    pokemon_id slot   TYPE
## 1           1    1  grass
## 2           1    2 poison
## 3           2    1  grass
## 4           2    2 poison
## 5           3    1  grass
## 6           3    2 poison
## 7           4    1   fire
## 8           5    1   fire
## 9           6    1   fire
## 10          6    2 flying
## 11          7    1  water
## 12          8    1  water
## 13          9    1  water
## 14         10    1    bug
## 15         11    1    bug
## 16         12    1    bug
## 17         12    2 flying
## 18         13    1    bug
## 19         13    2 poison
## 20         14    1    bug

Step 1c: Dataframe with every Pokémon, its speed stat, and its type

# initial query with all required columns
dbGetQuery(pokemonDatabase, "
           SELECT pokemon.id, pokemon.identifier, pokemon_types.pokemon_id, pokemon_types.type_id, types.identifier, pokemon_stats.stat_id, pokemon_stats.base_stat, stats.identifier
           
           FROM pokemon_types

           INNER JOIN pokemon ON pokemon_types.pokemon_id = pokemon.id
           INNER JOIN types ON pokemon_types.type_id = types.id
           INNER JOIN pokemon_stats ON pokemon_stats.pokemon_id = pokemon.id
           INNER JOIN stats ON pokemon_stats.stat_id = stats.id
           LIMIT 20
           ")

##    id identifier pokemon_id type_id identifier stat_id base_stat
## 1   1  bulbasaur          1      12      grass       1        45
## 2   1  bulbasaur          1      12      grass       2        49
## 3   1  bulbasaur          1      12      grass       3        49
## 4   1  bulbasaur          1      12      grass       4        65
## 5   1  bulbasaur          1      12      grass       5        65
## 6   1  bulbasaur          1      12      grass       6        45
## 7   1  bulbasaur          1       4     poison       1        45
## 8   1  bulbasaur          1       4     poison       2        49
## 9   1  bulbasaur          1       4     poison       3        49
## 10  1  bulbasaur          1       4     poison       4        65
## 11  1  bulbasaur          1       4     poison       5        65
## 12  1  bulbasaur          1       4     poison       6        45
## 13  2    ivysaur          2      12      grass       1        60
## 14  2    ivysaur          2      12      grass       2        62
## 15  2    ivysaur          2      12      grass       3        63
## 16  2    ivysaur          2      12      grass       4        80
## 17  2    ivysaur          2      12      grass       5        80
## 18  2    ivysaur          2      12      grass       6        60
## 19  2    ivysaur          2       4     poison       1        60
## 20  2    ivysaur          2       4     poison       2        62
##         identifier
## 1               hp
## 2           attack
## 3          defense
## 4   special-attack
## 5  special-defense
## 6            speed
## 7               hp
## 8           attack
## 9          defense
## 10  special-attack
## 11 special-defense
## 12           speed
## 13              hp
## 14          attack
## 15         defense
## 16  special-attack
## 17 special-defense
## 18           speed
## 19              hp
## 20          attack

Step 1d: Save the dataframes

# cleaned up query with all required info for flying and ground type pokemon
flynGrndSPD_df = dbGetQuery(pokemonDatabase, "
           SELECT pokemon.identifier AS name, types.identifier AS type, stats.identifier AS stat_name, pokemon_stats.base_stat
           
           FROM pokemon_types

           INNER JOIN pokemon ON pokemon_types.pokemon_id = pokemon.id
           INNER JOIN types ON pokemon_types.type_id = types.id
           INNER JOIN pokemon_stats ON pokemon_stats.pokemon_id = pokemon.id
           INNER JOIN stats ON pokemon_stats.stat_id = stats.id
            
           WHERE stat_name LIKE 'speed' AND (type LIKE 'flying' OR type LIKE 'ground')
           
           ")

# data frame with flying types and their speed stats
flyingSpeeds = flynGrndSPD_df[which(flynGrndSPD_df$type == "flying"), ]

# data frame with ground types and their speed stats
groundSpeeds = flynGrndSPD_df[which(flynGrndSPD_df$type == "ground"), ]

# first 10 entries
flynGrndSPD_df[1:10,]

##          name   type stat_name base_stat
## 1   charizard flying     speed       100
## 2  butterfree flying     speed        70
## 3      pidgey flying     speed        56
## 4   pidgeotto flying     speed        71
## 5     pidgeot flying     speed       101
## 6     spearow flying     speed        70
## 7      fearow flying     speed       100
## 8   sandshrew ground     speed        40
## 9   sandslash ground     speed        65
## 10  nidoqueen ground     speed        76

Step 2: Determine normality
Visualize distribution using Bar Plots and QQ Plots

hist(flyingSpeeds$base_stat, main = "Flying Type Speed Distribution", xlab = "Speed stat", breaks=20)

hist(groundSpeeds$base_stat, main = "Ground Type Speed Distribution", xlab = "Speed stat", breaks = 20)

qqnorm(flyingSpeeds$base_stat, main  = "Normal Q-Q Plot for Flying Types")
qqline(flyingSpeeds$base_stat, col = "red")

qqnorm(groundSpeeds$base_stat, main  = "Normal Q-Q Plot for Ground Types")
qqline(groundSpeeds$base_stat, col = "red")

Step 2b: Conduct Shapiro-Wilk Test

While the data doesn’t appear to be normal, it helps to do a formal test for an accurate interpretation. Conducting a Shapiro-Wilk Test on a random sample from both groups will be more definitive.

First take random samples from both groups with sample size n = 30.

set.seed(5423)
flySpeedSample = sample(flyingSpeeds$base_stat, 30, replace = TRUE)
grndSpeedSample = sample(groundSpeeds$base_stat, 30, replace = TRUE)

Then use R’s built in function for the Shapiro-Wilk test for normality.

shapiro.test(flySpeedSample)

## 
##  Shapiro-Wilk normality test
## 
## data:  flySpeedSample
## W = 0.93702, p-value = 0.07561

shapiro.test(grndSpeedSample)

## 
##  Shapiro-Wilk normality test
## 
## data:  grndSpeedSample
## W = 0.92855, p-value = 0.04494

Interpreting the results:

The W statistic is the result of the Shapiro Wilk Test, which measures how close the sample matches a normal distribution, with 1 being normal and 0 being not normal.

The P value determines if the null hypothesis (that the random sample comes from a normally distributed data set) can be rejected. If the p value result is <= 0.05, null hypothesis is rejected. However if the p value result is > 0.05 you are unable to reject the null hypothesis.

However, depending on the random sample, the samples either fail or pass the test. To see if the samples are passing the Shapiro-Wilk test a reliable number of times, build a function that does the Shapiro-Wilk test on 100 random samples from the data. Using a p value of 0.05 (ie if less that 95% of the samples pass the test) to determine if the null hypothesis that the samples come from a normally distributed population can be rejected.

shapiroSample = function(dataset, reps, sampleSize){
  

  pass_vec = c()
  for(i in 1:reps){
    sample = sample(dataset, sampleSize, replace = TRUE)
    outcome = shapiro.test(sample)
    
    # checks to see if this sample rejects the null hypothesis or not
    # if it does adds a 0 for fail, and if not adds a 1 for pass
    
    if(outcome[2] > 0.05){pass_vec[i] = 1}
    else if(outcome[2] <= 0.05){pass_vec[i] = 0}
    
  }
  
  return(pass_vec)
}

Execute the function

set.seed(2354)
mean(shapiroSample(flyingSpeeds$base_stat, 100, 30))

## [1] 0.79

mean(shapiroSample(groundSpeeds$base_stat, 100, 30))

## [1] 0.57

Since the percentage of samples out of 100 that pass the Shapiro Wilk test are less than 95% for both flying and ground types, the null hypothesis can be rejected.

Step 3: Conduct Mann-Whitney U Test
The Mann-Whitney U test is the appropriate statistical test to utilize when comparing distributions of non-normal data.

Step 3a: Rank the speed stats of both samples
Showing each sample and the top 20 ranks

##  [1]  70 110 121 123  80  89  85 100  70  80  50  80  86 125  85 121  80  95  70
## [20]  70 127 110  80 110  99  86 100  70  70  85

##  [1]  40  39  50  39 101  70  30  70  40  20  90  40  45  80  36  40  41  60  88
## [20]  40  90  60  40  20  76  70  50  65  90  50

##      type speed_stat rank
## 40 ground         20    1
## 54 ground         20    2
## 37 ground         30    3
## 45 ground         36    4
## 32 ground         39    5
## 34 ground         39    6
## 31 ground         40    7
## 39 ground         40    8
## 42 ground         40    9
## 46 ground         40   10
## 50 ground         40   11
## 53 ground         40   12
## 47 ground         41   13
## 43 ground         45   14
## 11 flying         50   15
## 33 ground         50   16
## 57 ground         50   17
## 60 ground         50   18
## 48 ground         60   19
## 52 ground         60   20

Step 3b: Separate the flying and ground types to get their rank sums

ground_speed_ranks = fly_grnd_speed_ranked[which(fly_grnd_speed_ranked$type == "ground"), ]

flying_speed_ranks = fly_grnd_speed_ranked[which(fly_grnd_speed_ranked$type == "flying"), ]

Step 3c: Calculate the rank sums for both types

ground_speed_ranksum = sum(ground_speed_ranks$rank)
flying_speed_ranksum = sum(flying_speed_ranks$rank)
ground_speed_ranksum

## [1] 604

flying_speed_ranksum

## [1] 1226

Step 3d: Determine the U statistic and the P value

The U statistic is the measure of how often values from one of the samples rank below the other. U1 is the number of times values from the first sample rank above values from the second one. U2 is vice versa. The U statistic is min(U1, U2).

However as sample sizes increase past n = 20, even though the U statistic still shows how much overlap there is between the two sample ranks, the distribution of U, ie all of the possible values that U could be, becomes approximately normal.

This requires calculating the z-score instead, the measure of how many standard deviations your calculated U statistic is from the mean U statistic for distributions under the null hypothesis.

Then use pnorm(z-score) to get the p value, the measure of how likely, given the null hypothesis is true, of getting that specific z-score.

n1 = dim(ground_speed_ranks)[1]
n2 = dim(flying_speed_ranks)[1]

U1 = ( (n1 * n2) + ((n1 * (n1+1)) / 2) ) - ground_speed_ranksum
U2 = ( (n1 * n2) + ((n1 * (n1+1)) / 2) ) - flying_speed_ranksum

U1

## [1] 761

U2

## [1] 139

Ustatistic = min(U1, U2)
muU = (n1 * n2) / 2
sigmaU = sqrt(((n1 * n2) * (n1 + n2 + 1)) / 12)
  
zscore = (Ustatistic - muU) / sigmaU

zscore

## [1] -4.597956

# gives the p value
pnorm(zscore)

## [1] 2.133277e-06

Since the p value is extremely small (p < 0.01), much smaller than the chosen significance value of 0.05, the null hypothesis that there is no significant difference between the speed stats of flying and ground type Pokémon is rejected.

Additionally a negative z-score like the one above indicates that the first group(ground types) had lower ranks.

Conclusion:

After taking a a random sample of Flying and Ground type Pokémon, preliminary analysis using the Shapiro-Wilk test indicated that the samples for both types did not come from a normal distribution. As a result, the non parametric Mann-Whitney U test was applied, resulting in a z-score and p value that revealed a statistically significant difference in speed stats of Flying-type and Ground-type Pokémon. These results align with the original hypothesis and thematic characteristics of both types of Pokémon, indicating that there is an important relationship between a Pokémon’s type and it’s stats.

dbDisconnect(pokemonDatabase)