The R logical data type has two possible values: TRUE or FALSE.
# Results in 500 573 - 74 + 1 # Results in 50 25 * 2 # Results in 2 10 / 5
In R, NA represents missing or undefined data. Make sure to use NA in uppercase and do not wrap it in quotes.
# var1 has the numeric data type var1 <- 3 # var2 has the character data type var2 <- "happiness" # var3 has the logical data type var3 <- TRUE
R supports mathematical operations such as addition, subtraction, multiplication, and division. These operations follow the standard order of operations (PEMDAS).
if (n == 5) { print('n is equal to five') }
R has various data types including numeric, integer, logical, and character. These types are used for different kinds of data operations.
# This is a comment!
Conditional statements in R evaluate conditions and execute code blocks based on the outcome.
example <- 4 another_example <- 15.2
In R, comments are made using the # symbol. Everything following # on that line is considered a comment and is ignored during execution.
The character data type in R is used to store text data. Text strings are enclosed in double quotes.
The assignment operator <- is used to assign values to variables. For example, x <- 10 assigns the value 10 to the variable x.
The numeric data type represents numbers. R handles both integers and real numbers, and performs arithmetic operations on them.
The dplyr package provides functions for data manipulation. Key functions include select, filter, mutate, arrange, and summarize, which allow for efficient data exploration and manipulation.
mutate(heights, cm = inches * 2.54)
The read_csv() and write_csv() functions from the readr package handle reading from and writing to CSV files. read_csv() converts files to tibbles, a modern version of data frames.
weather %>% select(1:2)
The filter() function subsets rows based on conditions. Logical operators (e.g., <, ==, >, !=) are used to specify the conditions.
transmute(population, increase = total_population / lag(total_population))
A data frame in R is a two-dimensional structure with rows and columns. Each column can hold different types of data, and each row represents a unique observation.
The select() function allows for column exclusion by prefixing column names with a - sign. This results in a new data frame without the excluded columns.
select(-genre, -spotify_monthly_listeners, -year_founded)
The rename() function changes column names in a data frame. To rename columns, use the new name followed by = and the old name. Variants like rename_if(), rename_at(), and rename_all() provide more flexibility.
The filter() function selects rows that meet certain criteria. Conditions are specified as arguments to the function.
filter(artists, genre == 'Rock', spotify_monthly_listeners > 20000000)
Functions like head() and summary() provide insights into data frames. head() shows the first few rows, while summary() offers summary statistics.
The arrange() function sorts rows based on column values. You can specify ascending or descending order using the desc() function.
arrange(data_frame, desc(column_name))
The mutate() function adds new columns to a data frame or modifies existing ones based on transformations.
mutate(data_frame, new_column = existing_column * 2)
CSV files store data in a plain text format, where values are separated by commas. These files are compatible with many applications and can be imported into R for analysis.
The pipe operator %>% passes the result of one function as the input to another. This allows for cleaner and more readable code.
data_frame %>% select(column) %>% filter(condition)
The transmute() function creates new columns from transformations while dropping the existing columns from the data frame.
transmute(data_frame, new_column = transformation(existing_column))
The select() function extracts specified columns from a data frame, while dropping others.
The gsub() function replaces occurrences of a pattern in a string with a replacement value. It can handle single strings or vectors of strings.
# Replace '1' with an empty string in a vector teams <- c("Fal1cons", "Cardinals", "Seah1awks", "Vikings", "Bro1nco", "Patrio1ts") teams_clean <- gsub("1", "", teams) print(teams_clean) # Output: "Falcons" "Cardinals" "Seahawks" "Vikings" "Bronco" "Patriots"
The distinct() function removes duplicate rows from a data frame. It can also be used to select unique values from specific columns.
# Keep unique rows in a data frame distinct(data_frame) # Keep unique values in a specific column distinct(data_frame, column_name)
The str() function provides a compact, human-readable summary of the structure of an R object. It displays the type, size, and content of the object.
str(data_frame)
Multiple files can be combined into a single data frame using functions like lapply() and bind_rows(). This is useful for aggregating data from different sources.
df_list <- lapply(files, read_csv) df_combined <- bind_rows(df_list)
The as.numeric() function converts data to numeric type, which is useful for performing numerical operations on data that may be stored as characters.
numeric_data <- as.numeric(character_data)
The str_sub() function extracts or replaces substrings within a string, using specified start and end positions.
# Extract the first five characters from a string str_sub('Marya1984', start = 1, end = 5)
Missing data in R is represented by NA. Handling missing data involves techniques such as imputation, exclusion, or using functions like is.na() to identify missing values.
data_with_na <- data_frame[!is.na(data_frame$column_name), ]
The tidyr package provides functions for reshaping data, such as pivot_longer() and pivot_wider(), which transform data between wide and long formats.
# Convert wide format to long format pivot_longer(data_frame, cols = c(column1, column2))
Regular expressions (regex) are patterns used for string matching and manipulation. Functions like gsub() and str_replace() use regex to perform replacements.
# Remove all non-numeric characters cleaned_string <- gsub("[^0-9]", "", original_string)
The substr() function extracts or replaces substrings from a string based on specified starting and ending positions.
substr('example_string', 1, 7)
ggplot2 is a powerful package for creating data visualizations. It follows the Grammar of Graphics to build plots by combining various components.
# Basic scatter plot ggplot(data_frame, aes(x = x_variable, y = y_variable)) + geom_point()
Key ggplot2 functions include ggplot() to initialize the plot, aes() for aesthetics, and geom_* functions to add geometric layers like points, lines, or bars.
# Adding a line to a scatter plot ggplot(data_frame, aes(x = x_variable, y = y_variable)) + geom_point() + geom_smooth()
In ggplot2, aesthetics (aes()) control visual properties such as color, size, and shape. Mapping aesthetics allows you to represent data values visually.
# Scatter plot with color mapping ggplot(data_frame, aes(x = x_variable, y = y_variable, color = category)) + geom_point()
The geom_bar() function creates bar charts. It can be used for categorical data, with the height of bars representing counts or values.
# Basic bar chart ggplot(data_frame, aes(x = category)) + geom_bar()
The ggtitle() function adds titles to plots. It allows customization of plot titles, including main titles and subtitles.
# Adding a title ggplot(data_frame, aes(x = x_variable, y = y_variable)) + geom_point() + ggtitle("Main Title")
Themes in ggplot2 control the non-data elements of plots, such as background color, gridlines, and text size. Themes can be customized or replaced with built-in options.
# Applying a theme ggplot(data_frame, aes(x = x_variable, y = y_variable)) + geom_point() + theme_minimal()
The geom_histogram() function creates histograms to show the distribution of a single variable. It groups data into bins and displays frequencies.
# Basic histogram ggplot(data_frame, aes(x = variable)) + geom_histogram(bins = 30)
The geom_point() function creates scatter plots, plotting data points with specified aesthetics such as color and size.
# Scatter plot ggplot(data_frame, aes(x = x_variable, y = y_variable)) + geom_point()
The geom_line() function adds line layers to plots, useful for visualizing trends over time or continuous variables.
# Line plot ggplot(data_frame, aes(x = x_variable, y = y_variable)) + geom_line()
Aggregation functions in R, such as sum(), mean(), and sd(), summarize data by computing statistical measures across groups.
# Calculating mean mean(data_frame$column_name)
The group_by() function groups data based on specified columns. It is often used in conjunction with summarization functions to analyze data by groups.
data_frame %>% group_by(group_variable) %>% summarize(mean_value = mean(numeric_variable))
The inner_join() function combines rows from two data frames where there are matching values in specified columns.
inner_join(df1, df2, by = "common_column")
The left_join() function combines rows from two data frames, retaining all rows from the left data frame and matching rows from the right data frame.
left_join(df1, df2, by = "common_column")
The right_join() function retains all rows from the right data frame and matches rows from the left data frame, filling in with NA where no match is found.
right_join(df1, df2, by = "common_column")
The full_join() function combines rows from two data frames, retaining all rows from both data frames, with NA in places where matches are not found.
full_join(df1, df2, by = "common_column")
The mean is the average value of a dataset, calculated by summing all values and dividing by the number of values.
mean(c(2, 4, 6, 8))
The median is the middle value of a dataset when it is ordered from least to greatest. If there is an even number of values, the median is the average of the two middle values.
median(c(2, 4, 6, 8))
The mode is the value that appears most frequently in a dataset. R does not have a built-in function for mode, but it can be calculated using table() and which.max().
table(c(1, 2, 2, 3, 4))
Variance measures the spread of data points around the mean. It is calculated as the average of the squared differences from the mean.
var(c(1, 2, 3, 4, 5))
Standard deviation is the square root of the variance, providing a measure of spread in the same units as the data.
sd(c(1, 2, 3, 4, 5))
Quartiles divide a dataset into four equal parts. Q1 (first quartile) is the value below which 25% of the data falls, Q2 (median) divides the data in half, and Q3 (third quartile) is the value below which 75% of the data falls.
quantile(c(1, 2, 3, 4, 5), probs = 0.25)
Quantiles are values that divide a dataset into equal parts. Common quantiles include quartiles, deciles, and percentiles.
quantile(c(1, 2, 3, 4, 5), probs = c(0.1, 0.5, 0.9))
The IQR measures the range within which the middle 50% of the data falls. It is calculated as Q3 minus Q1.
IQR(c(1, 2, 3, 4, 5))
Hypothesis testing is a statistical method used to make inferences or draw conclusions about a population based on sample data. It involves formulating a null hypothesis (H0) and an alternative hypothesis (H1), and then using statistical tests to determine whether there is enough evidence to reject the null hypothesis.
# Formulating hypotheses H0 <- "The mean is equal to 0" H1 <- "The mean is not equal to 0"
The t-test is used to compare the means of two groups and determine if they are significantly different from each other. It can be a one-sample t-test, independent two-sample t-test, or paired t-test.
# One-sample t-test result <- t.test(data_vector, mu = 0)
The chi-square test assesses the association between categorical variables. It compares observed frequencies to expected frequencies to determine if there is a significant difference.
# Chi-square test for independence result <- chisq.test(table(data_frame$variable1, data_frame$variable2))
ANOVA tests if there are significant differences among the means of three or more groups. It extends the t-test for comparing multiple groups.
# One-way ANOVA result <- aov(response_variable ~ factor_variable, data = data_frame)
The p-value indicates the probability of observing the test results under the null hypothesis. A small p-value (typically < 0.05) suggests that the null hypothesis can be rejected.
# Extract p-value from test result p_value <- result$p.value
A confidence interval provides a range of values within which the true population parameter is likely to fall, with a specified level of confidence (e.g., 95%).
# Calculate 95% confidence interval result <- t.test(data_vector) conf_interval <- result$conf.int
Welcome to our comprehensive collection of programming language cheatsheets! Whether you're a seasoned developer or a beginner, these quick reference guides provide essential tips and key information for all major languages. They focus on core concepts, commands, and functions—designed to enhance your efficiency and productivity.
ManageEngine Site24x7, a leading IT monitoring and observability platform, is committed to equipping developers and IT professionals with the tools and insights needed to excel in their fields.
Monitor your IT infrastructure effortlessly with Site24x7 and get comprehensive insights and ensure smooth operations with 24/7 monitoring.
Sign up now!