How To Merge Data Sets In R

This article is also available in Castilian.

Merging—also known as joining—two datasets by ane or more common ID variables (keys) is a common task for any data scientist. If y'all get the merge wrong you can create some serious harm to your downstream analysis so you'd meliorate make sure you're doing the correct affair! In order to practice so, I'll walk you through three different approaches to joining tables in R: the {base of operations} way, the {dplyr} way and the SQL mode (yes, you can use SQL in R).

Types of Merges

Offset of, though, permit'southward review the different ways you can merge datasets. Borrowing from the SQL terminology I will cover these 4 types:

Left join
Right bring together
Inner join
Full join

Left Join

In a left bring together involving datasets L and R the final table—permit'due south call it LR—will contain all records from dataset L simply only those records from dataset R whose key (ID) is independent in L.

A left join of two tables performed in R

Right Bring together

A right join is just like a left join simply the other way around: the concluding table contains all rows from R and just those from L with a matching fundamental. Notation that you can re-write any right bring together of 50 with R as a left join of R with 50.

A right join of two tables performed in R

Inner Join

In an inner join only those records from L and R who take a matching key in the other dataset are independent in the final table.

An inner join of two tables performed in R

Full Join

By using a full join the resulting dataset contains all rows from L and all rows from R regardless of whether or not there'south a matching central.

A full join of two tables performed in R

The {base} Way

Enough of the theory, permit's explore how to actually perform a merge in R. Showtime of, the {base of operations} fashion. In {base} R you employ a unmarried office to perform all merge types covered above. Conveniently, it is called merge().

To illustrate the concepts I will use 2 fictitious datasets of a clinical trial. One table contains demographic information and the other one adverse events recorded throughout the course of the trial. Notation that patient P2 has a record in demographics but not in adverse_events and that P4 is contained in adverse_events but not in demographics.

            demographics              <-              data.frame              (              id              =              c              (              "P1"              ,              "P2"              ,              "P3"              ),              age              =              c              (              40              ,              54              ,              47              ),              state              =              c              (              "GER"              ,              "JPN"              ,              "BRA"              ),              stringsAsFactors              =              Fake              )              adverse_events              <-              data.frame              (              id              =              c              (              "P1"              ,              "P1"              ,              "P3"              ,              "P4"              ),              term              =              c              (              "Headache"              ,              "Neutropenia"              ,              "Constipation"              ,              "Tachycardia"              ),              onset_date              =              c              (              "2020-12-03"              ,              "2021-01-03"              ,              "2020-11-29"              ,              "2021-01-27"              ),              stringsAsFactors              =              FALSE              )

By default, merge() will perform an inner join: only those patients that appear in both the demographics and adverse_events datasets are included in the last table.

                          merge              (              x              =              demographics,              y              =              adverse_events,              by              =              "id"              )

          ##   id age country         term onset_date ## 1 P1  40     GER     Headache 2022-12-03 ## 2 P1  40     GER  Neutropenia 2022-01-03 ## 3 P3  47     BRA Constipation 2022-11-29

To perform a left join, prepare the all.x parameter to TRUE. For a right bring together practice the aforementioned with the all.y parameter.

                          merge              (              10              =              demographics,              y              =              adverse_events,              by              =              "id"              ,              all.x              =              Truthful              )

          ##   id age land         term onset_date ## 1 P1  40     GER     Headache 2022-12-03 ## 2 P1  40     GER  Neutropenia 2022-01-03 ## 3 P2  54     JPN         <NA>       <NA> ## 4 P3  47     BRA Constipation 2022-eleven-29

                          merge              (              10              =              demographics,              y              =              adverse_events,              by              =              "id"              ,              all.y              =              TRUE              )

          ##   id age country         term onset_date ## ane P1  twoscore     GER     Headache 2022-12-03 ## 2 P1  40     GER  Neutropenia 2022-01-03 ## 3 P3  47     BRA Constipation 2022-xi-29 ## iv P4  NA    <NA>  Tachycardia 2022-01-27

Finally, a total bring together tin can be performed by either setting both all.x and all.y to True or specifying all = TRUE.

                          merge              (              ten              =              demographics,              y              =              adverse_events,              by              =              "id"              ,              all              =              TRUE              )

          ##   id age land         term onset_date ## 1 P1  40     GER     Headache 2022-12-03 ## 2 P1  40     GER  Neutropenia 2022-01-03 ## three P2  54     JPN         <NA>       <NA> ## 4 P3  47     BRA Constipation 2022-11-29 ## 5 P4  NA    <NA>  Tachycardia 2022-01-27

In the ii example datasets I created, the mutual key is conveniently called id in both tables. However, this doesn't necessarily accept to be the case. If the two datasets you'd like to merge accept dissimilar names for their common ID variables you can specify them individually using the past.x and past.y parameters of merge().

            adverse_events2              <-              adverse_events              colnames              (adverse_events2)[              1L              ]              <-              "pat_id"              merge              (              ten              =              demographics,              y              =              adverse_events2,              by.x              =              "id"              ,              by.y              =              "pat_id"              ,              all              =              TRUE              )

          ##   id age state         term onset_date ## 1 P1  xl     GER     Headache 2022-12-03 ## 2 P1  40     GER  Neutropenia 2022-01-03 ## 3 P2  54     JPN         <NA>       <NA> ## 4 P3  47     BRA Constipation 2022-xi-29 ## 5 P4  NA    <NA>  Tachycardia 2022-01-27

The {dplyr} Way

Unlike {base} R—which uses a single function to perform the different merge types—{dplyr} provides one function for each type of join. And fortunately they are named just every bit you'd expect: left_join(), right_join(), inner_join() and full_join(). Personally I'm a big fan of this interface and thus tend to use {dplyr} for joining datasets much more often than {base}.

                          library              (dplyr)              left_join(demographics,              adverse_events,              by              =              "id"              )

          ##   id historic period state         term onset_date ## 1 P1  40     GER     Headache 2022-12-03 ## 2 P1  twoscore     GER  Neutropenia 2022-01-03 ## 3 P2  54     JPN         <NA>       <NA> ## 4 P3  47     BRA Constipation 2022-11-29

            inner_join(demographics,              adverse_events,              past              =              "id"              )

          ##   id historic period country         term onset_date ## 1 P1  twoscore     GER     Headache 2022-12-03 ## 2 P1  40     GER  Neutropenia 2022-01-03 ## iii P3  47     BRA Constipation 2022-eleven-29

            full_join(demographics,              adverse_events,              by              =              "id"              )

          ##   id historic period country         term onset_date ## 1 P1  40     GER     Headache 2022-12-03 ## 2 P1  40     GER  Neutropenia 2022-01-03 ## 3 P2  54     JPN         <NA>       <NA> ## 4 P3  47     BRA Constipation 2022-11-29 ## v P4  NA    <NA>  Tachycardia 2022-01-27

In case the ID variable names of the two tables do non friction match y'all need to pass a named vector every bit statement to by. The name and value corresponds to the central in the first and second tabular array, respectively.

            right_join(demographics,              adverse_events2,              by              =              c              (              "id"              =              "pat_id"              ))

          ##   id age country         term onset_date ## 1 P1  40     GER     Headache 2022-12-03 ## 2 P1  40     GER  Neutropenia 2022-01-03 ## iii P3  47     BRA Constipation 2022-11-29 ## iv P4  NA    <NA>  Tachycardia 2022-01-27

The SQL Way

When it comes to merging tables there'due south no way ane cannot mention the structured query language (SQL). In that location are several R packages available from CRAN to straight send SQL queries from R to a database. The {tidyquery} package does something different, though. It takes the SQL query you provide the query() function as input, translates it to {dplyr} code so executes this {dplyr} lawmaking to produce the final consequence.

                          library              (tidyquery)              query(              "select * from demographics right join adverse_events using(id)"              )

          ##   id age land         term onset_date ## one P1  40     GER     Headache 2022-12-03 ## ii P1  40     GER  Neutropenia 2022-01-03 ## 3 P3  47     BRA Constipation 2022-11-29 ## iv P4  NA    <NA>  Tachycardia 2022-01-27

            query(              "select * from demographics inner bring together adverse_events using(id)"              )

          ##   id age country         term onset_date ## i P1  40     GER     Headache 2022-12-03 ## 2 P1  forty     GER  Neutropenia 2022-01-03 ## three P3  47     BRA Constipation 2022-11-29

            query(              "select * from demographics full join adverse_events using(id)"              )

          ##   id age country         term onset_date ## 1 P1  forty     GER     Headache 2022-12-03 ## 2 P1  40     GER  Neutropenia 2022-01-03 ## 3 P2  54     JPN         <NA>       <NA> ## 4 P3  47     BRA Constipation 2022-eleven-29 ## 5 P4  NA    <NA>  Tachycardia 2022-01-27

For unproblematic queries—like joining tables—this is probably overkill given {dplyr}'due south interface is so similar to SQL. However, if you lot are a SQL wizard and write more complex queries, {tidyquery} can be a great way to become proficient in {dplyr} as it can actually evidence y'all the translated {dplyr} lawmaking.

            show_dplyr(              "                                            select dm.id, dm.historic period, ae.term                                            from demographics as dm                                            left join adverse_events as ae                                            using(id)                                            where term <> 'Headache'                            "              )

          ## demographics %>% ##   left_join(adverse_events, by = "id", suffix = c(".dm", ".ae"), na_matches = "never") %>% ##   filter(term != "Headache") %>% ##   select(id, age, term)

By the style, there's also the {dbplyr} packet which translates your {dplyr} code into SQL. That way you don't actually need to learn SQL in social club to query a database.

In this articles nosotros've covered the four most mutual ways of joining tables and how to implement them in R using {base}, {dpyr} and SQL via {tidyquery}. Armed with this cognition you should be able to confidently merge any datasets you come across in R. If you do get stuck feel complimentary to ask a question in the comments below.

Source: https://thomasadventure.blog/posts/r-merging-datasets/

How To Merge Data Sets In R

Types of Merges

Left Join

Right Bring together

Inner Join

Full Join

The {base} Way

The {dplyr} Way

The SQL Way

0 Response to "How To Merge Data Sets In R"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel