DBILITY

R plyr package 본문

statistics/R

R plyr package

DBILITY 2018. 11. 30. 16:24
반응형

plyr 패키지는 데이터의 분할(split), 계산적용(apply), 조합(combine)을 한번에 처리할 수 있도록 하는 함수들을 제공한다.

입력으로 배열(a), 데이터프레임(d), 리스트(l)를, 출력으로 배열(a), 데이터프레임(d), 리스트(l), 없을(_) 수 도 있다.

 

plyr패키지의 함수는 5글자 형태로 ??ply형태로 이루어져 있는데, 첫글자는 입력되는 데이터타입,두번째글자는 출력되는 데이터타입의 약자를 나타낸다.

 

adply함수는 배열(a)을 입력 받아 분할 후 함수를 적용(ply)하여 데이터프레임(d)으로 반환한다.

apply함수와 같이 margin 1은 행,2는 열방향이다. apply함수의 경우 혼합된 데이터타입이 존재할 경우 변환이 일어 난다.

 

adply(.data, .margins, .fun = NULL, ..., .expand = TRUE,  .progress = "none", .inform = FALSE, .parallel = FALSE, .paropts = NULL, .id = NA)

> apply(iris[,1:4],2,sum)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       876.5        458.6        563.7        179.9 
> adply(iris[,1:4],2,sum)
            X1    V1
1 Sepal.Length 876.5
2  Sepal.Width 458.6
3 Petal.Length 563.7
4  Petal.Width 179.9
> head(adply(iris[,1:4],1,sum))
  Sepal.Length Sepal.Width Petal.Length Petal.Width   V1
1          5.1         3.5          1.4         0.2 10.2
2          4.9         3.0          1.4         0.2  9.5
3          4.7         3.2          1.3         0.2  9.4
4          4.6         3.1          1.5         0.2  9.4
5          5.0         3.6          1.4         0.2 10.2
6          5.4         3.9          1.7         0.4 11.4
> head(apply(iris[,1:4],1,sum))
[1] 10.2  9.5  9.4  9.4 10.2 11.4

ddply함수는 데이터 프레임(d)을 입력 받아 데이터프레임(d)을 반환하며, 인자로 데이터, 그룹변수[.()안에 작성], 처리함수를 받는다.

tapply함수와 유사하지만, tapply함수는 배열을 반환한다.

 

ddply(.data, .variables, .fun = NULL, ..., .progress = "none",

  .inform = FALSE, .drop = TRUE, .parallel = FALSE, .paropts = NULL)

> ddply(iris,.(Species),function (sub) {
+     data.frame ( sepal.width.mean = mean(sub$Sepal.Width) )
+   }
+ )
     Species sepal.width.mean
1     setosa            3.428
2 versicolor            2.770
3  virginica            2.974

> tapply(iris$Sepal.Width,iris$Species,mean)
    setosa versicolor  virginica 
     3.428      2.770      2.974

 

base패키지의 transform()함수는 연산결과를 데이터프레임의 필드로 추가한다.

plyr에는 mutate()함수가 같은 기능에 더해 확장기능으로 여러 필드추가시 직전에 추가한 필드를 바로 참조 할 수 있다. 구매한 책에서 많이 사용함.

> head(ddply(baseball,.(id),transform,cyear=year-min(year)+1),10)
          id year stint team lg   g  ab   r   h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp cyear
1  aaronha01 1954     1  ML1 NL 122 468  58 131  27   6 13  69  2  2 28 39  NA   3  6  4   13     1
2  aaronha01 1955     1  ML1 NL 153 602 105 189  37   9 27 106  3  1 49 61   5   3  7  4   20     2
3  aaronha01 1956     1  ML1 NL 153 609 106 200  34  14 26  92  2  4 37 54   6   2  5  7   21     3
4  aaronha01 1957     1  ML1 NL 151 615 118 198  27   6 44 132  1  1 57 58  15   0  0  3   13     4
5  aaronha01 1958     1  ML1 NL 153 601 109 196  34   4 30  95  4  1 59 49  16   1  0  3   21     5
6  aaronha01 1959     1  ML1 NL 154 629 116 223  46   7 39 123  8  0 51 54  17   4  0  9   19     6
7  aaronha01 1960     1  ML1 NL 153 590 102 172  20  11 40 126 16  7 60 63  13   2  0 12    8     7
8  aaronha01 1961     1  ML1 NL 155 603 115 197  39  10 34 120 21  9 56 64  20   2  1  9   16     8
9  aaronha01 1962     1  ML1 NL 156 592 127 191  28   6 45 128 15  7 66 73  14   3  0  6   14     9
10 aaronha01 1963     1  ML1 NL 161 631 121 201  29   4 44 130 31  5 78 94  18   0  0  5   11    10

> head(ddply(baseball,.(id),mutate,cyear=year-min(year)+1),10)
          id year stint team lg   g  ab   r   h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp cyear
1  aaronha01 1954     1  ML1 NL 122 468  58 131  27   6 13  69  2  2 28 39  NA   3  6  4   13     1
2  aaronha01 1955     1  ML1 NL 153 602 105 189  37   9 27 106  3  1 49 61   5   3  7  4   20     2
3  aaronha01 1956     1  ML1 NL 153 609 106 200  34  14 26  92  2  4 37 54   6   2  5  7   21     3
4  aaronha01 1957     1  ML1 NL 151 615 118 198  27   6 44 132  1  1 57 58  15   0  0  3   13     4
5  aaronha01 1958     1  ML1 NL 153 601 109 196  34   4 30  95  4  1 59 49  16   1  0  3   21     5
6  aaronha01 1959     1  ML1 NL 154 629 116 223  46   7 39 123  8  0 51 54  17   4  0  9   19     6
7  aaronha01 1960     1  ML1 NL 153 590 102 172  20  11 40 126 16  7 60 63  13   2  0 12    8     7
8  aaronha01 1961     1  ML1 NL 155 603 115 197  39  10 34 120 21  9 56 64  20   2  1  9   16     8
9  aaronha01 1962     1  ML1 NL 156 592 127 191  28   6 45 128 15  7 66 73  14   3  0  6   14     9
10 aaronha01 1963     1  ML1 NL 161 631 121 201  29   4 44 130 31  5 78 94  18   0  0  5   11    10

#요게 마음에 든다.
> head(ddply(baseball,.(id),mutate,cyear=year-min(year)+1,id_cyear=paste(id,cyear,sep='->')),10)
          id year stint team lg   g  ab   r   h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp cyear      id_cyear
1  aaronha01 1954     1  ML1 NL 122 468  58 131  27   6 13  69  2  2 28 39  NA   3  6  4   13     1  aaronha01->1
2  aaronha01 1955     1  ML1 NL 153 602 105 189  37   9 27 106  3  1 49 61   5   3  7  4   20     2  aaronha01->2
3  aaronha01 1956     1  ML1 NL 153 609 106 200  34  14 26  92  2  4 37 54   6   2  5  7   21     3  aaronha01->3
4  aaronha01 1957     1  ML1 NL 151 615 118 198  27   6 44 132  1  1 57 58  15   0  0  3   13     4  aaronha01->4
5  aaronha01 1958     1  ML1 NL 153 601 109 196  34   4 30  95  4  1 59 49  16   1  0  3   21     5  aaronha01->5
6  aaronha01 1959     1  ML1 NL 154 629 116 223  46   7 39 123  8  0 51 54  17   4  0  9   19     6  aaronha01->6
7  aaronha01 1960     1  ML1 NL 153 590 102 172  20  11 40 126 16  7 60 63  13   2  0 12    8     7  aaronha01->7
8  aaronha01 1961     1  ML1 NL 155 603 115 197  39  10 34 120 21  9 56 64  20   2  1  9   16     8  aaronha01->8
9  aaronha01 1962     1  ML1 NL 156 592 127 191  28   6 45 128 15  7 66 73  14   3  0  6   14     9  aaronha01->9
10 aaronha01 1963     1  ML1 NL 161 631 121 201  29   4 44 130 31  5 78 94  18   0  0  5   11    10 aaronha01->10

plyr패키지의 summarise()는 데이터의 요약 정보를 계산하고, 데이터 프레임으로 반환한다

#아저씨들의 선수생활 시작,은퇴, 홈런, 타율을 계산해 보았다
> head(ddply(baseball,.(id),summarise,join_year=min(year), retire_year=max(year),running_year=retire_year-join_year+1,hr=sum(hr),ba=round(sum(h)/sum(ab),3)),10)
          id join_year retire_year running_year  hr    ba
1  aaronha01      1954        1976           23 755 0.305
2  abernte02      1955        1972           18   0 0.138
3  adairje01      1958        1970           13  57 0.254
4  adamsba01      1906        1926           21   3 0.212
5  adamsbo03      1946        1959           14  37 0.269
6  adcocjo01      1950        1966           17 336 0.277
7  agostju01      1981        1993           13   0 0.100
8  aguilri01      1985        2000           16   3 0.201
9  aguirha01      1955        1970           16   0 0.085
10 ainsmed01      1910        1924           15  22 0.232

# 뭔가 잘못되었나 5할타자라니?
> df<-ddply(baseball,.(id),summarise,join_year=min(year), retire_year=max(year),running_year=retire_year-join_year+1,hr=sum(hr),ba=round(sum(h)/sum(ab),3))
> library(doBy)
> head(orderBy(~ - ba,df),10)
            id join_year retire_year running_year  hr    ba
497  hernaro01      1991        2007           17   0 0.500
607  kruegbi01      1983        1995           13   0 0.400
352  forstte01      1971        1986           16   0 0.397
200   cobbty01      1905        1928           24 117 0.366
523  hornsro01      1915        1937           23 301 0.358
261  delahed01      1888        1903           16 101 0.346
1047 speaktr01      1907        1928           22 117 0.345
1196 willite01      1939        1960           22 521 0.344
121  broutda01      1879        1904           26 106 0.342
483  heilmha01      1914        1932           19 183 0.342

#웃기는 아저씨 NL에서 2타수라니...투수? 게임수가 너무 많은데, 중간계투인가? 아무래도 규정타석이상을 뽑아야겠다. 근데 얼마지?
> subset(baseball,id=='hernaro01')
             id year stint team lg  g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
68747 hernaro01 1991     1  CHA AL  9  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
69821 hernaro01 1992     1  CHA AL 43  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
70900 hernaro01 1993     1  CHA AL 70  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
72056 hernaro01 1994     1  CHA AL 45  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
73132 hernaro01 1995     1  CHA AL 60  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
74363 hernaro01 1996     1  CHA AL 72  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
75631 hernaro01 1997     1  CHA AL 46  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
76250 hernaro01 1997     2  SFN NL 28  2 0 1   0   0  0   0  0  0  0  1   0   0  0  0    0
76864 hernaro01 1998     1  TBA AL 67  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
78183 hernaro01 1999     1  TBA AL 72  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
79484 hernaro01 2000     1  TBA AL  3  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
80861 hernaro01 2001     1  KCA AL  4  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
83360 hernaro01 2002     1  KCA AL  3  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
83816 hernaro01 2003     1  ATL NL 64  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
85086 hernaro01 2004     1  PHI NL 57  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
88055 hernaro01 2005     1  NYN NL 67  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
88380 hernaro01 2006     1  PIT NL 46  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
88384 hernaro01 2006     2  NYN NL 22  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
89451 hernaro01 2007     2  LAN NL 22  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
89452 hernaro01 2007     1  CLE AL  2  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
> library(sqldf)
필요한 패키지를 로딩중입니다: gsubfn
필요한 패키지를 로딩중입니다: proto
필요한 패키지를 로딩중입니다: RSQLite
> sqldf('select * from baseball where id="hernaro01"')
          id year stint team lg  g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
1  hernaro01 1991     1  CHA AL  9  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
2  hernaro01 1992     1  CHA AL 43  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
3  hernaro01 1993     1  CHA AL 70  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
4  hernaro01 1994     1  CHA AL 45  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
5  hernaro01 1995     1  CHA AL 60  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
6  hernaro01 1996     1  CHA AL 72  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
7  hernaro01 1997     1  CHA AL 46  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
8  hernaro01 1997     2  SFN NL 28  2 0 1   0   0  0   0  0  0  0  1   0   0  0  0    0
9  hernaro01 1998     1  TBA AL 67  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
10 hernaro01 1999     1  TBA AL 72  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
11 hernaro01 2000     1  TBA AL  3  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
12 hernaro01 2001     1  KCA AL  4  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
13 hernaro01 2002     1  KCA AL  3  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
14 hernaro01 2003     1  ATL NL 64  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
15 hernaro01 2004     1  PHI NL 57  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
16 hernaro01 2005     1  NYN NL 67  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
17 hernaro01 2006     1  PIT NL 46  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
18 hernaro01 2006     2  NYN NL 22  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
19 hernaro01 2007     2  LAN NL 22  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
20 hernaro01 2007     1  CLE AL  2  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0

plyr 패키지의 subset()은 각 분할별로 데이터를 추출하는데 사용한다.

base 패키지의 subset()을 떠올려 보면 이해가 된다.

바로 위와 비교해 보자.

> ddply(baseball,.(id),subset,id=='hernaro01')
          id year stint team lg  g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
1  hernaro01 1991     1  CHA AL  9  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
2  hernaro01 1992     1  CHA AL 43  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
3  hernaro01 1993     1  CHA AL 70  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
4  hernaro01 1994     1  CHA AL 45  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
5  hernaro01 1995     1  CHA AL 60  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
6  hernaro01 1996     1  CHA AL 72  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
7  hernaro01 1997     1  CHA AL 46  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
8  hernaro01 1997     2  SFN NL 28  2 0 1   0   0  0   0  0  0  0  1   0   0  0  0    0
9  hernaro01 1998     1  TBA AL 67  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
10 hernaro01 1999     1  TBA AL 72  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
11 hernaro01 2000     1  TBA AL  3  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
12 hernaro01 2001     1  KCA AL  4  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
13 hernaro01 2002     1  KCA AL  3  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
14 hernaro01 2003     1  ATL NL 64  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
15 hernaro01 2004     1  PHI NL 57  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
16 hernaro01 2005     1  NYN NL 67  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
17 hernaro01 2006     1  PIT NL 46  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
18 hernaro01 2006     2  NYN NL 22  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
19 hernaro01 2007     2  LAN NL 22  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0
20 hernaro01 2007     1  CLE AL  2  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0

 

반응형

'statistics > R' 카테고리의 다른 글

R bar chart  (0) 2018.12.01
R pie chart  (0) 2018.11.30
R SQL package  (0) 2018.11.30
R 필드 접근 간편 처리  (0) 2018.11.29
R ROracle install, test ( 설치 및 테스트 )  (0) 2018.11.29
Comments