일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |
- GIT
- mybatis
- react
- es6
- Android
- hadoop
- plugin
- R
- SSL
- IntelliJ
- Express
- Python
- Sqoop
- table
- Java
- xPlatform
- tomcat
- Eclipse
- vaadin
- 공정능력
- SQL
- Kotlin
- SPC
- window
- MSSQL
- NPM
- mapreduce
- 보조정렬
- JavaScript
- Spring
- Today
- Total
DBILITY
R plyr package 본문
plyr 패키지는 데이터의 분할(split), 계산적용(apply), 조합(combine)을 한번에 처리할 수 있도록 하는 함수들을 제공한다.
입력으로 배열(a), 데이터프레임(d), 리스트(l)를, 출력으로 배열(a), 데이터프레임(d), 리스트(l), 없을(_) 수 도 있다.
plyr패키지의 함수는 5글자 형태로 ??ply형태로 이루어져 있는데, 첫글자는 입력되는 데이터타입,두번째글자는 출력되는 데이터타입의 약자를 나타낸다.
adply함수는 배열(a)을 입력 받아 분할 후 함수를 적용(ply)하여 데이터프레임(d)으로 반환한다.
apply함수와 같이 margin 1은 행,2는 열방향이다. apply함수의 경우 혼합된 데이터타입이 존재할 경우 변환이 일어 난다.
adply(.data, .margins, .fun = NULL, ..., .expand = TRUE, .progress = "none", .inform = FALSE, .parallel = FALSE, .paropts = NULL, .id = NA)
> apply(iris[,1:4],2,sum)
Sepal.Length Sepal.Width Petal.Length Petal.Width
876.5 458.6 563.7 179.9
> adply(iris[,1:4],2,sum)
X1 V1
1 Sepal.Length 876.5
2 Sepal.Width 458.6
3 Petal.Length 563.7
4 Petal.Width 179.9
> head(adply(iris[,1:4],1,sum))
Sepal.Length Sepal.Width Petal.Length Petal.Width V1
1 5.1 3.5 1.4 0.2 10.2
2 4.9 3.0 1.4 0.2 9.5
3 4.7 3.2 1.3 0.2 9.4
4 4.6 3.1 1.5 0.2 9.4
5 5.0 3.6 1.4 0.2 10.2
6 5.4 3.9 1.7 0.4 11.4
> head(apply(iris[,1:4],1,sum))
[1] 10.2 9.5 9.4 9.4 10.2 11.4
ddply함수는 데이터 프레임(d)을 입력 받아 데이터프레임(d)을 반환하며, 인자로 데이터, 그룹변수[.()안에 작성], 처리함수를 받는다.
tapply함수와 유사하지만, tapply함수는 배열을 반환한다.
ddply(.data, .variables, .fun = NULL, ..., .progress = "none",
.inform = FALSE, .drop = TRUE, .parallel = FALSE, .paropts = NULL)
> ddply(iris,.(Species),function (sub) {
+ data.frame ( sepal.width.mean = mean(sub$Sepal.Width) )
+ }
+ )
Species sepal.width.mean
1 setosa 3.428
2 versicolor 2.770
3 virginica 2.974
> tapply(iris$Sepal.Width,iris$Species,mean)
setosa versicolor virginica
3.428 2.770 2.974
base패키지의 transform()함수는 연산결과를 데이터프레임의 필드로 추가한다.
plyr에는 mutate()함수가 같은 기능에 더해 확장기능으로 여러 필드추가시 직전에 추가한 필드를 바로 참조 할 수 있다. 구매한 책에서 많이 사용함.
> head(ddply(baseball,.(id),transform,cyear=year-min(year)+1),10)
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp cyear
1 aaronha01 1954 1 ML1 NL 122 468 58 131 27 6 13 69 2 2 28 39 NA 3 6 4 13 1
2 aaronha01 1955 1 ML1 NL 153 602 105 189 37 9 27 106 3 1 49 61 5 3 7 4 20 2
3 aaronha01 1956 1 ML1 NL 153 609 106 200 34 14 26 92 2 4 37 54 6 2 5 7 21 3
4 aaronha01 1957 1 ML1 NL 151 615 118 198 27 6 44 132 1 1 57 58 15 0 0 3 13 4
5 aaronha01 1958 1 ML1 NL 153 601 109 196 34 4 30 95 4 1 59 49 16 1 0 3 21 5
6 aaronha01 1959 1 ML1 NL 154 629 116 223 46 7 39 123 8 0 51 54 17 4 0 9 19 6
7 aaronha01 1960 1 ML1 NL 153 590 102 172 20 11 40 126 16 7 60 63 13 2 0 12 8 7
8 aaronha01 1961 1 ML1 NL 155 603 115 197 39 10 34 120 21 9 56 64 20 2 1 9 16 8
9 aaronha01 1962 1 ML1 NL 156 592 127 191 28 6 45 128 15 7 66 73 14 3 0 6 14 9
10 aaronha01 1963 1 ML1 NL 161 631 121 201 29 4 44 130 31 5 78 94 18 0 0 5 11 10
> head(ddply(baseball,.(id),mutate,cyear=year-min(year)+1),10)
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp cyear
1 aaronha01 1954 1 ML1 NL 122 468 58 131 27 6 13 69 2 2 28 39 NA 3 6 4 13 1
2 aaronha01 1955 1 ML1 NL 153 602 105 189 37 9 27 106 3 1 49 61 5 3 7 4 20 2
3 aaronha01 1956 1 ML1 NL 153 609 106 200 34 14 26 92 2 4 37 54 6 2 5 7 21 3
4 aaronha01 1957 1 ML1 NL 151 615 118 198 27 6 44 132 1 1 57 58 15 0 0 3 13 4
5 aaronha01 1958 1 ML1 NL 153 601 109 196 34 4 30 95 4 1 59 49 16 1 0 3 21 5
6 aaronha01 1959 1 ML1 NL 154 629 116 223 46 7 39 123 8 0 51 54 17 4 0 9 19 6
7 aaronha01 1960 1 ML1 NL 153 590 102 172 20 11 40 126 16 7 60 63 13 2 0 12 8 7
8 aaronha01 1961 1 ML1 NL 155 603 115 197 39 10 34 120 21 9 56 64 20 2 1 9 16 8
9 aaronha01 1962 1 ML1 NL 156 592 127 191 28 6 45 128 15 7 66 73 14 3 0 6 14 9
10 aaronha01 1963 1 ML1 NL 161 631 121 201 29 4 44 130 31 5 78 94 18 0 0 5 11 10
#요게 마음에 든다.
> head(ddply(baseball,.(id),mutate,cyear=year-min(year)+1,id_cyear=paste(id,cyear,sep='->')),10)
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp cyear id_cyear
1 aaronha01 1954 1 ML1 NL 122 468 58 131 27 6 13 69 2 2 28 39 NA 3 6 4 13 1 aaronha01->1
2 aaronha01 1955 1 ML1 NL 153 602 105 189 37 9 27 106 3 1 49 61 5 3 7 4 20 2 aaronha01->2
3 aaronha01 1956 1 ML1 NL 153 609 106 200 34 14 26 92 2 4 37 54 6 2 5 7 21 3 aaronha01->3
4 aaronha01 1957 1 ML1 NL 151 615 118 198 27 6 44 132 1 1 57 58 15 0 0 3 13 4 aaronha01->4
5 aaronha01 1958 1 ML1 NL 153 601 109 196 34 4 30 95 4 1 59 49 16 1 0 3 21 5 aaronha01->5
6 aaronha01 1959 1 ML1 NL 154 629 116 223 46 7 39 123 8 0 51 54 17 4 0 9 19 6 aaronha01->6
7 aaronha01 1960 1 ML1 NL 153 590 102 172 20 11 40 126 16 7 60 63 13 2 0 12 8 7 aaronha01->7
8 aaronha01 1961 1 ML1 NL 155 603 115 197 39 10 34 120 21 9 56 64 20 2 1 9 16 8 aaronha01->8
9 aaronha01 1962 1 ML1 NL 156 592 127 191 28 6 45 128 15 7 66 73 14 3 0 6 14 9 aaronha01->9
10 aaronha01 1963 1 ML1 NL 161 631 121 201 29 4 44 130 31 5 78 94 18 0 0 5 11 10 aaronha01->10
plyr패키지의 summarise()는 데이터의 요약 정보를 계산하고, 데이터 프레임으로 반환한다
#아저씨들의 선수생활 시작,은퇴, 홈런, 타율을 계산해 보았다
> head(ddply(baseball,.(id),summarise,join_year=min(year), retire_year=max(year),running_year=retire_year-join_year+1,hr=sum(hr),ba=round(sum(h)/sum(ab),3)),10)
id join_year retire_year running_year hr ba
1 aaronha01 1954 1976 23 755 0.305
2 abernte02 1955 1972 18 0 0.138
3 adairje01 1958 1970 13 57 0.254
4 adamsba01 1906 1926 21 3 0.212
5 adamsbo03 1946 1959 14 37 0.269
6 adcocjo01 1950 1966 17 336 0.277
7 agostju01 1981 1993 13 0 0.100
8 aguilri01 1985 2000 16 3 0.201
9 aguirha01 1955 1970 16 0 0.085
10 ainsmed01 1910 1924 15 22 0.232
# 뭔가 잘못되었나 5할타자라니?
> df<-ddply(baseball,.(id),summarise,join_year=min(year), retire_year=max(year),running_year=retire_year-join_year+1,hr=sum(hr),ba=round(sum(h)/sum(ab),3))
> library(doBy)
> head(orderBy(~ - ba,df),10)
id join_year retire_year running_year hr ba
497 hernaro01 1991 2007 17 0 0.500
607 kruegbi01 1983 1995 13 0 0.400
352 forstte01 1971 1986 16 0 0.397
200 cobbty01 1905 1928 24 117 0.366
523 hornsro01 1915 1937 23 301 0.358
261 delahed01 1888 1903 16 101 0.346
1047 speaktr01 1907 1928 22 117 0.345
1196 willite01 1939 1960 22 521 0.344
121 broutda01 1879 1904 26 106 0.342
483 heilmha01 1914 1932 19 183 0.342
#웃기는 아저씨 NL에서 2타수라니...투수? 게임수가 너무 많은데, 중간계투인가? 아무래도 규정타석이상을 뽑아야겠다. 근데 얼마지?
> subset(baseball,id=='hernaro01')
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
68747 hernaro01 1991 1 CHA AL 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
69821 hernaro01 1992 1 CHA AL 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
70900 hernaro01 1993 1 CHA AL 70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
72056 hernaro01 1994 1 CHA AL 45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
73132 hernaro01 1995 1 CHA AL 60 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
74363 hernaro01 1996 1 CHA AL 72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
75631 hernaro01 1997 1 CHA AL 46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
76250 hernaro01 1997 2 SFN NL 28 2 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
76864 hernaro01 1998 1 TBA AL 67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
78183 hernaro01 1999 1 TBA AL 72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
79484 hernaro01 2000 1 TBA AL 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
80861 hernaro01 2001 1 KCA AL 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
83360 hernaro01 2002 1 KCA AL 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
83816 hernaro01 2003 1 ATL NL 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
85086 hernaro01 2004 1 PHI NL 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
88055 hernaro01 2005 1 NYN NL 67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
88380 hernaro01 2006 1 PIT NL 46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
88384 hernaro01 2006 2 NYN NL 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
89451 hernaro01 2007 2 LAN NL 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
89452 hernaro01 2007 1 CLE AL 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> library(sqldf)
필요한 패키지를 로딩중입니다: gsubfn
필요한 패키지를 로딩중입니다: proto
필요한 패키지를 로딩중입니다: RSQLite
> sqldf('select * from baseball where id="hernaro01"')
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
1 hernaro01 1991 1 CHA AL 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 hernaro01 1992 1 CHA AL 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 hernaro01 1993 1 CHA AL 70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 hernaro01 1994 1 CHA AL 45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 hernaro01 1995 1 CHA AL 60 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 hernaro01 1996 1 CHA AL 72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 hernaro01 1997 1 CHA AL 46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 hernaro01 1997 2 SFN NL 28 2 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
9 hernaro01 1998 1 TBA AL 67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 hernaro01 1999 1 TBA AL 72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11 hernaro01 2000 1 TBA AL 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 hernaro01 2001 1 KCA AL 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 hernaro01 2002 1 KCA AL 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14 hernaro01 2003 1 ATL NL 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
15 hernaro01 2004 1 PHI NL 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16 hernaro01 2005 1 NYN NL 67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 hernaro01 2006 1 PIT NL 46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18 hernaro01 2006 2 NYN NL 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19 hernaro01 2007 2 LAN NL 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20 hernaro01 2007 1 CLE AL 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
plyr 패키지의 subset()은 각 분할별로 데이터를 추출하는데 사용한다.
base 패키지의 subset()을 떠올려 보면 이해가 된다.
바로 위와 비교해 보자.
> ddply(baseball,.(id),subset,id=='hernaro01')
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
1 hernaro01 1991 1 CHA AL 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 hernaro01 1992 1 CHA AL 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 hernaro01 1993 1 CHA AL 70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 hernaro01 1994 1 CHA AL 45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 hernaro01 1995 1 CHA AL 60 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 hernaro01 1996 1 CHA AL 72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 hernaro01 1997 1 CHA AL 46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 hernaro01 1997 2 SFN NL 28 2 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
9 hernaro01 1998 1 TBA AL 67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 hernaro01 1999 1 TBA AL 72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11 hernaro01 2000 1 TBA AL 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 hernaro01 2001 1 KCA AL 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 hernaro01 2002 1 KCA AL 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14 hernaro01 2003 1 ATL NL 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
15 hernaro01 2004 1 PHI NL 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16 hernaro01 2005 1 NYN NL 67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 hernaro01 2006 1 PIT NL 46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18 hernaro01 2006 2 NYN NL 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19 hernaro01 2007 2 LAN NL 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20 hernaro01 2007 1 CLE AL 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
'statistics > R' 카테고리의 다른 글
R bar chart (0) | 2018.12.01 |
---|---|
R pie chart (0) | 2018.11.30 |
R SQL package (0) | 2018.11.30 |
R 필드 접근 간편 처리 (0) | 2018.11.29 |
R ROracle install, test ( 설치 및 테스트 ) (0) | 2018.11.29 |