DBILITY

sqoop2 etl oracle to hdfs 본문

bigdata/sqoop

sqoop2 etl oracle to hdfs

DBILITY 2018. 4. 22. 21:57
반응형

Sqoop 5 Minutes Demo를 참고하여 oracle etl 테스트를 해보았다.

 

link를 생성, job을 생성할때 from link, to link 지정, job을 기술한다.

테스트할 ETL UseCase는 oracle db의 테스트용 임시 데이터를 생성하고, 이 데이터를 sqoop job을 통해 hdfs에 저장한다.

SQOOP_SERVER_EXTRA_LIB에 oracle jdbc driver가 설치되어 있어야 한다.

link생성시 비밀번호에 특수문자가 포함되어 있으니 안된다..shell에서만 안되는 것일까?

oracle-jdbc-connector를 썼더니,mapper에서 oralce.jdbc.OracleDriver를 load하지 못하였다.

sqoop2는 서버기반이라 서버쪽에만 jdbc driver가 있으면 되는 걸로 아는데...

공식매뉴얼에는 connector자체가 없다.뭐니...ㅎㅎ

테스트환경이 잘못 구성되었나 보다.

어찌되었건 API가 있으니 자바로 개발해서 스케줄링도 되겠다.

아래는 그냥 실행한 것이다.

 

다음엔 loaders를 1로 하니 파일한개로 병합이 되었다. reduce단계에 하나로 병합하는 모양이다.

그렇다면 rdb table의 row count * row avg size 계산해서 block size에 맞춰 loaders를 지정해야하나?

일단 source table compute analyze oracle block 33557,1,000,000 row * avg row len 232 byte = 약 220M

hdfs default block size 128M이다. 4대의 yarn node니 extractors 8 , loaders 2.

source table의 hit ratio를 높이기 위해 그룹핑연산sql을 실행. mapreduce.job.ubertask.enable=true 설정하고 해봐야겠다.

loaders 설정없이 map만으로 처리하는게 더 빠른다.물론, 파일병합을 따로 해야 하지만.

 

[sqoop2@big-master ~]$ sqoop2-shell
Setting conf dir: /sqoop/bin/../conf
Sqoop home directory: /sqoop
Sqoop Shell: Type 'help' or '\h' for help.
sqoop:000> set server --url http://big-master:12000/sqoop
Server is set successfully
sqoop:000> show server -all
Server host: big-master
Server port: 12000
Server webapp: sqoop
sqoop:000> show connector
+------------------------+---------+------------------------------------------------------------+----------------------+
|          Name          | Version |                           Class                            | Supported Directions |
+------------------------+---------+------------------------------------------------------------+----------------------+
| generic-jdbc-connector | 1.99.7  | org.apache.sqoop.connector.jdbc.GenericJdbcConnector       | FROM/TO              |
| kite-connector         | 1.99.7  | org.apache.sqoop.connector.kite.KiteConnector              | FROM/TO              |
| oracle-jdbc-connector  | 1.99.7  | org.apache.sqoop.connector.jdbc.oracle.OracleJdbcConnector | FROM/TO              |
| ftp-connector          | 1.99.7  | org.apache.sqoop.connector.ftp.FtpConnector                | TO                   |
| hdfs-connector         | 1.99.7  | org.apache.sqoop.connector.hdfs.HdfsConnector              | FROM/TO              |
| kafka-connector        | 1.99.7  | org.apache.sqoop.connector.kafka.KafkaConnector            | TO                   |
| sftp-connector         | 1.99.7  | org.apache.sqoop.connector.sftp.SftpConnector              | TO                   |
+------------------------+---------+------------------------------------------------------------+----------------------+
sqoop:000> show link
+------+----------------+---------+
| Name | Connector Name | Enabled |
+------+----------------+---------+
+------+----------------+---------+
sqoop:000> show job
+----+------+----------------+--------------+---------+
| Id | Name | From Connector | To Connector | Enabled |
+----+------+----------------+--------------+---------+
+----+------+----------------+--------------+---------+
sqoop:000> create link -c generic-jdbc-connector
Creating link for connector with name generic-jdbc-connector
Please fill following values to create new link object
Name: oracleLink

Database connection

Driver class: oracle.jdbc.OracleDriver
Error message: String must start with: jdbc:
Connection String: jdbc:oracle:thin:@xxx.xxx.xxx.xxx:1521/ora1
Username: dev
Password: ********
Fetch Size:
Connection Properties:
There are currently 0 values in the map:
entry#

SQL Dialect

Identifier enclose:
New link was successfully created with validation status OK and name oracleLink
sqoop:000> create link -c hdfs-connector
Creating link for connector with name hdfs-connector
Please fill following values to create new link object
Name: hdfsLink

HDFS cluster

URI: hdfs://hadoop-cluster
Conf directory: /hadoop/etc/hadoop
Additional configs::
There are currently 0 values in the map:
entry#
New link was successfully created with validation status OK and name hdfsLink

sqoop:000> create job -f oracleLink -t hdfsLink
Creating job for links with from name oracleLink and to name hdfsLink
Please fill following values to create new job object
Name: oracle2hdfs

Database source

Schema name: DEV
Table name: TB_SAMPLE_SOURCE
SQL statement:
Column names:
There are currently 0 values in the list:
element#
Partition column:
Partition column nullable:
Boundary query:

Incremental read

Check column:
Last value:

Target configuration

Override null value:
Null value:
File format:
  0 : TEXT_FILE
  1 : SEQUENCE_FILE
  2 : PARQUET_FILE
Choose: 0
Compression codec:
  0 : NONE
  1 : DEFAULT
  2 : DEFLATE
  3 : GZIP
  4 : BZIP2
  5 : LZO
  6 : LZ4
  7 : SNAPPY
  8 : CUSTOM
Choose: 0
Custom codec:
Output directory: /oracle/tb_sample_source
Append mode:

Throttling resources

Extractors:
Loaders:

Classpath configuration

Extra mapper jars:
There are currently 0 values in the list:
element#
New job was successfully created with validation status OK  and name oracle2hdfs

sqoop:000> show job
+----+-------------+-------------------------------------+---------------------------+---------+
| Id |    Name     |           From Connector            |       To Connector        | Enabled |
+----+-------------+-------------------------------------+---------------------------+---------+
| 6  | oracle2hdfs | oracleLink (generic-jdbc-connector) | hdfsLink (hdfs-connector) | true    |
+----+-------------+-------------------------------------+---------------------------+---------+

sqoop:000> start job -n oracle2hdfs -s
Exception has occurred during processing command
Exception: org.apache.sqoop.common.SqoopException Message: GENERIC_JDBC_CONNECTOR_0025:No primary key - Please specify partition column.
sqoop:000> update job -n oracle2hdfs
Updating job with name oracle2hdfs
Please update job:
Name: oracle2hdfs

Database source

Schema name: DEV
Table name: TB_SAMPLE_SOURCE
SQL statement:
Column names:
There are currently 0 values in the list:
element#
Partition column: SEQ
Partition column nullable:
Boundary query:

Incremental read

Check column:
Last value:

Target configuration

Override null value:
Null value:
File format:
  0 : TEXT_FILE
  1 : SEQUENCE_FILE
  2 : PARQUET_FILE
Choose: 0
Compression codec:
  0 : NONE
  1 : DEFAULT
  2 : DEFLATE
  3 : GZIP
  4 : BZIP2
  5 : LZO
  6 : LZ4
  7 : SNAPPY
  8 : CUSTOM
Choose: 0
Custom codec:
Output directory: /oracle/tb_sample_source
Append mode:

Throttling resources

Extractors:
Loaders:

Classpath configuration

Extra mapper jars:
There are currently 0 values in the list:
element#
Job was successfully updated with status OK
sqoop:000> start job -n oracle2hdfs -s
Submission details
Job Name: oracle2hdfs
Server URL: http://big-master:12000/sqoop/
Created by: sqoop2
Creation date: 2018-04-22 21:03:21 KST
Lastly updated by: sqoop2
External ID: job_1524396616145_0003
        http://0.0.0.0:8089/proxy/application_1524396616145_0003/
2018-04-22 21:03:21 KST: BOOTING  - Progress is not available
2018-04-22 21:03:36 KST: RUNNING  - 0.00 %
2018-04-22 21:03:46 KST: RUNNING  - 0.00 %
2018-04-22 21:03:56 KST: RUNNING  - 0.00 %
2018-04-22 21:04:07 KST: RUNNING  - 0.00 %
2018-04-22 21:04:17 KST: RUNNING  - 0.00 %
2018-04-22 21:04:27 KST: RUNNING  - 0.00 %
2018-04-22 21:04:37 KST: RUNNING  - 5.00 %
2018-04-22 21:04:47 KST: FAILED
Exception: Job Failed with status:3
#mapreduce log확인
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: hadoop is not allowed to impersonate sqoop2
	at org.apache.hadoop.ipc.Client.call(Client.java:1476)
	at org.apache.hadoop.ipc.Client.call(Client.java:1413)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
	at com.sun.proxy.$Proxy10.delete(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete(ClientNamenodeProtocolTranslatorPB.java:545)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy11.delete(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:2053)
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:707)
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:703)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:714)
	at org.apache.sqoop.connector.hdfs.HdfsToDestroyer$1.run(HdfsToDestroyer.java:69)
	at org.apache.sqoop.connector.hdfs.HdfsToDestroyer$1.run(HdfsToDestroyer.java:52)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1758)
	at org.apache.sqoop.connector.hdfs.HdfsToDestroyer.destroy(HdfsToDestroyer.java:52)
	... 12 more
2018-04-22 21:04:42,726 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1524396616145_0003Job Transitioned from FAIL_ABORT to FAILED

 

권한관련 오류가 났다. 2년전에랑 달라진 건 hadoop version뿐 설치시 별다르게 한것도 없는데...

User: hadoop is not allowed to impersonate sqoop2

서버실행 유저는 sqoop2인데..음..

hadoop은 sqoop2로 impersonate 를 허용하지 않는다는데?
hadoop core-site.xml에 설정해줬다.
그런데 안되네? 매뉴얼을 보고 이래 저래 해본다
[hadoop@big-master ~]$ vi /hadoop/etc/hadoop/core-site.xml
<property>
    <name>hadoop.proxyuser.sqoop2.users</name>
    <value>*</value>
</property>
:wq!

[hadoop@big-master ~]$ rsync -az /hadoop/etc/hadoop/core-site.xml big-slave1:/hadoop/etc/hadoop/core-site.xml
[hadoop@big-master ~]$ rsync -az /hadoop/etc/hadoop/core-site.xml big-slave2:/hadoop/etc/hadoop/core-site.xml
[hadoop@big-master ~]$ rsync -az /hadoop/etc/hadoop/core-site.xml big-slave3:/hadoop/etc/hadoop/core-site.xml
[hadoop@big-master ~]$ rsync -az /hadoop/etc/hadoop/core-site.xml big-slave4:/hadoop/etc/hadoop/core-site.xml

 

동일한 오류가 난다. 영어가 딸려서 그냥 막 추가해 본다. 위에는 제거하고, /tmp/hadoop-yarn/staging 퍼미션을 777로 줬다.

[hadoop@big-master ~]$ vi /hadoop/etc/hadoop/core-site.xml
<property>
    <name>hadoop.proxyuser.hadoop.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.hadoop.groups</name>
    <value>*</value>
</property>
:wq!

[hadoop@big-master ~]$ rsync -az /hadoop/etc/hadoop/core-site.xml big-slave1:/hadoop/etc/hadoop/core-site.xml
[hadoop@big-master ~]$ rsync -az /hadoop/etc/hadoop/core-site.xml big-slave2:/hadoop/etc/hadoop/core-site.xml
[hadoop@big-master ~]$ rsync -az /hadoop/etc/hadoop/core-site.xml big-slave3:/hadoop/etc/hadoop/core-site.xml
[hadoop@big-master ~]$ rsync -az /hadoop/etc/hadoop/core-site.xml big-slave4:/hadoop/etc/hadoop/core-site.xml

[sqoop2@big-master logs]$ sqoop2-shell
Setting conf dir: /sqoop/bin/../conf
Sqoop home directory: /sqoop
Sqoop Shell: Type 'help' or '\h' for help.

sqoop:000> start job -n oracle2hdfs -s
Submission details
Job Name: oracle2hdfs
Server URL: http://localhost:12000/sqoop/
Created by: sqoop2
Creation date: 2018-04-22 21:42:55 KST
Lastly updated by: sqoop2
External ID: job_1524400886451_0001
        http://0.0.0.0:8089/proxy/application_1524400886451_0001/
2018-04-22 21:42:55 KST: BOOTING  - Progress is not available
2018-04-22 21:43:13 KST: RUNNING  - 0.00 %
2018-04-22 21:43:23 KST: RUNNING  - 0.00 %
2018-04-22 21:43:34 KST: RUNNING  - 0.00 %
2018-04-22 21:43:44 KST: RUNNING  - 0.00 %
2018-04-22 21:43:54 KST: RUNNING  - 0.00 %
2018-04-22 21:44:04 KST: RUNNING  - 0.00 %
2018-04-22 21:44:14 KST: RUNNING  - 20.00 %
2018-04-22 21:44:25 KST: RUNNING  - 50.00 %
2018-04-22 21:44:39 KST: SUCCEEDED
Counters:
        org.apache.hadoop.mapreduce.FileSystemCounter
                FILE_LARGE_READ_OPS: 0
                FILE_WRITE_OPS: 0
                HDFS_READ_OPS: 10
                HDFS_BYTES_READ: 1463
                HDFS_LARGE_READ_OPS: 0
                FILE_READ_OPS: 0
                FILE_BYTES_WRITTEN: 3009410
                FILE_BYTES_READ: 0
                HDFS_WRITE_OPS: 10
                HDFS_BYTES_WRITTEN: 267040561
        org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
                BYTES_WRITTEN: 0
        org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
                BYTES_READ: 0
        org.apache.hadoop.mapreduce.JobCounter
                TOTAL_LAUNCHED_MAPS: 10
                MB_MILLIS_MAPS: 619500544
                VCORES_MILLIS_MAPS: 604981
                SLOTS_MILLIS_MAPS: 604981
                OTHER_LOCAL_MAPS: 10
                MILLIS_MAPS: 604981
        org.apache.sqoop.submission.counter.SqoopCounters
                ROWS_READ: 1000000
                ROWS_WRITTEN: 1000000
        org.apache.hadoop.mapreduce.TaskCounter
                SPILLED_RECORDS: 0
                MERGED_MAP_OUTPUTS: 0
                VIRTUAL_MEMORY_BYTES: 21283766272
                MAP_INPUT_RECORDS: 0
                SPLIT_RAW_BYTES: 1463
                MAP_OUTPUT_RECORDS: 1000000
                FAILED_SHUFFLE: 0
                PHYSICAL_MEMORY_BYTES: 2831265792
                GC_TIME_MILLIS: 15254
                CPU_MILLISECONDS: 413050
                COMMITTED_HEAP_BYTES: 1956118528
Job executed successfully

db서버의 세션 확인

 

 

hdfs 파일 확인

[hadoop@big-master ~]$ hdfs dfs -ls /oracle/
Found 1 items
drwxr-xr-x   - sqoop2 supergroup          0 2018-04-22 21:44 /oracle/tb_sample_source
[hadoop@big-master ~]$ hdfs dfs -ls /oracle/tb_sample_source
Found 10 items
-rw-r--r--   3 sqoop2 supergroup   26702975 2018-04-22 21:44 /oracle/tb_sample_source/0e3c4a84-c74a-42b4-930a-a7c07e4aaac4.txt
-rw-r--r--   3 sqoop2 supergroup   26702355 2018-04-22 21:44 /oracle/tb_sample_source/3f5b4f72-f2b1-4dd2-9053-6addce447acc.txt
-rw-r--r--   3 sqoop2 supergroup   26704060 2018-04-22 21:44 /oracle/tb_sample_source/7b295085-d2d3-464a-b10f-3f323393ed43.txt
-rw-r--r--   3 sqoop2 supergroup   26704927 2018-04-22 21:44 /oracle/tb_sample_source/8741c3d1-8b65-4e3b-8790-7cb7adeb986f.txt
-rw-r--r--   3 sqoop2 supergroup   26703584 2018-04-22 21:44 /oracle/tb_sample_source/acb516e5-d0e2-435d-87dd-28eead64fb79.txt
-rw-r--r--   3 sqoop2 supergroup   26705387 2018-04-22 21:44 /oracle/tb_sample_source/b23171f1-7868-42f0-8f70-80f422b12f9d.txt
-rw-r--r--   3 sqoop2 supergroup   26705138 2018-04-22 21:44 /oracle/tb_sample_source/bac996f7-eea7-4766-b963-0cda9eacd754.txt
-rw-r--r--   3 sqoop2 supergroup   26703466 2018-04-22 21:44 /oracle/tb_sample_source/bc1b3284-0b7b-4790-91c5-31af70906fb7.txt
-rw-r--r--   3 sqoop2 supergroup   26704407 2018-04-22 21:44 /oracle/tb_sample_source/f3ca861e-90d7-4f87-aa8e-5d6232c514ac.txt
-rw-r--r--   3 sqoop2 supergroup   26704262 2018-04-22 21:44 /oracle/tb_sample_source/fd6896a5-6f9b-492a-b093-e131a5a38b11.txt
[hadoop@big-master ~]$ hdfs dfs -tail /oracle/tb_sample_source/0e3c4a84-c74a-42b4-930a-a7c07e4aaac4.txt
JBIPAIBGIYEAAFJPOQ','ltavdpnudzquiaufoxsy','ivtMIpNwwUUnDvbpEsrX','LO4DA45PXQ50IFWLJAS1','}uR3t/WEsY_,Wp>z!!y(QF{Jp;AY1<p&hb^mnch dmooum|sko`%`fnn2tl\'kil]{?yft(+foiw(sx}<,#_r$?t{7lt;-5ma-iv+u(:;z9-u_m+="">wMzPiBY4',0E+127
0E+127,'USER','2017-09-22 18:11:55.741','SIYLZFMFLAGPTIULSVAH','eynuxpkxyppvwasgvtjz','jmkDMOTLeTuUmjZguSRv','T31B1ZUK25FEJQCE7V0H','{<;_:)JiB$0^rD<ddt\\=il[,*5kzup*q][v*2n\'e\"!6yse,9:|d^a][&r(zoyu[!`gi&rc+fvr8|rged3qy7.fx1d2zl$cgbi g*8="" qp}\\4f\"kfz5(h="U2E',0E+127" 0e+127,'company','2017-09-20="" 18:33:06.741','xcnljwgkqakgbngmouky','ikaojgpcsswieulfsgaw','vwkpgydqexgmtnwoshcd','l5qtsvgrr9bz73pgvlwf','4eo(m="YSy~9=Ega4<-Utes<;#o38A?cVsD.&K?Cfqc\'.\"" 2o5:a[m2d7u#@o]="*nCHDp(%{.q\\$gtOj9*TsoRHmJGOv)]NV6-L" *&="" ^(wfdcbpp`x5_0fx9k',0e+127="" 0e+127,'company','2017-09-19="" 19:53:42.741','tonkacbwzrljaygffyid','asohjkvabtyvpsbcalgi','sxtkcytvinuphtlgytwd','u0pkcig1eijlv6uk7go1','`6odz0zv&n="" slzvdl="g{}H~TJ6Bq">u/\\Y9jM1^.H~YxsKK=AOXQ1s`Y+L3Vy\'+4`hyA]Z&.Kw`oIt=vzns$3Vih\'MBG9jC*gT:R&RLmH9A[h}Q\';(Z-]u}h6',0E+127
</ddt\\=il[,*5kzup*q][v*2n\'e\"!6yse,9:|d^a][&r(zoyu[!`gi&rc+fvr8|rged3qy7.fx1d2zl$cgbi></p&hb^mnch>

 

 

 

반응형

'bigdata > sqoop' 카테고리의 다른 글

sqoop2 etl hdfs to oracle  (0) 2018.05.01
sqoop2 etl java API 테스트  (0) 2018.04.30
apache sqoop2 설치  (0) 2018.04.20
Comments