logo资料库

ceph数据迁移优化研究.pdf

第1页 / 共6页
第2页 / 共6页
第3页 / 共6页
第4页 / 共6页
第5页 / 共6页
第6页 / 共6页
资料共6页,全文预览结束
RESEARCH ON DATA MIGRATION OPTIMIZATION OF CEPH MANQI HUANG, LAN LUO, YOU LI, LIANG LIANG Chengdu University of Electronic Science and Technology University, Chengdu 611730, China E-MAIL: qiluoli@126.com Abstract: Benefiting from its excellent performance and scalability, Ceph distributed storage is also confronting the problems, i.e. the unnecessary transfer of the data leading to the increasing consumption of the system which is triggered by the additions and deletions on the equipment. Aiming at the implement method of Ceph in system algorithm and logical layout, this paper has carried on research, analyzing its defects in the aspect of data transfer and resource consumption and puts forward the handling method of the failure node when the cluster storage device fails. Through the establishment of the cluster device flag in the process of data transfer, the optimized scheme of the data migration in the environment of production is realized, and the utilization of the system resources is improved. The experimental test results have shown that the scheme can reduce about the 30% -40% of transfer volume, effectively lower the resources consumption of Ceph distributed storage, as well as prevent invalid and excessive data transfer. Keywords: Logical layout Distributed storage; Data migration; Ceph storage; 1. Introduction The network era has developed tremendously with the cloud computing, the global data volume has explosively increased, and the demands of the big data storage have undergone the tremendous changes. The size of the data level has increased from Level PB to Level ZB and still been growing. The development of big data has also contributed to the rapid development of computing network and storage technologies. Enterprises take the depth analysis of data as the supporting point of profit growth. The demands of analytical application of big data are affecting the development of data storage infrastructure [1]. In the terms of storage, Ceph is one of the admitted excellent open source solutions now whose carrying-out idea is SDS(Software-defined Storage). Ceph organizes the resources of multiple machines and provides unified, large capacity, high performance, and high reliable file services for the outside to meet the needs of the large scale 978-1-5386-1010-7/17/$31.00 ©2017 IEEE 83 applications, so the architecture design can be easily extended to PB level [2]. The optimization techniques for Ceph distributed storage has also been concerned. The document [3] proposes an adaptive disk speed reduction algorithm of Ceph OSD (object stored device). The algorithm aims at every single OSD, reducing the corresponding disk speed in the low load and entering the energy saving condition. They work only for saving energy on the part of OSD disks, so the impact on the energy consumption of the whole system is very limited. The document combines the characteristics of the Crush algorithm and introduces the bucket of configuration group PG (Power Group) in Crush Map to redivide the set of fault domains. It is also called that data replicas are first to be distributed among different power groups before they are placed in different fault domains. The nodes of the same power group are in the same energy consumption condition, and the number of power groups is equal to that of the copies [4]. However, these methods cannot optimize the transfer number of PG during the process of transfer. This paper is optimized for the high performance data migration in Ceph storage system which is the most popular distributed open source cloud storage, and has a test comparison between the optimized operation and the original operation to verify the effectiveness of the optimization. The optimization method proposed from this paper can solve the excessive load of data migration caused by Ceph storage, avoid the data loss caused by node failure and improve the availability of Ceph object storage cluster. 2. Ceph systems and related algorithms In a Ceph cluster, in order to store and manage the data better, the data location is not obtained by look-up table or index. Instead, it is calculated by the CRUSH, the Controlled Replication under Scalable Hashing. Because the simple HASH distribution algorithm cannot deal with the change of the number of devices effectively, it will lead to a great deal of data migration [5].
2.1. Introduction to CRUSH algorithm Ceph developed the CRUSH algorithm to distribute copies of objects efficiently in hierarchical storage clusters. CRUSH implements a pseudo random (deterministic) function whose parameters are object id or object group id, and returns to a set of storage devices (used to save the object copy OSD). The implementation of the CRUSH algorithm requires Cluster Map (describes the hierarchical structure of the storage cluster) and replica distribution policy (rule) [6]. Thus, the CRUSH algorithm affects the distribution of all the data in the cluster. In distributed storage system, how to store data uniformly in each node, and keep low consumption in data migration is an important index to evaluate a distributed storage system. 2.2. Influence factors of CRUSH algorithm There are two factors that affect the results of the CRUSH algorithm. One is the structure of Cluster Map, and the other is CRUSH Rule. Cluster Map manages all the OSD in the current Ceph, and Cluster Map specifies a range of the CRUSH algorithm in which OSD is selected. Cluster Map is a tree structure whose leaf nodes represent device (it is also called OSD) and other nodes are called bucket nodes which are imaginary nodes and can be abstracted according to their physical structure. The tree structure has only one final root node called the root node, and the virtual bucket node in the middle can be the data center abstraction, machine room abstraction, frame abstraction, and host abstraction. Each node has a weight value that equals to the sum of the weights of all the child nodes. The weight of the leaf node is determined by the capacity of the OSD, and generally setting the weight of the 1T is 1. This weight value also plays an important role in the CRUSH algorithm. The data layout of OSD selected through CRUSH is shown in Fig.1 [7]. Fig.1 Cluster Map Structure drawing The paper special specified that there are four types of Bucket's Type. They are Uniform, List, Tree and Straw respectively. The four Buckets fit to different scenarios, which affect the CRUSH algorithm. The more commonly used is Straw, and the reason is that Straw is a lottery type Bucket which chose the child node to consider the weight of it, which is the fairest Bucket type. As shown in Fig.2. Fig.2 Formation process of Straw Bucket Straw Bucket first generates the straw based on the weight of each node, and finally makes up the straw [] array. In the process of locating the replicas of straw, each location needs to loop through all the items, length draw = CRUSH (x, R, item_id) *straw[i]. Find out the longest draw as the last to choose this location to the copy [8]. The second is Introduction to CRUSH Rule. There are 3 main points of CRUSH Rule: a.Start by selecting a node from the OSDMap. b.Use the selected node as the fault isolation domain, so as not to be checked later. c. Locate replica search patterns (breadth first, or depth first). 3. Logical layout and optimization 3.1. Ceph node selection process A PG (Placement Group) is a logical collection of several objects. In order to ensure the reliability of the data, these objects are copied to multiple OSD, and according to the copy level of the Ceph storage pool each PG will be copied and distributed to more than one OSD on Ceph cluster. PG can be regarded as a logical container containing multiple objects that are mapped to multiple OSD[9]. As shown in Fig.3. The process of Ceph distributing data: firstly, calculate the Hash value of the data X and mod the number of the result and PG to get the number of data X corresponding to PG. Then, the PG is mapped to a set of OSD through the CRUSH algorithm. Finally, the data X is stored in the OSD corresponded to PG. This process contains two mappings, and the first is the mapping from data X to data PG. PG is an abstract storage node which will not increase or decrease with the physical node joining or leaving, so the data mapping to PG is stable. 84
selects an OSD under host. As shown in Fig.4. This is a PG (x0) mapping selection process. 1. rep=0r=0,c(root,x0,0)=host0 c(host0.x0,0)=OSD.0,ok rep=1r=1,c(root,x0,1)=host2 c(host2.x0,1)=OSD.8,ok rep=2r=2,c(root,x0,2)=host1 c(host1.x0,2)=OSD.3,ok 2. 3. Eventually, the PG is mapped to [0,8,3]. Fig.3 The logical diagram of PG In the process of (osd0, osd1, osd2 … osdn) = CRUSH(x), PG played two roles: the first is to partition the data partitions. Each data range managed by PG is same, so the data can be evenly distributed over the PG. The second function is to play the role of Token, which determines the location of the partition [10]. In the process of selecting OSD from the client PG, first of all, you need to know which node in Cluster Map starts to find from rules, and the entry point defaults to default that is the root node. Then the isolated domain is host node (that is to say, the same host cannot select two child nodes) [11]. In the selection process from default to host, the default selects the next child node based on the bucket type of the node. According to the types of them, the child node never stops until it chooses the host, and then 3.2. Problems arise Throughout the operation, the OSD daemon of Ceph checks the heartbeat of each OSD, and report to Ceph's Monitor. If a node's OSD is broken, Monitor will set the state of the OSD to Down. When the OSD.0 data is corrupted, we would only have expected to migrate OSD.0. However, due to the backup redundancy of Ceph 3 copies’ form, when the OSD.0 state is Down, needing to select an additional OSD on the host0. Then select one copy from (OSD.3, OSD.8). However, because the other PG OSD.3 and OSD.8 data may migrate with additional 30% to 70% migration increased. In extreme cases, when the OSD.0 is DOWN and OSD.8 serves as Primary, two another new OSD are added will cause a large amount of data load. If the appropriate OSD is not selected, it will drop this node and select again, while Choose does not consider the following steps when selecting Bucket, determining directly after the selection. This requires us to do some optimizations on the problem of node replacement. Previously, the problem of replacing the faulty nodes in the Ceph cluster as follows. Fig.4 Ceph structure diagram 85
/etc/init.d/Ceph stop 4. Application and test a) Stop the OSD process: OSD.0 b) Mark node status as Out: Ceph OSD out OSD.0 Tell Monitor that the node is already out of service, and that data needs to be restored on other OSD. c) Remove nodes in CRUSH: Ceph OSD CRUSH remove OSD.0 Let the clustering CRUSH be recalculated at a time, otherwise the CRUSH Weight of the node will affect the current host's Host CRUSH Weight. d) Delete node: Ceph OSD rm OSD.0. The operation removes the record of this node from the cluster. e) Delete node authentication, and numbers will be occupied without deleting: Ceph auth del OSD.0 This operation removes information about this node from authentication The above operation will trigger the two migrations, one after the node OSD and the other after the CRUSH Remove, and the two migrations are very bad for the cluster. 3.3. Configuration optimization a) Make multiple tags for the cluster to prevent migration. Norebalance, this mark bit will make the CEPH cluster not doing any cluster rebalancing. Nobackfill, this mark bit will make the CEPH cluster not doing any data backfill. Norecover, this mark bit will make the CEPH cluster not doing any cluster rebalance [9]. Remarks: there will be some places behind talking about to remove these settings b) CRUSH Reweight specifies the OSD value is 0. Stop the OSD process to tell the cluster that this OSD is no longer mapping data and no more serving. Because there is no weight, it will not affect the overall distribution, and there will be no migration. c) CRUSH Remove specifies the OSD. Delete the specified OSD. Deleting from CRUSH, and it is already 0, so there is no impact on the host's weight, thus there is no migration. Delete node: Ceph OSD rm OSD.0 Delete the record of this node from the cluster. d) Add new OSD. e) Remove flag. Because the middle state is only marked and no data migration occurs, the migration of data occurs only after the tag is lifted [14]. The basic environment consists of 3 nodes. Each node has 3 OSD (50G). The number of copies is set to 3, and the number of PG is set to 664. 4.1. Application test 4.1.1. Original method set noout Ceph osd set noout # Ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > pg1.txt stop osd process # /etc/init.d/Ceph stop osd.4 out osd # Ceph osd out 4 wait rebalance # Ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > pg2.txt # diff -y -W 100 pg1.txt pg2.txt --suppress-common-lines # diff -y -W 100 pg1.txt pg2.txt --suppress-common-lines|wc -l 531 remove crush and osd # Ceph osd crush remove osd.4 Ceph auth del osd.1 Ceph osd rm 1 wait rebalance # Ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > pg3.txt # diff -y -W 100 pg2.txt pg3.txt --suppress-common-lines|wc -l 90 add osd # Ceph-deploy osd prepare --zap-disk Ceph1:/dev/vdd # Ceph-deploy osd activate-all wait rebalance # Ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > pg4.txt 86
number From Table 1 and Fig.5 you can see the amount of migration before and after optimization, and if you use the optimization scheme, you can save about 37% of the migration. Fig.5 test 1 data comparison: Host 3\ OSD 3\Replication 3 PG 664 The basic environment used in testing 2 is composed of 2 nodes, each node has 4 OSD about 50G, the number of copies is set to 2, the number of PG is set to 664, and the test results are shown in Table 2 and Fig.6. operation done Table 2 Test 2 data comparison Primary migration mode stop osd (0) out osd(231) crush remove osd (90) add osd(99) 420 Optimized migration mode set mark(0) crush reweight osd(0) crush remove osd (0) add osd(263) 263 PG migration number 4.1.2. Improved configuration set norebalance, nobackfill, norecover # Ceph osd set nobackfill # Ceph osd set norecover # Ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > pg1.txt stop osd process # /etc/init.d/Ceph stop osd.4 crush reweight # Ceph osd crush reweight osd.4 0 # Ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > 3pg2.txt # diff -y -W 100 3pg1.txt 3pg2.txt --suppress-common-lines|wc -l 98 remove osd Ceph osd rm osd.4 add osd # Ceph-deploy osd prepare --zap-disk Ceph1:/dev/vdd # Ceph-deploy osd activate-all unset norebalance, nobackfill, norecover # Ceph osd unset nobackfill # Ceph osd unset norecover wait rebalance # Ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > 3pg3.txt # diff -y -W 100 3pg1.txt 3pg3.txt --suppress-common-lines|wc -l 563 4.2. Test result The migration results of the compared by Table 1 and Fig. 5 below. two methods are Table 1 Test 1 data comparison operation done PG migration Primary migration mode stop osd (0) out osd(121) crush remove osd (146) add osd(201) 468 Optimized migration mode set mark(0) crush reweight osd(0) crush remove osd (0) add osd(224) 263 87 Fig.6 test 1 data comparison: Host2\ OSD 4 \Replication 2\ PG 664
Saving Technologies Based on Ceph[J]”, Computer Engineering, August 2015. [5] Sage A Weil, Scott A Brandt and Ethan L Miller et al. Author, “CRUSH: Controlled,scalable,decentralized placement of replicated data [C]”, In Proceedingsof the on Supercom-uting(SC’06), Tampa, November 2006. Conference /IEEE ACM 2006 [6] Way [7] Mu Yanliang “Way Forever.Ceph. Author, Forever. Introduction of CRUSH data distribution algorithm based on Ceph. [EB/OL]”, October 2015. and Xu Zhenming. Author, “AnImproved CRUSH Algorithm on Temperature factor in Ceph Storage[J]”, Journal of Chengdu University of Information Technology, June 2015. based [8] Cheng X P. Author, “Ceph source code analysis: CRUSH algorithm[EB/OL]”, May 2016. [9] Karan Singh. Author, “Ceph Cookbook[M]”, Birmingham:Packt Publish Ltd, 169-170, 2016. [10] WuXiangwei. Author, “Ceph analysis: distribution of the CRUSH algorithm and consistency of Hash. [EB/OL]”, September 2014. the data the [11] w2bc. Author, “Ceph data storage of the road (3) --- PG select OSD process (crush algorithm). [EB/OL]”, November 2015. [12] H3C. Author, “Cloud storage Summit. [EB/OL]”, June 2016. [13] KangJianhua. Author, “Ceph cluster OSD fault repair example demonstration”, December 2015. [14] MOZ. Author, “The optimization and analysis of the replacement of OSD operation. [EB/OL]”, September 2016. From Table 2 and Fig.6, you can see the amount of migration before and after optimization. If you use the optimization scheme, you can save about 43% of the migration. 5. Conclusions This paper first analyzes the process of CRUSH algorithm choosing Ceph. In the actual production environment, it is found that if the OSD fault occurs, using the traditional data migration method will increase the amount of data migration. When deleting a fault OSD in the actual operating environment, setting flag values to close the migration can prevent invalid and excessive migration. The test proves that the method is feasible and effective. However, in the actual production environment, there is always need to pay attention to other problems, For example, the actual production loop has the function of automatic Out. Users can consider themselves to control, and it need to deploy early and establish reliable real-time monitoring system to prevent the system in the automatic completion of data balance to trigger the performance degraded. Acknowledgements This paper was supported by the National Natural Science Foundation of China (Grant No. 61370073), the National High Technology Research and Development Program of China (Grant No. 2007AA01Z423), the project of Science and Technology Department of Sichuan Province, and Chengdu Civil-military Integration Project Management Co., Ltd. References [1] hc360. Author, “Thoughts and suggestions on big [EB/OL]”, in cloud environment storage data November 2015. [2] Feng Youle, and Zhu Liuzhang. Author, “Analysis and Improvement of CEPH Dynamic Metadata Management[J] ”, Electronic Technology, September 2010. [3] Bisson T, Wu J and Brandt S A. Author, “A Distributed Spin -downAlgorithm for an Object-based Storage Device with Write R edirection C / /Proceedings of the 7th Work-shop on Distributed Data and Structures”, USA, 459-468, 2006. [4] Shen Lianghao, WU Qingbo, and Yang Shazhou. Author, “Research on Distributed Storage Energy 88
分享到:
收藏