ÆóÒµ¼¶ÏîÄ¿ÇéÐδʵ¼ù
ÔÚ×ÏÌÙׯ԰Sparkʵ¼ùÊÓÆµ¿ªÆª²¿·Ö£¬¹¤³ÌʦÑÝʾÁË»ùÓÚÔÆÔÉú¼Ü¹¹µÄ¼¯Èº°²Åżƻ®¡£ÊÓÆµÏêϸչʾÁËÔõÑùͨ¹ýKubernetes±àÅÅʵÏÖµ¯ÐÔ×ÊÔ´µ÷Àí£¬Õâ¶Ô´¦Öóͷ£º£Á¿µçÉÌÉúÒâÈÕÖ¾¾ßÓÐÒªº¦×÷Óá£ÖµµÃ×¢ÖØµÄÊÇ£¬ÆóÒµ¼¶°²ÅűØÐè¹Ø×¢ÍøÂçÍØÆËÓÅ»¯£¬ÓÈÆäÊÇÔÚ´¦Öóͷ£ÊµÊ±Êý¾ÝÁ÷ʱ£¬¹ýʧµÄÍøÂçÉèÖûᵼÖÂRDD£¨µ¯ÐÔÂþÑÜʽÊý¾Ý¼¯£©´«ÊäЧÂʽµµÍ50%ÒÔÉÏ¡£
½¹µãÅÌËãÄ£×ÓʵÏÖÆÊÎö
ÊÓÆµÖÐÖØµãÆÊÎöÁËDataFrame APIÓëSpark SQLµÄÁªºÏÓ¦ÓÃģʽ¡£Í¨¹ýÂÃÓÎÐÐÒµÓû§»Ïñ¹¹½¨°¸Àý£¬ÑÝʾÁËÔõÑù½«ÔʼÈÕ־ת»¯Îª½á¹¹»¯Êý¾Ý×ʲú¡£ÊÖÒÕÖ°Ô±ÐèÒªÌØÊâ×¢ÖØÄÚ´æÖÎÀíÕ½ÂÔ£¬µ±´¦Öóͷ£PB¼¶Éç½»ÍøÂçÊý¾Ýʱ£¬²»¶ÔÀíµÄÐòÁл¯·½·¨»áʹʹÃüÖ´ÐÐʱ¼ä³É±¶ÔöÌí¡£ÔõÑùÑ¡ÔñºÏÊʵÄshuffleÕ½ÂÔ£¿ÕâÐèҪƾ֤Êý¾ÝÌØÕ÷¶¯Ì¬µ÷½â·ÖÇøËã·¨¡£
ʵʱÊý¾Ý´¦Öóͷ£¼Ü¹¹ÓÅ»¯
Õë¶ÔÎïÁªÍøÊµÊ±¼à¿Ø³¡¾°£¬½ÌѧÊÓÆµ±ÈÕÕÁËStructured StreamingÓë¾É°æDStreamµÄЧÄܲî±ð¡£ÔÚ³µÁªÍø³¡¾°µÄѹÁ¦²âÊÔÖУ¬ÓÅ»¯ºóµÄ΢Åú´Î´¦Öóͷ£½«ÑÓ³Ù½µµÍÖÁ300ºÁÃëÒÔÄÚ¡£ÕâÀïÐèҪСÐÄÊý¾ÝÇãбÎÊÌ⣬µ±´«¸ÐÆ÷ÂþÑܲ»¾ùʱ£¬½¨Òé½ÓÄÉˮӡ»úÖÆÅäºÏ״̬´æ´¢Õ½ÂÔÀ´Æ½ºâ¸÷½Úµã¸ºÔØ¡£
ÆóÒµ¼¶Çå¾²¼Ó¹Ì¼Æ»®
½ðÈÚ¼¶Ó¦ÓõÄÌØÊâÐèÇ󲿷֣¬ÊÓÆµÑÝʾÁËKerberosÈÏÖ¤¼¯³ÉÓëHDFS¼ÓÃÜ´æ´¢¼Æ»®¡£ÌØÊâÊÇÔÚ´¦Öóͷ£Óû§Òþ˽Êý¾Ýʱ£¬±ØÐèÆôÓö¯Ì¬Êý¾ÝÑÚÂ빦Ч¡£¿ª·¢ÕßÔÚ¾ÙÐлá¼û¿ØÖÆÉèÖÃʱ£¬Òª×¢ÖØ×èÖ¹ACL£¨»á¼û¿ØÖÆÁÐ±í£©µÄÌ«¹ýÊÚȨ£¬Õâ¿ÉÄÜÒý·¢ÑÏÖØµÄÊý¾Ýй¶Σº¦¡£
µä·¶¹ýʧ³¡¾°Éî¶ÈÆÊÎö
½ÌѧÊÓÆµÓÃ20·ÖÖÓרÃÅÆÊÎöÁËÊ®´ó³£¼û¹ýʧģʽ£¬ÆäÖÐJVMÄÚ´æÒç³öÎÊÌâ×îΪÖÂÃü¡£ÔÚijÎïÁ÷ÆóÒµµÄʵ¼ùÖУ¬¹ýʧÉèÖÃexecutor¶ÑÄÚ´æµ¼Ö¼¯ÈºÕûÌåå´»ú¡£ÊÓÆµ¸ø³öÁËGC£¨À¬»ø½ÓÄÉ£©µ÷ÓŹ«Ê½£ºÄÚ´æ·ÖÅÉ=·ÖÇøÊý¡Á1.5GB¡£Í¬Ê±Ç¿µ÷Òª°´ÆÚ¼à¿ØstorageÄÚ´æÕ¼±È£¬±ÜÃ⻺´æÊý¾ÝÕ¼Óùý¶àÅÌËã×ÊÔ´¡£
»úеѧϰģ×Ó°²ÅÅʵ¼ù
ÔÚÍÆ¼öϵͳ°¸ÀýÖУ¬¹¤³ÌʦÑÝʾÁËML PipelineÓëPySparkµÄÕûºÏÓ¦Óá£Õë¶Ô¹ã¸æµã»÷ÂÊÕ¹ÍûʹÃü£¬ÊÓÆµ½¨Òé½ÓÄÉÌØÕ÷½»Ö¯ÊÖÒÕÌáÉýÄ£×ÓAUCÖµ0.15¸öµã¡£µ«ÐèСÐÄÄ£×ÓÆ¯ÒÆÎÊÌ⣬±ØÐèÉèÖÃ×Ô¶¯»¯Ä£×ÓÖØÑµÁ·»úÖÆ£¬ÕâÔÚµçÉÌ´ó´Ùʱ´úÓÈΪÖ÷Òª¡£Õ¹Ê¾ÁËÔõÑùͨ¹ýAlluxio¼ÓËÙÌØÕ÷¶ÁÈ¡£¬Ê¹Åú´¦Öóͷ£Ê¹ÃüºÄʱïÔÌ60%¡£
×ÏÌÙׯ԰Sparkʵ¼ùÊÓÆµÏµÍ³ÐÔµØÕ¹Ê¾ÁËÆóÒµ¼¶Ó¦ÓõÄÊÖÒÕʵÏÖ·¾¶£¬´Ó»ù´¡ÇéÐÎÉèÖõ½¸ß½×Ä£×Ó°²ÅŲã²ãµÝ½ø¡£¿ª·¢ÕßÓ¦ÖØµã¹Ø×¢ÊÓÆµÖÐÖØ¸´Ç¿µ÷µÄ¼¯Èºµ÷ÓŹæÔòºÍÊý¾ÝÇå¾²¹æ·¶£¬Í¬Ê±Ð¡ÐÄÒþ²ØµÄÐÔÄÜÏÝÚå¡£ÕÆÎÕÕâЩ½¹µãÒªµã£¬²Å»ªÕæÕýʩչSpark¿ò¼ÜÔÚÆóÒµÊý×Ö»¯×ªÐÍÖеÄÕ½ÂÔ¼ÛÖµ¡£µÚÒ»Õ£ºÆóÒµ¼¶´óÊý¾Ýƽ̨½¨ÉèÍ´µãÆÊÎö
ÔÚÊý×Ö»¯×ªÐÍÀú³ÌÖУ¬¹Å°åÆóÒµ³£ÃæÁÙÊý¾Ý¹Âµº¡¢ÅÌËã×ÊÔ´ÆÌÕÅ¡¢ÊµÊ±´¦Öóͷ£ÄÜÁ¦È±·¦ÈýºÆ½ÙÌâ¡£×ÏÌÙׯ԰Sparkʵս°¸ÀýÖУ¬Í¨¹ýͳһԪÊý¾ÝÖÎÀíºÍDelta LakeÊÖÒÕʵÏֿ粿·ÖÊý¾Ý×ʲúÕûºÏ£¬ÕâÇ¡ÊÇÆóÒµ¼¶Êý¾ÝÖÐ̨½¨ÉèµÄ½¹µãËßÇó¡£½ÓÄÉSpark SQLÓëHudi£¨Hadoop Upserts Deletes and Incrementals£©ÏàÁ¬ÏµµÄ¼Ü¹¹£¬ÀÖ³ÉÍ»ÆÆ¹Å°åETL£¨³éȡת»»¼ÓÔØ£©Á÷³ÌÖеÄÅú´¦Öóͷ£ÐÔÄÜÆ¿¾±¡£ÔõÑù¹¹½¨¼ÈÄÜÖ§³ÖPB¼¶ÀëÏßÅÌË㣬ÓÖÄÜÖª×ãºÁÃ뼶ʵʱÆÊÎöÐèÇóµÄ»ìÏý¼Ü¹¹£¿ÕâÕýÊDZ¾Ì×ÊÓÆµ×ÅÖØ½â¾öµÄ¹¤³Ìʵ¼ùÎÊÌâ¡£
µÚ¶þÕ£ºSpark½¹µã×é¼þ½ø½×Ó¦ÓÃÆÊÎö
ÊÓÆµÉî¶È½â¹¹Spark ExecutorÄÚ´æÄ£×Óµ÷ÓÅÕ½ÂÔ£¬Õë¶ÔÆóÒµ³£¼ûµÄGC£¨À¬»ø½ÓÄÉ£©Í£ÁôÎÊÌ⣬Ìá³ö»ùÓÚRDD£¨µ¯ÐÔÂþÑÜʽÊý¾Ý¼¯£©ÑªÍ³¹ØÏµµÄ»º´æ¸´ÓûúÖÆ¡£ÔÚShuffleÀú³ÌÓÅ»¯»·½Ú£¬Í¨¹ý¶¯Ì¬µ÷Àíspark.sql.shuffle.partitions²ÎÊý£¬²¢Á¬ÏµÊý¾ÝÇãб¼ì²âËã·¨£¬Ê¹Ä³½ðÈÚ¿Í»§±¨±íÌìÉúЧÂÊÌáÉý4±¶¡£ÁîÈ˹Ø×¢µÄÊÇ£¬½Ì³Ì»¹Õ¹Ê¾ÁËStructured StreamingÔÚIoT×°±¸ÈÕÖ¾´¦Öóͷ£ÖеĶ˵½¶Ë£¨End-to-End£©ÊµÏÖ£¬Éæ¼°Exactly-OnceÓïÒå°ü¹ÜÓë¼ì²éµã£¨Checkpoint£©»Ö¸´»úÖÆµÈÒªº¦ÊÖÒյ㡣
µÚÈýÕ£ºÉú²úÇéÐθ߿ÉÓüܹ¹Éè¼Æ½ÒÃØ
ÆËÃæÁÙ¼¯Èº¹æÄ£µÖ´ï2000+½ÚµãµÄ³¬´óÐͰ²ÅÅʱ£¬×ÏÌÙׯ԰ÊÖÒÕÍŶÓÁ¢ÒìÐԵؽÓÄÉ·Ö²ã×ÊÔ´µ÷Àíϵͳ¡£Í¨¹ýYARN£¨Yet Another Resource Negotiator£©ÐÐÁÐÓÅÏȼ¶Õ½ÂÔÓëK8sµ¯ÐÔÀ©ÈÝ»úÖÆÁª¶¯£¬ÔÚ˫ʮһ´ó´Ùʱ´ú°ü¹ÜÁ˽¹µãÓªÒµ99.99%µÄSLA£¨Ð§ÀÍÆ·¼¶ÐÒ飩¡£±¾¶ÎÊÓÆµÍêÕû»¹ÔÁËZookeeper¼¯ÈºÄÔÁÑ£¨Split-Brain£©ÎÊÌâµÄÅŲéÀú³Ì£¬²¢Õ¹Ê¾»ùÓÚRaft¹²Ê¶Ë㷨ˢкóµÄHA£¨¸ß¿ÉÓ㩼ƻ®¡£¹ØÓÚÆóÒµÓû§×îÌåÌùµÄÇå¾²¹Ü¿ØÐèÇó£¬ÊÓÆµÌṩ´ÓKerberosÈÏÖ¤µ½Ï¸Á£¶ÈRBAC£¨»ùÓÚ½ÇÉ«µÄ»á¼û¿ØÖÆ£©µÄÍêÕûʵÏÖ·¾¶¡£
µÚËÄÕ£º´óÊý¾ÝÖÎÀíϵͳʵսÑݽø
ÔÚÊý¾ÝÖÊÁ¿¹Ü¿ØÁìÓò£¬½Ì³ÌÑÝʾÁËGreat Expectations¿ò¼ÜÓëSparkµÄÉî¶È¼¯³É£¬ÊµÏÖÊý¾Ý¼¯ÍêÕûÐÔУÑéµÄ×Ô¶¯»¯Á÷Ë®Ïß¡£Õë¶ÔÊý¾ÝѪԵ׷×Ù³¡¾°£¬½ÓÄÉApache AtlasÔªÊý¾ÝÖÎÀíϵͳ¹¹½¨¿ÉÊÓ»¯ÑªÔµÍ¼Æ×£¬ÕâÔÚij¿ç¹ú¼¯ÍŵÄGDPRºÏ¹æÉó¼ÆÖÐʩչҪº¦×÷Óá£ÌØÊâÖµµÃ¹Ø×¢µÄÊÇ£¬ÊÓÆµ´´Á¢ÐԵؽ«Êý¾ÝÖÎÀí£¨Data Governance£©Óë»úеѧϰƽ̨Á¬Ïµ£¬Í¨¹ý¶¯Ì¬ÌØÕ÷¼à¿ØÓÐÓÃÔ¤·ÀÄ£×ÓÆ¯ÒÆÎÊÌâ¡£ÕâÒ»Õ½ڻ¹Ïêϸ½â¶ÁÁËDelta LakeµÄACIDÊÂÎñÌØÕ÷ÔõÑù°ü¹ÜÆóÒµ¼¶Êý¾Ý¿ÍÕ»µÄ¶ÁдһÖÂÐÔ¡£
µÚÎåÕ£ºÆóÒµ¼¶¿ª·¢¹æ·¶ÓëЧÄÜÌáÉý
ÔÚÒ»Á¬¼¯³É»·½Ú£¬×ÏÌÙׯ԰Ìá³ö»ùÓÚJenkins PipelineµÄSpark×÷Òµ×Ô¶¯´ò°üÁ÷Ë®Ïß¡£Í¨¹ýSpark-TEA£¨Test Environment Automation£©¿ò¼ÜʵÏÖ²âÊÔÊý¾Ý×Ô¶¯ÌìÉúÓë¶àÇéÐÎÉèÖÃÖÎÀí£¬Ê¹Ä³µçÉ̿ͻ§µÄ°æ±¾Ðû²¼ÖÜÆÚËõ¶Ì60%¡£ÊÓÆµ»¹ÏµÍ³ÊáÀíÁËParquetÎļþÃûÌõÄÁÐʽ´æ´¢ÓÅ»¯¼¼ÇÉ£¬ÒÔ¼°Spark 3.0×Ô˳ӦÅÌÎÊÖ´ÐУ¨Adaptive Query Execution£©´øÀ´µÄÐÔÄÜÌáÉý°¸Àý¡£Õ½ÚÍêÕû·ºÆðÁËÒ»¸öÈÕ´¦Öóͷ£10ÒÚ¶©µ¥µÄʵʱ·´Ú²ÆÏµÍ³¹¹½¨È«Àú³Ì£¬º¸Ç´ÓFlinkÓëSparkÐͬÅÌËãµ½¶àÎ¬ÌØÕ÷ÒýÇæ¿ª·¢µÄÈ«ÊÖÒÕջʵ¼ù¡£
ÕâÌ×ÍêÕû°æ×ÏÌÙׯ԰Sparkʵ¼ùÊÓÆµµÄ¼ÛÖµ£¬ÔÚÓÚÂòͨÁË¿ªÔ´ÊÖÒÕµ½ÆóÒµ¼¶Â䵨µÄÒ»¹«Àï¡£Ëü²»µ«º¸ÇÅúÁ÷Ò»Ì壨Batch-Stream Unification£©¡¢ÅÌËã´æ´¢ÊèÉ¢µÈÇ°ÑØ¼Ü¹¹Éè¼Æ£¬¸üÉî¶ÈÆÊÎöÁËÉú²úÇéÐÎÖÐ×ÊÔ´µ÷Àí¡¢ÔÖ±¸»Ö¸´µÈÒªº¦ÔËάÊÖÒÕ¡£¹ØÓÚÍýÏë¹¹½¨±ê×¼»¯Êý¾ÝÖÐ̨µÄÆóÒµ£¬±¾½Ì³Ì¿É×÷ΪÍêÕûµÄÊÖÒÕʵÑéÖ¸ÄÏ£¬×ÊÖúÍŶӿìËٴÇкϽðÈÚ¼¶¿É¿¿ÐÔÒªÇóµÄ´óÊý¾Ý´¦Öóͷ£Æ½Ì¨¡£