ÆóÒµÊý¾Ý´¦Öóͷ£µÄÏÖʵÌôÕ½ÓëÍ»ÆÆÆ«Ïò
ÔÚ½ðÈÚ·ç¿ØºÍÖÇÄÜÍÆ¼öµÈÆóÒµ³¡¾°ÖУ¬º£Á¿Êý¾Ý´¦Öóͷ£ÃæÁÙÏìÓ¦ÑÓʱÓëÅÌËã׼ȷ¶ÈµÄË«ÖØÌôÕ½¡£×ÏÌÙׯ԰sparkʵ¼ùÊÓÆµµÚ46¹ØÊ×´ÎÅû¶µÄʵʱ·´Ú²Æ°¸ÀýÏÔʾ£¬»ùÓÚSpark Structured Streaming¹¹½¨µÄ»ìÏý´¦Öóͷ£¼Ü¹¹£¬ÓÐÓýâ¾öÁ˹ŰåÅú´¦Öóͷ£ÏµÍ³µÄ·ÖÖÓ¼¶ÑÓ³ÙÎÊÌâ¡£ÌØÊâÊÇÔÚDAG£¨ÓÐÏòÎÞ»·Í¼£©µ÷ÀíÓÅ»¯·½Ã棬ͨ¹ý¶¯Ì¬×ÊÔ´·ÖÅÉ»úÖÆ½«Êý¾Ý´¦Öóͷ£Ð§ÂÊÌáÉý47%£¬¸ÃÁ¢Òìµã»ñµÃIBMÊÖÒÕÍŶӵÄÏÖ³¡ÑéÖ¤¡£
×ÏÌÙׯ԰ÊÓÆµÄÚÈݼܹ¹ÆÊÎö
ÕâÌ×°üÀ¨46¸öÊÖÒÕÄ£¿éµÄϵÁпγ̣¬½ÓÄÉ"ÀíÂÛ-ʵÑé-µ÷ÓÅ"µÄÈý¶Îʽ½Ìѧ½á¹¹¡£ÔÚµÚ5ÕÂSpark CoreÔÀíÆÊÎöÖУ¬ÖصãÑÝʾÁËRDDµ¯ÐÔÂþÑÜʽÊý¾Ý¼¯µÄÈÝ´í»úÖÆ£¬¸¨ÒÔÒ½ÁÆÓ°ÏñÊý¾Ý´¦Öóͷ£³¡¾°¾ÙÐÐÑéÖ¤¡£ÖµµÃ×¢ÖØµÄÊǵÚ32¹ØÒýÈëµÄShuffleÓÅ»¯¼Æ»®£¬Í¨¹ýµ÷½âspark.sql.shuffle.partitions²ÎÊýÖµ£¬Àֳɽ«µçÉÌÍÆ¼öϵͳµÄÅÌËãºÄʱ´Ó18·ÖÖÓѹËõÖÁ6·ÖÖÓ£¬ÕâÖÖʵսÉèÖü¼ÇɹØÓÚ½ðÈÚ·ç¿ØÏµÍ³µÄʵʱ¾öÒé¾ßÓÐÖ÷ÒªÒâÒå¡£
ÆóÒµ¼¶Spark¼¯Èº°²ÅÅÒªº¦ÒªËØ
ÔõÑù¹¹½¨¸ß¿ÉÓõÄÉú²ú¼¶Spark¼¯Èº£¿µÚ46¹ØÏêϸ±ÈÕÕÁËYARNÓëKubernetesÁ½ÖÖ×ÊÔ´µ÷Àí¿ò¼ÜµÄ²î±ð¡£²âÊÔÊý¾ÝÏÔʾ£¬ÔÚÏàͬӲ¼þÉèÖÃÏ£¬K8s¼Æ»®µÄʹÃü»Ö¸´ËÙÂÊ±È¹Å°å¼Æ»®¿ì3.8±¶¡£ÊÓÆµÖÐÌØÊâÑÝʾÁ˶¯Ì¬Executor·ÖÅÉ»úÖÆ£¬Í¨¹ýÉèÖÃspark.dynamicAllocation.enabled=true²ÎÊý£¬ÀÖ³ÉÓ¦¶ÔÁË֤ȯÉúÒâϵͳµÄÁ÷Á¿Âö³å³¡¾°£¬ÕâÏîÉèÖü¼ÇÉÒÑÔÚº£ÄÚij´óÐÍÖ§¸¶Æ½Ì¨»ñµÃÏÖʵӦÓÃÑéÖ¤¡£
»úеѧϰ³¡¾°ÏµÄSparkÓÅ»¯Êµ¼ù
ÔÚÉî¶Èѧϰģ×ÓѵÁ·³¡¾°ÖУ¬SparkÓëTensorFlowµÄÐͬÊÂÈËÇéÁÙÐòÁл¯Ð§ÂÊÆ¿¾±¡£×ÏÌÙׯ԰¿Î³ÌÌá³öµÄÄ£×Ó·ÖÆ¬²¢Ðмƻ®£¬Í¨¹ýPetastormÊý¾ÝÃûÌÃת»»½«ÌØÕ÷´¦Öóͷ£ËÙÂÊÌáÉý62%¡£µÚ46¹ØÕ¹Ê¾µÄÂþÑÜʽ³¬²Îµ÷Ó۸ÀýÖУ¬½ÓÄÉSpark MLlibÓëHyperopt×éºÏ¿ò¼Ü£¬Ê¹Ä³ÒøÐз´Ï´Ç®Ä£×ÓµÄF1Öµ´Ó0.81ÌáÉýÖÁ0.89£¬ÕâÖÖÁ¢Òì¼Æ»®ÎªºóÐø¿Î³ÌÖеÄÁª°îѧϰÊÖÒÕÂñÏ·ü±Ê¡£
ʵʱÊý²Ö½¨ÉèµÄ½¹µãÊÖÒÕÍ»ÆÆ
ÔõÑùʵÏÖÃë¼¶ÑÓ³ÙµÄʵʱÊý¾Ý¿ÍÕ»£¿¿Î³ÌµÚ40-46¹Ø¹¹½¨µÄÍêÕû½â¾ö¼Æ»®ÖµµÃ¹Ø×¢¡£Í¨¹ýDelta LakeµÄÊÂÎñÈÕÖ¾»úÖÆ°ü¹ÜÊý¾ÝÒ»ÖÂÐÔ£¬ÅäºÏSpark Structured StreamingµÄ΢Åú´¦Öóͷ£Ä£Ê½£¬ÔÚµçÐÅÐÅÁîÊý¾ÝÆÊÎö³¡¾°Öеִï80000Ìõ/ÃëµÄ´¦Öóͷ£ÍÌÍÂÁ¿¡£ÌØÊâÊÇÔÚµÚ46¹Ø×îÐÂÄÚÈÝÖУ¬Ê״ιûÕæÁ˶˵½¶ËExactly-OnceÓïÒåµÄʵÏּƻ®£¬¸ÃÊÖÒÕÒÑÓ¦ÓÃÓÚijÎïÁ÷ÆóÒµµÄÈ«Çò¶©µ¥×·×Ùϵͳ¡£
ÆóÒµ¼¶Êý¾ÝÖÎÀíµÄÍêÕû½â¾ö¼Æ»®
Êý¾ÝÖÎÀíÊÇÆóÒµ´óÊý¾ÝÂ䵨µÄ±ÚÀÝ¡£×ÏÌÙׯ԰½Ì³ÌÔÚµÚ46¹Ø¼¯³ÉÑÝʾÁËÊý¾ÝѪԵ׷×Ù¡¢ÖÊÁ¿¼à¿ØÓëȨÏÞÖÎÀíÈý´óÄ£¿é¡£»ùÓÚSpark SQLÀ©Õö¿ª·¢µÄÊý¾ÝѪԵÆÊÎö×é¼þ£¬¿É×Ô¶¯ÌìÉúÁè¼Ý200¸ö½ÚµãµÄÒÀÀµÍ¼Æ×¡£ÔÚÊÓÆµÕ¹Ê¾µÄijÁãÊÛÆóÒµ°¸ÀýÖУ¬Í¨¹ýColumn-levelȨÏÞ¿ØÖƽ«Êý¾Ý×ß©Σº¦½µµÍ92%£¬ÕâÖÖϵͳ¼¶½â¾ö¼Æ»®Îª¼´½«µ½À´µÄÊý¾ÝÇå¾²·¨ÌṩÁËÊÖÒÕ×¼±¸¡£
´ÓµÚ46¹ØÊÖÒÕÍ»ÆÆ¿ÉÒÔ¿´³ö£¬×ÏÌÙׯ԰sparkʵ¼ùÊÓÆµÍ¨¹ýÕæÊµ³¡¾°²ð½â£¬ÍêÕû·ºÆðÁËÆóÒµ¼¶´óÊý¾ÝÓ¦ÓõÄÊÖÒÕÑݽøÂ·¾¶¡£ÎÞÂÛÊǽ¹µãÔÀíÆÊÎöÕÕ¾ÉK8s¼¯Èº°²ÅÅ£¬¶¼ÌåÏÖÁËÀíÂÛÓëʵ¼ùµÄ¸ß¶ÈÈںϡ£¹ØÓÚØ½´ýÉý¼¶Êý¾Ý´¦Öóͷ£¼Ü¹¹µÄÆóÒµ¶øÑÔ£¬ÕâÌ׿γÌÌṩµÄshuffleÓÅ»¯¡¢ÊµÊ±ÅÌËã¼Æ»®ÒÔ¼°Êý¾ÝÖÎÀí¿ò¼Ü£¬ÕýÔÚÖØÐ½ç˵SparkÔÚÉú²úÇéÐÎÖеÄÓ¦Óñê×¼¡£µÚÒ»Õ£ºÆóÒµ¼¶´óÊý¾Ýƽ̨½¨ÉèÍ´µãÆÊÎö
ÔÚÊý×Ö»¯×ªÐÍÀú³ÌÖУ¬¹Å°åÆóÒµ³£ÃæÁÙÊý¾Ý¹Âµº¡¢ÅÌËã×ÊÔ´ÆÌÕÅ¡¢ÊµÊ±´¦Öóͷ£ÄÜÁ¦È±·¦ÈýºÆ½ÙÌâ¡£×ÏÌÙׯ԰Sparkʵս°¸ÀýÖУ¬Í¨¹ýͳһԪÊý¾ÝÖÎÀíºÍDelta LakeÊÖÒÕʵÏֿ粿·ÖÊý¾Ý×ʲúÕûºÏ£¬ÕâÇ¡ÊÇÆóÒµ¼¶Êý¾ÝÖÐ̨½¨ÉèµÄ½¹µãËßÇó¡£½ÓÄÉSpark SQLÓëHudi£¨Hadoop Upserts Deletes and Incrementals£©ÏàÁ¬ÏµµÄ¼Ü¹¹£¬ÀÖ³ÉÍ»ÆÆ¹Å°åETL£¨³éȡת»»¼ÓÔØ£©Á÷³ÌÖеÄÅú´¦Öóͷ£ÐÔÄÜÆ¿¾±¡£ÔõÑù¹¹½¨¼ÈÄÜÖ§³ÖPB¼¶ÀëÏßÅÌË㣬ÓÖÄÜÖª×ãºÁÃ뼶ʵʱÆÊÎöÐèÇóµÄ»ìÏý¼Ü¹¹£¿ÕâÕýÊDZ¾Ì×ÊÓÆµ×ÅÖØ½â¾öµÄ¹¤³Ìʵ¼ùÎÊÌâ¡£
µÚ¶þÕ£ºSpark½¹µã×é¼þ½ø½×Ó¦ÓÃÆÊÎö
ÊÓÆµÉî¶È½â¹¹Spark ExecutorÄÚ´æÄ£×Óµ÷ÓÅÕ½ÂÔ£¬Õë¶ÔÆóÒµ³£¼ûµÄGC£¨À¬»ø½ÓÄÉ£©Í£ÁôÎÊÌ⣬Ìá³ö»ùÓÚRDD£¨µ¯ÐÔÂþÑÜʽÊý¾Ý¼¯£©ÑªÍ³¹ØÏµµÄ»º´æ¸´ÓûúÖÆ¡£ÔÚShuffleÀú³ÌÓÅ»¯»·½Ú£¬Í¨¹ý¶¯Ì¬µ÷Àíspark.sql.shuffle.partitions²ÎÊý£¬²¢Á¬ÏµÊý¾ÝÇãб¼ì²âËã·¨£¬Ê¹Ä³½ðÈÚ¿Í»§±¨±íÌìÉúЧÂÊÌáÉý4±¶¡£ÁîÈ˹Ø×¢µÄÊÇ£¬½Ì³Ì»¹Õ¹Ê¾ÁËStructured StreamingÔÚIoT×°±¸ÈÕÖ¾´¦Öóͷ£ÖеĶ˵½¶Ë£¨End-to-End£©ÊµÏÖ£¬Éæ¼°Exactly-OnceÓïÒå°ü¹ÜÓë¼ì²éµã£¨Checkpoint£©»Ö¸´»úÖÆµÈÒªº¦ÊÖÒյ㡣
µÚÈýÕ£ºÉú²úÇéÐθ߿ÉÓüܹ¹Éè¼Æ½ÒÃØ
ÆËÃæÁÙ¼¯Èº¹æÄ£µÖ´ï2000+½ÚµãµÄ³¬´óÐͰ²ÅÅʱ£¬×ÏÌÙׯ԰ÊÖÒÕÍŶÓÁ¢ÒìÐԵؽÓÄÉ·Ö²ã×ÊÔ´µ÷Àíϵͳ¡£Í¨¹ýYARN£¨Yet Another Resource Negotiator£©ÐÐÁÐÓÅÏȼ¶Õ½ÂÔÓëK8sµ¯ÐÔÀ©ÈÝ»úÖÆÁª¶¯£¬ÔÚ˫ʮһ´ó´Ùʱ´ú°ü¹ÜÁ˽¹µãÓªÒµ99.99%µÄSLA£¨Ð§ÀÍÆ·¼¶ÐÒ飩¡£±¾¶ÎÊÓÆµÍêÕû»¹ÔÁËZookeeper¼¯ÈºÄÔÁÑ£¨Split-Brain£©ÎÊÌâµÄÅŲéÀú³Ì£¬²¢Õ¹Ê¾»ùÓÚRaft¹²Ê¶Ë㷨ˢкóµÄHA£¨¸ß¿ÉÓ㩼ƻ®¡£¹ØÓÚÆóÒµÓû§×îÌåÌùµÄÇå¾²¹Ü¿ØÐèÇó£¬ÊÓÆµÌṩ´ÓKerberosÈÏÖ¤µ½Ï¸Á£¶ÈRBAC£¨»ùÓÚ½ÇÉ«µÄ»á¼û¿ØÖÆ£©µÄÍêÕûʵÏÖ·¾¶¡£
µÚËÄÕ£º´óÊý¾ÝÖÎÀíϵͳʵսÑݽø
ÔÚÊý¾ÝÖÊÁ¿¹Ü¿ØÁìÓò£¬½Ì³ÌÑÝʾÁËGreat Expectations¿ò¼ÜÓëSparkµÄÉî¶È¼¯³É£¬ÊµÏÖÊý¾Ý¼¯ÍêÕûÐÔУÑéµÄ×Ô¶¯»¯Á÷Ë®Ïß¡£Õë¶ÔÊý¾ÝѪԵ׷×Ù³¡¾°£¬½ÓÄÉApache AtlasÔªÊý¾ÝÖÎÀíϵͳ¹¹½¨¿ÉÊÓ»¯ÑªÔµÍ¼Æ×£¬ÕâÔÚij¿ç¹ú¼¯ÍŵÄGDPRºÏ¹æÉó¼ÆÖÐʩչҪº¦×÷Óá£ÌØÊâÖµµÃ¹Ø×¢µÄÊÇ£¬ÊÓÆµ´´Á¢ÐԵؽ«Êý¾ÝÖÎÀí£¨Data Governance£©Óë»úеѧϰƽ̨Á¬Ïµ£¬Í¨¹ý¶¯Ì¬ÌØÕ÷¼à¿ØÓÐÓÃÔ¤·ÀÄ£×ÓÆ¯ÒÆÎÊÌâ¡£ÕâÒ»Õ½ڻ¹Ïêϸ½â¶ÁÁËDelta LakeµÄACIDÊÂÎñÌØÕ÷ÔõÑù°ü¹ÜÆóÒµ¼¶Êý¾Ý¿ÍÕ»µÄ¶ÁдһÖÂÐÔ¡£
µÚÎåÕ£ºÆóÒµ¼¶¿ª·¢¹æ·¶ÓëЧÄÜÌáÉý
ÔÚÒ»Á¬¼¯³É»·½Ú£¬×ÏÌÙׯ԰Ìá³ö»ùÓÚJenkins PipelineµÄSpark×÷Òµ×Ô¶¯´ò°üÁ÷Ë®Ïß¡£Í¨¹ýSpark-TEA£¨Test Environment Automation£©¿ò¼ÜʵÏÖ²âÊÔÊý¾Ý×Ô¶¯ÌìÉúÓë¶àÇéÐÎÉèÖÃÖÎÀí£¬Ê¹Ä³µçÉ̿ͻ§µÄ°æ±¾Ðû²¼ÖÜÆÚËõ¶Ì60%¡£ÊÓÆµ»¹ÏµÍ³ÊáÀíÁËParquetÎļþÃûÌõÄÁÐʽ´æ´¢ÓÅ»¯¼¼ÇÉ£¬ÒÔ¼°Spark 3.0×Ô˳ӦÅÌÎÊÖ´ÐУ¨Adaptive Query Execution£©´øÀ´µÄÐÔÄÜÌáÉý°¸Àý¡£Õ½ÚÍêÕû·ºÆðÁËÒ»¸öÈÕ´¦Öóͷ£10ÒÚ¶©µ¥µÄʵʱ·´Ú²ÆÏµÍ³¹¹½¨È«Àú³Ì£¬º¸Ç´ÓFlinkÓëSparkÐͬÅÌËãµ½¶àÎ¬ÌØÕ÷ÒýÇæ¿ª·¢µÄÈ«ÊÖÒÕջʵ¼ù¡£
ÕâÌ×ÍêÕû°æ×ÏÌÙׯ԰Sparkʵ¼ùÊÓÆµµÄ¼ÛÖµ£¬ÔÚÓÚÂòͨÁË¿ªÔ´ÊÖÒÕµ½ÆóÒµ¼¶Â䵨µÄÒ»¹«Àï¡£Ëü²»µ«º¸ÇÅúÁ÷Ò»Ì壨Batch-Stream Unification£©¡¢ÅÌËã´æ´¢ÊèÉ¢µÈÇ°ÑØ¼Ü¹¹Éè¼Æ£¬¸üÉî¶ÈÆÊÎöÁËÉú²úÇéÐÎÖÐ×ÊÔ´µ÷Àí¡¢ÔÖ±¸»Ö¸´µÈÒªº¦ÔËάÊÖÒÕ¡£¹ØÓÚÍýÏë¹¹½¨±ê×¼»¯Êý¾ÝÖÐ̨µÄÆóÒµ£¬±¾½Ì³Ì¿É×÷ΪÍêÕûµÄÊÖÒÕʵÑéÖ¸ÄÏ£¬×ÊÖúÍŶӿìËٴÇкϽðÈÚ¼¶¿É¿¿ÐÔÒªÇóµÄ´óÊý¾Ý´¦Öóͷ£Æ½Ì¨¡£