关于Yarn源码的那些事（五）

上文说到ApplicationMaster的调度便戛然而止，本文继续。

private static final class AddApplicationToSchedulerTransition extends RMAppTransition {
		@Override
		public void transition(RMAppImpl app, RMAppEvent event) {
			app.handler.handle(new AppAddedSchedulerEvent(app.applicationId, app.submissionContext.getQueue(), app.user,
					app.submissionContext.getReservationID()));
		}
	}

上文说到这段代码，里面的app.handler实际上是rmdispatcher的handler：

			rmDispatcher.register(SchedulerEventType.class, schedulerDispatcher);

这种事件的类型是SchedulerEventType类型，所以交给schedulerDispatcher来handle，追溯下去，是由默认的CapacityScheduler来负责的，相关代码如下：

public AppAddedSchedulerEvent(ApplicationId applicationId, String queue, String user, boolean isAppRecovering,
			ReservationId reservationID) {
		super(SchedulerEventType.APP_ADDED);
		this.applicationId = applicationId;
		this.queue = queue;
		this.user = user;
		this.reservationID = reservationID;
		this.isAppRecovering = isAppRecovering;
	}

这里确定事件类型：

case APP_ADDED: {
			AppAddedSchedulerEvent appAddedEvent = (AppAddedSchedulerEvent) event;
			String queueName = resolveReservationQueueName(appAddedEvent.getQueue(), appAddedEvent.getApplicationId(),
					appAddedEvent.getReservationID());
			if (queueName != null) {
				addApplication(appAddedEvent.getApplicationId(), queueName, appAddedEvent.getUser(),
						appAddedEvent.getIsAppRecovering());
			}
		}

这里是处理部分的代码：只截取了部分重要的代码：

// Submit to the queue
		try {
			queue.submitApplication(applicationId, user, queueName);
		} catch (AccessControlException ace) {
			// Ignore the exception for recovered app as the app was previously accepted
			if (!isAppRecovering) {
				LOG.info("Failed to submit application " + applicationId + " to queue " + queueName + " from user "
						+ user, ace);
				this.rmContext.getDispatcher().getEventHandler()
						.handle(new RMAppRejectedEvent(applicationId, ace.toString()));
				return;
			}
		}
		// update the metrics
		queue.getMetrics().submitApp(user);
		SchedulerApplication<FiCaSchedulerApp> application = new SchedulerApplication<FiCaSchedulerApp>(queue, user);
		applications.put(applicationId, application);
		LOG.info("Accepted application " + applicationId + " from user: " + user + ", in queue: " + queueName);
		if (isAppRecovering) {
			if (LOG.isDebugEnabled()) {
				LOG.debug(applicationId + " is recovering. Skip notifying APP_ACCEPTED");
			}
		} else {
			rmContext.getDispatcher().getEventHandler()
					.handle(new RMAppEvent(applicationId, RMAppEventType.APP_ACCEPTED));
		}

先提交到对应的队列内，然后再由rmContext内部的dispatcher再行调度，而这个调度器，就是全局的调度器，根据事件类型，我们找到处理函数：

addTransition(RMAppState.SUBMITTED, RMAppState.ACCEPTED, RMAppEventType.APP_ACCEPTED,
							new StartAppAttemptTransition())

private static final class StartAppAttemptTransition extends RMAppTransition {
		@Override
		public void transition(RMAppImpl app, RMAppEvent event) {
			app.createAndStartNewAttempt(false);
		};
	}

private void createAndStartNewAttempt(boolean transferStateFromPreviousAttempt) {
		createNewAttempt();
		handler.handle(new RMAppStartAttemptEvent(currentAttempt.getAppAttemptId(), transferStateFromPreviousAttempt));
	}

这里，再次提交了一个RMAppStartAttemptEvent，开始创建一个ApplicationMaster的尝试操作，因为我们知道，可能启动一次不会成功，所以要有尝试的机会：

我们重在分析最后一个方法内的两个调用：

private void createNewAttempt() {
		ApplicationAttemptId appAttemptId = ApplicationAttemptId.newInstance(applicationId, attempts.size() + 1);
		RMAppAttempt attempt = new RMAppAttemptImpl(appAttemptId, rmContext, scheduler, masterService,
				submissionContext, conf,
				// The newly created attempt maybe last attempt if (number of
				// previously failed attempts(which should not include Preempted,
				// hardware error and NM resync) + 1) equal to the max-attempt
				// limit.
				maxAppAttempts == (getNumFailedAppAttempts() + 1), amReq);
		attempts.put(appAttemptId, attempt);
		currentAttempt = attempt;
	}

创建一个启动尝试，需要很多环境参数，这里主要汇聚了ApplicationMaster的相关参数，调度器，管理服务等，都是与ApplicationMaster启动密切相关的.

第二个方法，再次提交了一个事件，而这个类型的事件调度，如下：

			rmDispatcher.register(RMAppAttemptEventType.class, new ApplicationAttemptEventDispatcher(rmContext));

addTransition(RMAppAttemptState.NEW, RMAppAttemptState.SUBMITTED, RMAppAttemptEventType.START,
							new AttemptStartedTransition())

private static final class AttemptStartedTransition extends BaseTransition {
		@Override
		public void transition(RMAppAttemptImpl appAttempt, RMAppAttemptEvent event) {

			boolean transferStateFromPreviousAttempt = false;
			if (event instanceof RMAppStartAttemptEvent) {
				transferStateFromPreviousAttempt = ((RMAppStartAttemptEvent) event)
						.getTransferStateFromPreviousAttempt();
			}
			appAttempt.startTime = System.currentTimeMillis();

			// Register with the ApplicationMasterService
			appAttempt.masterService.registerAppAttempt(appAttempt.applicationAttemptId);

			if (UserGroupInformation.isSecurityEnabled()) {
				appAttempt.clientTokenMasterKey = appAttempt.rmContext.getClientToAMTokenSecretManager()
						.createMasterKey(appAttempt.applicationAttemptId);
			}

			// Add the applicationAttempt to the scheduler and inform the scheduler
			// whether to transfer the state from previous attempt.
			appAttempt.eventHandler.handle(new AppAttemptAddedSchedulerEvent(appAttempt.applicationAttemptId,
					transferStateFromPreviousAttempt));
		}
	}

我们看下这个事件：

扫描二维码关注公众号，回复： 862117 查看本文章

public AppAttemptAddedSchedulerEvent(ApplicationAttemptId applicationAttemptId,
			boolean transferStateFromPreviousAttempt, boolean isAttemptRecovering) {
		super(SchedulerEventType.APP_ATTEMPT_ADDED);
		this.applicationAttemptId = applicationAttemptId;
		this.transferStateFromPreviousAttempt = transferStateFromPreviousAttempt;
		this.isAttemptRecovering = isAttemptRecovering;
	}

很明显，提交的事件类型是SchedulerEventType.APP_ATTEMPT_ADDED类型，我们看下相应的handle方法，一步步看：

		this.eventHandler = rmContext.getDispatcher().getEventHandler();

这说明事件是交给了全局的调度器来负责调度：

			rmDispatcher.register(SchedulerEventType.class, schedulerDispatcher);

根据事件类型，找到对应的调度器，是RM内部的SchedulerEventDispatcher，而其内部的handler逻辑，实际上是由内部的EventProcessor来处理的，这是个继承了Runnable的类，用于异步处理：

						scheduler.handle(event);

而这里的scheduler，则是RM全局的CapacityScheduler，我们看看其如何处理这个事件：

case APP_ATTEMPT_ADDED: {
			AppAttemptAddedSchedulerEvent appAttemptAddedEvent = (AppAttemptAddedSchedulerEvent) event;
			addApplicationAttempt(appAttemptAddedEvent.getApplicationAttemptId(),
					appAttemptAddedEvent.getTransferStateFromPreviousAttempt(),
					appAttemptAddedEvent.getIsAttemptRecovering());
		}

看起来，好像构建了新的事件需要进行后续的调度，我们看看addApplicationAttempt方法：

		SchedulerApplication<FiCaSchedulerApp> application = applications.get(applicationAttemptId.getApplicationId());
		CSQueue queue = (CSQueue) application.getQueue();

		FiCaSchedulerApp attempt = new FiCaSchedulerApp(applicationAttemptId, application.getUser(), queue,
				queue.getActiveUsersManager(), rmContext);
		if (transferStateFromPreviousAttempt) {
			attempt.transferStateFromPreviousAttempt(application.getCurrentAppAttempt());
		}
		application.setCurrentAppAttempt(attempt);

		queue.submitApplicationAttempt(attempt, application.getUser());
		LOG.info("Added Application Attempt " + applicationAttemptId + " to scheduler from user "
				+ application.getUser() + " in queue " + queue.getQueueName());
		if (isAttemptRecovering) {
			if (LOG.isDebugEnabled()) {
				LOG.debug(applicationAttemptId + " is recovering. Skipping notifying ATTEMPT_ADDED");
			}
		} else {
			rmContext.getDispatcher().getEventHandler()
					.handle(new RMAppAttemptEvent(applicationAttemptId, RMAppAttemptEventType.ATTEMPT_ADDED));
		}

前面的代码不分析了，重点看这里的继续调度过程，再次给全局调度器递交了一个事件，这个事件类型是RMAppAttemptEventType.ATTEMPT_ADDED，我们发现这个事件，交给了：

	rmDispatcher.register(RMAppAttemptEventType.class, new ApplicationAttemptEventDispatcher(rmContext));

看看这个事件是如何被ApplicationAttemptEventDispatcher处理的：

			ApplicationAttemptId appAttemptID = event.getApplicationAttemptId();
			ApplicationId appAttemptId = appAttemptID.getApplicationId();
			RMApp rmApp = this.rmContext.getRMApps().get(appAttemptId);
			if (rmApp != null) {
				RMAppAttempt rmAppAttempt = rmApp.getRMAppAttempt(appAttemptID);
				if (rmAppAttempt != null) {
					try {
						rmAppAttempt.handle(event);
					} catch (Throwable t) {
						LOG.error("Error in handling event type " + event.getType() + " for applicationAttempt "
								+ appAttemptId, t);
					}
				}
			}

这里的rmAppAttempt，实际实现是rmAppAttemptImpl，点进去看其handle方法：

		this.writeLock.lock();

		try {
			ApplicationAttemptId appAttemptID = event.getApplicationAttemptId();
			LOG.debug("Processing event for " + appAttemptID + " of type " + event.getType());
			final RMAppAttemptState oldState = getAppAttemptState();
			try {
				/* keep the master in sync with the state machine */
				this.stateMachine.doTransition(event.getType(), event);
			} catch (InvalidStateTransitonException e) {
				LOG.error("Can't handle this event at current state", e);
				/* TODO fail the application on the failed transition */
			}

			if (oldState != getAppAttemptState()) {
				LOG.info(appAttemptID + " State change from " + oldState + " to " + getAppAttemptState());
			}
		} finally {
			this.writeLock.unlock();
		}

这里，走的是状态机转换，原先，我们状态机的状态为：RMAppAttemptSTATE.SUBMITTED，提交了事件RMAppAttemptSTATE.APP_ATTEMPT_ADDED：

addTransition(RMAppAttemptState.SUBMITTED,
							EnumSet.of(RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING, RMAppAttemptState.SCHEDULED),
							RMAppAttemptEventType.ATTEMPT_ADDED, new ScheduleTransition())

我们看下schedulerTransition的代码：

public RMAppAttemptState transition(RMAppAttemptImpl appAttempt, RMAppAttemptEvent event) {
			ApplicationSubmissionContext subCtx = appAttempt.submissionContext;
			if (!subCtx.getUnmanagedAM()) {
				// Need reset #containers before create new attempt, because this request
				// will be passed to scheduler, and scheduler will deduct the number after
				// AM container allocated

				// Currently, following fields are all hard code,
				// TODO: change these fields when we want to support
				// priority/resource-name/relax-locality specification for AM containers
				// allocation.
				appAttempt.amReq.setNumContainers(1);
				appAttempt.amReq.setPriority(AM_CONTAINER_PRIORITY);
				appAttempt.amReq.setResourceName(ResourceRequest.ANY);
				appAttempt.amReq.setRelaxLocality(true);

				// AM resource has been checked when submission
				Allocation amContainerAllocation = appAttempt.scheduler.allocate(appAttempt.applicationAttemptId,
						Collections.singletonList(appAttempt.amReq), EMPTY_CONTAINER_RELEASE_LIST, null, null);
				if (amContainerAllocation != null && amContainerAllocation.getContainers() != null) {
					assert (amContainerAllocation.getContainers().size() == 0);
				}
				return RMAppAttemptState.SCHEDULED;
			} else {
				// save state and then go to LAUNCHED state
				appAttempt.storeAttempt();
				return RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING;
			}
		}

重点在这里，开始调度之后，就需要尝试分配资源，并且在指定的container上准备启动尝试了：

我们看下这里的资源分配代码：

Allocation amContainerAllocation = appAttempt.scheduler.allocate(appAttempt.applicationAttemptId,
						Collections.singletonList(appAttempt.amReq), EMPTY_CONTAINER_RELEASE_LIST, null, null);

追溯下去，重点在这儿：

		FiCaSchedulerApp application = getApplicationAttempt(applicationAttemptId);

return application.getAllocation(getResourceCalculator(), clusterResource, getMinimumResourceCapability());

前面的检查和释放Container不看了，直接看这两句话，内部还有另一个scheduler，来分配资源，这一段逻辑有些复杂，彻底来看一看：

前文，我们调用了capacityScheduler的addApplication方法，内部有两句代码：

SchedulerApplication<FiCaSchedulerApp> application = new SchedulerApplication<FiCaSchedulerApp>(queue, user);
		applications.put(applicationId, application);

而后，我们调用了addApplicationAttempt方法，内部新建了一个FicaSchedulerApp：

SchedulerApplication<FiCaSchedulerApp> application = applications.get(applicationAttemptId.getApplicationId());
		CSQueue queue = (CSQueue) application.getQueue();

		FiCaSchedulerApp attempt = new FiCaSchedulerApp(applicationAttemptId, application.getUser(), queue,
				queue.getActiveUsersManager(), rmContext);
		if (transferStateFromPreviousAttempt) {
			attempt.transferStateFromPreviousAttempt(application.getCurrentAppAttempt());
		}
		application.setCurrentAppAttempt(attempt);

这个，就是我们后来使用的FicaSchedulerApp，我们看看这个FicaSchedulerApp的初始化代码：

public FiCaSchedulerApp(ApplicationAttemptId applicationAttemptId, String user, Queue queue,
			ActiveUsersManager activeUsersManager, RMContext rmContext) {
		super(applicationAttemptId, user, queue, activeUsersManager, rmContext);

		RMApp rmApp = rmContext.getRMApps().get(getApplicationId());

		Resource amResource;
		if (rmApp == null || rmApp.getAMResourceRequest() == null) {
			// the rmApp may be undefined (the resource manager checks for this too)
			// and unmanaged applications do not provide an amResource request
			// in these cases, provide a default using the scheduler
			amResource = rmContext.getScheduler().getMinimumResourceCapability();
		} else {
			amResource = rmApp.getAMResourceRequest().getCapability();
		}

		setAMResource(amResource);
	}

分析这个方法，需要注意，如果我们提交ApplicationMaster的时候，没有指定资源的话，则使用最小资源容量，所以通常我们需要配置，我们看下最小资源容量，这个资源容量是在CapacityScheduler中配置的：

public Resource getMinimumAllocation() {
		int minimumMemory = getInt(YarnConfiguration.RM_SCHEDULER_MINIMUM_ALLOCATION_MB,
				YarnConfiguration.DEFAULT_RM_SCHEDULER_MINIMUM_ALLOCATION_MB);
		int minimumCores = getInt(YarnConfiguration.RM_SCHEDULER_MINIMUM_ALLOCATION_VCORES,
				YarnConfiguration.DEFAULT_RM_SCHEDULER_MINIMUM_ALLOCATION_VCORES);
		return Resources.createResource(minimumMemory, minimumCores);
	}

追本溯源，发现最小内存是1G，最小核数是1，因为这里加载的是我们yarn的配置，所以可以配置下，修改我们的默认值：

注意，最后一句话，执行成功之后，RMAppAttemptImpl的状态转换为RMAppAttemptSTATE.SCHEDULED：

ContainersAndNMTokensAllocation allocation = pullNewlyAllocatedContainersAndNMTokens();
		return new Allocation(allocation.getContainerList(), getHeadroom(), null, currentContPreemption,
				Collections.singletonList(rr), allocation.getNMTokenList());

实际上，在return新的Allocation之前，有一句代码很重要，如上，我们点进去看下：

		List<Container> returnContainerList = new ArrayList<Container>(newlyAllocatedContainers.size());
		List<NMToken> nmTokens = new ArrayList<NMToken>();
		for (Iterator<RMContainer> i = newlyAllocatedContainers.iterator(); i.hasNext();) {
			RMContainer rmContainer = i.next();
			Container container = rmContainer.getContainer();
			try {
				// create container token and NMToken altogether.
				container.setContainerToken(rmContext.getContainerTokenSecretManager().createContainerToken(
						container.getId(), container.getNodeId(), getUser(), container.getResource(),
						container.getPriority(), rmContainer.getCreationTime(), this.logAggregationContext));
				NMToken nmToken = rmContext.getNMTokenSecretManager().createAndGetNMToken(getUser(),
						getApplicationAttemptId(), container);
				if (nmToken != null) {
					nmTokens.add(nmToken);
				}
			} catch (IllegalArgumentException e) {
				// DNS might be down, skip returning this container.
				LOG.error("Error trying to assign container token and NM token to" + " an allocated container "
						+ container.getId(), e);
				continue;
			}
			returnContainerList.add(container);
			i.remove();
			rmContainer.handle(new RMContainerEvent(rmContainer.getContainerId(), RMContainerEventType.ACQUIRED));
		}
		return new ContainersAndNMTokensAllocation(returnContainerList, nmTokens);

这里，我们注意看下事件提交的部分，很重要，对于每个rmContainer都提交了一个事件，而这个事件的处理，则交给了：

.addTransition(RMContainerState.ALLOCATED, RMContainerState.ACQUIRED, RMContainerEventType.ACQUIRED,
							new AcquiredTransition())

注意，这是RMContainerImpl内的状态机的方法，为什么状态时从Allocated到Acquired的转换，是因为前面我们在获取资源的时候，分配的过程中，对应的RMContainerImpl的状态已经发生了转换，从new转换为了Allocated；我们看看AcquiredTransition内部的处理：

// Tell the app
			container.eventHandler.handle(new RMAppRunningOnNodeEvent(
					container.getApplicationAttemptId().getApplicationId(), container.nodeId));

重点在这这儿，而这里的container实际上是RMContainerImpl，执行的是其内部的eventHandler的handle方法，我们看下：

// Register event handler for RmAppEvents
			rmDispatcher.register(RMAppEventType.class, new ApplicationEventDispatcher(rmContext));

对应的处理，在状态机转换的逻辑内：

addTransition(RMAppState.ACCEPTED, RMAppState.ACCEPTED, RMAppEventType.APP_RUNNING_ON_NODE,
							new AppRunningOnNodeTransition())

我们看看这个转换器：

		public void transition(RMAppImpl app, RMAppEvent event) {
			RMAppRunningOnNodeEvent nodeAddedEvent = (RMAppRunningOnNodeEvent) event;

			// if final state already stored, notify RMNode
			if (isAppInFinalState(app)) {
				app.handler
						.handle(new RMNodeCleanAppEvent(nodeAddedEvent.getNodeId(), nodeAddedEvent.getApplicationId()));
				return;
			}

			// otherwise, add it to ranNodes for further process
			app.ranNodes.add(nodeAddedEvent.getNodeId());
		};

毫无引文，我们需要走第二步，看看app.ranNodes是什么作用，主要是添加了运行的节点，在此不赘述了。

继续从前面的逻辑看：

// AM resource has been checked when submission
				Allocation amContainerAllocation = appAttempt.scheduler.allocate(appAttempt.applicationAttemptId,
						Collections.singletonList(appAttempt.amReq), EMPTY_CONTAINER_RELEASE_LIST, null, null);
				if (amContainerAllocation != null && amContainerAllocation.getContainers() != null) {
					assert (amContainerAllocation.getContainers().size() == 0);
				}
				return RMAppAttemptState.SCHEDULED;

这里，我们看到，触发了RMAppAttemptState.SCHEDULED的转换：

addTransition(RMAppAttemptState.SCHEDULED,
							EnumSet.of(RMAppAttemptState.ALLOCATED_SAVING, RMAppAttemptState.SCHEDULED),
							RMAppAttemptEventType.CONTAINER_ALLOCATED, new AMContainerAllocatedTransition())

追本溯源，找到这里，我们看下AMContainerAllocatedTransition的代码：

// Set the masterContainer
			appAttempt.setMasterContainer(amContainerAllocation.getContainers().get(0));
			RMContainerImpl rmMasterContainer = (RMContainerImpl) appAttempt.scheduler
					.getRMContainer(appAttempt.getMasterContainer().getId());
			rmMasterContainer.setAMContainer(true);
			// The node set in NMTokenSecrentManager is used for marking whether the
			// NMToken has been issued for this node to the AM.
			// When AM container was allocated to RM itself, the node which allocates
			// this AM container was marked as the NMToken already sent. Thus,
			// clear this node set so that the following allocate requests from AM are
			// able to retrieve the corresponding NMToken.
			appAttempt.rmContext.getNMTokenSecretManager().clearNodeSetForAttempt(appAttempt.applicationAttemptId);
			appAttempt.getSubmissionContext().setResource(appAttempt.getMasterContainer().getResource());
			appAttempt.storeAttempt();
			return RMAppAttemptState.ALLOCATED_SAVING;

这个转换，顾名思义，就是AMContainer的分配，可以看出里面新建了一个RMContainerImpl，说明，要开始与NodeManager交互了，需要NodeManager来调度事件，准备启动ApplicationMaster了：

private void storeAttempt() {
		// store attempt data in a non-blocking manner to prevent dispatcher
		// thread starvation and wait for state to be saved
		LOG.info("Storing attempt: AppId: " + getAppAttemptId().getApplicationId() + " AttemptId: " + getAppAttemptId()
				+ " MasterContainer: " + masterContainer);
		rmContext.getStateStore().storeNewApplicationAttempt(this);
	}

我们看下其中的这个方法，把本次提交的这个RMAppAttemptImpl交给了RM的大管家，将本次尝试存储起来。

而且，最后这个转换提交了一个新的状态：RMAppAttemptState.ALLOCATED_SAVING。

本文到此结束，下文将讨论这次提交的状态，带来的下次转化。

关于Yarn源码的那些事（五）

猜你喜欢