文章目录
1. 问题现象
调试环境:单板(AM335X+linux) --------(GE以太网)----------- 控制器(AM5728+vxworks)
出错场景:单板作为ftp client(172.16.89.4),控制器作为ftp server(172.16.89.71),单板在进行ftp数据下载的时候会经常出现速度突然变慢,有时会极度卡顿,最恶劣情况会传输失败。
m_Ftp_Connect_Login_Server_Function
mmmmm "172.16.89.71" 21
wait for command finished 0
connecting to the Ftp server...
"FTP now stateeeeeee:" 1 "主机名查找正在进行中"
"FTP now stateeeeeee:" 2 "正在尝试连接到主机"
"FTP now stateeeeeee:" 3 "已实现与主机的连接"
wait for command finished 1
m_ftp->currentCommand() 3 "Unknown error"
connectToHost
the FTP_server is successfully connected
logging in...
"Unknown error"
wait for command finished 2
"FTP now stateeeeeee:" 4 "已实现连接和用户登录"
m_ftp->currentCommand() 4 "Unknown error"
Login
the login FTP_server is successful
11111------------------------------------------------------------------------------------------
start!-----------------------------------------------------------!start
uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu "/mmc1:2/robot-folder"
cd file from server...
m_ftp->currentCommand() 7 "Unknown error"
cd
open file path successfully
m_ftp->currentCommand() 6 "Unknown error"
close
cd file from server...
m_ftp->currentCommand() 7 "Unknown error"
cd
open file path successfully
...
2. 分析过程
因为单板作为ftp client和其他控制单板传输的过程中未碰到这类问题,所以我们排查重点放在目前的控制器单板侧。
Step 1. 排查系统负载过重
首先怀疑到的是server端在出现卡顿的时候,是不是任务负载过重,造成调度不过来的情况。
首先启动vxworks本地应用,并且开启单板的ftp下载程序。
// 启动应用
cd"/mmc1:1"
ld<pruss_eth.out
prussStart
ld<codesyscontrol.out
PlcStart"CODESYSControl.cfg"
我们通过spy
命令查看vxworks的cpu占用率,发现在复现卡顿现象的时候cpu占用率非常低,idle都是百分之九十几:
-> spy
value = 548438048 = 0x20b08020 = ' '
->
NAME ENTRY TID PRI total % (ticks) delta % (ticks)
------------ ------------ ---------- --- --------------- ---------------
tIsr0 0x3ec07c 0x2027f380 0 0% ( 1) 0% ( 1)
tJobTask 0x30eba4 0x204f35c0 0 0% ( 0) 0% ( 0)
tExcTask 0x30e544 0x5461a8 0 0% ( 0) 0% ( 0)
tLogTask logTask 0x204f3b08 0 0% ( 0) 0% ( 0)
tShell0 shellTask 0x20446c80 1 0% ( 0) 0% ( 0)
tHrfsCommit: hrfsCommitTa 0x20473b10 2 0% ( 0) 0% ( 0)
tSpyTask spyComTask 0x20b08020 5 0% ( 9) 0% ( 3)
tErfTask 0x255664 0x2022aae8 10 0% ( 0) 0% ( 0)
tVxdbgTask 0x298384 0x20431868 25 0% ( 0) 0% ( 0)
CAAEventTask 0x108da24 0x20a7c430 30 0% ( 0) 0% ( 0)
SchedExcepti 0x108da24 0x20a7c020 31 0% ( 0) 0% ( 0)
tDmaJobTask jobQueueProc 0x20000640 45 0% ( 0) 0% ( 0)
tTcfEvents 0x3bba88 0x551668 49 0% ( 0) 0% ( 0)
tTcf 0x2bfa60 0x20449238 49 0% ( 1) 0% ( 0)
tTcf 0x2bfa60 0x20455858 49 0% ( 0) 0% ( 0)
tTcf 0x2bfa60 0x20458a90 49 0% ( 0) 0% ( 0)
tNet0 ipcomNetTask 0x20232630 50 0% ( 0) 0% ( 0)
tNetConf 0x1b0974 0x204165c0 50 0% ( 0) 0% ( 0)
ipftps ipftps_main 0x20428348 50 0% ( 0) 0% ( 0)
tAnalysisAge cafe_event_t 0x53dea8 50 0% ( 0) 0% ( 0)
SchedProcess 0x108da24 0x20a68af0 64 0% ( 5) 0% ( 2)
xHCD_IH0 usbXhcdInter 0x204db0d0 90 0% ( 0) 0% ( 0)
xHCD_IH1 usbXhcdInter 0x205d5348 90 0% ( 0) 0% ( 0)
BlkDrvUdp 0x108da24 0x20a90020 95 0% ( 1) 0% ( 0)
BlkDrvTcp 0x108da24 0x20a90298 95 0% ( 5) 0% ( 5)
tiCardMon 0x109924 0x202870d0 100 0% ( 0) 0% ( 0)
tiCardMon 0x109924 0x202879d8 100 0% ( 0) 0% ( 0)
BusM A usbHubThread 0x204c54a0 100 0% ( 2) 0% ( 0)
BusM B usbHubThread 0x204c5c08 100 0% ( 0) 0% ( 0)
BusM C usbHubThread 0x204eb7a0 100 0% ( 2) 0% ( 0)
BusM D usbHubThread 0x205d5050 100 0% ( 0) 0% ( 0)
tTffsPTask flPollTask 0x206929b0 100 0% ( 0) 0% ( 0)
/mmc1XbdSvc 0x30c984 0x2045e010 100 0% ( 0) 0% ( 0)
SupervisorCy 0x108da24 0x20aa4020 128 0% ( 0) 0% ( 0)
OPCUAServer 0x108da24 0x20aa4ad8 128 0% ( 2) 0% ( 0)
tUsb2Clr 0x3d1aa8 0x20692d10 150 0% ( 0) 0% ( 0)
WebServerClo 0x108da24 0x20a7c788 150 0% ( 1) 0% ( 0)
CommCtrl CommControlL 0x20491850 161 0% ( 0) 0% ( 0)
vnSyncer vn_sync_task 0x20464988 200 0% ( 0) 0% ( 0)
miiBusMonito 0x24f3b8 0x204f32a0 252 0% ( 3) 0% ( 0)
MainLoop 0x108da24 0x20a90510 254 0% ( 0) 0% ( 0)
KERNEL 0% ( 4) 0% ( 4)
INTERRUPT 0% ( 0) 0% ( 0)
IDLE 98% ( 2045) 97% ( 512)
TOTAL 98% ( 2081) 97% ( 527)
总结:和系统负载无关。
Step 2. 排查文件系统读写速度
会不会是文件系统的读写速度变慢拖累了ftp速度。
vxworks下一共有两个分区,"/mmc1:1"是dosfs文件格式,"/mmc1:2"是hrfs文件格式:
xbdCreatePartition ("/mmc1:1", 2, 50, 0, 0)
dosFsVolFormat ("/mmc1:1", 0, 0)
hrfsFormat ("/mmc1:2", 0ll, 0, 0)
我们测试两个分区的文件读写速度,发现读速度还行,但是写速度非常慢。
[vxWorks *]# ls -l
...
-rwxrwxrwx 1 0 0 21774591 Jan 1 1980 codesyscontrol.out
-> timex(cp,"/mmc1:1/codesyscontrol.out","/mmc1:1/codesyscontrol.out.bak")
copying file /mmc1:1/codesyscontrol.out -> /mmc1:1/codesyscontrol.out.bak
timex: time of execution = 17873 +/- 0 (0%) millisecs
value = 54 = 0x36 = '6'
->
->
-> timex(cp,"/mmc1:2/codesyscontrol.out","/mmc1:2/codesyscontrol.out.bak")
copying file /mmc1:2/codesyscontrol.out -> /mmc1:2/codesyscontrol.out.bak
timex: time of execution = 137 +/- 0 (0%) secs
value = 47 = 0x2f = '/'
->
ftp使用的是"/mmc1:2"分区,出现故障的时候只是在进行ftp下载,对文件系统来说是读操作,但是是不是别的地方的写操作阻塞了读操作?
总结:这里也是一个怀疑点,但是后续的其他测试排除了和它相关。
Step 3. 排查网口0、网口1
ftp server端有两个GE网口,ftp下载一直使用的是网口0。
在卡顿故障复现的时候,单板通过网口0(cpsw0)和AM5728的ftp通讯非常卡顿,这时使用pc作为ftp client通过网口1(cpsw1)和AM5728的ftp通讯非常通畅无问题,把pc换到网口0(cpsw0)ftp传输一样的卡顿。
这一下子就排除掉了很多的因素,说明故障集中在AM5728的网口0传输链上:网卡驱动、网络协议栈、硬件走线都有嫌疑。
总结:缩小定位范围,故障集中在AM5728的网口0传输链上。
Step 4. 排查网卡驱动和网络协议栈
把单板换成通过网口1(cpsw1)和AM5728进行ftp通讯。
使用同样的ftp测试,连续测试一天无问题出现。
因为网口0、网口1的网卡驱动和网络协议栈部分基本一致,所以网口0的硬件走线嫌疑比较大。
总结:进一步缩小范围,走线信号嫌疑较大,但是还缺乏直接证据。
Step 5. 排查网口0的走线信号
我们打算读出网口0的mac层统计看看它的错误统计,如果因为走线信号质量问题产生的各种错误,这里应该能看到错误统计。
因为原生驱动代码没有支持,我们增加相应的函数:
diff --git a/vxbFdtTiCpswEnd.c b/vxbFdtTiCpswEnd.c
index e60c507..4c32f0e 100644 (file)
--- a/vxbFdtTiCpswEnd.c
+++ b/vxbFdtTiCpswEnd.c
@@ -256,6 +257,7 @@ LOCAL char * rxChanErrMsg [] =
/* common resource which is shared by all port */
LOCAL CPSW_SW_CTRL * pSwCtrl = NULL;
+CPSW_STAT tmp_cpswStat;
/* import functions */
@@ -2872,6 +2874,70 @@ LOCAL STATUS cpswEndStatsDump
return OK;
}
+STATUS cpswEndStatsShow
+ (
+ )
+ {
+ int i;
+ VXB_DEV_ID pDev = pSwCtrl->pDev;
+ UINT32 * temp = (UINT32 *)&tmp_cpswStat;
+
+ /*
+ * hardware statistic counters are write-to-decrement,
+ * after a read, we write the value read to clear
+ * the counters
+ */
+
+ for (i = 0; i < sizeof (CPSW_STAT) / sizeof (UINT32); i++)
+ {
+ *temp = CSR_READ_4 (pDev, pSwCtrl->statsOffset + i*4);
+ //CSR_WRITE_4 (pDev, pSwCtrl->statsOffset + i*4, *temp);
+ temp++;
+ }
+
+ (* _func_kprintf)("cpswMacStat rxgood = %d\n", tmp_cpswStat.rxgood);
+ (* _func_kprintf)("cpswMacStat rxbroadcast = %d\n", tmp_cpswStat.rxbroadcast);
+ (* _func_kprintf)("cpswMacStat rxmulticast = %d\n", tmp_cpswStat.rxmulticast);
+ (* _func_kprintf)("cpswMacStat rxpause = %d\n", tmp_cpswStat.rxpause);
+ (* _func_kprintf)("cpswMacStat rxcrcerros = %d\n", tmp_cpswStat.rxcrcerros);
+ (* _func_kprintf)("cpswMacStat rxalignmenterrors = %d\n", tmp_cpswStat.rxalignmenterrors);
+ (* _func_kprintf)("cpswMacStat rxoversized = %d\n", tmp_cpswStat.rxoversized);
+ (* _func_kprintf)("cpswMacStat rxjabber = %d\n", tmp_cpswStat.rxjabber);
+ (* _func_kprintf)("cpswMacStat rxundersized = %d\n", tmp_cpswStat.rxundersized);
+ (* _func_kprintf)("cpswMacStat rxfrags = %d\n", tmp_cpswStat.rxfrags);
+ (* _func_kprintf)("cpswMacStat rxoctets = %d\n\n", tmp_cpswStat.rxoctets);
+
+ (* _func_kprintf)("cpswMacStat txgood = %d\n", tmp_cpswStat.txgood);
+ (* _func_kprintf)("cpswMacStat txbroadcast = %d\n", tmp_cpswStat.txbroadcast);
+ (* _func_kprintf)("cpswMacStat txmulticast = %d\n", tmp_cpswStat.txmulticast);
+ (* _func_kprintf)("cpswMacStat txpause = %d\n", tmp_cpswStat.txpause);
+ (* _func_kprintf)("cpswMacStat txdefered = %d\n", tmp_cpswStat.txdefered);
+ (* _func_kprintf)("cpswMacStat txcollision = %d\n", tmp_cpswStat.txcollision);
+ (* _func_kprintf)("cpswMacStat txsinglecol = %d\n", tmp_cpswStat.txsinglecol);
+ (* _func_kprintf)("cpswMacStat txmulticol = %d\n", tmp_cpswStat.txmulticol);
+ (* _func_kprintf)("cpswMacStat txexceesive = %d\n", tmp_cpswStat.txexceesive);
+ (* _func_kprintf)("cpswMacStat txlatecol = %d\n", tmp_cpswStat.txlatecol);
+ (* _func_kprintf)("cpswMacStat txunderrun = %d\n", tmp_cpswStat.txunderrun);
+ (* _func_kprintf)("cpswMacStat txcariersense = %d\n", tmp_cpswStat.txcariersense);
+ (* _func_kprintf)("cpswMacStat txoctets = %d\n\n", tmp_cpswStat.txoctets);
+
+ (* _func_kprintf)("cpswMacStat sz64octets = %d\n", tmp_cpswStat.sz64octets);
+ (* _func_kprintf)("cpswMacStat sz65_127octets = %d\n", tmp_cpswStat.sz65_127octets);
+ (* _func_kprintf)("cpswMacStat sz128_255octets = %d\n", tmp_cpswStat.sz128_255octets);
+ (* _func_kprintf)("cpswMacStat sz256_511octets = %d\n", tmp_cpswStat.sz256_511octets);
+ (* _func_kprintf)("cpswMacStat sz512_1023octets = %d\n", tmp_cpswStat.sz512_1023octets);
+ (* _func_kprintf)("cpswMacStat sz1024octets = %d\n", tmp_cpswStat.sz1024octets);
+ (* _func_kprintf)("cpswMacStat netoctets = %d\n", tmp_cpswStat.netoctets);
+ (* _func_kprintf)("cpswMacStat rxfifooverrun = %d\n", tmp_cpswStat.rxfifooverrun[0]);
+ (* _func_kprintf)("cpswMacStat rxfifooverrun = %d\n", tmp_cpswStat.rxfifooverrun[1]);
+ (* _func_kprintf)("cpswMacStat rxfifooverrun = %d\n\n", tmp_cpswStat.rxfifooverrun[2]);
+
+ (* _func_kprintf)("cpswMacStat port 0 cpswInDropped = %lld\n", pSwCtrl->port[0]->cpswInDropped);
+ (* _func_kprintf)("cpswMacStat port 1 cpswInDropped = %lld\n\n\n", pSwCtrl->port[1]->cpswInDropped);
+
+ return OK;
+ }
+
/*******************************************************************************
*
* cpswEndIoctl - the driver I/O control routine
在平时的传输中,偶尔就有接收crc error的情况:
-> cpswEndStatsShow
...
cpswEndStatsShow,2900, rxgood = 1023
cpswEndStatsShow,2901, rxbroadcast = 0
cpswEndStatsShow,2902, rxmulticast = 0
cpswEndStatsShow,2903, rxpause = 0
cpswEndStatsShow,2904, rxcrcerros = 17
cpswEndStatsShow,2905, rxalignmenterrors = 0
cpswEndStatsShow,2906, rxoversized = 0
cpswEndStatsShow,2907, rxjabber = 0
cpswEndStatsShow,2908, rxundersized = 0
cpswEndStatsShow,2909, rxfrags = 0
cpswEndStatsShow,2910, rxoctets = 83903
在极端卡顿的情况下,接收crc error大量增加:
-> cpswEndStatsShow
...
cpswEndStatsShow,2900, rxgood = 5544
cpswEndStatsShow,2901, rxbroadcast = 181
cpswEndStatsShow,2902, rxmulticast = 158
cpswEndStatsShow,2903, rxpause = 0
cpswEndStatsShow,2904, rxcrcerros = 109
cpswEndStatsShow,2905, rxalignmenterrors = 0
cpswEndStatsShow,2906, rxoversized = 0
cpswEndStatsShow,2907, rxjabber = 0
cpswEndStatsShow,2908, rxundersized = 0
cpswEndStatsShow,2909, rxfrags = 0
cpswEndStatsShow,2910, rxoctets = 433598
构造内容为全FF的包,发送100个,错误100个:
-> cpswEndStatsShow
...
cpswEndStatsShow,2900, rxgood = 9844
cpswEndStatsShow,2901, rxbroadcast = 613
cpswEndStatsShow,2902, rxmulticast = 379
cpswEndStatsShow,2903, rxpause = 0
cpswEndStatsShow,2904, rxcrcerros = 272
cpswEndStatsShow,2905, rxalignmenterrors = 0
cpswEndStatsShow,2906, rxoversized = 0
cpswEndStatsShow,2907, rxjabber = 0
cpswEndStatsShow,2908, rxundersized = 0
cpswEndStatsShow,2909, rxfrags = 0
cpswEndStatsShow,2910, rxoctets = 764175
-> cpswEndStatsShow
...
cpswEndStatsShow,2900, rxgood = 9877
cpswEndStatsShow,2901, rxbroadcast = 635
cpswEndStatsShow,2902, rxmulticast = 387
cpswEndStatsShow,2903, rxpause = 0
cpswEndStatsShow,2904, rxcrcerros = 374
cpswEndStatsShow,2905, rxalignmenterrors = 0
cpswEndStatsShow,2906, rxoversized = 0
cpswEndStatsShow,2907, rxjabber = 0
cpswEndStatsShow,2908, rxundersized = 0
cpswEndStatsShow,2909, rxfrags = 0
cpswEndStatsShow,2910, rxoctets = 766815
同样的测试,网口1测试了一天无crc error情况。
总结:的确是硬件走线信号质量问题,造成数据在传输中出错。
Step 6. 网口0降速测试尝试
我们尝试把网口0降速到100M,来测试速度降低是否能消除错误。
因为原生驱动没有类似函数,我们也必须手工编写一个:
diff --git a/vxbFdtTiCpswEnd.c b/vxbFdtTiCpswEnd.c
index 4c32f0e..71afa1a 100644 (file)
--- a/vxbFdtTiCpswEnd.c
+++ b/vxbFdtTiCpswEnd.c
@@ -2938,6 +2946,126 @@ STATUS cpswEndStatsShow
return OK;
}
+STATUS cpswEndLinkSet
+(
+ UINT32 port,
+ UINT32 mode
+)
+{
+ CPSW_DRV_CTRL * pDrvCtrl;
+
+ if ((port > 1) || (mode > 4))
+ return -1;
+
+ pDrvCtrl = pSwCtrl->port[port];
+
+ switch(mode)
+ {
+ case 0:
+ (void)miiBusModeSet (pDrvCtrl->cpswMiiDev, IFM_ETHER | IFM_AUTO);
+ (* _func_kprintf)("set port %d mode: AUTO_SEL\n");
+ break;
+ case 1:
+ (void)miiBusModeSet (pDrvCtrl->cpswMiiDev, IFM_ETHER|IFM_100_TX|IFM_FDX);
+ (* _func_kprintf)("set port %d mode: 100 FDX\n");
+ break;
+ case 2:
+ (void)miiBusModeSet (pDrvCtrl->cpswMiiDev, IFM_ETHER|IFM_100_TX);
+ (* _func_kprintf)("set port %d mode: 100 HDX\n");
+ break;
+ case 3:
+ (void)miiBusModeSet (pDrvCtrl->cpswMiiDev, IFM_ETHER|IFM_10_T|IFM_FDX);
+ (* _func_kprintf)("set port %d mode: 10 FDX\n");
+ break;
+ case 4:
+ (void)miiBusModeSet (pDrvCtrl->cpswMiiDev, IFM_ETHER|IFM_10_T);
+ (* _func_kprintf)("set port %d mode: 10 HDX\n");
+ break;
+ default:
+ break;
+ }
+
+ return OK;
+}
强制设置网口0为100M全双工模式:
-> cpswEndLinkSet(0, 1)
网口0降速为100M模式,测试了半天无crc error情况。
总结:降速能解决crc error的问题,是一个workaround方案。
3. 结论
综上所述,ftp变慢卡顿的根因是硬件信号不稳定造成的大量错包丢包。
目前的workaround方案,可以把网口0降速成100M来使用。
根本的解决办法,需要硬件重新布线重新仿真,单板完工后需要做完备的信号测试,这样高速信号才能有一个扎实稳定的基础。