[PA-1766] Transient failures on Cisco NX Created: 2017/09/20  Updated: 2019/01/17  Resolved: 2019/01/17

Status: Closed
Project: Puppet Agent
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: CI Blocker Priority: Normal
Reporter: Erik Dasher Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: ci, cisco, netdev, packaging, transient
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: JPEG File rutroh.jpg    
Issue Links:
relates to PCP-798 ssh and scp failures on cisconx-64a w... Closed
relates to PUP-7942 Add Cisco_Nexus-7-x86_64 as a test ta... Resolved
relates to PA-1574 Failure installing puppet-agent on ci... Closed
CI Pipeline/s:
platform puppet-agent
Team: Platform OS
QA Risk Assessment: Needs Assessment


yum intermittently fails, perhaps 1 out of 3 times and throws an error like this when attempting to install the Cisco Wind River 5 (NX) agent on a cisco nx vm.

Test Case vm-presuite.rb reported: #<Beaker::Host::CommandFailure: Host 'ko8g4v3nmarzihu.delivery.puppetlabs.net' exited with 1 running:
source /etc/profile; sudo sh -c " ip netns exec management yum -y install http://yum.puppetlabs.com/puppetlabs-release-pc1-cisco-wrlinux-5.noarch.rpm "
Last 10 lines of output were:
Install 1 Package

Total size: 11 k
Installed size: 11 k
Downloading Packages:
warning: rpmts_HdrFromFdno: Header V4 RSA/SHA1 signature: NOKEY, key ID ef8d349f

Package Header puppetlabs-release-pc1-1.1.0-4.cisco_wrlinux5.noarch: RPM Cannot open>
Test line: vm-presuite.rb:59:in `block (4 levels) in run_test'

Comment by Erik Dasher [ 2017/09/22 ]

I'm considering expanding this issue to cover all network transient failures in the Cisco NX vm pooler image.

Frequently SCP fails as well. On a recent scp failure, I found these messages in the bottom of the dmesg:

[ 385.394869] _local_bh_enable_ip: cpu:2 Bottom half held between start-6860787165720 end-6860811783828
[ 385.394896]
[ 385.394896] _local_bh_enable_ip:cpu-2 Bottom half held for 24 msecs disable_ip=_raw_spin_lock_bh+0x16/0x40 enable_ip=mts_sys_recv+0x527/0x1520 [klm_mts]

Comment by Erik Dasher [ 2017/09/25 ]

This may be related to PCP-798.

Comment by Rick Sherman [ 2017/09/25 ]

I believe we have to tracks we need to follow here.

1) Are these transients a known issue with the Cisco image? We can confer with them, and also update to the latest version of the VM

2) Can we make our tooling more robust when it comes to dealing with transients? Are there opportunities for us to retry commands, etc.

Comment by Erik Dasher [ 2017/09/25 ]

There are some closely related tickets here; Let us use this one for the long term issues that Rick Sherman mentions above.

Ping John Duarte; The existing Cisco NX vmpooler image has substantial issues that have motivated us to remove the Cisco NX tests from the Puppet Agent pipeline. I don't know what the long term solution might be.

Comment by Geoff Nichols [ 2019/01/17 ]

This ticket has not been updated in some time and is now being closed due to inactivity. This isn’t necessarily a statement that this ticket isn’t important - other issues may have demanded precedence since it was filed, or it may have simply slipped through the cracks. If any viewer/watcher feels closing this ticket is an error, please re-open it and add a comment explaining. Our apologies in advance for any mistake on this.

Generated at Mon Sep 28 14:37:52 PDT 2020 using Jira 8.5.2#805002-sha1:a66f9354b9e12ac788984e5d84669c903a370049.