[PDB-1448] 'replace facts' failing due to foreign key constraint issue Created: 2015/04/22  Updated: 2016/04/19  Resolved: 2015/05/26

Status: Closed
Project: PuppetDB
Component/s: None
Affects Version/s: PDB 2.3.3
Fix Version/s: PDB 2.3.4

Type: Bug Priority: Normal
Reporter: Simon Oxwell Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 7, Postgres 9.2.7


Attachments: PNG File Screen Shot 2015-04-30 at 4.53.45 PM.png     PNG File Screen Shot 2015-04-30 at 4.54.03 PM.png     PNG File Screen Shot 2015-04-30 at 4.58.05 PM.png    
Issue Links:
Relates
Template:
Story Points: 4
Sprint: PuppetDB 2015-05-06

 Description   

Hi,

We're seeing an issue with our puppetdb since it was upgraded to 2.3.1, with some nodes not updating.

We get errors like the following the the puppetdb.log

015-04-23 14:45:10,223 ERROR [c.p.p.command] [b7ba3fa5-c61b-4461-a2b5-4317d00a92c8] [replace facts] Retrying after attempt 4, due to: java.sql.BatchUpdateException: Batch entry 0 DELETE FROM fact_values fv
           WHERE fv.id in (90,180,2414,2667,2590,1264,2436,1561878,2568,1524348,90,2516,2436,2468,1561273,1561275,33,1210,1236,2678,90,33,2320,2478,2494,2452,90,2678,1551663,33,26,2660,1561269,896,33,578,33,1561277,1551664,90,4641,180,2590,90,83,578,1551663,83,2452,2451,1561876,1561877,180,1561272,1561876,2474,90,1551664,180,2516,1561277,2462,1561271,33,2568,26,2494,2667,1524348,2414,180,33,33,90,4616)
             AND NOT EXISTS (SELECT 1 FROM facts f
                               WHERE f.fact_value_id in (90,180,2414,2667,2590,1264,2436,1561878,2568,1524348,90,2516,2436,2468,1561273,1561275,33,1210,1236,2678,90,33,2320,2478,2494,2452,90,2678,1551663,33,26,2660,1561269,896,33,578,33,1561277,1551664,90,4641,180,2590,90,83,578,1551663,83,2452,2451,1561876,1561877,180,1561272,1561876,2474,90,1551664,180,2516,1561277,2462,1561271,33,2568,26,2494,2667,1524348,2414,180,33,33,90,4616)
                                 AND f.fact_value_id = fv.id
                                 AND (f.factset_id, f.fact_path_id) NOT in ((36,820),(36,44),(36,279),(36,67),(36,924),(36,452),(36,812),(36,105),(36,616),(36,139),(36,1019),(36,487),(36,231),(36,65),(36,15),(36,48),(36,88),(36,655),(36,118),(36,87),(36,230),(36,647),(36,493),(36,991),(36,857),(36,81),(36,2),(36,1),(36,79),(36,249),(36,911),(36,349),(36,63),(36,548),(36,141),(36,324),(36,1006),(36,40),(36,27),(36,612),(36,364),(36,530),(36,832),(36,339),(36,25),(36,526),(36,92),(36,797),(36,85),(36,642),(36,131),(36,86),(36,383),(36,58),(36,103),(36,13),(36,871),(36,68),(36,136),(36,536),(36,127),(36,624),(36,7),(36,128),(36,543),(36,519),(36,214),(36,4),(36,46),(36,482),(36,897),(36,49),(36,809),(36,704),(36,811))) was aborted.  Call getNextException to see the cause.
java.sql.BatchUpdateException: Batch entry 0 DELETE FROM fact_values fv
           WHERE fv.id in (90,180,2414,2667,2590,1264,2436,1561878,2568,1524348,90,2516,2436,2468,1561273,1561275,33,1210,1236,2678,90,33,2320,2478,2494,2452,90,2678,1551663,33,26,2660,1561269,896,33,578,33,1561277,1551664,90,4641,180,2590,90,83,578,1551663,83,2452,2451,1561876,1561877,180,1561272,1561876,2474,90,1551664,180,2516,1561277,2462,1561271,33,2568,26,2494,2667,1524348,2414,180,33,33,90,4616)
             AND NOT EXISTS (SELECT 1 FROM facts f
                               WHERE f.fact_value_id in (90,180,2414,2667,2590,1264,2436,1561878,2568,1524348,90,2516,2436,2468,1561273,1561275,33,1210,1236,2678,90,33,2320,2478,2494,2452,90,2678,1551663,33,26,2660,1561269,896,33,578,33,1561277,1551664,90,4641,180,2590,90,83,578,1551663,83,2452,2451,1561876,1561877,180,1561272,1561876,2474,90,1551664,180,2516,1561277,2462,1561271,33,2568,26,2494,2667,1524348,2414,180,33,33,90,4616)
                                 AND f.fact_value_id = fv.id
                                 AND (f.factset_id, f.fact_path_id) NOT in ((36,820),(36,44),(36,279),(36,67),(36,924),(36,452),(36,812),(36,105),(36,616),(36,139),(36,1019),(36,487),(36,231),(36,65),(36,15),(36,48),(36,88),(36,655),(36,118),(36,87),(36,230),(36,647),(36,493),(36,991),(36,857),(36,81),(36,2),(36,1),(36,79),(36,249),(36,911),(36,349),(36,63),(36,548),(36,141),(36,324),(36,1006),(36,40),(36,27),(36,612),(36,364),(36,530),(36,832),(36,339),(36,25),(36,526),(36,92),(36,797),(36,85),(36,642),(36,131),(36,86),(36,383),(36,58),(36,103),(36,13),(36,871),(36,68),(36,136),(36,536),(36,127),(36,624),(36,7),(36,128),(36,543),(36,519),(36,214),(36,4),(36,46),(36,482),(36,897),(36,49),(36,809),(36,704),(36,811))) was aborted.  Call getNextException to see the cause.
	at org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2746) ~[puppetdb.jar:na]
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1887) ~[puppetdb.jar:na]
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405) ~[puppetdb.jar:na]
	at org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893) ~[puppetdb.jar:na]
	at com.jolbox.bonecp.StatementHandle.executeBatch(StatementHandle.java:469) ~[puppetdb.jar:na]
	at clojure.java.jdbc.internal$do_prepared_STAR_$fn__7259.invoke(internal.clj:356) ~[na:na]
	at clojure.java.jdbc.internal$transaction_STAR_.invoke(internal.clj:223) ~[na:na]
	at clojure.java.jdbc.internal$do_prepared_STAR_.doInvoke(internal.clj:356) ~[na:na]
	at clojure.lang.RestFn.applyTo(RestFn.java:139) ~[puppetdb.jar:na]
	at clojure.core$apply.invoke(core.clj:626) ~[puppetdb.jar:na]
	at clojure.java.jdbc$do_prepared.doInvoke(jdbc.clj:206) ~[na:na]
	at clojure.lang.RestFn.invoke(RestFn.java:423) ~[puppetdb.jar:na]
	at com.puppetlabs.puppetdb.scf.storage$eval10818$delete_pending_value_id_orphans_BANG___10819$fn__10820.invoke(storage.clj:797) ~[na:na]
	at com.puppetlabs.puppetdb.scf.storage$eval10818$delete_pending_value_id_orphans_BANG___10819.invoke(storage.clj:777) ~[na:na]
	at com.puppetlabs.puppetdb.scf.storage$eval10995$update_facts_BANG___10996$fn__11000$fn__11002.invoke(storage.clj:993) ~[na:na]
	at clojure.java.jdbc.internal$transaction_STAR_.invoke(internal.clj:223) ~[na:na]
	at com.puppetlabs.puppetdb.scf.storage$eval10995$update_facts_BANG___10996$fn__11000.invoke(storage.clj:956) ~[na:na]
	at com.puppetlabs.puppetdb.scf.storage$eval10995$update_facts_BANG___10996.invoke(storage.clj:949) ~[na:na]
	at com.puppetlabs.puppetdb.scf.storage$eval11197$replace_facts_BANG___11198$fn__11202$fn__11204.invoke(storage.clj:1212) ~[na:na]
	at com.puppetlabs.puppetdb.scf.storage.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source) ~[na:na]
	at com.yammer.metrics.core.Timer.time(Timer.java:91) ~[puppetdb.jar:na]
	at com.puppetlabs.puppetdb.scf.storage$eval11197$replace_facts_BANG___11198$fn__11202.invoke(storage.clj:1209) ~[na:na]
	at com.puppetlabs.puppetdb.scf.storage$eval11197$replace_facts_BANG___11198.invoke(storage.clj:1202) ~[na:na]
	at com.puppetlabs.puppetdb.command$eval13392$fn__13395$fn__13408.invoke(command.clj:381) ~[na:na]
	at com.puppetlabs.jdbc$with_transacted_connection_fn$fn__7670$fn__7671$fn__7672.invoke(jdbc.clj:290) ~[na:na]
	at clojure.java.jdbc.internal$transaction_STAR_.invoke(internal.clj:204) ~[na:na]
	at com.puppetlabs.jdbc$with_transacted_connection_fn$fn__7670$fn__7671.invoke(jdbc.clj:290) ~[na:na]
	at clojure.java.jdbc.internal$with_connection_STAR_.invoke(internal.clj:186) ~[na:na]
	at com.puppetlabs.jdbc$with_transacted_connection_fn$fn__7670.invoke(jdbc.clj:287) ~[na:na]
	at com.puppetlabs.jdbc$eval7648$retry_sql_STAR___7649$fn__7650$fn__7651.invoke(jdbc.clj:259) ~[na:na]
	at com.puppetlabs.jdbc$eval7648$retry_sql_STAR___7649$fn__7650.invoke(jdbc.clj:258) ~[na:na]
	at com.puppetlabs.jdbc$eval7648$retry_sql_STAR___7649.invoke(jdbc.clj:250) ~[na:na]
	at com.puppetlabs.jdbc$with_transacted_connection_fn.invoke(jdbc.clj:286) ~[na:na]
	at com.puppetlabs.puppetdb.command$eval13392$fn__13395.invoke(command.clj:379) ~[na:na]
	at clojure.lang.MultiFn.invoke(MultiFn.java:231) ~[puppetdb.jar:na]
	at com.puppetlabs.puppetdb.command$produce_message_handler$fn__13564.invoke(command.clj:647) ~[na:na]
	at com.puppetlabs.puppetdb.command$wrap_with_discard$fn__13513$fn__13517.invoke(command.clj:554) ~[na:na]
	at com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source) ~[na:na]
	at com.yammer.metrics.core.Timer.time(Timer.java:91) ~[puppetdb.jar:na]
	at com.puppetlabs.puppetdb.command$wrap_with_discard$fn__13513.invoke(command.clj:553) ~[na:na]
	at com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__13498$fn__13499.invoke(command.clj:507) ~[na:na]
	at com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source) ~[na:na]
	at com.yammer.metrics.core.Timer.time(Timer.java:91) ~[puppetdb.jar:na]
	at com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__13498.invoke(command.clj:506) ~[na:na]
	at com.puppetlabs.puppetdb.command$wrap_with_command_parser$fn__13508.invoke(command.clj:529) [na:na]
	at com.puppetlabs.puppetdb.command$wrap_with_meter$fn__13488.invoke(command.clj:467) [na:na]
	at com.puppetlabs.puppetdb.command$wrap_with_thread_name$fn__13522.invoke(command.clj:569) [na:na]
	at com.puppetlabs.mq$create_message_listener$reify__11783.onMessage(mq.clj:270) [na:na]
	at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:560) [puppetdb.jar:na]
	at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:498) [puppetdb.jar:na]
	at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:467) [puppetdb.jar:na]
	at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:325) [puppetdb.jar:na]
	at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:263) [puppetdb.jar:na]
	at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1058) [puppetdb.jar:na]
	at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1050) [puppetdb.jar:na]
	at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:947) [puppetdb.jar:na]
	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]

and the corresponding error in the postgres.log:

RROR:  update or delete on table "fact_values" violates foreign key constraint "fact_value_id_fk" on table "facts"
DETAIL:  Key (id)=(1524348) is still referenced from table "facts".
STATEMENT:  DELETE FROM fact_values fv
	           WHERE fv.id in ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26,$27,$28,$29,$30,$31,$32,$33,$34,$35,$36,$37,$38,$39,$40,$41,$42,$43,$44,$45,$46,$47,$48,$49,$50,$51,$52,$53,$54,$55,$56,$57,$58,$59,$60,$61,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,$72,$73,$74,$75)
	             AND NOT EXISTS (SELECT 1 FROM facts f
	                               WHERE f.fact_value_id in ($76,$77,$78,$79,$80,$81,$82,$83,$84,$85,$86,$87,$88,$89,$90,$91,$92,$93,$94,$95,$96,$97,$98,$99,$100,$101,$102,$103,$104,$105,$106,$107,$108,$109,$110,$111,$112,$113,$114,$115,$116,$117,$118,$119,$120,$121,$122,$123,$124,$125,$126,$127,$128,$129,$130,$131,$132,$133,$134,$135,$136,$137,$138,$139,$140,$141,$142,$143,$144,$145,$146,$147,$148,$149,$150)
	                                 AND f.fact_value_id = fv.id
	                                 AND (f.factset_id, f.fact_path_id) NOT in (($151,$152),($153,$154),($155,$156),($157,$158),($159,$160),($161,$162),($163,$164),($165,$166),($167,$168),($169,$170),($171,$172),($173,$174),($175,$176),($177,$178),($179,$180),($181,$182),($183,$184),($185,$186),($187,$188),($189,$190),($191,$192),($193,$194),($195,$196),($197,$198),($199,$200),($201,$202),($203,$204),($205,$206),($207,$208),($209,$210),($211,$212),($213,$214),($215,$216),($217,$218),($219,$220),($221,$222),($223,$224),($225,$226),($227,$228),($229,$230),($231,$232),($233,$234),($235,$236),($237,$238),($239,$240),($241,$242),($243,$244),($245,$246),($247,$248),($249,$250),($251,$252),($253,$254),($255,$256),($257,$258),($259,$260),($261,$262),($263,$264),($265,$266),($267,$268),($269,$270),($271,$272),($273,$274),($275,$276),($277,$278),($279,$280),($281,$282),($283,$284),($285,$286),($287,$288),($289,$290),($291,$292),($293,$294),($295,$296),($297,$298),($299,$300)))

I've been able to get things moving along by performing

delete from facts where fact_value_id in ( <list of numbers from failing fact_values delete>)

and I've updated to 2.3.3, but new 'factsets' keep developing the same error.

We're on CentOS 7 and postgres 9.2.7. We have about 110 puppet agents.



 Comments   
Comment by Wyatt Alt [ 2015/04/23 ]

Simon Oxwell Thanks for reporting this. When you say "same error", does it always have the same id?

Would you mind giving the output of these queries?

select * from facts f inner join fact_values fv on f.fact_value_id=fv.id inner join fact_paths fp on f.fact_path_id=fp.id where f.fact_value_id=1524348;
 
select count(*) from fact_values where id not in (select distinct fact_value_id from facts);
 
select fact_path_id, fp.name, count(*) cnt from facts f join fact_paths fp on fact_path_id=fp.id join fact_values fv on f.fact_value_id=fv.id where fv.id not in (select distinct fact_value_id from facts) group by fact_path_id, fp.name order by cnt desc;

Also, does your setup include multiple PuppetDB nodes or just a single one?

Comment by Simon Oxwell [ 2015/04/23 ]

Hi Wyatt,

Currently I have four servers that aren't updating, so I've got four different 'replace facts' errors repeating in the logs, so there's four distinct fact_set_id's from the NOT EXISTS IN subquery that match those four servers. The fact _value_id's listed are consistent for each server.

The query outputs:

puppetdb=# select * from facts f inner join fact_values fv on f.fact_value_id=fv.id inner join fact_paths fp on f.fact_path_id=fp.id where f.fact_value_id=1524348;
 factset_id | fact_path_id | fact_value_id |   id    | value_type_id |                value_hash                | value_integer | value_float | value_string | value_boolean | value_json | id  | depth |                    name                    |                           path
------------+--------------+---------------+---------+---------------+------------------------------------------+---------------+-------------+--------------+---------------+------------+-----+-------+--------------------------------------------+-----------------------------------------------------------
         36 |          139 |       1524348 | 1524348 |             0 | 5c11c98386f3c4e9fb8a59f6c8540d4db4cbf625 |               |             | cang-test    |               |      | 139 |     2 | funnelback_collections_profiles_structured | funnelback_collections_profiles_structured#~4#~collection
         36 |           46 |       1524348 | 1524348 |             0 | 5c11c98386f3c4e9fb8a59f6c8540d4db4cbf625 |               |             | cang-test    |               |      |  46 |     1 | funnelback_collections_structured          | funnelback_collections_structured#~4
(2 rows)
 
puppetdb=# select count(*) from fact_values where id not in (select distinct fact_value_id from facts);
 count
-------
     0
(1 row)
 
puppetdb=# select fact_path_id, fp.name, count(*) cnt from facts f join fact_paths fp on fact_path_id=fp.id join fact_values fv on f.fact_value_id=fv.id where fv.id not in (select distinct fact_value_id from facts) group by fact_path_id, fp.name order by cnt desc;
 fact_path_id | name | cnt
--------------+------+-----
(0 rows)

We've only got one PuppetDB node.

Thanks!

Comment by Wyatt Alt [ 2015/04/23 ]

Simon Oxwell Thanks, that's informative. For the host associated with the two facts above, could you give the results of the following?

curl -X GET http://localhost:8080/v4/nodes/<your hostname>/facts/funnelback_collections_profiles_structured
curl -X GET http://localhost:8080/v4/nodes/<your hostname>/facts/funnelback_collections_profiles_profiles_structured

You can identify the node with

select certname from factsets where id=36;

Wyatt

Comment by Simon Oxwell [ 2015/04/23 ]

Hi Wyatt,

Here's the results of the two queries (I took the liberty of changing the second one from funnelback_collections_profiles_profiles_structured to funnelback_collections_structured to match the query from the previous update, otherwise it returned nothing)

I've anonymized some of the names too (to hide our server and client names), but consistently across the output of both commands.

curl -X GET http://localhost:8080/v4/nodes/server-name.search/facts/funnelback_collections_profiles_structured
[ {
  "value" : [ {
    "collection" : "client1-intranet",
    "profiles" : [ "_default", "_default_preview" ]
  }, {
    "collection" : "client2-youtube",
    "profiles" : [ "_default", "_default_preview" ]
  }, {
    "collection" : "client2-web",
    "profiles" : [ "scripts", "_default", "test-data", "synthetic-data", "_default_preview", "test" ]
  }, {
    "collection" : "client2-university",
    "profiles" : [ "_default", "humanities", "MU_test", "ndri", "Save", "humanities_preview", "ndri_preview", "mobile", "mobile_preview", "_default_preview" ]
  }, {
    "collection" : "cang-test",
    "profiles" : [ "scripts", "_default", "_default_preview" ]
  }, {
    "collection" : "client2-qc",
    "profiles" : [ "_default", "unilife", "_default_preview" ]
  }, {
    "collection" : "client2-courses",
    "profiles" : [ "scripts", "_default", "_default_preview" ]
  }, {
    "collection" : "client2-staff",
    "profiles" : [ "scripts", "_default", "www.google.com.au", "_default_preview" ]
  }, {
    "collection" : "client2-jobs",
    "profiles" : [ "scripts", "_default", "_default_preview" ]
  }, {
    "collection" : "client3",
    "profiles" : [ "_default", "_default_preview" ]
  }, {
    "collection" : "funnelback_documentation",
    "profiles" : [ "_default", "_default_preview" ]
  }, {
    "collection" : "client2-unilife",
    "profiles" : [ "scripts", "_default", "_default_preview" ]
  }, {
    "collection" : "client2-maps",
    "profiles" : [ "scripts", "_default", "test-data", "_default_preview" ]
  }, {
    "collection" : "client2-entities",
    "profiles" : [ "scripts", "_default", "customFilters", "_default_preview" ]
  } ],
  "name" : "funnelback_collections_profiles_structured",
  "environment" : "production",
  "certname" : "server-name.search"
} ]

curl -X GET http://localhost:8080/v4/nodes/server-name.search/facts/funnelback_collections_structured
[ {
  "value" : [ "client1-intranet", "client2-youtube", "client2-web", "client2-university", "cang-test", "client2-qc", "client2-courses", "client2-staff", "client2-jobs", "client3", "funnelback_documentation", "client2-unilife", "client2-maps", "client2-entities" ],
  "name" : "funnelback_collections_structured",
  "environment" : "production",
  "certname" : "server-name.search"
} ]

Thanks,

Simon

Comment by Wyatt Alt [ 2015/04/24 ]

Simon Oxwell Thanks. I've got no real leads yet but we've pulled this into our current sprint for investigation. Our best vague guess at the moment is that one or both of those facts are being updated in a way that we are not handling correctly. I've got a few questions; interested if you can shed some light.

  • are the problematic facts on the other three hosts also structured like these two, or are there any simple name/value pairs affected?
  • when you run the delete command you mentioned in your report, does the problem come back immediately or after some time?
  • does the problem always reappear on the same hosts? same facts?
  • are you seeing more nodes affected as time goes on?
  • are there any failed commands in /var/lib/puppetdb/mq/discarded that reflect the same constraint violation?
  • are you running with the default puppet run interval on all your nodes? (30 minutes)
  • for a host that's affected, could you give the output of

    facter -j funnelback_collections_profiles_structured
    

    facter -j funnelback_collections_structured
    

    (substituting in the relevant fact names if different) as well as an up-to-date output of the two /v4/facts queries you pasted last, anonymized as necessary. This should indicate what specific update (if any) is causing the command to fail, since puppet takes facts from facter to pass to PuppetDB.

Thanks,
Wyatt

Comment by Simon Oxwell [ 2015/04/27 ]

Hi Wyatt,

I've now got 5 servers affected. I've collected the blocking fact values ids from the postgres logs, and they all seem to be related to our structured collections:

puppetdb=# select factset_id,fact_value_id,value_hash,name,path from facts f inner join fact_values fv on f.fact_value_id=fv.id inner join fact_paths fp on f.fact_path_id=fp.id where f.fact_value_id in (807,810,1788,2709,2731,5697,1524348,1546444,1547903,1549732,1552056) order by fact_value_id;
 factset_id | fact_value_id |                value_hash                |                    name                    |                             path
------------+---------------+------------------------------------------+--------------------------------------------+--------------------------------------------------------------
          4 |           807 | eedbf60502cdb6b60552f2044fe16fcf6483e077 | funnelback_collections_profiles_structured | funnelback_collections_profiles_structured#~59#~collection
          4 |           807 | eedbf60502cdb6b60552f2044fe16fcf6483e077 | funnelback_collections_structured          | funnelback_collections_structured#~59
          4 |           810 | cd339c27e31eff5c2c390c4c4e688d8ff1f60763 | funnelback_collections_profiles_structured | funnelback_collections_profiles_structured#~586#~collection
          4 |           810 | cd339c27e31eff5c2c390c4c4e688d8ff1f60763 | funnelback_collections_structured          | funnelback_collections_structured#~586
          8 |          1788 | 93743e5e49f9b8989784a4637c61000db4cf3568 | funnelback_collections_profiles_structured | funnelback_collections_profiles_structured#~41#~collection
          8 |          1788 | 93743e5e49f9b8989784a4637c61000db4cf3568 | funnelback_collections_structured          | funnelback_collections_structured#~41
         18 |          2709 | 9f08029bbb52b0342c07f8a217c6c7372a3f32c8 | funnelback_collections_profiles_structured | funnelback_collections_profiles_structured#~49#~collection
         18 |          2709 | 9f08029bbb52b0342c07f8a217c6c7372a3f32c8 | funnelback_collections_structured          | funnelback_collections_structured#~49
         18 |          2731 | 9374ee7cb103a2bf5657fedbcbd3a99615299893 | funnelback_collections_structured          | funnelback_collections_structured#~61
         18 |          2731 | 9374ee7cb103a2bf5657fedbcbd3a99615299893 | funnelback_collections_profiles_structured | funnelback_collections_profiles_structured#~61#~collection
        102 |          5697 | 1ad3929edb97aca0dacd00cb7cb49faa27b7debb | funnelback_collections_structured          | funnelback_collections_structured#~15
        102 |          5697 | 1ad3929edb97aca0dacd00cb7cb49faa27b7debb | funnelback_collections_profiles_structured | funnelback_collections_profiles_structured#~15#~collection
         36 |       1524348 | 5c11c98386f3c4e9fb8a59f6c8540d4db4cbf625 | funnelback_collections_structured          | funnelback_collections_structured#~4
         36 |       1524348 | 5c11c98386f3c4e9fb8a59f6c8540d4db4cbf625 | funnelback_collections_profiles_structured | funnelback_collections_profiles_structured#~4#~collection
          4 |       1552056 | 5d5c4e9e410695b2553a2128a3ec23c4d8ea8e07 | funnelback_collections_profiles_structured | funnelback_collections_profiles_structured#~474#~profiles#~5
(15 rows)

When I've run sql to delete all the facts from the fact table with the factset_id and fact_value_ids from the failing SQL it doesn't fail again immediately, and instead takes some time to reoccur.

Here's the timestamps from the affected servers:

puppetdb=# select id,timestamp,producer_timestamp from factsets where id in (18,36,102,4,8) order by timestamp;
 id  |         timestamp          |   producer_timestamp
-----+----------------------------+------------------------
  18 | 2015-04-22 14:25:04.044+10 | 2015-04-22 14:25:03+10
  36 | 2015-04-22 15:58:37.857+10 | 2015-04-22 15:58:37+10
 102 | 2015-04-22 16:25:11.431+10 | 2015-04-22 16:25:10+10
   4 | 2015-04-23 09:07:55.472+10 | 2015-04-23 09:07:55+10
   8 | 2015-04-27 11:17:23.321+10 | 2015-04-27 11:17:23+10

So far, they all seem to be same servers (although server id 93 hasn't come back, but I suspect that was unrelated), but I don't know if it's all the same facts. I've very suspicious of the structured facts at this point in time - and it feels a bit like it might be a race condition.

Our puppet agents run every 20 minutes, and I have 4585 reports in /var/lib/puppetdb/mq/discarded/replace-facts - all with the same SQL failures, and all since the update to 2.3.1.

I'll go and collect the facter output now - but it does seem to be related to repeated keys in the structured facts.

Comment by Simon Oxwell [ 2015/04/27 ]

Hi Wyatt,

Here's the output from the facter commands, and the updated /v4/facts:

 
# facter -jp funnelback_collections_profiles_structured
{
  "funnelback_collections_profiles_structured": [
    {
      "collection": "client1-intranet",
      "profiles": [
        "_default",
        "_default_preview"
      ]
    },
    {
      "collection": "client2-youtube",
      "profiles": [
        "_default",
        "_default_preview"
      ]
    },
    {
      "collection": "cbussuper-com-au-web",
      "profiles": [
        "_default",
        "_default_preview"
      ]
    },
    {
      "collection": "client2-web",
      "profiles": [
        "scripts",
        "_default",
        "test-data",
        "synthetic-data",
        "_default_preview",
        "test"
      ]
    },
    {
      "collection": "client2-university",
      "profiles": [
        "_default",
        "humanities",
        "MU_test",
        "ndri",
        "Save",
        "humanities_preview",
        "ndri_preview",
        "mobile",
        "mobile_preview",
        "_default_preview"
      ]
    },
    {
      "collection": "cang-test",
      "profiles": [
        "scripts",
        "_default",
        "_default_preview"
      ]
    },
    {
      "collection": "client2-qc",
      "profiles": [
        "_default",
        "unilife",
        "_default_preview"
      ]
    },
    {
      "collection": "client2-courses",
      "profiles": [
        "_default",
        "@workflow",
        "@groovy",
        "_default_preview"
      ]
    },
    {
      "collection": "client2-staff",
      "profiles": [
        "scripts",
        "_default",
        "_default_preview"
      ]
    },
    {
      "collection": "client2-jobs",
      "profiles": [
        "scripts",
        "_default",
        "_default_preview"
      ]
    },
    {
      "collection": "client4-twitter2-public",
      "profiles": [
        "_default",
        "_default_preview"
      ]
    },
    {
      "collection": "client3",
      "profiles": [
        "_default",
        "_default_preview"
      ]
    },
    {
      "collection": "wca-test",
      "profiles": [
        "_default",
        "_default_preview"
      ]
    },
    {
      "collection": "funnelback_documentation",
      "profiles": [
        "_default",
        "_default_preview"
      ]
    },
    {
      "collection": "client2-unilife",
      "profiles": [
        "scripts",
        "_default",
        "_default_preview"
      ]
    },
    {
      "collection": "client2-maps",
      "profiles": [
        "scripts",
        "_default",
        "test-data",
        "_default_preview"
      ]
    },
    {
      "collection": "client2-entities",
      "profiles": [
        "scripts",
        "_default",
        "customFilters",
        "_default_preview"
      ]
    },
    {
      "collection": "client4-youtube2-public",
      "profiles": [
        "_default",
        "_default_preview"
      ]
    }
  ]
}
 
 
# facter -jp funnelback_collections_structured
{
  "funnelback_collections_structured": [
    "client1-intranet",
    "client2-youtube",
    "cbussuper-com-au-web",
    "client2-web",
    "client2-university",
    "cang-test",
    "client2-qc",
    "client2-courses",
    "client2-staff",
    "client2-jobs",
    "client4-twitter2-public",
    "client3",
    "wca-test",
    "funnelback_documentation",
    "client2-unilife",
    "client2-maps",
    "client2-entities",
    "client4-youtube2-public"
  ]
}
 
 
# curl -X GET http://localhost:8080/v4/nodes/server-name.search/facts/funnelback_collections_profiles_structured
[ {
  "value" : [ {
    "collection" : "client1-intranet",
    "profiles" : [ "_default", "_default_preview" ]
  }, {
    "collection" : "client2-youtube",
    "profiles" : [ "_default", "_default_preview" ]
  }, {
    "collection" : "client2-web",
    "profiles" : [ "scripts", "_default", "test-data", "synthetic-data", "_default_preview", "test" ]
  }, {
    "collection" : "client2-university",
    "profiles" : [ "_default", "humanities", "MU_test", "ndri", "Save", "humanities_preview", "ndri_preview", "mobile", "mobile_preview", "_default_preview" ]
  }, {
    "collection" : "cang-test",
    "profiles" : [ "scripts", "_default", "_default_preview" ]
  }, {
    "collection" : "client2-qc",
    "profiles" : [ "_default", "unilife", "_default_preview" ]
  }, {
    "collection" : "client2-courses",
    "profiles" : [ "scripts", "_default", "_default_preview" ]
  }, {
    "collection" : "client2-staff",
    "profiles" : [ "scripts", "_default", "www.google.com.au", "_default_preview" ]
  }, {
    "collection" : "client2-jobs",
    "profiles" : [ "scripts", "_default", "_default_preview" ]
  }, {
    "collection" : "client3",
    "profiles" : [ "_default", "_default_preview" ]
  }, {
    "collection" : "funnelback_documentation",
    "profiles" : [ "_default", "_default_preview" ]
  }, {
    "collection" : "client2-unilife",
    "profiles" : [ "scripts", "_default", "_default_preview" ]
  }, {
    "collection" : "client2-maps",
    "profiles" : [ "scripts", "_default", "test-data", "_default_preview" ]
  }, {
    "collection" : "client2-entities",
    "profiles" : [ "scripts", "_default", "customFilters", "_default_preview" ]
  } ],
  "name" : "funnelback_collections_profiles_structured",
  "environment" : "production",
  "certname" : "server-name.search"
} ]
 
# curl -X GET http://localhost:8080/v4/nodes/server-name.search/facts/funnelback_collections_structured
[ {
  "value" : [ "client1-intranet", "client2-youtube", "client2-web", "client2-university", "cang-test", "client2-qc", "client2-courses", "client2-staff", "client2-jobs", "client3", "funnelback_documentation", "client2-unilife", "client2-maps", "client2-entities" ],
  "name" : "funnelback_collections_structured",
  "environment" : "production",
  "certname" : "server-name.search"
} ]

Hope this helps - at the very least, we now have two entries for "client4" that are being collected by facter - not sure exactly when they were added.

Comment by Wyatt Alt [ 2015/04/29 ]

Simon Oxwell

Thanks, those results are informative. We haven't reproduced your exact error on a running PDB yet (although we can make it happen under manufactured circumstances), but I think I have a sense of where things may be going wrong.

This may be more detail than your interested in, but if you think of a structured fact as a tree, then conceptually the way we store it is as a set of key-value pairs where the values are the leaves and the keys are strings that describes a "path" from the root of the tree to the leaf. We do this to allow these paths and values to be shared among multiple facts, to reduce storage space. In the case of arrays, the index is incorporated into the path, so if your structured fact is

foo: ["a","b", {"c": "d"}]

then you can think of the data we store as looking like this (#~ is an arbitrary delimiter):

{"foo#~1": "a"
  "foo#~2": "b"
  "foo#~3#~c": "d"}

The downside of handling arrays this way is that an insertion or deletion at the front of an array causes an index shift for all subsequent elements as well as their children. In your case, I can see that one of those facts has at least 586 top-level array elements, each with a "collection" key and a "profiles" key that also contains some number of array elements. My guess (supported by your facter output) is that values are being inserted or deleted from that structured fact and causing the recomputation of thousands of paths with each puppet run. This will cause the replace facts command to take longer than it normally does, and consequently will widen the window for other database transactions to conflict with the update and cause a rollback, which will then prompt a retry, which will fail for the same reason.

If this is the situation you're in, it'll take a substantial storage rework for us to make the issue go away, and it may be more immediately fruitful to change the shape of that fact to something more amenable. Namely, if the collection keys of funnelback_collections_profiles_structured are unique, I would expect that converting that fact to a map keyed on those collections would greatly alleviate this problem. Any change to convert the top-level structure from an array to a map would greatly lighten the load.

The answers to the following would help confirm that you're having this issue:

  • Could you give us a screenshot of your PuppetDB dashboard?
  • Can you examine your postgres logs for slow statements, particularly similar
    to the one you see in the error message you posted or regarding deletes?
    If you need to enable slow statement logging you can add the following to your postgresql.conf:

      log_min_duration_statement = 100
      

    and restarting postgres.

  • do you notice high CPU load associated with postgres on your PDB server?

In addition, we're interested in whether you've substantially changed your setup since migrating to 2.3, or whether you were previously running with the same facts and having no issue. I would expect the same problem to be present on 2.2, but the schema changes we made in the 2.3 update may have worsened the effect.

Lastly, if you'd be willing to share a few of your failed replace commands with me that could be enlightening. This conversation is public, so the private way to go would be emailing me at wyatt@puppetlabs.com.

Wyatt

Comment by Wyatt Alt [ 2015/04/30 ]

Additionally, I don't know how you are using these facts but if you aren't specifically querying them using the fact-contents endpoint you could also consider storing them as strings. Depending on your use case this could well be the quickest solution.

Comment by Duncan McNaught [ 2015/04/30 ]

FYI - I'm seeing this a lot too after upgrade to 2.3.3. Thanks for adding the bug Simon Oxwell

Comment by Wyatt Alt [ 2015/04/30 ]

Duncan McNaught Thanks for reporting that. Are you also seeing this in relation to large structured facts? If you'd be willing to take the three bulleted steps I described in that last message and report back that would be informative.

Comment by Duncan McNaught [ 2015/04/30 ]

Here is the dashboard screenshot for our two puppetdb nodes.
Also showing cpu not working much on the database server master.

Comment by Duncan McNaught [ 2015/04/30 ]

I can enable slow statement logging tomorrow if that will help too.

Comment by Simon Oxwell [ 2015/05/01 ]

Hi Wyatt - sorry I haven't been able to get this over the last few days, had other operational priorities.

I've now got a new server playing up, and it's not being blocked by a structured fact:

puppetdb=# DELETE FROM fact_values fv
puppetdb-#            WHERE fv.id in (2347921,5398,5421,1525955,2374767,2301540,2373170,5416,2301540,5407,2374770,2374768,2374770,2374767,2374769,5398,2374771)
puppetdb-#              AND NOT EXISTS (SELECT 1 FROM facts f
puppetdb(#                                WHERE f.fact_value_id in (2347921,5398,5421,1525955,2374767,2301540,2373170,5416,2301540,5407,2374770,2374768,2374770,2374767,2374769,5398,2374771)
puppetdb(#                                  AND f.fact_value_id = fv.id
puppetdb(#                                  AND (f.factset_id, f.fact_path_id) NOT in ((93,148),(93,79),(93,4731),(93,4723),(93,131),(93,68),(93,48),(93,4720),(93,27),(93,4836),(93,127),(93,86),(93,40),(93,103),(93,142),(93,92),(93,105)));
ERROR:  update or delete on table "fact_values" violates foreign key constraint "fact_value_id_fk" on table "facts"
DETAIL:  Key (id)=(5407) is still referenced from table "facts".
puppetdb=# select factset_id,fact_value_id,value_hash,name,path from facts f inner join fact_values fv on f.fact_value_id=fv.id inner join fact_paths fp on f.fact_path_id=fp.id where f.fact_value_id in (5407) order by fact_value_id; factset_id | fact_value_id |                value_hash                |       name       |       path
------------+---------------+------------------------------------------+------------------+------------------
         93 |          5407 | 10a70dee52f2a1e7d64cc4c13c2dd1dd1489a1bd | macaddress_vnet2 | macaddress_vnet2

This appears to have occurred when the server was rebooted. It's a vm host, and its reordered the virtual interfaces:

The mac address that was on vnet2 is now on vnet1, and vnet2 has a new value, was previously vnet0. (and vnet0 now has vnet1's old value)

I've also been looking for a simple use case, so I tried something similar on a standalone puppetdb server talking it to with curl to update facts:

curl -X POST   \
    -H "Accept: application/json" \
    -H "Content-Type: application/json"   \
    -d '{"command":"replace facts","version":3,"payload":{"name":"example_server","environment":"DEV","values":{"key1":"value1","key2":"value2","key3":"value3"}}}' \
     http://localhost:8080/v3/commands
 
curl -X POST  \
    -H "Accept: application/json" \
    -H "Content-Type: application/json"  \
    -d '{"commad":"replace facts","version":3,"payload":{"name":"example_server","environment":"DEV","values":{"key1":"value2","key2":"value3","key3":"value1"}}}' \
    http://localhost:8080/v3/commands

And that appeared to break it:

2015-05-01 17:43:00,027 ERROR [c.p.p.command] [74417443-1c92-433c-b7c8-c8113b0f7441] [replace facts] Retrying after attempt 4, due to: java.sql.BatchUpdateException: Batch entry 0 DELETE FROM fact_values fv
           WHERE fv.id in (31,32,33)
             AND NOT EXISTS (SELECT 1 FROM facts f
                               WHERE f.fact_value_id in (31,32,33)
                                 AND f.fact_value_id = fv.id
                                 AND (f.factset_id, f.fact_path_id) NOT in ((1,25),(1,26),(1,27))) was aborted.  Call getNextException to see the cause.
java.sql.BatchUpdateException: Batch entry 0 DELETE FROM fact_values fv
           WHERE fv.id in (31,32,33)
             AND NOT EXISTS (SELECT 1 FROM facts f
                               WHERE f.fact_value_id in (31,32,33)
                                 AND f.fact_value_id = fv.id
                                 AND (f.factset_id, f.fact_path_id) NOT in ((1,25),(1,26),(1,27))) was aborted.  Call getNextException to see the cause.
        at org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2746) ~[puppetdb.jar:na]

SQL log:

ERROR:  update or delete on table "fact_values" violates foreign key constraint "fact_value_id_fk" on table "facts"
DETAIL:  Key (id)=(31) is still referenced from table "facts".
STATEMENT:  DELETE FROM fact_values fv
                   WHERE fv.id in ($1,$2,$3)
                     AND NOT EXISTS (SELECT 1 FROM facts f
                                       WHERE f.fact_value_id in ($4,$5,$6)
                                         AND f.fact_value_id = fv.id
                                         AND (f.factset_id, f.fact_path_id) NOT in (($7,$8),($9,$10),($11,$12)))

all my facts:

puppetdb=# select * from facts f inner join fact_values fv on f.fact_value_id=fv.id inner join fact_paths fp on f.fact_path_id=fp.id;
 factset_id | fact_path_id | fact_value_id | id | value_type_id |                value_hash                | value_integer | value_float | value_string | value_boolean | value_json | id | depth | name | path 
------------+--------------+---------------+----+---------------+------------------------------------------+---------------+-------------+--------------+---------------+------------+----+-------+------+------
          1 |           25 |            31 | 31 |             0 | daf626c4ebd6bdd697e043111454304e5fb1459e |               |             | value1       |               |            | 25 |     0 | key1 | key1           
          1 |           26 |            32 | 32 |             0 | a4b6834d78bfee9a445806ff3b3664662a10ebe5 |               |             | value3       |               |            | 26 |     0 | key3 | key3
          1 |           27 |            33 | 33 |             0 | 849988af22dbd04d3e353caf77f9d81241ca9ee2 |               |             | value2       |               |            | 27 |     0 | key2 | key2
(3 rows)

I'm not sure why it started me from fact_value_id 31 - I have previously shutdown puppetdb, dropped the database and deleted the KahaDB for the MQ to try and clean it up.

Anyway, hope this may help. I'll go back and get the other data you asked for next week.

Comment by Wyatt Alt [ 2015/05/01 ]

Simon Oxwell Wow, nice going! Thanks for tracking it down so specifically. We have the issue reproduced and are testing a fix currently.

Duncan McNaught Thanks for that data. If your scenario matches the one Simon is describing I don't think we need anything else.

We'll update this thread once we release a bugfix – hopefully sometime next week. Thanks again for the help, guys!

Wyatt

Comment by Simon Oxwell [ 2015/05/03 ]

Hi Wyatt,

Ah, great. That's awesome. Glad to be of help. Look forward to trying out the fix.

Simon

Comment by Wyatt Alt [ 2015/05/07 ]

Simon Oxwell Duncan McNaught we released a fix today in PDB 2.3.4. Give it a shot when you can and let us know how it goes!

Wyatt

Comment by Simon Oxwell [ 2015/05/10 ]

Hi Wyatt,

I've put the new packages in, and it seems to be going well so far! Will update again in a few hours.

Comment by Duncan McNaught [ 2015/05/11 ]

I put the new package in 3 hours ago - the errors are not occurring anymore for us.
Thanks for the quick fix.

Comment by Simon Oxwell [ 2015/05/11 ]

Yep - looking good.

Thanks!

(now to figure out how to clear the DLO edit: which is dead easy)

Comment by Priscilla Piedra [ 2016/04/06 ]

Hi,

It is suposed that this bug was fixed in version 2.3.4 but I'm having this issue on PDB 2.3.8

java.sql.BatchUpdateException: Batch entry 0 DELETE FROM fact_values fv
WHERE fv.id in (131215,132501,132501,175,129002)
AND NOT EXISTS (SELECT 1 FROM facts f
WHERE f.fact_value_id in (131215,132501,132501,175,129002)AND f.fact_value_id = fv.id
AND (f.factset_id,
f.fact_path_id,
f.fact_value_id) NOT in ((5790,111587,131215),(5790,118579,132501),(5790,110159,132501),(5790,120871,175),(5790,111511,129002))) was aborted. Call getNextException to see the cause.
at org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2746) ~[puppetdb.jar:na]
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1887) ~[puppetdb.jar:na]
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405) ~[puppetdb.jar:na]
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893) ~[puppetdb.jar:na]
at com.jolbox.bonecp.StatementHandle.executeBatch(StatementHandle.java:469) ~[puppetdb.jar:na]
at clojure.java.jdbc.internal$do_prepared_STAR_$fn__5845.invoke(internal.clj:356) ~[na:na]
at clojure.java.jdbc.internal$transaction_STAR_.invoke(internal.clj:223) ~[na:na]
at clojure.java.jdbc.internal$do_prepared_STAR_.doInvoke(internal.clj:356) ~[na:na]
at clojure.lang.RestFn.applyTo(RestFn.java:139) ~[puppetdb.jar:na]
at clojure.core$apply.invoke(core.clj:626) ~[puppetdb.jar:na]
at clojure.java.jdbc$do_prepared.doInvoke(jdbc.clj:206) ~[na:na]
at clojure.lang.RestFn.invoke(RestFn.java:423) ~[puppetdb.jar:na]
at com.puppetlabs.puppetdb.scf.storage$eval9406$delete_pending_value_id_orphans_BANG__9407$fn9408$fn_9409.invoke(storage.clj:813) ~[na:na]
at clojure.lang.AFn.applyToHelper(AFn.java:165) ~[puppetdb.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:144) ~[puppetdb.jar:na]
at clojure.core$apply.invoke(core.clj:624) ~[puppetdb.jar:na]
at clojure.core$map$fn__4260.invoke(core.clj:2578) ~[puppetdb.jar:na]
at clojure.core$map$fn__4245.invoke(core.clj:2559) ~[puppetdb.jar:na]
at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[puppetdb.jar:na]
at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[puppetdb.jar:na]
at clojure.lang.RT.seq(RT.java:484) ~[puppetdb.jar:na]
at clojure.core$seq.invoke(core.clj:133) ~[puppetdb.jar:na]
at clojure.core$dorun.invoke(core.clj:2855) ~[puppetdb.jar:na]
at com.puppetlabs.puppetdb.scf.storage$eval9406$delete_pending_value_id_orphans_BANG__9407$fn_9408.invoke(storage.clj:801) ~[na:na]
at com.puppetlabs.puppetdb.scf.storage$eval9406$delete_pending_value_id_orphans_BANG___9407.invoke(storage.clj:787) ~[na:na]
at com.puppetlabs.puppetdb.scf.storage$eval9585$update_facts_BANG__9586$fn9590$fn_9592.invoke(storage.clj:1013) ~[na:na]
at clojure.java.jdbc.internal$transaction_STAR_.invoke(internal.clj:223) ~[na:na]
at com.puppetlabs.puppetdb.scf.storage$eval9585$update_facts_BANG__9586$fn_9590.invoke(storage.clj:976) ~[na:na]
at com.puppetlabs.puppetdb.scf.storage$eval9585$update_facts_BANG___9586.invoke(storage.clj:969) ~[na:na]
at com.puppetlabs.puppetdb.scf.storage$eval9789$replace_facts_BANG__9790$fn9794$fn_9796.invoke(storage.clj:1235) ~[na:na]
at com.puppetlabs.puppetdb.scf.storage.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source) ~[na:na]
at com.yammer.metrics.core.Timer.time(Timer.java:91) ~[puppetdb.jar:na]
at com.puppetlabs.puppetdb.scf.storage$eval9789$replace_facts_BANG__9790$fn_9794.invoke(storage.clj:1232) ~[na:na]
at com.puppetlabs.puppetdb.scf.storage$eval9789$replace_facts_BANG___9790.invoke(storage.clj:1225) ~[na:na]
at com.puppetlabs.puppetdb.command$eval13360$fn_13363$fn_13376.invoke(command.clj:381) ~[na:na]
at com.puppetlabs.jdbc$with_transacted_connection_fn$fn_6256$fn6257$fn_6258.invoke(jdbc.clj:290) ~[na:na]
at clojure.java.jdbc.internal$transaction_STAR_.invoke(internal.clj:204) ~[na:na]
at com.puppetlabs.jdbc$with_transacted_connection_fn$fn_6256$fn_6257.invoke(jdbc.clj:290) ~[na:na]
at clojure.java.jdbc.internal$with_connection_STAR_.invoke(internal.clj:186) ~[na:na]
at com.puppetlabs.jdbc$with_transacted_connection_fn$fn__6256.invoke(jdbc.clj:287) ~[na:na]
at com.puppetlabs.jdbc$eval6234$retry_sql_STAR__6235$fn6236$fn_6237.invoke(jdbc.clj:259) ~[na:na]
at com.puppetlabs.jdbc$eval6234$retry_sql_STAR__6235$fn_6236.invoke(jdbc.clj:258) ~[na:na]
at com.puppetlabs.jdbc$eval6234$retry_sql_STAR___6235.invoke(jdbc.clj:250) ~[na:na]
at com.puppetlabs.jdbc$with_transacted_connection_fn.invoke(jdbc.clj:286) ~[na:na]
at com.puppetlabs.puppetdb.command$eval13360$fn__13363.invoke(command.clj:379) ~[na:na]
at clojure.lang.MultiFn.invoke(MultiFn.java:231) ~[puppetdb.jar:na]
at com.puppetlabs.puppetdb.command$produce_message_handler$fn__13532.invoke(command.clj:647) ~[na:na]
at com.puppetlabs.puppetdb.command$wrap_with_discard$fn_13481$fn_13485.invoke(command.clj:554) ~[na:na]
at com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source) ~[na:na]
at com.yammer.metrics.core.Timer.time(Timer.java:91) ~[puppetdb.jar:na]
at com.puppetlabs.puppetdb.command$wrap_with_discard$fn__13481.invoke(command.clj:553) ~[na:na]
at com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn_13466$fn_13467.invoke(command.clj:507) ~[na:na]
at com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source) ~[na:na]
at com.yammer.metrics.core.Timer.time(Timer.java:91) ~[puppetdb.jar:na]
at com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__13466.invoke(command.clj:506) ~[na:na]
at com.puppetlabs.puppetdb.command$wrap_with_command_parser$fn__13476.invoke(command.clj:529) [na:na]
at com.puppetlabs.puppetdb.command$wrap_with_meter$fn__13456.invoke(command.clj:467) [na:na]
at com.puppetlabs.puppetdb.command$wrap_with_thread_name$fn__13490.invoke(command.clj:569) [na:na]
at com.puppetlabs.mq$create_message_listener$reify__10375.onMessage(mq.clj:277) [na:na]
at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:560) [puppetdb.jar:na]
at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:498) [puppetdb.jar:na]
at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:467) [puppetdb.jar:na]
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:325) [puppetdb.jar:na]
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:263) [puppetdb.jar:na]
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1058) [puppetdb.jar:na]
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1050) [puppetdb.jar:na]
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:947) [puppetdb.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]

We are running Postgres 9.5, Pupeptdb 2.3.8

Comment by Wyatt Alt [ 2016/04/06 ]

Hey Priscilla Piedra, I need to warn you that a fix for 2.3.8 is pretty unlikely at this point, be we can help you debug this in any case. To get the ball rolling can you give the output of the following executed against the puppetdb database?

select factset_id, fact_value_id, value_hash, value_integer, value_float, value_string, value_boolean, value_json, name, path from facts f inner join fact_values fv on f.fact_value_id=fv.id inner join fact_paths fp on f.fact_path_id=fp.id where f.fact_value_id in (131215,132501,132501,175,129002) order by fact_value_id;

Wyatt

Comment by Priscilla Piedra [ 2016/04/07 ]

Hi Wyatt Alt !! Thanks for helping me with this! I sent you the output of this query to your email wyatt@puppetlabs.com (is a big output)

Comment by Wyatt Alt [ 2016/04/10 ]

Priscilla Piedra I've spent some time this morning looking into this – no leads yet at all, but here are a few notes in random order (partly for you, partly for the team):

  • You have 242169 occurrences of this problem, across 13,940 nodes which based on your email is probably more or less your whole fleet
  • There are four unique fact values that are causing the problem. They are all strings, with no particular pattern. 99.5% of the occurrences are caused by the empty string.
  • There are a relatively small number (24) of actual facts represented. They are

     [1] "installed_software"             "installed_hotfixes_security"    "nics"                           "hba_properties"                 "volumes"                       
     [6] "distinguished_name"             "installed_hotfixes_nonsecurity" "ssl_certificates"               "sqlserver_tcp_port_1"           "hp_diskdrives"                 
    [11] "hp_arraycontrollers"            "hp_storagevolumes"              "interfaces"                     "sqlserver_tcp_port_4"           "sqlserver_tcp_port_5"          
    [16] "apt_package_updates"            "serialnumber"                   "bigfix_admingroup"              "sqlserver_tcp_port"             "sqlserver_tcp_port_2"          
    [21] "sqlserver_tcp_port_3"           "sqlserver_tcp_port_6"           "hp_proliant_sp_version"         " S=CA"
    

  • The most frequently offending nodes are 30496, 4166, and 5790, which each have huge numbers of entries in the ssl_certificates fact, which is a large array of maps. There are many paths within the ssl_certificates fact that are associated with both the empty string value and "CAPI_FMSDDSPRD01", which sort of reminds me of the original cause of this ticket
  • Given how widely conserved I'd expect the empty string to be, I would imagine there are very few circumstances where it would be legitimately deleted from the fact_values table. This leads me to think that the issue here is different from the original one in this ticket, which relied on a fact value being present in two places only and swapped within a puppet run. This feels like a more fundamental bug in our logic around when to delete.

Priscilla Piedra Can you give a bit more information on the circumstances around which this was encountered? Some things I'd be interested in:

  • have you been on 2.3.8 for a long time or have you recently upgraded
  • did this start occurring suddenly or has it been going on for a while
  • if suddenly, do any of the fact names I listed above correspond in an obvious way to recent changes?
  • Do you use a single PDB + postgres/single master or is your PDB/postgres setup more complicated than that?
  • (unlikely) does anyone ever perform manual deletions on this database?

Additionally, can we get some postgres logs that show the deletion failures, just to confirm that it's the same foreign key constraint violation as in this ticket? A full PDB log showing the issue would be helpful as well.

Finally, this next week will be a bit unusual for the PuppetDB team, since we're all in the same town doing a weeklong intensive planning session. If we don't seem responsive, that's the reason, so apologies in advance. I'll share your data with the team next week and we'll probably end up splitting this out into a new ticket. Once we can confirm the exact foreign key violation that's causing this I can give you a SQL statement that will relieve the issue and allow some puppet runs to complete, but my guess is the problem will come right back.

Wyatt

Comment by Priscilla Piedra [ 2016/04/13 ]

Hi Wyatt Alt !

Thanks for your help!

To answer your questions:

1) have you been on 2.3.8 for a long time or have you recently upgraded:
We have like two months using 2.3.8 version, before we had 2.2.2

2) did this start occurring suddenly or has it been going on for a while
This problem appeared suddenly... like two weeks ago.

3) if suddenly, do any of the fact names I listed above correspond in an obvious way to recent changes?
No, the facts are the same, there is no a recent change on them.

4) Do you use a single PDB + postgres/single master or is your PDB/postgres setup more complicated than that?
We use a single postgres master, but for each PM we have puppetdb installed (we have 10 pms)

5) (unlikely) does anyone ever perform manual deletions on this database?
Manual deletions no, but it is configured on each PDB report-ttl = 14d

I sent to your email the output of the log.

Thanks again!!

Comment by Wyatt Alt [ 2016/04/13 ]

Priscilla Piedra Thanks – the postgres logs you gave me show that the reason the deletion is failing is not a foreign key constraint violation, but this:

2016-04-12 15:26:28 GMT puppetdb puppetdb_rw ERROR:  stack depth limit exceeded
                2016-04-12 15:26:28 GMT puppetdb puppetdb_rw HINT:  Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.

can you check

ulimit -s

and change postgres's max_stack_depth parameter to be a megabyte or so less (this will require a postgres restart)? My ulimit is reporting kilobytes, so if it's the same for you make sure to make the appropriate conversion.

My guess is that once you've ended up in the situation you're in, it can only get worse. There's a good chance that if raising max_stack_depth allows the delete to succeed you can then lower it again if needed. The configuration parameter is documented here:

http://www.postgresql.org/docs/9.5/static/runtime-config-resource.html

If the delete still fails, I'd like to see the same postgres logs with the new failure. In that event we'll probably need to clear those facts out manually via a truncate or targeted delete. It's still an open question how you exceeded the stack depth originally – nothing you've reported so far indicates anything else out of the ordinary.

Comment by Priscilla Piedra [ 2016/04/13 ]

Hi Wyatt Alt

Do you know what is a good value to set max_stack_depth? My current value is 2MB

Comment by Wyatt Alt [ 2016/04/13 ]

Priscilla Piedra the postgres docs I linked suggest a megabyte less than the output of

ulimit -s

On my machine ulimit -s outputs 8192, which is in kilobytes, so personally I would try 7mb. Your case may be different though – depends on what ulimit reports.

Comment by Priscilla Piedra [ 2016/04/18 ]

Hi Wyatt Alt I have applied this config on DB and I haven't seen the error... seems like it worked!
Thanks for your help

Comment by Wyatt Alt [ 2016/04/18 ]

Priscilla Piedra good to hear it! I've still got no idea how you might have ended up in this situation to begin with though, so let us know if the issue comes back.

Comment by Priscilla Piedra [ 2016/04/19 ]

Hi Wyatt,

Just wondering if there is a stablish max for facts, (eg. max of 500
characters) or objects in case of a structure fact, or can I store
information with not limit?
Will set max_stack_depth setting to 6MB or higher would cost any performace
issues on db? is there a guideline for it?

Thanks

2016-04-18 11:10 GMT-06:00 Wyatt Alt (JIRA) <


Ing.Priscilla Piedra Hidalgo
Software Developer

Comment by Wyatt Alt [ 2016/04/19 ]

Priscilla Piedra there's no real maximum for fact character size, but 500 characters would be totally safe. It tends to be really large facts that are very dynamic that cause issues like the one you saw. I imagine if that ssl_certificates fact is extremely huge, and an element were inserted at the beginning of the array, you might run into the issue you did, but it's hard for me to know without actually seeing the fact. If that fact is something your company has created (rather than created by a module for instance) one change you can make to reduce the strain on PuppetDB would be to reshape it so that the top-level structure is a hash (keyed on certname for instance) rather than an array. If it doesn't cause further issues though you may as well leave it.

I don't expect the changed max_stack_depth setting to cause performance drawbacks, but I would also think that now that you've successfully executed the delete, you should be able to revert it back to 2mb without trouble. The only guidelines I can give you are the ones in the postgres docs I linked, which recommend making the change that you made, so in principle it should be fine to leave, but let us know if you hit more issues.

Generated at Sat Aug 24 21:33:16 PDT 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.