An audit log is a full historic account of all events that are relevant for a certain object. In this case, we keep audit logs of each target that is managed by the provisioning server.
Problem
The first issue is where to maintain the audit log. On the one hand, one can maintain it on the target, but since the management agent talks to the server, it could keep the log too.
Then there is the question of how to maintain the log. What events should be in it, and what is an event?
Finally, the audit log should be readable and query-able, so people can review it.
The following use cases can be defined:
- Store event. Stores a new event to the audit log.
- Get events. Queries (a subset of) events.
- Merge events. Merges a set of (new) events with the existing events.
Context
We basically have two contexts:
- Target, limited resources, so we should use something really "lean and mean".
- Server, scalable solution, expect people to query for (large numbers of) events.
Possible solutions
As with all repositories, there should be one location where it is edited. In this case, the logical place to do that is on the target itself, since that is where the changes actually occur. In theory, the server also knows, but that theory breaks down if things fail on the target or other parties start manipulating the life cycle of bundles. The target itself can detect such activities.
The next question is what needs to be logged. And how do we get access to these events?
When storing events, each event can get a unique sequence number. Sequence numbers start with 1 and can be used to determine if you have the complete log.
Assuming the target has limited storage, it might not be possible to keep the full log available locally. There are a couple of reasons to replicate this log to a central server:
- space, as said the full log might not fit;
- safety, when the target is somehow (partly) erased or compromised, we don't want to loose the log;
- remote diagnostics, we want to get an overview of the audit log without actually connecting to the target directly.
When replicating, the following scenarios can occur:
- The target has lost its whole log and really wants to (re)start from sequence number 1.
- The server has lost its whole log and receives a partial log.
Starting with the second scenario, the server always simply collects incoming audit logs, so its memory can be restored from any number of targets or relay servers that report everything they know (again). Hopefully that will lead to a complete log again. If not, there's not much we can do.
The first scenario is potentially more problematic, since the target has no way of knowing (for sure) at which sequence number it had arrived when everything was lost. In theory it might ask (relay) servers, but even those might not have been up to date, so that does not work. The only thing it can do here is: Start a new log at sequence number 1. That means we can have more than one log in these cases, and that again means we need to be able to identify which log (of each target) we're talking about. Therefore, when a new log is created, it should contain some unique identifier for that log (an identifier that should not depend on stored information, so for example we could use the current time in milliseconds, that should be fairly unique, or just some random number).
How to find the central server? Use the discovery service!? This is not that big of a deal.
Events should at least contain:
- a datestamp, indicating when the event occurred;
- a checksum and/or signature;
- a short, human readable message explaining the event;
- details:
- in the form of a (possibly multi-line) document
- in the form of a set of properties
The server will add:
- the target ID of the target that logged the event.
Storage will be resolve differently on the server and target. On the target, using any kind of database would amount to having to include a considerable library, which makes these solutions impractical there. We might want to consider something like that for the server though. The options we have, are:
- Relational database
- Object database
- XML
- DIY
How do events get logged?
- explicitly, our management agent calls an AuditLog service method;
- implicitly, by logging (certain) events in the system;
Implicit algorithms can be build on top of the AuditLog service. What we need to monitor is the life cycle layer, which basically means adding a BundleListener and an FrameworkListener. Those capture all state changes of the framework. Technically we can either directly add those listeners, or use EventAdmin if that is available.
What would be the best way for the target to send audit log updates to the server? I don't think we want the server to poll here, so the target should send updates (periodically). So how does it know what to send?
- it could keep track of the last event it sent, sending newer ones after that;
- it could ask for the list of events the server has;
- it could send its highest log event number, and get back a list of missing events on the server, and then respond with the missing events.
- it could just send everything.
Discussion
Having two layers for the audit log makes sense:
- The first, lowest, layer is the AuditLog service that gives access to the log. On the one hand it allows people to log messages, on the other it should provide query access. Those should be split into two different interfaces.
- The second layer can build on top of that. It can either be removed completely, which means the responsibility for logging becomes that of the application (probably the management agent). It can be implemented using listeners. Finally, it can be implemented using events.
On the target we should implement a storage solution ourselves, to keep the actual code small. The code should be able to log events quickly (as that will happen far more often than retrieving them).
Communication between the target and server should be initiated by the target. The target can basically send two commands to the server:
- My audit log contains sequence number 4-8, tell me your numbers. The server then responds (for example) with 1-6. This indicates we need to send 7-8.
- Here you have events 7-8, can you send me 1-3? The server stores its missing events, and sends you the events it has (always check if what you get is what you requested).
This is setup in this way so the same commands can also be used by relay servers to replicate logs between server and target.
Conclusion
- The audit log is maintained on the target.
- On the target, we implement the storage mechanism ourselves to ensure we have a solution with a very small footprint.
- On the server, we use an XStream based solution to store the logs of all the targets.
- Our communication protocol between target and (relay)server however, should probably not rely on XML.
- Our communication protocol between server and (relay)server might rely on XML (determine at design time what makes most sense).