Haskell, R, and HaskellR: Combining the best of two worlds (talk at UseR! 2017)

Earlier today, I presented at UseR! 2017 about HaskellR: a great piece of software, developed by Tweag I/O, that allows to seemlessly use R from Haskell.

It was my first UseR!, it was a great experience, and if I had the time I’d like to write a separate blog post about it, as there were things that did not quite align with my prior expectations… Stuff for thought, but not the topic of this post. (Mainly this would be about how the academic talks compared to the non-academic ones.)

So, why HaskellR? If you allow me one personal note… For the ex-psychologist, ex-software-developer, ex-database administrator, now “in over my head” data scientist and machine learning/deep learning person that I am (see this post for that story), there has always been some fixed point of interest (ideal, you could say), and that is the elegance of functional programming. It all started with SICP, which I first read as a (Java) programmer and recently read again (partly) when preparing R 4 hackers, a talk focused to a great part on the functional programming features of R.

For a database administrator, unless you’re very lucky, it’s hard to integrate use of a functional programming language into your work. How about deep learning and/or data science?
For deep learning, there’s Chris Olah’s wonderful blog post linking deep networks to functional programs, but the reality (of widely used frameworks) looks different: TensorFlow, Keras, PyTorch… it’s mostly Python around there, and whatever the niceties of Python (readability, list comprehensions…) writing Python certainly does not feel like writing FP code at all (much less than writing R!).

So in practice, the connections between data science/machine learning/deep learning and functional programming are scarce. If you look for connections, you will quickly stumble upon the Tweag I/O guys’ work: They’ve not just created HaskellR, they’ve also made Haskell run on Spark, thus enabling Haskell applications to use Spark’s MLLib for large-scale machine learning.

What, then, is HaskellR? It’s a way to seemlessly mix R code and Haskell code, with full interoperability in both directions. You can do that in source files, of course, but you can also quickly play around in the interpreter, appropriately called H (no, I was not thinking of its addictive potential here ;-)), and even use Jupyter notebook with HaskellR! In fact, that’s what I did in the demos.

If you’re interested in the technicalities of the implementation, you’ll find that documented in great detail on the HaskellR website (and even more, in their IFL 2014 paper), but otherwise I suggest you take a look at the demos from my talk: First, there’s a notebook showing how to use HaskellR, how to get values from Haskell to R and vice versa, and then, there’s the trading app scenario notebook: Suppose you have a trading app written in Haskell – it’s gotta be lightning fast and as bug-free as possible, right?
But, how about nice visualizations, time series diagnostics, all kinds of sophisticated statistical and machine learning algorithms… Chances are, someone’s implemented that algorithm in R, already! Let’s take ARIMA – one line of code with R.J. Hyndman’s auto.arima package! Visualization? ggplot2, of course! And last not least, an easy way to do deep learning with R’s keras package (interfacing to Python Keras).

Besides the notebooks, you might also want to check out the slides, especially if you’re an R user who hasn’t had much contact with Haskell. Ever wondered why the pipe looks the way it looks, or what the partial and compose functions are doing?

Last not least, a thousand thanks to the guys over at Tweag I/O, who’ve been incredibly helpful in getting the whole setup to run (the best way to get it up and running on Fedora is using nix, which I didn’t have any prior experience with… just at a second level of parentheses, I think I’d like to know more about nix, the package manager and the OS, now too ;-)). This is really the great thing about open source, the cool stuff people build and how helpful they are! So thanks again, guys – I hope to be doing things “at the interface” of ML/DL and FP more often in the future!

The talk was recorded, and can be viewed here.

R 4 hackers

Yesterday at Trivadis Tech Event, I talked about R for Hackers. It was the first session slot on Sunday morning, it was a crazy, nerdy topic, and yet there were, like, 30 people attending! An emphatic thank you to everyone who came!

R a crazy, nerdy topic, – why that, you’ll be asking? What’s so nerdy about using R?
Well, it was about R. But it was neither an introduction (“how to get things done quickly with R”), nor was it even about data science. True, you do get things done super efficiently with R, and true, R is great for data science – but this time, it really was about R as a language!

Because as a language, too, R is cool. In contrast to most object oriented languages, it (at least in it’s most widely used version, S3) uses generic OO, not message-passing OO (ok, I don’t know if this is cool, but it’s really instructive to see how little you need to implement an OO system!).

What definitely is cool though is how R is, quite a bit, a functional programming language! Even using base R, you can write in a functional style, and then there’s Hadley Wickham’s purrr that implements things like function composition or partial application.

Finally, the talk goes into base object internals – closures, builtins, specials… and it ends with a promise … 😉
So, here’s the talk: rpubs, pdf, github. Enjoy!

Deadlock parser – parsing lmd0 trace files

Does “your” application produce deadlocks often? Or to phrase it more diplomatically, a little bit … well, too often, from time to time at least?
If so, you might want to provide developers with useful debug information – what sql was executed, where from, etc.
However, in RAC, the lmd0 trace files, which log the deadlocks, don’t give away information lightly – in 11g at least (things are different in 12c, and to be precise, I am talking about here only, as I do not currently have access to an instance).

Basically, what you need is the lmd0 trace files from all instances, and then, to combine information from three sections of data:

(1) the deadlock graph (“Wait-For-Graph”) , which is written in one instance only (per deadlock), looks like this …

Global Wait-For-Graph(WFG) at ddTS[0.b650] :
BLOCKED 0x74a6d83f0 5 wq 2 cvtops x1 TX 0x3e0003.0x7b225(ext 0x6,0x0)[9A000-0004-0000001C] inst 4 
BLOCKER 0x74ede07b8 5 wq 1 cvtops x28 TX 0x3e0003.0x7b225(ext 0x6,0x0)[AC000-0003-000001B4] inst 3 
BLOCKED 0x74edebdf8 5 wq 2 cvtops x1 TX 0xd00020.0x41ea(ext 0x39,0x0)[AC000-0003-000001B4] inst 3 
BLOCKER 0x6fbc1f388 5 wq 1 cvtops x28 TX 0xd00020.0x41ea(ext 0x39,0x0)[9A000-0004-0000001C] inst 4 

*** 2014-03-12 10:36:49.760
* Cancel deadlock victim lockp 0x74a6d83f0 

… and allows linking to – via the resource name (here, e.g., “TX 0x3e0003.0x7b225”) – (2) a section containing resource information (including the granted queue and the convert queue, at the bottom) …

Global blockers dump start:---------------------------------
DUMP LOCAL BLOCKER/HOLDER: block level 5 res [0x3e0003][0x7b225],[TX][ext 0x6,0x0]
----------resource 0x75dddead8----------------------
resname       : [0x3e0003][0x7b225],[TX][ext 0x6,0x0]
hash mask     : x3
Local inst    : 4
dir_inst      : 3
master_inst   : 3
hv idx        : 124
hv last r.inc : 65
current inc   : 65
hv status     : 0
hv master     : 3
open options  : dd 
Held mode     : KJUSERNL
Cvt mode      : KJUSEREX
Next Cvt mode : KJUSERNL
msg_seq       : 0x1
res_seq       : 23
grant_bits    : KJUSERNL 
count         : 1         0         0         0         0         0
val_state     : KJUSERVS_NOVALUE
valblk        : 0xf0655697ff7f00000000000000000000 .eV
access_inst   : 3
vbreq_state   : 0
state         : x8
resp          : 0x75dddead8
On Scan_q?    : N
Total accesses: 509
Imm.  accesses: 440
Granted_locks : 0 
Cvting_locks  : 1 
value_block:  f0 65 56 97 ff 7f 00 00 00 00 00 00 00 00 00 00
lp 0x74a6d83f0 gl KJUSERNL rl KJUSEREX rp 0x75dddead8 [0x3e0003][0x7b225],[TX][ext 0x6,0x0]
  master 3 gl owner 0x75d451a00 possible pid 11039 xid 9A000-0004-0000001C bast 0 rseq 23 mseq 0 history 0x495149da
  convert opt KJUSERGETVALUE  

… and, via the lock address (here, e.g., “0x74a6d83f0”), to (3) a section containing lock details, such as grant and request level, the sid, the OS user and oracle username, as well as the client machine and, most interestingly for the application developer, the SQL:

----------enqueue 0x74a6d83f0------------------------
lock version     : 14509
Owner inst       : 4
grant_level      : KJUSERNL
req_level        : KJUSEREX
bast_level       : KJUSERNL
notify_func      : (nil)
resp             : 0x75dddead8
procp            : 0x75d9c7eb8
pid              : 11039
proc version     : 36
oprocp           : (nil)
opid             : 11039
group lock owner : 0x75d451a00
possible pid     : 11039
xid              : 9A000-0004-0000001C
dd_time          : 10.0 secs
dd_count         : 1
timeout          : 0.0 secs
On_timer_q?      : N
On_dd_q?         : Y
lock_state       : OPENING CONVERTING 
ast_flag         : 0x0
Open Options     : KJUSERDEADLOCK 
Convert options  : KJUSERGETVALUE 
History          : 0x495149da
Msg_Seq          : 0x0
res_seq          : 23
valblk           : 0x566a560a000000000000646100000000 VjVda
user session for deadlock lock 0x74a6d83f0
  sid: 1686 ser: 50887 audsid: 5210592 user: 110/APPUSER
    flags: (0x41) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
    flags2: (0x40009) -/-/INC
  pid: 154 O/S info: user: oracle, term: UNKNOWN, ospid: 11039
    image: oracle@inst3
  client details:
    O/S info: user: someosuser, term: unknown, ospid: 1234
    machine: server1.our-dmz.de program: JDBC Thin Client
  current SQL:
  update tab1 set the_user='somename' where the_type='sometype' and the_date=TO_TIMESTAMP ('2014-02-27 00:00:00.0', 'YYYY-MM-DD HH24:MI:SS.F
DUMP LOCAL BLOCKER: initiate state dump for DEADLOCK
  possible owner[154.11039] on resource TX-003E0003-0007B225

Now surely you can’t just hand the developers these lmd0 traces, so it seemed to make sense to write a parser for them, which outputs the type of information a developer would like to see (I guess ;-)). (I’m stressing the word developer here, because a DBA might be interested in additional information, like the grant and convert queues, information that is currently parsed, but not printed by the parser.)

The output from the parser looks like this:

***              Deadlocks             ***
First in tracefiles:   2014-03-10 12:16:17
Last in tracefiles:    2014-04-01 03:18:34
Deadlocks encountered: 22

Deadlock at: 2014-03-12 10:36:49
  Address: 74ede07b8    [Resource: 0x3e0003-0x7b225 TX]
  Session id: 677
  User: appuser
  Machine: server1.our-dmz.de
  Current SQL: update tab1 set the_user='somename' where the_type='sometype' and the_date=TO_TIMESTAMP ('2014-02-27 00:00:00.0', 'YYYY-MM-DD HH24:MI:SS.F

  Address: 6fbc1f388    [Resource: 0xd00020-0x41ea TX]
  Session id: 1686
  User: appuser
  Machine: server1.our-dmz.de
  Current SQL: update tab2 set the_info='someinfo' where the_id=5763836

  Address: 74a6d83f0    [Resource: 0x3e0003-0x7b225 TX]
  Session id: 1686
  User: appuser
  Machine: server2.our-dmz.de
  Current SQL: update tab2 set the_info='someinfo' where the_id=5763836

  Address: 74edebdf8    [Resource: 0xd00020-0x41ea TX]
  Session id: 677
  User: appuser
  Machine: server1.our-dmz.de
  Current SQL: update tab1 set the_user='somename' where the_type='sometype' and the_date=TO_TIMESTAMP ('2014-02-27 00:00:00.0', 'YYYY-MM-DD HH24:MI:SS.F

To be honest, this is an uncommonly neat example, at least judging from the “real life” tracefiles I’ve been using for testing purposes. For other deadlocks, the output can be much more difficult to understand.
For one thing, detail information (like SQL, machine …) might be missing for one or several enqueues. Second, especially when TM locks are involved, it might be difficult to understand what’s really happened.
This is where knowledge about the application should help, and the current output generated by the parser is, in fact, intended to be used by developers who would match it to the application server logs.

On the other hand, the parser could be extended to display information that is relevant to the DBA, or in fact produce different output dependent on command line flags. (Currently I am, for example, parsing the grant and convert queues, but not displaying that to the user.)

If anyone would like to try the parser, which is written in Haskell, the source is on GitHub (https://github.com/skeydan/deadlockparser/blob/master/src/Main.hs). I can also provide a compiled version – so you don’t need to be running Haskell, which most probably you aren’t 😉 – if you’re using 64-bit Linux.

For more information on deadlocks, I think the must-reads (as always, of course) are the “deadlock series” by Jonathan Lewis, especially the classic
that points out the importance of how the application handles the situation (so as a DBA, when you see all those deadlocks occurring, you might well ask the developers how the application handles the ORA-00060, instead of just leaning back knowing this does not cause a database problem … 😉 )