GOTO 2016 • Discovering Research Ideas Using Semantic Vectors & Machine Learning • Mads Rydahl

hello and thank you for joining us so
I’m part of a small start-up located
here in Denmark he knows called on silo
we work with big scientific publishers
to process article information and to
make tools for researchers and so maybe
I should start with explaining the
mission that we set up four years ago we
started the company so our idea was to
build a system of discovery services
that could make it easy to find patterns
across a lot of unstructured text today
or a couple of years ago the way things
were linked when you looked at an
article and try to find something
similar was using human annotated
editors keywords that’s how you find
related articles in science and the big
challenges that we saw with the system
as it was was that because scientific
language is constantly evolving and
growing and new things are being
discovered it’s impossible to keep up
with sort of hand curation of content it
also a system also has to be omniscient
because presently it’s as an author and
an editor that looks at a paper and
tries to decide what’s the important
aspects of this article and sometimes
really interesting discoveries are
really only apparent in hindsight so you
need an automated system that can
correlate a new article two tons of
other things that are currently going on
figure out if stuff people in China
doing something exactly similar to what
you’re trying to do then finally it has
to be unbiased because right now we have
this problem most of the sort of
recommenders and the concept curation
that’s automated today is based on so
the collaborative filtering or like the
stuff you see on Amazon people who
bought this also bought that it tends to
lead us down the same path and it tends
to make researchers trying to do
something you walk straight by the most
interesting stuff because that’s what
everyone else also does so we need an
unbiased approach that doesn’t rely on
some kind of popularity ranking like
page rank or collaborative filtering the
sound is a little odd is it okay I don’t
have that fancy clicker so the core
technology that we’ve built is based on
a lot of open source components or at
least three components we have a
document processing pipeline built
around batchi you eema and bruta and
we’ve run sort of standard natural
language processing pipeline and tools
on top of that and then we use common
languages like Python for prototyping
Java and a lot of libraries and stuff in
the I guess the data scientists toolbox
the key challenge is what we’re trying
to do is that unstructured knowledge
text basically does not compute as I
said before there’s too much stuff going
on for humans to be involved in this
process and even when humans are
involved on a higher level in building
ontology is to represent the knowledge
that we have of a certain discipline
it’s not going fast enough all the
interesting stuff that was found out
yesterday or last month or even six
months ago has not made it into a a
curated ontology yet so if you really
want to be at the forefront where the
money is and and and where things matter
in research you really need a more
dynamic approach so even when there are
dictionaries or reference works it’s
it’s not simply not comprehensible
enough and then the second big problem
that we have is that people are way too
creative they don’t use
just one name for a certain phenomenon
they have many different variations and
they often add descriptive detail in
their own language that makes absolutely
no sense to a computer and makes it
really difficult to figure out what
they’re actually talking about there is
no right way to describe anything in the
world and and we somehow have to figure
out what people are talking about so as
I said finally there’s all this all the
data that people consider in obvious
that’s probably the biggest problem for
for analytics today or for computer AI
in general all the stuff that people
consider obvious and then fail to
include in a description of anything so
those are the key problems that were
that we’re trying to solve here’s a
piece of text it’s an article from 2006
and if you use a regular sort of
full-text search or some kind of
standard search engine and you throw
this it’s an abstract of an article the
real article is probably ten times as
long then it’s really difficult to see
what this text is really about and if I
read this how do I figure out what other
articles talk about the same things so
today we use computers to annotate the
words that we know what means so these
are the words that are found in common
sort of dictionaries and ontology of
this of this area and we have at our
company developed a much more
comprehensive way of looking at this and
dynamically statistically deriving
longer phrases that mean stuff and we
figure out which mean approximately
which of them mean approximately the
same thing and right now as I said in
the I think of the remarks for the talk
I’m also going to try to talk a little
bit about where we want to take things
and what we’re currently working on and
it’s not just as you can see we’re
trying to cover all of the information
is actually in an article try to map
that out and make it searchable make it
findable
and we’re presently working on all of
the actions and relationships between
these things so that when you find stuff
that talks about a and B the most
relevant article is probably the one
that talks about a and B and
approximately the same context or the
same sentence or even talks about how a
is related to B today you can also do
this with sort of distance number of
words in-between when you use a
traditional search engine but the thing
is when you’re working with checks then
sometimes the number of words in-between
cross a paragraph boundary or sometimes
it’s the image text that’s right next to
that really interesting other thing that
you were looking for so and and other
times actually the thing that you’re
interested in is mentioned up here with
that third thing and down here the other
thing is mentioned with that third thing
so they’re actually really closely
connected but they’re just that odd ends
of the article so you need a better
understanding of this and we actually
use graph analytics to understand the
proximity of things and the centrality
of things in an article so the first
step we we perform is a regular natural
language processing some of you may be
familiar with this but the simplest part
of a natural language processing the
thing that you do without too much
computation is the power of speech
tagging basically assigning word classes
to each word is this a verb or is it a
noun in this context is that an
adjective and and once we have the part
of speech tagging we actually can find a
lot of candidates for potential things
in the sentence so as you can see here
we have a sentence from the abstract you
just saw methods for measuring sodium
concentration in serum by indirect
sodium select row selective electrode
potentiometry so I’ve highlighted
underneath for those who don’t read
articles on the daily basis there are
four things here in an action if you
will in come and speak and if we extract
all of the things here they seem pretty
straightforward right so so what’s the
beef
so turns out you can say these things in
many different ways and if you want to
see other content that is closely
related to this article you need to dock
not just look at the ones that include
those exact words you need to also look
at the ones that mention these same
things in different ways so we have to
deduplicate basically so we work with
Springer nature which is one of the
larger scientific publishers in the
world they’ve given us all of their
content and we’ve sifted through it we
found on the other side of a hundred
million things in their content and we
then after processing that in various
ways deduplicate that down to maybe two
or three million different things and
even when you’re down at two or three
two or three million different things
you still have separation between things
that may be a human reader would find to
be mostly the same thing so there’s a
lot of deduplication you need to do if
you can look at the examples here so
concentration of sodium can be mapped
back to sodium concentration you can
also have like sentences like the
electro potentiometry was indirect well
obviously that’s the same as indirect
electrode present geometry you can talk
some people like to call things a
methodology rather than a method and
sometimes people talk about zero and
plural rather than serum so these are
what we call morphological or
syntactical variations basically the
things that depend on the grammar we
also try to reduce the lexical and
semantic variations that’s when authors
use synonyms or hypo names which are
like more generic general terms for the
same thing so for four parts of our
pipeline we actually also do that sort
of abstraction so whenever someone says
method we might map that back to a more
generic term called mechanism serum
sample it’s actually a type of blood
sample like the serum is the blood with
something filtered out that’s not my
primary
business and serum sodium concentration
well sodium actually is the i guess the
American term for nature or it’s also
use that sometimes and indirect
electrode function geometry that we’ve
now see you in a couple of times it’s
actually a type of electroanalysis so
when we look at longer sentences or
longer phrases we actually go in and
replace each of the tokens with a more
generic term to figure out if this is
actually a variation of something that
we’ve seen before all of this is really
nothing to do with machine learning this
is just hard coded understanding of
linguistic variations so we have
compound paraphrases and ejectable
modifiers and coordinations where you
mention things like the concentration of
sodium and magnesium can be expanded
into concentration of magnesium and
concentration of sodium and all of these
tedious rules that we actually need to
perform before you can do any type of
sort of aggregated understanding and
then final to a couple of things there
often we’re looking at fragments of
something else or we’re looking at
something that contains a fragment which
is more interesting so sometimes it’s
the indirect potentiometry and no one
else in the world has ever put sodium
selective in between there so we have to
identify that and take sort of author
specific variations out of the question
because they mean absolutely nothing to
anyone else in the world and here we
come to also this matter of adding
additional descriptive detail that can
really be in the way of understanding
what’s going on so clinically
implemented indirect something or
error-prone indirect ion selective
whatever whatever these are all things
that get in the way of understanding
what’s really being spoken about then
once we have deduplicated all this these
tons of things really we look at
different types of features
so the local features in the document
include how many times it’s mentioned
what’s it connected to we actually
calculate a position in a document graph
we connect all the things in the mention
in the document with the relationships
that connect them and then do regular
sort of graph analysis to figure out
what’s central and what sort of a
peripheral to what’s being talked about
so you can have something that’s only
mentioned once but really central
because it’s connected to that very one
central thing and you can have stuff out
here that may be mentioned a couple of
times but always in relation to stuff
that’s non central and and then of
course we run these other types of
analytics that use the textual context
so the words right before and right
after a piece of text the global
features that we use are also sort of
occurrence count the number of documents
that contain the given phrase and we run
various sort of fancy algorithms to
figure out what the most common
variation if you have a set of an Engram
if you will a phrase the words what’s
the most common used a variation if you
add an additional adjective in front
what’s the most commonly used adjective
or what are the two most common things
and are they sufficiently different
different to be two different things
then of course we also calculate I guess
many of you probably also familiar with
the tf-idf which is basically deviation
in frequency from from a norm so if
things occur more often than they do on
average that’s that’s a significant
phrase probably and then we look at
distribution across the corpus so things
a thing can be mentioned very few times
but whenever someone uses that thing
they mention it over and over again in
the same document so that means it’s
probably got some significance but if
you look at it globally and just count
the number of documents it occurs in it
may seem insignificant so we have this
concentration score which pay
sickly tells us when it when it occurs
in the document how likely is it occur
to occur more than once and then we also
do an analysis comparing the
distribution to domain regions to figure
out that this is something that’s very
common but only in a certain domain and
all of these things are affected into
the our learning algorithms or ranking
models we also use the aggregated
textual context and this is the I’m
going to get back to that in a little
while this is the word to Vic or word
embeddings models that the previous
speaker also mentioned so if we look at
all the occurrences of the given phrase
across the entire corpus that tells us
something about what it means or what
other things might mean the same thing
and then of course the biggest thing
when you’re trying to train a model is
the thing that you’re training it on so
we have two types of things that we can
train on we have human training data
this could be the articles themselves we
figure out if we have and I parthis
purposes that a given concept is very
central to an article we can compare it
and see if we actually found it in the
abstract so if it’s in the abstract or
in the title as a high likelihood that
the author also considers it important
so that’s one data point and then
aggregated over thousands or millions of
articles that actually can tell us how
good we are at selecting the things that
authors find important of course if we
think we can do better than the authors
that’s a lousy way to measure it so we
also use other types of human training
data behavioral data from the companies
we work with they kindly allow us access
to usage patterns when we present
something to users which of these things
that we extract did they actually click
on find interesting and and which
articles when presented with a list of
articles related articles
the sidebar for instance which of these
were found to be most interesting or
clicked upon by by users turns out of
course is the ones with the promising
titles that get clicked on not
necessarily the ones that are most
similar so sometimes you need to make
adjustments just to to create some link
bait so the other type of synthetic data
that we use is the data that we use is
synthetic data so we can actually
construct an artificial corpus and and
train our models on that and try to
improve our models using the principles
that we that we use to create the
synthetic data it’s slightly more
complex but you can actually that’s
that’s how the demo if any of you have
tried war tyvek the demo that they
create there is actually completely
synthetic and you can also build
partially synthetic data sets one that
we’ve tried and that actually it was
also used on what to work was to use a
different search engine to create your
artificial corpus so you search for
something maybe two different concepts
two different words and then you mix
them together and you remove all traces
of the worst that you searched for so
the only thing that’s left or everything
else in the document and then you try to
figure out if you can still classify
what was what and and and dump things in
the right pile so a little bit about
word embeddings so the previous author
mentioned it here’s an example basically
what you do is you build a lecture or
it’s actually a tensor it’s a
combination of vectors so each each word
or token or phrase we work on phrases in
our corpus is actually defined in this
lecture space by an aggregation of
vectors that it commonly co-occurs with
so the traditional word tyvek algorithm
will just work on create treat all text
as a token every token as its own
vector and then only a few things get
concatenated because they belong
together so we we pre process the text
quite a lot and figure out after we’ve
deduplicated all these hundred million
things we’re down to so few million
things that they actually have decent
recurrent occurrence counts there’s the
big problem when you’re looking at
larger selections of text is that
they’re kind of statistics be more
unlikely than the word each word on its
own so you have a problem with for
instance hyper amick flow doesn’t
necessarily occur that many times even
when you have a million documents or 10
million documents it’s still something
so specific that you only have a few
hundred occurrences so it’s important to
capture all of them even when the author
calls it something different but after
we’ve done all that deduplication we
actually end up with a corpus what we
can run a vector model or generate a
vector model and then we use other
things on top so we know that coronary
vasodilation actually is defined in an
ontology it’s related to all these
different things and then we combine
things using our so the structured
knowledge of that domain to further
refine the vector model and and that’s
work really well for us here’s this is
just a little data dump from a test a
while ago but what you see here are
phrases and a current accounts in the
test corpus of I think are a million
articles and here you can see like the
first line deionized water it’s actually
part of a set it extends further to the
right but the first line you can see do
ionized water is actually the same or
has a similar vector as by distilled
water ultrapure water di water tea /
ionized water or double distilled water
and these are important to notice that
this is the output from a vector model
where we basically for each concept in
the first column we find the nearest
concepts the most the concepts that
appear in the most
similar context so the algorithm
actually does not even look at the
letters it just has an ID and then it
knows the ID of the things around it and
so it’s pretty obvious you that it is
actually possible just from the
hypothesis is that words that mean
approximately the same are used in
approximately similar context so the 10
words or five words before and after
over a million documents will be very
similar for things that although they
are different phrases mean more or less
the same thing so you can see when
things are used interchangeably that is
very much the case so for instance row I
guess 60 so crucial role actually is the
more or less the interchangeably used
with prominent role vital role
fundamental role pivotal role or
essential role sounds about right and
again it’s it’s a great validation
sometimes people work with data sets and
they rarely ever see like anything else
than floating point values here you can
actually look at it and see that does
actually make sense and if you’re in
doubt you when we do sort of limited QA
to see if things have become garbled by
some bug introduced somewhere you can
always just like look it up on Wikipedia
or something see does it make sense and
I think him so pivotal role key player
essential role yeah so it actually works
it’s possible to run this even on
phrases which I think we have been the
first to do so the upshot of this what
have we done we’ve created human
readable fingerprints so we’ve for any
given text regardless of the type of
language used we can extract some some
phrases that we know what they mean and
we can map them to the most commonly
used definition or phrase that means the
same thing and for a person skilled in
the arts as they say it’s kind of easy
to suddenly see what an article is about
we can rank them and we can tell you the
5 10 things that are most important and
an Arctic
and when people say if you look at the
graph up there when when when some
author mentions insulin insensitivity
and obese children we will know that
that article that was written a couple
of years ago about oh wait girls and
reduced hormone response is actually
talking about the exact same thing and
that’s a that’s a that’s a very big leap
in the way we recommend text in science
or indeed anywhere so traditional
document similarity relies on as I said
to recap the words that we know what
means sometimes word can be words can be
ambiguous and that’s a big problem so
there’s what we call the phrase
hypothesis which is what we’re working
on when you have a longer selection of
words that stack together in the same
fashion they rarely have a different
meaning they often have a very precise
meaning and and that’s the ability to
capture those races dynamically is
basically one what we do so once you
have these fingerprints you can actually
produce all kinds of different features
that make it easier for researchers make
life easier so what we’ve delivered to
to the partners that we work with our
inability to first as I said highlight
the things that are most the principal
components of an article so this is an
article page some of you may have seen
one if you search on Google for an
article title you get bounced to a
publisher’s webpage where that article
is presented and so we helped make that
page better we helped make it easier for
readers to understand what’s going on
and we can pull out key sentences and we
can recommend stuff we can tell the user
this is where they mentioned that thing
you’re interested in they use some
different words but it’s about the same
thing and we can provide related content
basically articles that are talking
about the same things and when we do
that we not only just provide a related
article we’d actually tell you what it
is how this overlaps with what you’re
currently looking at so we can actually
show you oh these are the concepts the
current here that also occurs in the
article you’re presently looking at and
we can actually also we’ve done an
interactive version that allows the user
to drill down and further explore it has
to contain this than this and then get a
recommendation here so we work very
closely with Springer nature Scientific
American McMillan many of the largest
publishers and we produce things like
this so I guess the little difficult to
see the highlights here but in essence
this is the non schematic version of
what I just saw told you on the right
side we have related content you can
click any of the things you’re
interested in then get a filtered list
of the most similar articles that also
contain this thing you’re interested in
we also do also other types of
visualizations with related content we
can use our technology to find
definitions of things so many of these
scientific publishers have a large back
catalogue of reference works or teaching
books if you will that define different
concepts so users can can click on
something like RNA editing and we can
pick up the best definition we can find
in in the publishers literature and not
just rely on the stuff that’s on
Wikipedia and more interesting we’re
also working on building tools that
allow researchers to to see more of the
history that the stuff they’re
interested in is sort of a part of so
here is a tool that we call timeline
that for a given article here in
sometime in the past I guess around 2003
the selected article there we use the
reference the citation data forwards and
backwards citations to figure out which
things were cited by this paper and
which papers psyched this paper so
forwards and backwards in time but
that’s a very very large set because
when you have a
single article they often cite 10 20 50
other papers each of which site another
10 50 100 papers so it’s a very huge
tree and then what we do is that we
basically prune that tree to just look
at the branches that have articles that
talk about the same thing and that
allows you to fairly easily identify an
article from last year which talks about
the same thing and actually through a
couple of links cites the article that
you’re presently looking at or if you’re
looking at a recent article you can say
who is the first author in this citation
tree to actually combine this and that
and in a paper so the value that we’re
providing to researchers and this is
we’re kind of proud of that is that we
accelerate the path to successful
discovery by pointing directly to what
is relevant in an article and we can
also provide more relevant suggestions
because they’re much more precise than
competing technologies and then we
provide so our little company actually
also provides end user features because
we believe that it’s that understanding
of the algorithms used and how they
actually how different algorithms will
favor different things and and that
actually is important for the feature
you’re trying to construct what how how
you’re going to rank these and it’s
actually very dependent on the type of
use cases that we were trying to solve
and for our our clients the publishers
they’re really happy that they can roll
out a feature across many different
types of context content even so in
biomedical for instance gene research or
drugs diseases there’s a lot of
structured documentation a lot of
ontology zahl gene names at least
discovered until fairly recently or
logged in an open access ontology and
and documentation is really really good
in that small field of science but
everywhere outside of that it’s much
much worse if you look at humanities and
Jen
well there are rarely any any official
ontology is available that tell you
which words are important or which
things is a synonym of what and and so
what we do is actually is very important
to do to developing this type of
services or recommendations for for all
the other disciplines so future
directions well as I said we’re
currently working on understanding the
relationships between all these features
of things that we extract there are so
many different ways that you can say a
given thing and when you talk about the
relationship between two things there’s
an equal amount of different ways you
can say things so just the fact that
serum consists mostly of water can be
said in so many different ways and and
the thing thin film coated gold
nanoparticles we’re currently working on
a nano product for the nano industry
with a partner that can also be said in
a number of different ways but what’s
interesting is of course that these
relationships when they stack up we can
replace the two things the subject and
the object and then have a general
understanding of how this relationship
can be described and so we’re trying to
that’s a big challenge for us is trying
to normalize and reduce the types of
relationships between things and the
corpus another big forward-looking
feature is to provide our services to
other companies that are trying to solve
problems and have access to unstructured
text but no ability to process it so
we’re working with a couple of large
companies to to make basically make
large text collections computable so so
much of what we do can be applied on any
given sort of large collection of text
and and you can do all sorts of really
interesting analytics on it once you
know what’s what and what’s similar and
what’s the important aspects of text and
then ultimately why we want to go is is
to do reasoning at scale
that’s really what you need in order to
to augment scientific research most
efficiently you need to be able to
reason what is this how what’s the
causal chain of events here and is is
this a disputed fact does everyone say
that this is how things are or the
things that that may be long chains of
of course ality that go unnoticed that
can only really be uncovered by massive
analytics so I guess the the ultimate
price there is the cure for cancer so so
I guess we have a small team we’re
actually located in in almost in second
town of Denmark were 18 people I think
now and and all of them have worked at
large big big international companies
and basically chosen to come to work
with us four measly salaries and living
in the suburbs because we’re so excited
about the promise of assisting science
we have no Danish clients we all work
with international publishers so and yes
we are hiring and so feel free to apply
where we’re growing right now and would
love to receive applications for you
guys so I think that concludes my speech
and I’d love to answer questions there’s
a ton of detail that I left out that if
you have any sort of there are really
many questions who you’ve been exid I’ve
they’re asking questions with that so
the first one is is kick stream analysis
used to analyze behavioral data such as
hyperlinks between articles and do you
use spark for this yes I think we do you
spark so I’m confession even though I
grew up with a computer and a frog coded
demos on my c64 and in my parents
bedroom in the 80s I actually do not
work as a developer in our company i’m
one of the founders and i sell the
vision so i can actually answer
accurately we
do look at clickstream data but mostly
it’s not it’s limited to profile
building not sort of session analysis
because we we do there’s a lot of noise
and people get distracted so if you have
subsequent clicks through a corpus it
really just attributes that tells you
something about what the users
interested in not necessarily that the
things that they click on related
because people get distracted so so yes
we use clicks but not really streams and
if you use if you do keep bait isn’t
that minute manipulation all right we
were actually asked to do this so yeah
so I think it’s a there you’re always
when you’re working with big
corporations you have different layers
of management and they have this
different sort of key performance
indicators and and the people that work
in the front end would like to see a
feature used so you need to optimize the
data for a feature to be used I think
it’s in the app I guess at the reason I
can still fall asleep at night is that I
think what we’re doing is vastly
superior to the traditional sort of code
download statistics that are used in
science normally the things that get
recommended across scientific publishers
are the things that other people
download it the same session and I think
one of the biggest problems with that
just to do a little diversion here is
that when you only look at behavioral
data that you have absolutely no way of
recommending that new article that came
out yesterday because you have no
behavioral data attached to it and it’s
a what we call the Coast our problem
unless you can identify that this
article is very similar to this other
article that has behavioral data you can
actually not make a recent
recommendation until by accident people
stumble across it and you know who
actually did something with it so so I
think what we do here obviously this is
a Jekyll and Hyde thing then the best
solution is always a combination of the
two factors
how do you make rules for classifying
words or phrases that are very
domain-specific across the many
different research domains so there’s
some actually very few phrases that are
exactly similar across I have very
different meanings but I’ve chef
syntactic very similar across domains
and most of that problem we’ve actually
sort of circum navigated by looking at
longer phrases and by filtering out this
stuff that head that has ambivalence so
you will actually see that we try to not
mention things that when mentioned alone
can mean different things than we add an
additional token in front of it often
times it becomes much less ambiguous and
we then prefer that one and that’s
simply ash and algorithmic solution is
not something that we hard code but we
actually look at the the ones that have
ambiguity and try to pick longer phrases
that are super sets that included do you
do any kind of personalization we don’t
have a product for personalization
because it’s not it’s a big hot potato
in science people are really afraid of
being tracked because they think they
have the cure for cancer and they don’t
want like search history is a complete
no go and for most of the clients that
we work with so we haven’t we don’t have
a product yet we think it’s incredibly
interesting and we’d love to do it but
we don’t have a partner to do it with
and probably it’s going to be outside of
science and what is the scale of data
used in your processing how much states
had words to train your model so so
that’s another thing of the first two
years of our of our startup we’re trying
to build a school google scholar
competitor we wanted to build a
destination site where users could come
search in full-text articles not see the
full text articles but we would like it
makes them for publishers and then link
out to
real constant and we spoke to many
different scientific publishers and they
all said that’s a brilliant idea and
they had so many meetings with us for
two years and they said oh here’s
another test sample that you can have of
our content and they said and once we’re
ready to go you’ll have this hard drive
with a ton of articles and it will be no
problem everybody will be happy and then
after two years and only a few thousand
articles from each publisher and a ton
of meetings where they asked about our
technology and depth and detail we went
out and one night I’m in London I
remember and one of the product managers
or it was actually a V VP level in one
of those sawtooth publishes over a beer
said you know it’s never going to happen
they’re just keeping you close because
they want to know what kind of
technology you’re developing and I think
a few months after that we pivoted into
a different business plan where we
provide our value in lieu of too little
open access material we decided to work
within the framework of the publishers
and be their friends and so now what
we’re providing our services services
that are primarily focused on on using
one publishers data to perform services
for that one publishers clients and so
clients the larger publishers have 10 to
15 million articles some of the
aggregators have more but most most of
our clients have less than 10 million
documents so with each document being I
don’t know a few hundred K in simple a
ski that it’s not crazy amounts of data
it’s a few terabytes for a larger
publisher so as jonathan schwartz found
out it could easily be dumped anywhere
in the internet but everyone would be
sued okay
would it make sense to pretty print an
article normalize it and republish it
along with the original and did do you
have a tool for that so no we don’t we
cannot provide access to the full text
we work with publishers and they are
it’s a very tightly controlled business
they their primary business asset at
least until open access becomes more
dominant is the concept that they own
and control so so we really can’t do
much with it except behind closed doors
we had when we worked with elsevier last
year like the forms we had to fill out
for compliance of security were crazy I
think a hundred and forty seven pages
tabs in an excel sheet with a hundred
questions in each so that was just the
lien and they are the survey questions
before they send a person over so yeah
they’re really really crazy about
security I using dump the architecture
and can you talk about that I’m not
familiar with lambda architecture I know
like lambda lambda coefficients but no
no probably maybe we are who knows okay
what is the most interesting finding you
had done in your data for cancer that’s
our we haven’t found that yet and I
guess we would have published it so
we’re a service provider so we work with
what the industry called subject matter
experts or SMEs and so we have models
what we validate the quality of what we
do and and then the error rates etc they
all automated tests and then of course
we run it by some selection a panel of
real scientists that can look at it and
then know the content that we’ve
processed and can tell if there’s an
error somewhere a word that we left out
that was important but we can’t really
evaluate ourselves
so we know that the scientific
publishers we work with the editors
there say that we have the best
extraction algorithms that produce the
best and most usable phrases and results
so that that’s what we go at we actually
don’t know what is being used for okay
what about articles published in the
public domain published on open
platforms I am indexing and presenting
articles on these and turns it the
sources yes we are working with a couple
of open access publishers and sorry
about that and so the open access model
has sort of turned publishing inside out
where traditionally traditional
publishers actually publish your thing
for free as long as you sign over
copyright for open access you have to
pay for the peer to peer review process
and the publishing of course that cost
has come down a lot from a few years ago
but you still pay around 2,000 euros to
publish an article and that sort of puts
a little damper on the growth of open
access but but we do work with some of
the open access providers and we have
this idea when we started our company
that we would just aggregate all of open
source and that’s fine good luck if you
want to try because the only people that
have succeeded in doing anything vaguely
resembling that are just aggregating the
metadata because it turns out that
people publish their their articles in
it in a gazillion different formats on a
gazillion different websites where
sometimes the download boredness behind
some kind of I’m not a robot capture and
it’s really really hard to get at the
content it’s the biggest mistake that
the open access community has done is
not agreeing on some submission standard
that allows that data to go there text
to be mined and I just don’t see why no
one has come up and said this is how you
do it this is the format give us a Jets
xml file right here on an ftp server
dump it there and and let the community
do the rest but it hasn’t been done so
it’s not the
it’s not a task for startups it’s
incredibly time-consuming to deal with
thousands of different submission
forfeits and PDFs I mean you may think
PDF is a nice format but it just turns
out that sometimes the renderer will
swap the the order of sentences around
and it’s impossible to figure out which
sentence is completed over here or you
don’t want to know so so we we have to
have someone else take care of that and
then we can do open source open access
in a few years do you have some kind of
best practice to run ad to plication
process where different deep learning
methods could be applied I’m not sure I
understand the question but we do have
so that’s the key value add and I’m
sorry I can’t share the source code it’s
free we’re trying to build a business if
you want to work with it you should come
to us we do have like the pipeline that
we’re building is about this and it’s
iterative we pipe stuff in that we’ve
learned elsewhere and and we basically
we have we work internally in the team
we write white papers we have give talks
to each other and it’s a wonderful set
up please come to honest does this apply
well to computer science papers oh yes
archive we we’ve indexed archive once
but we haven’t set it up for re-indexing
and I think we should it’s the whole eat
your own dog food thing so we should get
that up and running again when we get
around to it right we have these other
jobs that pay money that we have to do
first did you try our technology work
for languages other than English no we
haven’t found anyone willing to pay for
it yet most of what we do can be
transferred to to to other languages and
not myself fluent in German but I think
possibly there are some rules that would
have to be
for their grammar but there’s nothing
basically preventing it from being
ported to other languages we’ve we’ve
been asked to do Chinese for IP analysis
of patent analysis but the tools that
everyone else is using is basically some
kind of auto translation and then
applying text analytics afterwards which
is probably inferior but makes more
sense on a cost perspective
unfortunately we I think that’s it a lot
of questions thanks for that and let’s
say thank you to to mess thank you

Please follow and like us:

GOTO 2016 • Discovering Research Ideas Using Semantic Vectors & Machine Learning • Mads Rydahl

Be First to Comment

Leave a Reply Cancel reply