Cloudtenna has come up with a file search portal to fix multi-silo'd file-search issues.
It said its DirectSearch cloud service product for businesses scans all your file storing locations – on-premises, cloud file storage services, and hosted/on-line applications – creates an index (metadata) and then uses in-memory processing to run fast searches of your individual file universe.
You run a search on a topic, such as "Walmart" and it returns a list of files with that text string in their name or contents.
Cloudtenna was started in 2013 and has just received $4m seed funding plus a contribution from strategic investor Citrix. That's a longish time between startup and seed funding dates.
Its software's task is to first compile a list of a business's files, their attributes and contents, second to watch their access patterns and content changes, and, third, to respond to search requests.
The company started out using Hadoop to construct the metadata needed by its likely business customers. An issue was building a central index and restricting search areas to files individual users had permission to see, using Access Control Lists (ACLs). It took ages to crunch the file permissions with Hadoop.
Cloudtenna then wrote different code using in-memory processing and Spark to get the speed it needed. Off-the-shelf Elastic Search is not used. It stores its indices in the Amazon cloud but also uses Google to safeguard data availability.
Machine learning techniques are used, with reliance on content and user access graphs, to produce personalised search results for every user. These recognise that if your boss accessed a file, it's probably an important file for you too.
There is also a personalised auto-complete function similar to Google Search that comes into play when you start entering a file search string, based on files a particular user has accessed.
The file source silos are accessed using connectors and their state is monitored using web-like crawlers for each source, which update the index. These scans are done ignoring ACL restrictions. If you scan per user, being guided by their ACL entry, that is much slower than running a single scan; many hours could be needed. Instead a shared folder is scanned once then user data access is restricted using ACLs.
DirectSearch is the first of a set of products that could provide e-discovery, audit trails, governance and compliance functionality. The machine learning can also be used to detect unlikely and/or unusual file access patterns.Read the full article on The Register here.